Hey. Since metaflow 2.15.8+ we experience some int...
# ask-metaflow
a
Hey. Since metaflow 2.15.8+ we experience some interesting S3 errors on our setup
2025-05-06 15:39:35.945 [759/start/6468 (pid 139695)] Transient S3 failure (attempt #1) -- total success: 2, last attempt 2/4 -- remaining: 2
2025-05-06 15:39:39.125 [759/start/6468 (pid 139695)] Transient S3 failure (attempt #2) -- total success: 2, last attempt 0/2 -- remaining: 2
2025-05-06 15:39:44.889 [759/start/6468 (pid 139695)] Transient S3 failure (attempt #3) -- total success: 2, last attempt 0/2 -- remaining: 2
2025-05-06 15:39:53.515 [759/start/6468 (pid 139695)] Transient S3 failure (attempt #4) -- total success: 2, last attempt 0/2 -- remaining: 2
which leads to a complete halt of the workflow eventually. Our setup consist of an on-prem Minio s3 instance and metaflow all running on a kubernetes cluster. Switching back to 2.15.7 resolves the error magically. Any ideas? Cheers
a
@hundreds-rainbow-67050 this might be related to the recent metaflow.s3 change?
h
Possibly. Although according to the logs, the same 2 failed files continued to fail for 3 more attempts. @adorable-oxygen-86530 do you have a minimal reproducing example?
also was that log cut short? there should be 9 attempts
a
Hey @hundreds-rainbow-67050, yes the log was cut short. The most simple example seems to be this one here:
Copy code
from metaflow import FlowSpec, Parameter, step, metaflow_current, kubernetes, environment, card, pypi, conda_base
 

@conda_base(python='3.9.13', packages = {'pandas':'1.2.1'})
class HelloWorld(FlowSpec):
    

    @step
    def start(self):
        import pandas as pd
 
        print("Hello...")
        self.df = pd.DataFrame([1,2,3,4,5], columns=["test"])
        self.batches= [1,2,3,4,5,6]
        self.next(self.process_batch, foreach='batches')
 
    @step
    def process_batch(self):
        self.next(self.joiner)

    @step
    def joiner(self, inputs):
        self.next(self.end)

    @step
    def end(self):
        pass
 

if __name__ == '__main__':
    HelloWorld()
I do believe, the error appears when I use foreach-loops. When I remove that part of source code it seems to work without errors
e
Hi @adorable-oxygen-86530 I tried to reproduce the error by pip install metaflow==2.15.8 and run the flow user provided. But it is successfully done.
a
Hi @echoing-napkin-23329, many thanks for checking on that issue. That's interesting that it works on your machine. Can you do me a favor and try one last thing: For testing purpose we used a docker container "python:3.13.3-slim-bullseye" where the problem arises consistently in the "joiner" step using the above given code. Can you test this as well?
h
@adorable-oxygen-86530 can you show the full stack trace? The joiner step doesn't access any artifacts so not sure what it's actually trying to download from S3
e
@adorable-oxygen-86530 I cannot fully reproduce your case if I am don't know your dockfile and how you run your docker.
a
Hi there, @hundreds-rainbow-67050 The full stack trace is really boring 🙂 and it takes quite some while to wait for the entire process. In the end I had to manually stop it, because I couldn´t wait any longer:
Copy code
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
2025-05-13 06:53:22.909 Creating local datastore in current directory (/root/.metaflow)
2025-05-13 06:53:22.909 Bootstrapping virtual environment(s) ...
2025-05-13 06:54:04.709 Virtual environment(s) bootstrapped!
2025-05-13 06:54:11.834 Workflow starting (run-id 878):
2025-05-13 06:54:13.668 [878/start/8324 (pid 8491)] Task is starting.
2025-05-13 06:54:15.535 [878/start/8324 (pid 8491)] Hello...
2025-05-13 06:54:18.249 [878/start/8324 (pid 8491)] Foreach yields 6 child steps.
2025-05-13 06:54:18.249 [878/start/8324 (pid 8491)] Task finished successfully.
2025-05-13 06:54:19.511 [878/process_batch/8325 (pid 8849)] Task is starting.
2025-05-13 06:54:19.936 [878/process_batch/8326 (pid 8857)] Task is starting.
2025-05-13 06:54:20.269 [878/process_batch/8327 (pid 8865)] Task is starting.
2025-05-13 06:54:20.638 [878/process_batch/8328 (pid 8886)] Task is starting.
2025-05-13 06:54:21.091 [878/process_batch/8329 (pid 8894)] Task is starting.
2025-05-13 06:54:21.478 [878/process_batch/8330 (pid 8909)] Task is starting.
2025-05-13 06:54:23.674 [878/process_batch/8325 (pid 8849)] Task finished successfully.
2025-05-13 06:54:25.004 [878/process_batch/8326 (pid 8857)] Task finished successfully.
2025-05-13 06:54:26.015 [878/process_batch/8327 (pid 8865)] Task finished successfully.
2025-05-13 06:54:27.020 [878/process_batch/8328 (pid 8886)] Task finished successfully.
2025-05-13 06:54:28.948 [878/process_batch/8329 (pid 8894)] Task finished successfully.
2025-05-13 06:54:29.934 [878/process_batch/8330 (pid 8909)] Task finished successfully.
2025-05-13 06:54:31.080 [878/joiner/8331 (pid 9729)] Task is starting.
2025-05-13 06:54:32.636 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #1) -- total success: 108, last attempt 108/216 -- remaining: 108
2025-05-13 06:54:35.871 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #2) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 06:54:41.153 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #3) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 06:54:50.995 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #4) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 06:55:08.992 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #5) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 06:55:40.197 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #6) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 06:56:39.532 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #7) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 06:58:40.952 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #8) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 06:59:13.986 1 task is running: joiner (1 running; 0 done).
2025-05-13 06:59:13.987 No tasks are waiting in the queue.
2025-05-13 06:59:13.987 end step has not started
2025-05-13 07:03:05.177 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #9) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 07:04:14.245 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:04:14.245 No tasks are waiting in the queue.
2025-05-13 07:04:14.245 end step has not started
2025-05-13 07:09:14.543 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:09:14.543 No tasks are waiting in the queue.
2025-05-13 07:09:14.543 end step has not started
2025-05-13 07:09:32.498 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #10) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 07:14:14.779 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:14:14.779 No tasks are waiting in the queue.
2025-05-13 07:14:14.779 end step has not started
2025-05-13 07:15:19.813 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #11) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 07:19:15.048 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:19:15.048 No tasks are waiting in the queue.
2025-05-13 07:19:15.048 end step has not started
2025-05-13 07:20:48.775 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #12) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 07:24:15.981 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:24:15.982 No tasks are waiting in the queue.
2025-05-13 07:24:15.982 end step has not started
2025-05-13 07:27:23.725 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #13) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 07:29:16.837 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:29:16.837 No tasks are waiting in the queue.
2025-05-13 07:29:16.837 end step has not started
2025-05-13 07:33:47.695 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #14) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 07:34:17.724 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:34:17.724 No tasks are waiting in the queue.
2025-05-13 07:34:17.724 end step has not started
2025-05-13 07:39:18.022 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:39:18.022 No tasks are waiting in the queue.
2025-05-13 07:39:18.022 end step has not started
2025-05-13 07:39:30.754 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #15) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 07:44:18.066 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:44:18.066 No tasks are waiting in the queue.
2025-05-13 07:44:18.067 end step has not started
2025-05-13 07:45:30.339 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #16) -- total success: 108, last attempt 0/108 -- remaining: 108
....
....
2025-05-13 08:19:52.585 [878/joiner/8331 (pid 9729)] Data missing:
2025-05-13 08:19:52.586 [878/joiner/8331 (pid 9729)] Some input datastores are missing. Expected: 6 Actual: 0
2025-05-13 08:19:52.649 [878/joiner/8331 (pid 9729)] 
2025-05-13 08:19:53.059 [878/joiner/8331 (pid 9729)] Task failed.
@echoing-napkin-23329 Alright sorry for the confusion - I`ll try to be as precise as possible My
metaflowconfig.json
looks like this (some values needed to be redacted):
Copy code
{
  "METAFLOW_DEFAULT_METADATA": "service",
  "METAFLOW_KUBERNETES_NAMESPACE": "<REDACTED>",
  "METAFLOW_KUBERNETES_SERVICE_ACCOUNT": "default",
  "METAFLOW_SERVICE_INTERNAL_URL": "<REDACTED>",
  "METAFLOW_SERVICE_URL": "<REDACTED>",
  "METAFLOW_S3_RETRY_COUNT": "0",
  "METAFLOW_DEFAULT_DATASTORE": "s3",
  "METAFLOW_DATASTORE_ROOT": "<REDACTED>",
  "METAFLOW_DATASTORE_SYSROOT_S3": "<REDACTED>",
  "METAFLOW_S3_ENDPOINT_URL": "<REDACTED>",
  "METAFLOW_CONDA_CHANNELS": "conda-forge",
  "METAFLOW_USER": "<REDACTED>",
  "METAFLOW_DEFAULT_ENVIRONMENT": "conda",
  "METAFLOW_DEFAULT_CONTAINER_REGISTRY": "<REDACTED>",
  "METAFLOW_DEFAULT_PACKAGE_SUFFIXES": ".txt,.sql,.sh,.py,.toml,.json,.ini,.gz,.csv",
  "METAFLOW_KUBERNETES_IMAGE_PULL_POLICY": "Always",
  "METAFLOW_ARGO_EVENTS_EVENT_BUS": "default",
  "METAFLOW_ARGO_EVENTS_EVENT_SOURCE": "argo-events-webhook",
  "METAFLOW_ARGO_EVENTS_SERVICE_ACCOUNT": "operate-workflow-sa",
  "METAFLOW_ARGO_EVENTS_EVENT": "metaflow-event",
  "METAFLOW_ARGO_EVENTS_WEBHOOK_URL": "<REDACTED>",
  "METAFLOW_ARGO_WORKFLOWS_UI_URL": "<REDACTED>",
  "METAFLOW_CONDA_DEPENDENCY_RESOLVER": "micromamba",
  "METAFLOW_DEBUG_CONDA": "1"
}
I started the docker container like this:
docker run -it -v ${PWD}:/root --network host python:3.13.3-slim-bullseye
My
.aws/credentials
file looks regular without any special cases.
${PWD}
contains the above shown
hello_world.py
,
.metaflowconfig/config.json
,
.aws/credentials
I use VSCode to attach to the running instance, install metaflow within VSCode using
pip install metaflow==2.15.7
or
pip install metaflow==2.15.8
- with
2.15.7
everything works as expected with
2.15.8+
the above shown error appears. Hope this helps
n
Same thing happened to me, I traced it back to this change https://github.com/Netflix/metaflow/pull/2329/files#diff-71988f4dbd50f965814243cf5077668d38709078fb4c2facf604b18c16e22789R1098 changing
Copy code
json.loads(s3clientparams) if s3clientparams else DEFAULT_S3_CLIENT_PARAMS,
to
Copy code
json.loads(s3clientparams) if s3clientparams else None,
Solves the issue for me. On my case I'm targeting a custom s3 compatible provider (23m).
This the content of that default client params
Copy code
{
  "_user_provided_options": {
    "retries": {
      "max_attempts": 10,
      "mode": "adaptive"
    }
  },
  "_inject_host_prefix": "UNSET",
  "region_name": null,
  "signature_version": null,
  "user_agent": null,
  "user_agent_extra": null,
  "user_agent_appid": null,
  "connect_timeout": 60,
  "read_timeout": 60,
  "parameter_validation": true,
  "max_pool_connections": 10,
  "proxies": null,
  "proxies_config": null,
  "s3": null,
  "retries": {
    "max_attempts": 10,
    "mode": "adaptive"
  },
  "client_cert": null,
  "endpoint_discovery_enabled": null,
  "use_dualstack_endpoint": null,
  "use_fips_endpoint": null,
  "ignore_configured_endpoint_urls": null,
  "defaults_mode": null,
  "tcp_keepalive": null,
  "request_min_compression_size_bytes": null,
  "disable_request_compression": null,
  "client_context_params": null,
  "sigv4a_signing_region_set": null,
  "request_checksum_calculation": null,
  "response_checksum_validation": null,
  "account_id_endpoint_mode": null
}
h
Hmm I’m confused. Previously, the default was None. Is the problem that the new default conflicts with your custom provider? @numerous-alarm-85001
@adorable-oxygen-86530 can you try the suggested change and see if that also fixes it for you?
n
After debugging, if you use None later in the code the client params of the s3 client comes from DATATOOLS_CLIENT_PARAMS which in my case looks like:
{endpoint_url ...}
By using the other variable, DEFAULT_S3_CLIENT_PARAMS, we are not passing the endpoint to the s3 client. Indeed it seems to be a totally different object. DATATOOLS_CLIENT_PARAMS is a dictionary that includes the custom endpoint url DEFAULT_S3_CLIENT_PARAMS is an object (embedded on a dictionary), that contains what I wrote above
h
I see. That would explain why no objects can be retrieved, but in one of Dan’s earlier log it was able to make progress so doesn’t fully explain that case (perhaps another config option is in play). Thanks for identifying this issue though! I will submit a PR reverting that line later today. Our internal setup is slightly different so it’s a bit difficult for us to reproduce these issues
n
welcome! After a lot of debugging this what gave me the clue. Executing this by hand rather than through metaflow code (the subprocess call obscures the warning):
Copy code
/Users/alvarodurantovar/source/yazio/ml/pricing/venv/bin/python /Users/alvarodurantovar/source/yazio/ml/pricing/venv/lib/python3.10/site-packages/metaflow/plugins/datatools/s3/s3op.py get --verify --no-verbose --no-info --listing --recursive --inputs /Users/alvarodurantovar/source/yazio/ml/pricing/metaflow.s3.mc5fn8yl/metaflow.s3.inputs.uwp8cw7t
Copy code
[WARNING] S3 datastore operation list_prefix failed (An error occurred (InvalidAccessKeyId) when calling the ListObjectsV2 operation: The AWS Access Key Id you provided does not exist in our records.). Retrying 2 more times..
h
looks like we should add
InvalidAccessKeyId
to our list of fatal exceptions -- it's unlikely that will magically start working after a retry
smart 1
a
@numerous-alarm-85001 Many thanks for your support here - Your investigation and your posting here did the trick. The problem was solved by rolling back those three lines. FYI: @hundreds-rainbow-67050 - Looking forward to the newest version 😉
fox yay 2
llama yay 2