adorable-oxygen-86530
05/06/2025, 3:43 PM2025-05-06 15:39:35.945 [759/start/6468 (pid 139695)] Transient S3 failure (attempt #1) -- total success: 2, last attempt 2/4 -- remaining: 2
2025-05-06 15:39:39.125 [759/start/6468 (pid 139695)] Transient S3 failure (attempt #2) -- total success: 2, last attempt 0/2 -- remaining: 2
2025-05-06 15:39:44.889 [759/start/6468 (pid 139695)] Transient S3 failure (attempt #3) -- total success: 2, last attempt 0/2 -- remaining: 2
2025-05-06 15:39:53.515 [759/start/6468 (pid 139695)] Transient S3 failure (attempt #4) -- total success: 2, last attempt 0/2 -- remaining: 2
which leads to a complete halt of the workflow eventually.
Our setup consist of an on-prem Minio s3 instance and metaflow all running on a kubernetes cluster.
Switching back to 2.15.7 resolves the error magically. Any ideas?
Cheersancient-application-36103
05/07/2025, 12:26 AMhundreds-rainbow-67050
05/07/2025, 12:30 AMhundreds-rainbow-67050
05/07/2025, 12:31 AMadorable-oxygen-86530
05/07/2025, 8:03 AMfrom metaflow import FlowSpec, Parameter, step, metaflow_current, kubernetes, environment, card, pypi, conda_base
@conda_base(python='3.9.13', packages = {'pandas':'1.2.1'})
class HelloWorld(FlowSpec):
@step
def start(self):
import pandas as pd
print("Hello...")
self.df = pd.DataFrame([1,2,3,4,5], columns=["test"])
self.batches= [1,2,3,4,5,6]
self.next(self.process_batch, foreach='batches')
@step
def process_batch(self):
self.next(self.joiner)
@step
def joiner(self, inputs):
self.next(self.end)
@step
def end(self):
pass
if __name__ == '__main__':
HelloWorld()
I do believe, the error appears when I use foreach-loops. When I remove that part of source code it seems to work without errorsechoing-napkin-23329
05/09/2025, 6:33 PMadorable-oxygen-86530
05/12/2025, 9:43 AMhundreds-rainbow-67050
05/12/2025, 2:12 PMechoing-napkin-23329
05/12/2025, 5:58 PMadorable-oxygen-86530
05/13/2025, 7:48 AMValidating your flow...
The graph looks good!
Running pylint...
Pylint not found, so extra checks are disabled.
2025-05-13 06:53:22.909 Creating local datastore in current directory (/root/.metaflow)
2025-05-13 06:53:22.909 Bootstrapping virtual environment(s) ...
2025-05-13 06:54:04.709 Virtual environment(s) bootstrapped!
2025-05-13 06:54:11.834 Workflow starting (run-id 878):
2025-05-13 06:54:13.668 [878/start/8324 (pid 8491)] Task is starting.
2025-05-13 06:54:15.535 [878/start/8324 (pid 8491)] Hello...
2025-05-13 06:54:18.249 [878/start/8324 (pid 8491)] Foreach yields 6 child steps.
2025-05-13 06:54:18.249 [878/start/8324 (pid 8491)] Task finished successfully.
2025-05-13 06:54:19.511 [878/process_batch/8325 (pid 8849)] Task is starting.
2025-05-13 06:54:19.936 [878/process_batch/8326 (pid 8857)] Task is starting.
2025-05-13 06:54:20.269 [878/process_batch/8327 (pid 8865)] Task is starting.
2025-05-13 06:54:20.638 [878/process_batch/8328 (pid 8886)] Task is starting.
2025-05-13 06:54:21.091 [878/process_batch/8329 (pid 8894)] Task is starting.
2025-05-13 06:54:21.478 [878/process_batch/8330 (pid 8909)] Task is starting.
2025-05-13 06:54:23.674 [878/process_batch/8325 (pid 8849)] Task finished successfully.
2025-05-13 06:54:25.004 [878/process_batch/8326 (pid 8857)] Task finished successfully.
2025-05-13 06:54:26.015 [878/process_batch/8327 (pid 8865)] Task finished successfully.
2025-05-13 06:54:27.020 [878/process_batch/8328 (pid 8886)] Task finished successfully.
2025-05-13 06:54:28.948 [878/process_batch/8329 (pid 8894)] Task finished successfully.
2025-05-13 06:54:29.934 [878/process_batch/8330 (pid 8909)] Task finished successfully.
2025-05-13 06:54:31.080 [878/joiner/8331 (pid 9729)] Task is starting.
2025-05-13 06:54:32.636 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #1) -- total success: 108, last attempt 108/216 -- remaining: 108
2025-05-13 06:54:35.871 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #2) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 06:54:41.153 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #3) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 06:54:50.995 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #4) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 06:55:08.992 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #5) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 06:55:40.197 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #6) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 06:56:39.532 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #7) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 06:58:40.952 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #8) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 06:59:13.986 1 task is running: joiner (1 running; 0 done).
2025-05-13 06:59:13.987 No tasks are waiting in the queue.
2025-05-13 06:59:13.987 end step has not started
2025-05-13 07:03:05.177 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #9) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 07:04:14.245 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:04:14.245 No tasks are waiting in the queue.
2025-05-13 07:04:14.245 end step has not started
2025-05-13 07:09:14.543 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:09:14.543 No tasks are waiting in the queue.
2025-05-13 07:09:14.543 end step has not started
2025-05-13 07:09:32.498 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #10) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 07:14:14.779 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:14:14.779 No tasks are waiting in the queue.
2025-05-13 07:14:14.779 end step has not started
2025-05-13 07:15:19.813 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #11) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 07:19:15.048 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:19:15.048 No tasks are waiting in the queue.
2025-05-13 07:19:15.048 end step has not started
2025-05-13 07:20:48.775 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #12) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 07:24:15.981 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:24:15.982 No tasks are waiting in the queue.
2025-05-13 07:24:15.982 end step has not started
2025-05-13 07:27:23.725 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #13) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 07:29:16.837 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:29:16.837 No tasks are waiting in the queue.
2025-05-13 07:29:16.837 end step has not started
2025-05-13 07:33:47.695 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #14) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 07:34:17.724 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:34:17.724 No tasks are waiting in the queue.
2025-05-13 07:34:17.724 end step has not started
2025-05-13 07:39:18.022 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:39:18.022 No tasks are waiting in the queue.
2025-05-13 07:39:18.022 end step has not started
2025-05-13 07:39:30.754 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #15) -- total success: 108, last attempt 0/108 -- remaining: 108
2025-05-13 07:44:18.066 1 task is running: joiner (1 running; 0 done).
2025-05-13 07:44:18.066 No tasks are waiting in the queue.
2025-05-13 07:44:18.067 end step has not started
2025-05-13 07:45:30.339 [878/joiner/8331 (pid 9729)] Transient S3 failure (attempt #16) -- total success: 108, last attempt 0/108 -- remaining: 108
....
....
2025-05-13 08:19:52.585 [878/joiner/8331 (pid 9729)] Data missing:
2025-05-13 08:19:52.586 [878/joiner/8331 (pid 9729)] Some input datastores are missing. Expected: 6 Actual: 0
2025-05-13 08:19:52.649 [878/joiner/8331 (pid 9729)]
2025-05-13 08:19:53.059 [878/joiner/8331 (pid 9729)] Task failed.
@echoing-napkin-23329 Alright sorry for the confusion - I`ll try to be as precise as possible
My metaflowconfig.json
looks like this (some values needed to be redacted):
{
"METAFLOW_DEFAULT_METADATA": "service",
"METAFLOW_KUBERNETES_NAMESPACE": "<REDACTED>",
"METAFLOW_KUBERNETES_SERVICE_ACCOUNT": "default",
"METAFLOW_SERVICE_INTERNAL_URL": "<REDACTED>",
"METAFLOW_SERVICE_URL": "<REDACTED>",
"METAFLOW_S3_RETRY_COUNT": "0",
"METAFLOW_DEFAULT_DATASTORE": "s3",
"METAFLOW_DATASTORE_ROOT": "<REDACTED>",
"METAFLOW_DATASTORE_SYSROOT_S3": "<REDACTED>",
"METAFLOW_S3_ENDPOINT_URL": "<REDACTED>",
"METAFLOW_CONDA_CHANNELS": "conda-forge",
"METAFLOW_USER": "<REDACTED>",
"METAFLOW_DEFAULT_ENVIRONMENT": "conda",
"METAFLOW_DEFAULT_CONTAINER_REGISTRY": "<REDACTED>",
"METAFLOW_DEFAULT_PACKAGE_SUFFIXES": ".txt,.sql,.sh,.py,.toml,.json,.ini,.gz,.csv",
"METAFLOW_KUBERNETES_IMAGE_PULL_POLICY": "Always",
"METAFLOW_ARGO_EVENTS_EVENT_BUS": "default",
"METAFLOW_ARGO_EVENTS_EVENT_SOURCE": "argo-events-webhook",
"METAFLOW_ARGO_EVENTS_SERVICE_ACCOUNT": "operate-workflow-sa",
"METAFLOW_ARGO_EVENTS_EVENT": "metaflow-event",
"METAFLOW_ARGO_EVENTS_WEBHOOK_URL": "<REDACTED>",
"METAFLOW_ARGO_WORKFLOWS_UI_URL": "<REDACTED>",
"METAFLOW_CONDA_DEPENDENCY_RESOLVER": "micromamba",
"METAFLOW_DEBUG_CONDA": "1"
}
I started the docker container like this:
docker run -it -v ${PWD}:/root --network host python:3.13.3-slim-bullseye
My .aws/credentials
file looks regular without any special cases.
${PWD}
contains the above shown hello_world.py
, .metaflowconfig/config.json
, .aws/credentials
I use VSCode to attach to the running instance, install metaflow within VSCode using pip install metaflow==2.15.7
or pip install metaflow==2.15.8
- with 2.15.7
everything works as expected with 2.15.8+
the above shown error appears.
Hope this helpsnumerous-alarm-85001
05/13/2025, 3:35 PMjson.loads(s3clientparams) if s3clientparams else DEFAULT_S3_CLIENT_PARAMS,
to
json.loads(s3clientparams) if s3clientparams else None,
Solves the issue for me. On my case I'm targeting a custom s3 compatible provider (23m).numerous-alarm-85001
05/13/2025, 3:37 PM{
"_user_provided_options": {
"retries": {
"max_attempts": 10,
"mode": "adaptive"
}
},
"_inject_host_prefix": "UNSET",
"region_name": null,
"signature_version": null,
"user_agent": null,
"user_agent_extra": null,
"user_agent_appid": null,
"connect_timeout": 60,
"read_timeout": 60,
"parameter_validation": true,
"max_pool_connections": 10,
"proxies": null,
"proxies_config": null,
"s3": null,
"retries": {
"max_attempts": 10,
"mode": "adaptive"
},
"client_cert": null,
"endpoint_discovery_enabled": null,
"use_dualstack_endpoint": null,
"use_fips_endpoint": null,
"ignore_configured_endpoint_urls": null,
"defaults_mode": null,
"tcp_keepalive": null,
"request_min_compression_size_bytes": null,
"disable_request_compression": null,
"client_context_params": null,
"sigv4a_signing_region_set": null,
"request_checksum_calculation": null,
"response_checksum_validation": null,
"account_id_endpoint_mode": null
}
hundreds-rainbow-67050
05/13/2025, 4:14 PMhundreds-rainbow-67050
05/13/2025, 4:16 PMnumerous-alarm-85001
05/13/2025, 4:26 PM{endpoint_url ...}
By using the other variable, DEFAULT_S3_CLIENT_PARAMS, we are not passing the endpoint to the s3 client. Indeed it seems to be a totally different object.
DATATOOLS_CLIENT_PARAMS is a dictionary that includes the custom endpoint url
DEFAULT_S3_CLIENT_PARAMS is an object (embedded on a dictionary), that contains what I wrote abovehundreds-rainbow-67050
05/13/2025, 4:40 PMnumerous-alarm-85001
05/13/2025, 5:22 PM/Users/alvarodurantovar/source/yazio/ml/pricing/venv/bin/python /Users/alvarodurantovar/source/yazio/ml/pricing/venv/lib/python3.10/site-packages/metaflow/plugins/datatools/s3/s3op.py get --verify --no-verbose --no-info --listing --recursive --inputs /Users/alvarodurantovar/source/yazio/ml/pricing/metaflow.s3.mc5fn8yl/metaflow.s3.inputs.uwp8cw7t
[WARNING] S3 datastore operation list_prefix failed (An error occurred (InvalidAccessKeyId) when calling the ListObjectsV2 operation: The AWS Access Key Id you provided does not exist in our records.). Retrying 2 more times..
hundreds-rainbow-67050
05/13/2025, 6:11 PMInvalidAccessKeyId
to our list of fatal exceptions -- it's unlikely that will magically start working after a retryadorable-oxygen-86530
05/14/2025, 7:01 AM