witty-energy-7748
11/14/2025, 1:56 PMpoetry run python metaflow_jobs/flows/train.py --with kubernetes --environment=pypi run
Metaflow 2.19.7 executing CovtypeTrainingPipelineFlow for user:chen.liang
Project: covtype, Branch: user.chen.liang
Validating your flow...
The graph looks good!
Running pylint...
Pylint not found, so extra checks are disabled.
2025-11-14 08:51:19.058 Bootstrapping virtual environment(s) ...
2025-11-14 08:51:19.213 Virtual environment(s) bootstrapped!
2025-11-14 08:51:19.577 Workflow starting (run-id 2), see it in the UI at <http://localhost:3000/CovtypeTrainingPipelineFlow/2>
2025-11-14 08:51:20.588 [2/start/4 (pid 81134)] Task is starting.
2025-11-14 08:51:21.352 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] Task is starting (Pod is pending, Container is waiting - ContainerCreating)...
2025-11-14 08:51:21.689 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] Setting up task environment.
2025-11-14 08:51:26.033 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] Downloading code package...
2025-11-14 08:51:26.498 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] Code package downloaded.
2025-11-14 08:51:26.526 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] Task is starting.
2025-11-14 08:51:26.913 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] Bootstrapping virtual environment...
2025-11-14 08:51:33.352 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] Bootstrap failed while executing: set -e;
2025-11-14 08:51:34.444 [2/start/4 (pid 81134)] Kubernetes error:
2025-11-14 08:51:34.444 [2/start/4 (pid 81134)] Error: Setting up task environment.
2025-11-14 08:51:34.541 [2/start/4 (pid 81134)] Downloading code package...
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] Code package downloaded.
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] Task is starting.
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] Bootstrapping virtual environment...
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] Bootstrap failed while executing: set -e;
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] tmpfile=$(mktemp);
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] echo "@EXPLICIT" > "$tmpfile";
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] ls -d /metaflow/.pkgs/conda// >> "$tmpfile";
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] export PATH=$PATH:$(pwd)/micromamba;
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] export CONDA_PKGS_DIRS=$(pwd)/micromamba/pkgs;
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] export MAMBA_NO_LOW_SPEED_LIMIT=1;
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] export MAMBA_USE_INDEX_CACHE=1;
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] export MAMBA_NO_PROGRESS_BARS=1;
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] export CONDA_FETCH_THREADS=1;
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] micromamba create --yes --offline --no-deps --safety-checks=disabled --no-extra-safety-checks --prefix /metaflow/linux-aarch64/538301e1d2a9acc --file "$tmpfile" --no-pyc --no-rc --always-copy;
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] rm "$tmpfile"
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] Stdout:
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] Stderr: ls: cannot access '/metaflow/.pkgs/conda//': No such file or directory
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)]
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)] (exit code 1). This could be a transient error. Use @retry to retry.
2025-11-14 08:51:34.542 [2/start/4 (pid 81134)]
2025-11-14 08:51:33.352 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] tmpfile=$(mktemp);
2025-11-14 08:51:33.352 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] echo "@EXPLICIT" > "$tmpfile";
2025-11-14 08:51:33.352 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] ls -d /metaflow/.pkgs/conda/*/* >> "$tmpfile";
2025-11-14 08:51:33.352 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] export PATH=$PATH:$(pwd)/micromamba;
2025-11-14 08:51:33.352 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] export CONDA_PKGS_DIRS=$(pwd)/micromamba/pkgs;
2025-11-14 08:51:33.352 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] export MAMBA_NO_LOW_SPEED_LIMIT=1;
2025-11-14 08:51:33.352 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] export MAMBA_USE_INDEX_CACHE=1;
2025-11-14 08:51:33.352 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] export MAMBA_NO_PROGRESS_BARS=1;
2025-11-14 08:51:33.352 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] export CONDA_FETCH_THREADS=1;
2025-11-14 08:51:33.352 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] micromamba create --yes --offline --no-deps --safety-checks=disabled --no-extra-safety-checks --prefix /metaflow/linux-aarch64/538301e1d2a9acc --file "$tmpfile" --no-pyc --no-rc --always-copy;
2025-11-14 08:51:33.352 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] rm "$tmpfile"
2025-11-14 08:51:33.352 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] Stdout:
2025-11-14 08:51:33.352 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29] Stderr: ls: cannot access '/metaflow/.pkgs/conda/*/*': No such file or directory
2025-11-14 08:51:33.352 [2/start/4 (pid 81134)] [pod t-07ea3803-xkdgx-jth29]
2025-11-14 08:51:34.553 [2/start/4 (pid 81134)] Task failed.
2025-11-14 08:51:34.562 Workflow failed.
2025-11-14 08:51:34.562 Terminating 0 active tasks...
2025-11-14 08:51:34.562 Flushing logs...
Step failure:
Step start (task-id 4) failed.quick-carpet-67110
11/12/2025, 12:22 PMjackieob/metadata_service:gcp.rc1
link to code
Is this image still maintained? I was looking for the source repository of this image and it appears to be this one, but most of the commits are from years ago.
We are trying to take advantage of the newly released spin feature, but it looks like that requires a newer version of the metadata sevrice.shy-refrigerator-15055
11/12/2025, 10:28 AMable-alligator-13951
11/11/2025, 9:20 PMargo retry).
When I retry the workflow, Argo changes workflow status to Running, delete all failed pods, but it gets stuck without rerunning any step. This is happening specially with workflows that use foreach instruction.
Is this the expected current behavior? Should I instead rely on Metaflow retry config (--with retry) only?handsome-book-90365
11/11/2025, 7:18 AMNOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyG. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...
and without this flag it is throwing bus error basically unable to allocate a large graph memory.
How do i set --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 in our batch jobs?modern-summer-61066
11/10/2025, 8:03 PM@retry on the foreach step and with --max-workers 1, metaflow will process 1 child step at a time. if one of the child steps fails, if I run the flow locally, it would retry on the failed child step with the interval, but if I run the flow with argo workflow, it would skip the retry and run other child steps, and then come back to retry on the failed step. Is this expected behavior with argo workflow?colossal-tent-96436
11/06/2025, 2:55 PM@timeout decorator combining with @catch . According to the documentation:
It will cause the step to be retried if needed and the exception will be caught by the @catch decorator, if present.
But that is not behaviour I'm observing. When the timeout is reached the whole process is aborted and the pipeline fails.shy-refrigerator-15055
11/06/2025, 2:52 PM2.19.5 , mainly for the conditional step transitions, but now run into an issue that makes absolutely no sense to us. Details in π§΅, thanks in advance.fast-vr-44972
11/06/2025, 11:34 AMadamant-eye-92351
11/05/2025, 11:03 AM@conda and @pypi decorators. In our case, every flow that is run on AWS Batch, whether it's deployed or not, will use the custom Docker image, which creates two main pain points:
β’ When adding new dependencies, the inability to run a flow with batch without first having updated the pyproject.toml on the main branch (so that the CI/CD triggers itself and the Docker image is re built, including freshly added packages)
β’ The obvious bottlenecks it will create as more projects are created, depending on larger and @conda or @pypi decorators, is that correct?
β¦ Is there an official alternative if for some reason someone would like to stick with poetry or uv?
β’ If it's not yet possible, I was wondering what is everyone's opinion on eventually having an official decorator, let's say @uv(pyproject="pyproject.toml") or @uv(lock="uv.lock"), that would allow to specify the dependencies at the flow level while allowing the creation of the venv at runtime.
β¦ Similar to, from the doc about netflix's metaflow extensions, A more full-fledged environment command allowing you to resolve environments using external requirements.txt or environment.yml files as well as inspect and rehydrate environments used in any previously run step.?
In any case, happy to hear your thoughts and recommendations on that topic, thanks a lot in advance!
PS: the spin feature looks amazing π€©witty-ability-67045
11/05/2025, 12:37 AMboundless-sugar-55740
11/04/2025, 2:05 PM@schedule to our modern Argo Workflows cluster.
The core problem is a schema incompatibility in the generated Argo CronWorkflow manifest.
Key Issue: Singular vs. Plural Fields
1. Metaflow Output: Metaflow is generating the CronWorkflow using the deprecated singular field: spec.schedule: '*/15 * * * *' (a string).
2. Argo Controller Requirement: Our Argo Controller (v3.6+) *r*equires the current plural field: spec.schedule*s*: ['*/15 * * * *'] (a list).
3. The Failure: As a result, the Argo Controller sees an empty list (schedules: []) and throws the error: "cron workflow must have at least one schedule".billions-memory-41337
11/03/2025, 7:30 PMearly-nest-89176
11/03/2025, 12:03 PMfast-honey-9693
10/31/2025, 12:34 AMhost_volumes is working properly.
i've got an AWS setup, using fargate. the recent change is that i'm using a custom container. the batch job is able to launch the ec2 instance, and the task starts, but for some reason the mount isn't passed through to the container properly.
more details and things in thread.chilly-cat-54871
10/30/2025, 1:58 PMquick-carpet-67110
10/30/2025, 1:27 PMmetaflow_metadata_service Docker image the same in ECR (under the Outerbounds org) and in DockerHub (under the netflix-oss org) or are there differences?
https://gallery.ecr.aws/outerbounds/metaflow_metadata_service
https://hub.docker.com/r/netflixoss/metaflow_metadata_service/tagshappy-journalist-26770
10/30/2025, 10:00 AMgorgeous-florist-65298
10/29/2025, 5:02 PMabundant-wolf-81413
10/28/2025, 9:16 PM@pypi_base(
packages={
"dalex": "1.7.2",
...
},
python="3.12.6",
)
But when Metaflow bootstraps the virtual environment, I get:
ERROR: Could not find a version that satisfies the requirement dalex==1.7.2
(from versions: 0.1.0, 0.1.2, ..., 0.2.0)
It seems that Micromamba / Metaflow is only seeing the old dalex versions (0.1.x β 0.2.0), even though dalex 1.7.2 exists on PyPI and installs fine outside Metaflow.
For example, if I run locally:
pip install dalex==1.7.2
It works perfectly, but inside Metaflowβs Micromamba environment it fails.
It looks like Metaflow 2.12.5 with Python 3.12 + Micromamba cannot find the dalex 1.7.x wheel!!plain-carpenter-99052
10/28/2025, 5:43 PM@batch(memory=512, cpu=1, gpu=1) the job associated remains in RUNNABLE state indefinitely :
β’ I have double-checked that compute ressources are sufficient
β’ I have an ECS Cluster with a g4dn.xlarge instance attached to it and launched by the Auto Scalling Group that Metaflow or Batch launched.
β’ I use the AMI ami-02124cf261ef1e336 so I have cuda installed
My CloudTrails shows a RunTask event that fails with response element :
"responseElements": {
"tasks": [],
"failures": [
{
"arn": "arn:aws:ecs:eu-west-3:*******:container-instance/62da090445a44320811473cd2c0e4055",
"reason": "RESOURCE:GPU"
}
It can't understand why, it's like my Batch couldn't see the GPU ressource attached to my EC2 and there is no log to help...
When I launch the same step without gpu=1 it works perfectly well.abundant-quill-72601
10/28/2025, 4:35 PMquick-carpet-67110
10/28/2025, 8:38 AMgoogle-api-core release yesterday is not compatible with Metaflow since we have started seeing massive failure of all our pipelines this morning with this message:
[2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base] File "/usr/local/bin/download-gcp-object", line 5, in <module>
[2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base] from simple_gcp_object_downloader.download_gcp_object import main
[2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base] File "/usr/local/lib/python3.12/site-packages/simple_gcp_object_downloader/download_gcp_object.py", line 2, in <module>
[2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base] from google.cloud import storage
[2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base] File "/usr/local/lib/python3.12/site-packages/google/cloud/storage/__init__.py", line 35, in <module>
[2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base] from google.cloud.storage.batch import Batch
[2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base] File "/usr/local/lib/python3.12/site-packages/google/cloud/storage/batch.py", line 43, in <module>
[2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base] from google.cloud import exceptions
[2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base] File "/usr/local/lib/python3.12/site-packages/google/cloud/exceptions/__init__.py", line 24, in <module>
[2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base] from google.api_core import exceptions
[2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base] File "/usr/local/lib/python3.12/site-packages/google/api_core/__init__.py", line 20, in <module>
[2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base] from google.api_core import _python_package_support
[2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base] File "/usr/local/lib/python3.12/site-packages/google/api_core/_python_package_support.py", line 28, in <module>
[2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base] from packaging.version import parse as parse_version
[2025-10-28, 07:02:00 UTC] {pod_manager.py:471} INFO - [base] ModuleNotFoundError: No module named 'packaging'
The failure is coming from this line. I've looked through the metaflow codebase and it looks like none of the GCP dependencies here are pinned to specific versions so they are probably pulling in the latest google-api-core versionfuture-crowd-14830
10/24/2025, 2:38 PMchilly-cat-54871
10/23/2025, 4:03 PMjoin? Is it absolutely required, or is there some type of no-op I can specify to avoid loading a million artifacts from the datastore if I don't need to? Am I missing something / thinking about this incorrectly?cool-notebook-79020
10/23/2025, 10:59 AMTimeout: loading cards
Most of the times it works after refreshing the page a few times but not always.
Running on AWS ECSlively-lunch-9285
10/21/2025, 7:37 PMabundant-byte-82093
10/20/2025, 11:54 AMdef building_block_echo(input: str) -> None:
print(f"echo... {(input)}.")
class MyEchoPipeline(FlowSpec):
test = Parameter(
"test",
help="A test parameter to show mypy typing works",
default=1,
type=int)
@step
def start(self):
print(type(self.test))
building_block_echo(self.test)
self.next(self.end)
@step
def end(self):
pass
if __name__ == "__main__":
MyEchoPipeline()hundreds-rainbow-67050
10/20/2025, 3:56 AMfuture-crowd-14830
10/16/2025, 5:44 PM--package-suffixes and I'm also using a uv environment via --environment=uv. Since these arguments come before the metaflow commands (run, show, etc.), it doesn't appear that I can set these in a configuration some way. Is there some way to set these via a configuration or even hard code them into the flow script? I tried argument injection in __main__ but it didn't work.