https://outerbounds.com/ logo
Join Slack
Powered by
  • c

    colossal-tent-96436

    11/06/2025, 2:55 PM
    Hi I got a question regarding the
    @timeout
    decorator combining with
    @catch
    . According to the documentation:
    Copy code
    It will cause the step to be retried if needed and the exception will be caught by the @catch decorator, if present.
    But that is not behaviour I'm observing. When the timeout is reached the whole process is aborted and the pipeline fails.
    👀 1
    0
    a
    • 2
    • 3
  • s

    shy-refrigerator-15055

    11/06/2025, 2:52 PM
    Hello, we are running metaflows in argo and have been doing so successfully for quite a while. We upgraded to metaflow
    2.19.5
    , mainly for the conditional step transitions, but now run into an issue that makes absolutely no sense to us. Details in 🧵, thanks in advance.
    0
    a
    t
    • 3
    • 6
  • f

    fast-vr-44972

    11/06/2025, 11:34 AM
    Hi, while using baseflow pattern with flow mutators, I get some error. Details in 🧵
    0
    a
    d
    • 3
    • 5
  • a

    adamant-eye-92351

    11/05/2025, 11:03 AM
    Hi everyone, I have a couple questions regarding dependency management in the case of a monorepo. What we have: • A monorepo with different projects that rely on a common set of dependencies defined in a pyproject.toml at the root • A CI/CD pipeline that builds a custom Docker image, installing all common dependencies using poetry, everytime a PR is approved • When a flow is deployed, using AWS Step function, it will uses the custom Docker image where the (global) virtual environment has already been created From my understanding (please correct me if I'm wrong), Metaflow will install custom sets of dependencies at run time, for a given flow, only when using the
    @conda
    and
    @pypi
    decorators. In our case, every flow that is run on AWS Batch, whether it's deployed or not, will use the custom Docker image, which creates two main pain points: • When adding new dependencies, the inability to run a flow with batch without first having updated the pyproject.toml on the main branch (so that the CI/CD triggers itself and the Docker image is re built, including freshly added packages) • The obvious bottlenecks it will create as more projects are created, depending on larger and probably most likely conflicting dependencies What I would like is/my questions are: • The ability to specify dependencies at the project level ◦ From what I've seen, I feel like the recommended way is essentially to use the
    @conda
    or
    @pypi
    decorators, is that correct? ◦ Is there an official alternative if for some reason someone would like to stick with poetry or uv? • If it's not yet possible, I was wondering what is everyone's opinion on eventually having an official decorator, let's say
    @uv(pyproject="pyproject.toml")
    or
    @uv(lock="uv.lock")
    , that would allow to specify the dependencies at the flow level while allowing the creation of the venv at runtime. ◦ Similar to, from the doc about netflix's metaflow extensions,
    A more full-fledged environment command allowing you to resolve environments using external requirements.txt or environment.yml files as well as inspect and rehydrate environments used in any previously run step.
    ? In any case, happy to hear your thoughts and recommendations on that topic, thanks a lot in advance! PS: the spin feature looks amazing 🤩
    👀 1
    ✅ 1
    0
    d
    • 2
    • 8
  • w

    witty-ability-67045

    11/05/2025, 12:37 AM
    Has anyone setup a metaflow pipeline to read off an SQS queue and aws batch? The benefit I would see to doing this is that we can keep a pipeline alive if there are still things to process on the queue instead of having to spin up an entire another aws batch job.
    ✅ 1
    0
    f
    v
    • 3
    • 12
  • b

    boundless-sugar-55740

    11/04/2025, 2:05 PM
    Hi We're facing a persistent scheduling issue for Metaflow flows deployed with
    @schedule
    to our modern Argo Workflows cluster. The core problem is a schema incompatibility in the generated Argo CronWorkflow manifest. Key Issue: Singular vs. Plural Fields 1. Metaflow Output: Metaflow is generating the CronWorkflow using the deprecated singular field:
    spec.schedule: '*/15 * * * *'
    (a string). 2. Argo Controller Requirement: Our Argo Controller (v3.6+) *r*equires the current plural field:
    spec.schedule*s*: ['*/15 * * * *']
    (a list). 3. The Failure: As a result, the Argo Controller sees an empty list (
    schedules: []
    ) and throws the error:
    "cron workflow must have at least one schedule"
    .
    ✅ 1
    0
    a
    t
    • 3
    • 16
  • b

    billions-memory-41337

    11/03/2025, 7:30 PM
    Are their recommendations for setting up a high availability deployment of metaflow (metadata service + ui backend) on k8s? Are they both horizontally scalable? Any recommended resource requests/limits? Apologies if this is already documented somewhere. I was having a tough time finding it
    ✅ 2
    0
    s
    • 2
    • 5
  • e

    early-nest-89176

    11/03/2025, 12:03 PM
    Hi, I am currently trying to debug, why our Metaflow steps sometimes are slow during "bootstrap of virtual environment". Is there any debugging variable I can set to get more logging out of the bootstrapping process? I am trying to locate any potential bottleneck in network, or k8s cluster resources. In the attached screenshot the bootstrapping takes minutes, but other times it is closer to 20 seconds We are using the --environment=pypi flag and the pypi decorator in the example
    0
    f
    a
    d
    • 4
    • 12
  • f

    fast-honey-9693

    10/31/2025, 12:34 AM
    I'm running into a strange issue and am a bit stumped. doesn't seem like
    host_volumes
    is working properly. i've got an AWS setup, using fargate. the recent change is that i'm using a custom container. the batch job is able to launch the ec2 instance, and the task starts, but for some reason the mount isn't passed through to the container properly. more details and things in thread.
    ✅ 1
    0
    f
    • 2
    • 19
  • c

    chilly-cat-54871

    10/30/2025, 1:58 PM
    When running Metaflow on ECS Fargate, is there a client hook to add AWS Tags to the Metaflow-generated AWS Batch JobDefinition? Goal is to propogate tags to the generated Fargate workers (I see propogateTags is set to true for the JobDefinition already)
    ✅ 1
    0
    f
    • 2
    • 3
  • q

    quick-carpet-67110

    10/30/2025, 1:27 PM
    Hey team! is the
    metaflow_metadata_service
    Docker image the same in ECR (under the Outerbounds org) and in DockerHub (under the netflix-oss org) or are there differences? https://gallery.ecr.aws/outerbounds/metaflow_metadata_service https://hub.docker.com/r/netflixoss/metaflow_metadata_service/tags
    ✅ 1
    0
    s
    t
    • 3
    • 3
  • h

    happy-journalist-26770

    10/30/2025, 10:00 AM
    Hi, is there a way to resume/reun a flow after a fix with argo-workflows?
    ✅ 1
    0
    s
    • 2
    • 2
  • g

    gorgeous-florist-65298

    10/29/2025, 5:02 PM
    Hey Folks! Nice to meet you all. I am relatively new to working with Metaflow and was hoping to pick the collective brain. Currently running Metaflow with Argo Workflows on GCP GKE. We are seeing increasingly more occurrences of pods getting preempted as part of Autoscaler pod consolidation events. More often than not these are getting preempted before populating into Metaflow and so I thought to implement Argo Workflow retries to assist with but I am being told by the team, this is possibly not a great solution. I supposed closest issue I found was here. Have any of you ran into this on GKE and any suggestions for a more resilient deployment?
    ✅ 1
    0
    a
    b
    f
    • 4
    • 6
  • a

    abundant-wolf-81413

    10/28/2025, 9:16 PM
    Hello! I'm facing a problem installing dalex in Metaflow 2.12.5 using @pypi_base. I wrote:
    Copy code
    @pypi_base(
    packages={
    "dalex": "1.7.2",
    ...
    },
    python="3.12.6",
    )
    But when Metaflow bootstraps the virtual environment, I get:
    ERROR: Could not find a version that satisfies the requirement dalex==1.7.2
    (from versions: 0.1.0, 0.1.2, ..., 0.2.0)
    It seems that Micromamba / Metaflow is only seeing the old dalex versions (0.1.x → 0.2.0), even though dalex 1.7.2 exists on PyPI and installs fine outside Metaflow. For example, if I run locally:
    pip install dalex==1.7.2
    It works perfectly, but inside Metaflow’s Micromamba environment it fails. It looks like Metaflow 2.12.5 with Python 3.12 + Micromamba cannot find the dalex 1.7.x wheel!!
    ✅ 1
    0
    d
    f
    • 3
    • 5
  • p

    plain-carpenter-99052

    10/28/2025, 5:43 PM
    Hello ! I'm facing a problem with AWS Batch running GPU. When I use
    @batch(memory=512, cpu=1, gpu=1)
    the job associated remains in
    RUNNABLE
    state indefinitely : • I have double-checked that compute ressources are sufficient • I have an ECS Cluster with a g4dn.xlarge instance attached to it and launched by the Auto Scalling Group that Metaflow or Batch launched. • I use the AMI
    ami-02124cf261ef1e336
    so I have cuda installed My CloudTrails shows a RunTask event that fails with response element :
    Copy code
    "responseElements": {
            "tasks": [],
            "failures": [
                {
                    "arn": "arn:aws:ecs:eu-west-3:*******:container-instance/62da090445a44320811473cd2c0e4055",
                    "reason": "RESOURCE:GPU"
                }
    It can't understand why, it's like my Batch couldn't see the GPU ressource attached to my EC2 and there is no log to help... When I launch the same step without
    gpu=1
    it works perfectly well.
    ✅ 1
    0
    c
    f
    • 3
    • 5
  • a

    abundant-quill-72601

    10/28/2025, 4:35 PM
    Hi team, i am facing a memory issue with metaflow-argoWF integration. When users submit jobs directly to K8s, there is no explicit ephemeral-storage request, so k8s uses the full allocatble storage. With argoWF, pods explicitly request ephemeral-storage, limited to 10GB and k8s enforces that. So even if the job gets scheduled to a good-enough instance, there is this ephemeral storage limitation...Anyone faced this before? any insight on best way to solve this would be appreciated! My plan is to change workflow defaults in argoWF or inejct in workflow template but not sure it will work
    ✅ 1
    0
    a
    • 2
    • 1
  • q

    quick-carpet-67110

    10/28/2025, 8:38 AM
    Hello! It' appears that something in the
    google-api-core
    release yesterday is not compatible with Metaflow since we have started seeing massive failure of all our pipelines this morning with this message:
    Copy code
    [2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base]   File "/usr/local/bin/download-gcp-object", line 5, in <module>
    [2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base]     from simple_gcp_object_downloader.download_gcp_object import main
    [2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base]   File "/usr/local/lib/python3.12/site-packages/simple_gcp_object_downloader/download_gcp_object.py", line 2, in <module>
    [2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base]     from google.cloud import storage
    [2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base]   File "/usr/local/lib/python3.12/site-packages/google/cloud/storage/__init__.py", line 35, in <module>
    [2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base]     from google.cloud.storage.batch import Batch
    [2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base]   File "/usr/local/lib/python3.12/site-packages/google/cloud/storage/batch.py", line 43, in <module>
    [2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base]     from google.cloud import exceptions
    [2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base]   File "/usr/local/lib/python3.12/site-packages/google/cloud/exceptions/__init__.py", line 24, in <module>
    [2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base]     from google.api_core import exceptions
    [2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base]   File "/usr/local/lib/python3.12/site-packages/google/api_core/__init__.py", line 20, in <module>
    [2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base]     from google.api_core import _python_package_support
    [2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base]   File "/usr/local/lib/python3.12/site-packages/google/api_core/_python_package_support.py", line 28, in <module>
    [2025-10-28, 07:01:50 UTC] {pod_manager.py:471} INFO - [base]     from packaging.version import parse as parse_version
    [2025-10-28, 07:02:00 UTC] {pod_manager.py:471} INFO - [base] ModuleNotFoundError: No module named 'packaging'
    The failure is coming from this line. I've looked through the metaflow codebase and it looks like none of the GCP dependencies here are pinned to specific versions so they are probably pulling in the latest
    google-api-core
    version
    ✅ 2
    this 1
    0
    b
    d
    +2
    • 5
    • 9
  • f

    future-crowd-14830

    10/24/2025, 2:38 PM
    Hello. I have a quick question about the @trigger_on_finish decorator. Can I use this to essentially re-trigger the same flow and basically keep it running continuously instead of using a scheduled approach? I was going to attempt it but just thought I'd check with the group if anyone knows if this would work.
    ✅ 1
    0
    a
    • 2
    • 6
  • c

    chilly-cat-54871

    10/23/2025, 4:03 PM
    Hello! Brand new Metaflow user here with potentially boneheaded question. I'm hoping to use Metaflow on a long-running petabyte-scale data transformation job with potentially up to a couple million steps/tasks depending on how I chunk my manifest. (I've used similar approaches with Step Functions DistributedMap & ECS Fargate before, so feel like Metaflow will be able to handle this, but please correct me). The DAG is a simple fanout to workers, with no additional joins / reductions necessary. My question is, since I don't have any need to interrogate step artifacts after foreach/map branches complete, how should I handle my
    join
    ? Is it absolutely required, or is there some type of no-op I can specify to avoid loading a million artifacts from the datastore if I don't need to? Am I missing something / thinking about this incorrectly?
    ✅ 1
    0
    a
    d
    • 3
    • 8
  • c

    cool-notebook-79020

    10/23/2025, 10:59 AM
    Hi, Did anyone else have issues with cards timing out in the metaflow UI?
    Timeout: loading cards
    Most of the times it works after refreshing the page a few times but not always. Running on AWS ECS
    0
    a
    r
    • 3
    • 4
  • l

    lively-lunch-9285

    10/21/2025, 7:37 PM
    Anyone want to move to Utah and use metaflow with me? Job 😆
    ⛷️ 2
    🏔️ 2
    excited 1
    ⛪ 1
    🧂 1
    0
  • a

    abundant-byte-82093

    10/20/2025, 11:54 AM
    Is it possible to add typing support, or use pydantic models for parameter definition? For example, the fact that the parameter is an int and not a string as desired does not get picked up by mypy.
    Copy code
    def building_block_echo(input: str) -> None:
        print(f"echo... {(input)}.")
    
    class MyEchoPipeline(FlowSpec):
        test = Parameter(
            "test",
            help="A test parameter to show mypy typing works",
            default=1,
            type=int)
    
        @step
        def start(self):
            print(type(self.test))
            building_block_echo(self.test)
    
            self.next(self.end)
    
        @step
        def end(self):
            pass
    
    if __name__ == "__main__":
        MyEchoPipeline()
    0
    h
    • 2
    • 2
  • h

    hundreds-rainbow-67050

    10/20/2025, 3:56 AM
    announcement speaker Metaflow Office Hours 📆 When: Tuesday Oct 28th, 9am PST 📍 RSVP: https://luma.com/6q3m64uf Title: Metaflow for Platform Engineers and Agent Builders: Mutators, Custom Decorators, and Beyond! Speaker: Romain Cledat (Netflix) When Metaflow was first released, it was built for data scientists and ML researchers. As adoption grew, it evolved into a framework that can be extended and customized to meet each company’s needs. In this session, Romain will show how platform teams can use mutators, user-defined decorators, and other advanced features to make Metaflow work seamlessly for their own users—unlocking new power for platform engineers and MLEs.
    myaowlcat flexq 3
    netflix 5
    thankyou 2
    0
    g
    • 2
    • 1
  • f

    future-crowd-14830

    10/16/2025, 5:44 PM
    Hello. I have a question about running flows. My flows have need to include some configuration files and other important artifacts so I use
    --package-suffixes
    and I'm also using a uv environment via
    --environment=uv
    . Since these arguments come before the metaflow commands (run, show, etc.), it doesn't appear that I can set these in a configuration some way. Is there some way to set these via a configuration or even hard code them into the flow script? I tried argument injection in
    __main__
    but it didn't work.
    0
    a
    • 2
    • 7
  • b

    billions-memory-41337

    10/16/2025, 2:17 AM
    Is there a recommended way to implement the workflow of workflows pattern with metaflow? I gather I could make it happen by chaining flow executions together with @trigger_on_finish, but it's a bit hard to visualize the high level workflow since it's cobbled together across files. I could write a helper for this, but before I did, I wanted to check-in with community to see if there is a better way or existing helpers. Thank you in advance 😃
    ✅ 1
    0
    v
    • 2
    • 6
  • q

    quick-carpet-67110

    10/15/2025, 9:17 AM
    Micromamba errors Hey everyone! I was wondering if anyone has encountered errors such as the ones shown below when running this command: Metaflow command:
    Copy code
    python transformed_metaflow_pipelines/some_metaflow_file.py --environment=pypi  --branch something airflow create --generate-new-token pipelines/some_metaflow_file.py
    Error:
    Copy code
    2025-10-13 11:32:58.300 Bootstrapping virtual environment(s) ...
        Micromamba ran into an error while setting up environment:
        command '/home/runner/.metaflowconfig/micromamba/bin/micromamba create --yes --no-deps --download-only --safety-checks=disabled --no-extra-safety-checks --repodata-ttl=86400 --prefix=/tmp/tmpixz_p440/prefix --quiet
    (omitting a bunch of packagenames that get dumped to stacktrace)
    Copy code
    returned error (1)
        critical libmamba Unable to read repo solv file 'conda-forge/noarch', error was: unexpected EOF, depth = 3
    Unfortunately, this error does not happen everytime this command is run and thus far we have not been able to pin down the exact conditions when this happens, but wondering if someone else has seen this before.
    0
    v
    • 2
    • 2
  • d

    delightful-actor-70552

    10/14/2025, 1:39 PM
    Hey everyone! I'm super new to Metaflow and have a question re. building with/extending it - how easy is it for us to swap out pieces of the Metaflow internals for experimentation? For example: • I'd like to have a bit more control over the directory uploaded as part of the code package - perhaps I could customise the snapshot class? • I'd be quite interested in experimenting with a custom datastore class to serialise certain python objects in a specific way These objects are quite neatly abstracted into classes, but I'm not super sure how easy it is to configure Metaflow to override these when developing a new flow. Is this something you support?
    ➕ 1
    0
    a
    • 2
    • 1
  • d

    delightful-zebra-65925

    10/13/2025, 1:24 PM
    Hi everyone! I've been trying to load my sandbox on the browser, but it gets stuck on the loading page. Does anyone have an idea why that might be hapenning?
    0
    v
    • 2
    • 6
  • f

    fast-vr-44972

    10/13/2025, 10:14 AM
    Hi, is it possible to identify from inside a flow whether a flow run is a cron job or not?
    0
    v
    • 2
    • 5
  • n

    narrow-waitress-79414

    10/10/2025, 6:29 PM
    Is it possible to run part of DAG in Metaflow? ex: I want to run just start step. other steps in a second run with resume.
    0
    r
    f
    • 3
    • 6