https://outerbounds.com/ logo
Join Slack
Powered by
  • f

    future-crowd-14830

    10/16/2025, 5:44 PM
    Hello. I have a question about running flows. My flows have need to include some configuration files and other important artifacts so I use
    --package-suffixes
    and I'm also using a uv environment via
    --environment=uv
    . Since these arguments come before the metaflow commands (run, show, etc.), it doesn't appear that I can set these in a configuration some way. Is there some way to set these via a configuration or even hard code them into the flow script? I tried argument injection in
    __main__
    but it didn't work.
    0
    a
    • 2
    • 7
  • b

    billions-memory-41337

    10/16/2025, 2:17 AM
    Is there a recommended way to implement the workflow of workflows pattern with metaflow? I gather I could make it happen by chaining flow executions together with @trigger_on_finish, but it's a bit hard to visualize the high level workflow since it's cobbled together across files. I could write a helper for this, but before I did, I wanted to check-in with community to see if there is a better way or existing helpers. Thank you in advance 😃
    ✅ 1
    0
    v
    • 2
    • 6
  • q

    quick-carpet-67110

    10/15/2025, 9:17 AM
    Micromamba errors Hey everyone! I was wondering if anyone has encountered errors such as the ones shown below when running this command: Metaflow command:
    Copy code
    python transformed_metaflow_pipelines/some_metaflow_file.py --environment=pypi  --branch something airflow create --generate-new-token pipelines/some_metaflow_file.py
    Error:
    Copy code
    2025-10-13 11:32:58.300 Bootstrapping virtual environment(s) ...
        Micromamba ran into an error while setting up environment:
        command '/home/runner/.metaflowconfig/micromamba/bin/micromamba create --yes --no-deps --download-only --safety-checks=disabled --no-extra-safety-checks --repodata-ttl=86400 --prefix=/tmp/tmpixz_p440/prefix --quiet
    (omitting a bunch of packagenames that get dumped to stacktrace)
    Copy code
    returned error (1)
        critical libmamba Unable to read repo solv file 'conda-forge/noarch', error was: unexpected EOF, depth = 3
    Unfortunately, this error does not happen everytime this command is run and thus far we have not been able to pin down the exact conditions when this happens, but wondering if someone else has seen this before.
    0
    v
    • 2
    • 2
  • d

    delightful-actor-70552

    10/14/2025, 1:39 PM
    Hey everyone! I'm super new to Metaflow and have a question re. building with/extending it - how easy is it for us to swap out pieces of the Metaflow internals for experimentation? For example: • I'd like to have a bit more control over the directory uploaded as part of the code package - perhaps I could customise the snapshot class? • I'd be quite interested in experimenting with a custom datastore class to serialise certain python objects in a specific way These objects are quite neatly abstracted into classes, but I'm not super sure how easy it is to configure Metaflow to override these when developing a new flow. Is this something you support?
    ➕ 1
    0
    a
    • 2
    • 1
  • d

    delightful-zebra-65925

    10/13/2025, 1:24 PM
    Hi everyone! I've been trying to load my sandbox on the browser, but it gets stuck on the loading page. Does anyone have an idea why that might be hapenning?
    0
    v
    • 2
    • 6
  • f

    fast-vr-44972

    10/13/2025, 10:14 AM
    Hi, is it possible to identify from inside a flow whether a flow run is a cron job or not?
    0
    v
    • 2
    • 5
  • n

    narrow-waitress-79414

    10/10/2025, 6:29 PM
    Is it possible to run part of DAG in Metaflow? ex: I want to run just start step. other steps in a second run with resume.
    0
    r
    f
    • 3
    • 5
  • s

    stale-ambulance-25084

    10/09/2025, 5:45 PM
    Hi team! I'm diving into Metaflow and have a question for the experienced users here. I've read almost all the documentation and tried some examples but still trying to understand the core problem it was originally designed to solve and its underlying philosophy. From a practical standpoint, what are the compelling reasons for mle\ds\mlops to choose Metaflow if you're currently Airflow user as an example? I've mapped out our typical ML development lifecycle (from experimentation to production), and I'm finding it hard to pinpoint specific stages where Metaflow offers a significant improvement in convenience or capability but i believe it should exist.
    0
    v
    g
    • 3
    • 4
  • r

    rich-agent-87730

    10/08/2025, 1:48 PM
    Hi Team, I'm currently using the
    Config
    and
    config_expr
    mechanisms (which is pretty cool) to load in all of the data for my decorators. There is a pain point where it is difficult or hard to resume a failed flow with a fix in the config file. One example is for the
    @batch
    decorator, If I get a OOM error, I can just up the amount of memory in the configs but if I resume the flow, since the old config has been cached, that gets pulled instead of the fixed new value. I currently have to edit the flow and manually update the batch decorator with the new values. Is there a better way to do this?
    0
  • h

    happy-journalist-26770

    10/08/2025, 1:35 PM
    Hi, Is there a way (or any workaround) to run some custom setup/functions before Metaflow installs packages in the pod (Argo workflow, default image)? Basically, I’m trying to create a
    .netrc
    file or set up Git config for GitHub auth - need to install a package from a private GitHub repo using uv.
    0
    • 1
    • 2
  • s

    silly-megabyte-67326

    10/07/2025, 11:15 PM
    Is there a simple example that shows how to integrate a new kind of cluster for deployment? I have a cli that I can use to submit scripts to a cluster and I have S3 to store code and inputs/outputs. I put together a decorator as a proof of concept that will execute a step on my cluster, but I'm probably not using metaflow as intended.
    0
    a
    • 2
    • 1
  • s

    straight-shampoo-11124

    10/07/2025, 7:39 PM
    x-posting here since this will be a really fun Metaflow gathering that you don't want to miss
    taco rocket 2
    0
  • a

    abundant-quill-72601

    10/03/2025, 8:41 PM
    Hi team, is there a way to inject pod priority specs in Metaflow-ArgoWF integration? I want to add a default priority class to all metaflow jobs flowing through Argo WF but I cannot seem to find a way to inject it. My analysis/tests so far: • Metaflow's Argo integration doesn't support
    priorityClassName
    • Tried adding a default priority class for argo WF, it is also not supported ◦ Argo Workflows supports the
    podPriorityClassName
    field, but unless the upstream tool (Metaflow) exposes this in its generated Argo manifests, there is no clean injection route • Tried adding
    METAFLOW_KUBERNETES_POD_SPEC_OVERRIDE
    , this works for direct k8s jobs, but we are running through argoWF, my test failed... • Also adding
    priorityClassName
    to a WorkflowTemplate (that we can create) using a direct field is not natively available in current Metaflow-Argo integrations... • Also, of course, users cannot add pod specs in their metaflow jobs. Not supported yet Let me know your recommendations, thanks a bunch!!!
    0
    a
    • 2
    • 1
  • a

    alert-needle-14247

    10/03/2025, 3:07 PM
    I have a very long list of packages that I install via pypi, and I get the
    Copy code
    "StatusReason": "Container Overrides length must be at most 8192"
    I’m happy to see that this has been fixed in version 2.18.3, using the command:
    Copy code
    Deployer(flow, environment="pypi").step_functions().create(max_workers=5, compress_state_machine=True)
    But now I get the following error:
    Copy code
    1759499526663,Downloading code package...
    1759499526675,"  File ""<string>"", line 1"
    1759499526675,"    import boto3, os; ep=os.getenv(\""METAFLOW_S3_ENDPOINT_URL\""); boto3.client(\""s3\"", **({\""endpoint_url\"":ep} if ep else {})).download_file(\""***\"", \""***/data/c3/c365cb66d0989afafb2a3ec7ee7bcb01504793a1\"", \""job.tar\"")"
    1759499526675,                                    ^
    1759499526675,SyntaxError: unexpected character after line continuation character
    1759499536683,/tmp/step_command.sh: line 1: 2025-10-03T13:52:16.681827Z: command not found
    1759499536683,/tmp/step_command.sh: line 1: task: command not found
    1759499536684,/tmp/step_command.sh: line 1: 0: command not found
    1759499536684,/tmp/step_command.sh: line 1: 2025-10-03T13:52:16.681827065+00:00]Failed: command not found
    1759499536684,Failed to download code package from s3://***/c3/c365cb66d0989afafb2a3ec7ee7bcb01504793a1 after 6 tries. Exiting...
    I replace the S3 bucketname with *, but both bucketname and key are correct. Does anyone know why this is happening?
    0
    a
    • 2
    • 1
  • m

    millions-barista-45672

    10/03/2025, 5:15 AM
    Hey, I seem to be seeing the same error popping up from the metadata-service-v2 ERRORAsyncPostgresDBglobal:Exception occurred. I can't see any resource constraints on the metadata services nor our postgres instance in RDS. We're running a t3.small at the moment, what I did notice was the ReadIOPS were hitting limits so I upgraded from GP2 -> GP3 storage and also created indexes to help irradiate some of the huge scans I was seeing and this has helped lower the IOPS read significantly. I see a few others have mentioned this issue previously but thought best to ask as each use case is sometimes different. I restarted the service about a week ago also so we'll be on the latest metadata service also.
    0
    a
    • 2
    • 1
  • g

    great-egg-84692

    10/02/2025, 9:50 PM
    anyone knows what such error could be about?
    0
    • 1
    • 1
  • c

    crooked-camera-86023

    10/01/2025, 7:10 PM
    Hello Has anyone encountered an issue when starting metaflow locally. The postgresql seems to fail to start:
    Copy code
    `Events:
      Type     Reason          Age                    From               Message
      ----     ------          ----                   ----               -------
      Normal   Scheduled       50m                    default-scheduler  Successfully assigned default/postgresql-0 to minikube
      Normal   Pulling         49m (x4 over 50m)      kubelet            Pulling image "<http://docker.io/bitnami/postgresql:15.3.0-debian-11-r7|docker.io/bitnami/postgresql:15.3.0-debian-11-r7>"
      Warning  Failed          49m (x4 over 50m)      kubelet            Failed to pull image "<http://docker.io/bitnami/postgresql:15.3.0-debian-11-r7|docker.io/bitnami/postgresql:15.3.0-debian-11-r7>": Error response from daemon: manifest for bitnami/postgresql:15.3.0-debian-11-r7 not found: manifest unknown: manifest unknown
      Warning  Failed          49m (x4 over 50m)      kubelet            Error: ErrImagePull
      Warning  Failed          48m (x6 over 50m)      kubelet            Error: ImagePullBackOff
      Normal   BackOff         45m (x18 over 50m)     kubelet            Back-off pulling image "<http://docker.io/bitnami/postgresql:15.3.0-debian-11-r7|docker.io/bitnami/postgresql:15.3.0-debian-11-r7>"
      Normal   SandboxChanged  43m (x2 over 43m)      kubelet            Pod sandbox changed, it will be killed and re-created.
      Warning  Failed          42m (x3 over 43m)      kubelet            Error: ErrImagePull
      Normal   BackOff         42m (x6 over 43m)      kubelet            Back-off pulling image "<http://docker.io/bitnami/postgresql:15.3.0-debian-11-r7|docker.io/bitnami/postgresql:15.3.0-debian-11-r7>"
      Warning  Failed          42m (x6 over 43m)      kubelet            Error: ImagePullBackOff
      Normal   Pulling         41m (x4 over 43m)      kubelet            Pulling image "<http://docker.io/bitnami/postgresql:15.3.0-debian-11-r7|docker.io/bitnami/postgresql:15.3.0-debian-11-r7>"
      Warning  Failed          41m (x4 over 43m)      kubelet            Failed to pull image "<http://docker.io/bitnami/postgresql:15.3.0-debian-11-r7|docker.io/bitnami/postgresql:15.3.0-debian-11-r7>": Error response from daemon: manifest for bitnami/postgresql:15.3.0-debian-11-r7 not found: manifest unknown: manifest unknown
      Normal   SandboxChanged  35m (x3 over 35m)      kubelet            Pod sandbox changed, it will be killed and re-created.
      Warning  Failed          35m (x3 over 35m)      kubelet            Failed to pull image "<http://docker.io/bitnami/postgresql:15.3.0-debian-11-r7|docker.io/bitnami/postgresql:15.3.0-debian-11-r7>": Error response from daemon: manifest for bitnami/postgresql:15.3.0-debian-11-r7 not found: manifest unknown: manifest unknown
      Warning  Failed          35m (x3 over 35m)      kubelet            Error: ErrImagePull
      Warning  Failed          34m (x6 over 35m)      kubelet            Error: ImagePullBackOff
      Normal   Pulling         34m (x4 over 35m)      kubelet            Pulling image "<http://docker.io/bitnami/postgresql:15.3.0-debian-11-r7|docker.io/bitnami/postgresql:15.3.0-debian-11-r7>"
      Normal   BackOff         10m (x109 over 35m)    kubelet            Back-off pulling image "<http://docker.io/bitnami/postgresql:15.3.0-debian-11-r7|docker.io/bitnami/postgresql:15.3.0-debian-11-r7>"
      Normal   SandboxChanged  5m59s (x2 over 6m4s)   kubelet            Pod sandbox changed, it will be killed and re-created.
      Warning  Failed          5m15s (x3 over 5m59s)  kubelet            Error: ErrImagePull
      Warning  Failed          4m36s (x6 over 5m59s)  kubelet            Error: ImagePullBackOff
      Normal   Pulling         4m23s (x4 over 6m4s)   kubelet            Pulling image "<http://docker.io/bitnami/postgresql:15.3.0-debian-11-r7|docker.io/bitnami/postgresql:15.3.0-debian-11-r7>"
      Warning  Failed          4m22s (x4 over 5m59s)  kubelet            Failed to pull image "<http://docker.io/bitnami/postgresql:15.3.0-debian-11-r7|docker.io/bitnami/postgresql:15.3.0-debian-11-r7>": Error response from daemon: manifest for bitnami/postgresql:15.3.0-debian-11-r7 not found: manifest unknown: manifest unknown
      Normal   BackOff         52s (x22 over 5m59s)   kubelet            Back-off pulling image "<http://docker.io/bitnami/postgresql:15.3.0-debian-11-r7|docker.io/bitnami/postgresql:15.3.0-debian-11-r7>"
    ✅ 1
    0
    h
    t
    • 3
    • 11
  • f

    future-crowd-14830

    10/01/2025, 5:24 PM
    Hello, sorry to create a new thread, but I think this is different than my previous timeout issue. I've noticed that when I use s3.get_many vs. s3.get with files that contain non-english characters in the file name (e.g. Japanese Kanji) I see a perpetual s3 transient error that doesn't resolve. I can confirm that the s3 file exists by doing an s3 ls command and seeing it exactly as I'm providing to the s3.get command. When I use s3.get("filename") it works fine. When I use s3.get_many(["filename"]) I see the following Transient S3 failure which doesn't resolve.
    Copy code
    Transient S3 failure (attempt #1) -- total success: 18, last attempt 18/20 -- remaining: 2
    Transient S3 failure (attempt #2) -- total success: 18, last attempt 0/2 -- remaining: 2
    Transient S3 failure (attempt #3) -- total success: 18, last attempt 0/2 -- remaining: 2
    Transient S3 failure (attempt #4) -- total success: 18, last attempt 0/2 -- remaining: 2
    0
    a
    h
    • 3
    • 22
  • f

    few-dress-69520

    10/01/2025, 4:53 PM
    Hey there. I'm running into problem when trying to use metaflow-netflixext that I haven't seen before. It's a new deployment using the kubernetes/argo stack, which we haven't used before, so it might have to do with some faulty configuration. I'm testing it with a very simple flow like this one When I don't have the extensions package installed, everything works as expected. I can run it using
    --with kubernetes
    and I can deploy it as an argo-workflow and trigger it without problems. When I install the metaflow-netflixext, I can still run it locally, however when I try to run it remotely in a container I get the error that some 20 packages were not found in the cache. All of these look like they are required for Metaflow itself, not for the specific flow I'm running. E.g.
    Copy code
    'ld_impl_linux-64-2.44-h1423503_1': not found at packages/conda/conda.anaconda.org/conda-forge/linux-64/ld_impl_linux-64-2.44-h1423503_1.conda/ld_impl_linux-64-2.44-h1423503_1.conda/0be7c6e070c19105f966d3758448d018/ld_impl_linux-64-2.44-h1423503_1.conda
    'libgomp-15.1.0-h767d61c_4': not found at packages/conda/conda.anaconda.org/conda-forge/linux-64/libgomp-15.1.0-h767d61c_4.conda/libgomp-15.1.0-h767d61c_4.conda/3baf8976c96134738bba224e9ef6b1e5/libgomp-15.1.0-h767d61c_4.conda
    ...
    'uv-0.7.8-h2f11bb8_0': not found at packages/conda/conda.anaconda.org/conda-forge/linux-64/uv-0.7.8-h2f11bb8_0.conda/uv-0.7.8-h2f11bb8_0.conda/aff01745ebc7e711904866ee2e762a42/uv-0.7.8-h2f11bb8_0.conda
    When I look at our datastore location on S3, this is true. They packages are indeed not there. Why are they not there and what do I have to do to make them available?
    0
    d
    • 2
    • 4
  • f

    future-crowd-14830

    10/01/2025, 3:40 PM
    Hello, I was wondering if anyone has suggestions for isolating branched steps using the @catch decorator. Currently I am doing some data processing in a step, and this data processing can be problematic and cause various issues. To prevent a failure in this step from causing the entire flow to fail I am using a @catch and @timeout to limit the allowed execution time as well as catch exceptions, and I am also executing the processing in a subprocess as well to handle segmentation faults. The issue I'm running into is that when a memory leak occurs, which sometimes happens, the step hard fails (seems like kubernetes is killing the pod), causing the entire flow to eventually fail and not complete. Should the I am using Argo to start my Metaflows. I can provide more detail if it helps. Would using a custom step decorator monitoring memory usage be a good way to go? Thanks!
    0
    s
    a
    • 3
    • 5
  • s

    some-nail-13772

    10/01/2025, 9:02 AM
    Hi, I am facing an issue when running metaflow_ray with Jobset This is the error:
    Copy code
    1 FailedCreatePodSandBox: Failed to create pod sandbox: failed to construct FQDN from pod hostname and cluster domain, FQDN js-5b374c0-control-0-0.js-5b374c0.xxxxxxxxxxxxxxx.svc.cluster.local is too long (64 characters is the max, 74 characters requested)
    1. is there any way to fix it without changing our namespace? 2. If I set this
    set_hostname_as_fqdn
    to False, is there any impact?
    0
    s
    h
    • 3
    • 20
  • e

    echoing-camera-27293

    10/01/2025, 8:09 AM
    Hi every one. Metaflow is a great product, and I succeed to start the dev stack. However, I did not undertand in doc how I can manage binary data. Here is my usecase : 1. I load a CSV in a df pandas. One column is a
    filename
    2. I have to load the
    filename
    and process him (calculate the len for instance) Than I created the following flow : 1. Loading DF -> iterating over the
    filename
    columns to the next step with
    id
    as input 2. Load the file using the
    self.input
    , compute the length, store it in
    self.len
    as the id in
    self.id
    3. Converge to the next step that must reconcile the length iterating on inputs
    self.df.loc[input.id, 'len'] = self.len
    However, the number of filesto process is huge, and I'm thinking to put the process in cloud. But the files are only on the host ... What's the best way to proceed ?
    0
    h
    • 2
    • 1
  • a

    agreeable-ambulance-71794

    10/01/2025, 6:38 AM
    Hey guys, good morning. I am having an issue when running an Argo Template generated from a Metaflow flow. The flow includes the recently released in Metaflow 2.18 conditional steps. I am able to create the template and triggering it from the Metaflow CLI. However, when I check the execution in the UI I see an error related to the flow is not found. Any ideas/suggestions?
    0
    s
    t
    • 3
    • 4
  • b

    bland-garden-80695

    09/30/2025, 11:37 PM
    Hey guys, I have been working on building a RAG system with metaflow(leveraging conditions and loop). Metaflow became a great fit for ingestion and embedding. I am ingesting/parsing multiple types of documents and each parser becomes a step and can be processed in parallel. I want to know how will metaflow behave for retrieval. For retrieval I get the request through REST API and is processed through FASTAPI. I intend to replace it with metaflow, but not sure if metaflow is a great tool for scalable real time inference. Below is the link to the github, a system diagram should be enough to understand concept. https://github.com/patel-lay/contex-aware-search
    0
    a
    d
    • 3
    • 5
  • a

    acoustic-river-26222

    09/30/2025, 9:59 PM
    Hi everyone! I have a question regarding the dynamo db table that must be configured in
    METAFLOW_SFN_DYNAMO_DB_TABLE
    . I created a flow and deployed in aws step functions, however, i dont see any item in dynamo. Every flow has run normal. I share an image of the table and the step functions used to run a simple flow.
    ✅ 1
    0
    a
    • 2
    • 5
  • h

    happy-journalist-26770

    09/30/2025, 6:01 AM
    Hi Team, Im trying to limit parallelism (limit pods that spinup when using
    foreach
    ) when running on k8s via argo-workflows. The cli one works fine:
    Copy code
    uv run example_flow.py --branch dev --environment=uv --with retry argo-workflows create --max-workers 2
    but the env/config.json is not not working:
    Copy code
    # /home/sln/.metaflowconfig/config.json:
    METAFLOW_ARGO_WORKFLOWS_CREATE_MAX_WORKERS=2
    
    # Tries these aswell:
    METAFLOW_ARGO_WORKFLOWS_MAX_WORKERS
    METAFLOW_ARGO_CREATE_MAX_WORKERS
    ✅ 1
    0
    a
    • 2
    • 4
  • h

    hundreds-rainbow-67050

    09/29/2025, 4:41 AM
    announcement speaker PSA: Metaflow Office Hours We’ve got an exciting talk coming up! DoorDash: 📆 Sep 30 ⏰ 9am PT RSVP: https://luma.com/office-hours-with-doordash Why is Metaflow such a natural fit for powering food delivery AI? DoorDash will share how they’ve standardized their ML stack with Metaflow, creating a unified UX that boosts MLE/DS productivity while keeping their globally distributed ML Platform team tightly aligned. Hear what they’ve built, what they’ve extended, and what’s next on their roadmap.
    🙌 2
    yayfox 2
    👀 2
    0
  • b

    bland-fountain-92046

    09/27/2025, 11:17 PM
    Hi Metaflow team, I’m exploring building a REST backend in Python and I’m very interested in learning best practices from large-scale projects like Metaflow. I understand this might be a bit different from the typical questions you usually get, but this goes beyond a personal project—it’s part of an important technology decision at my company. I would greatly appreciate any guidance or references regarding project structure, modularity, and REST endpoint management in multi-service environments. Thank you very much for your time and advice.
    0
    v
    • 2
    • 2
  • c

    crooked-camera-86023

    09/27/2025, 6:22 PM
    Hello Outerbounds Team: We (as Roblox) are currently evaluating Metaflow to run on a Kubernetes cluster that is managed by an Istio service mesh. As we scale our usage, we're exploring best practices for securing access to Metaflow's components and UI in a multi-tenant environment. Our security model for other applications on the cluster leverages the Istio Ingress Gateway with an OIDC provider for end-user authentication. We are interested in understanding how this pattern can be applied to Metaflow. Specifically, we have a few questions: 1. What is the recommended approach for user authentication and authorization when multiple data scientists are using Metaflow on a shared Kubernetes cluster? 2. For the Metaflow UI, if we expose it via an ingress, have you seen customers successfully place it behind an OIDC-aware proxy for authentication? 3. Are there any reference architectures for integrating Metaflow with a service mesh like Istio, particularly concerning identity propagation for jobs and fine-grained access control for the metadata service? We are essentially looking to achieve a similar end-user authentication experience as one might find with a platform like Kubeflow, and we would appreciate any guidance or best practices you can share. Thank you for your time and help.
    ✅ 1
    0
    a
    a
    b
    • 4
    • 5
  • n

    nutritious-magazine-38839

    09/26/2025, 9:35 AM
    The Outerbounds Sandbox doesn't seem to be working for a day or so (got stuck at
    SandboxProvisioning
    ), @square-wire-39606 @limited-tomato-18674 could you please check if something global? I tried with two of my email addresses, so I doubt it's specific to my account. Would love to get it working for a demo I'm doing today 2:30pm (CET), but I understand if it's not doable (due to late night in the USA) and I'm working on a plan B
    ✅ 1
    0
    a
    • 2
    • 2