Hello We have a multi flow data pipeline that is currently m Outerbounds #ask-metaflow

Hello. We have a multi flow data pipeline that is ...

stocky-electrician-3956

07/22/2024, 2:23 PM

Hello. We have a multi flow data pipeline that is currently manual, ie a person runs the commands to deploy and trigger each flow as an argo workflow one after the other. Because its been developed this way each flow in the process has ended up with its own parameters, some of which are required. We want to move to using

@trigger_on_end

to launch the next flow with out the manual intervention of a person, while still allowing us to manually trigger flows if needed, but the parameters are causing an issue. Is there a way to pass parameters when creating the template or some other way that doesn't cause the previous flow in the chain to have to know about all the down stream parameters?

✅ 1

stocky-electrician-3956

07/22/2024, 2:25 PM

worst case we'll end up doing the work to change all the parameter handling to work around this, but it would be nice if there is a way to avoid that work.

shy-address-41011

07/22/2024, 2:46 PM

Hello Emily These parameters that you mention are they dependent on any of the steps that you run in a up stream flow ?

shy-address-41011

07/22/2024, 2:46 PM

Or are they just unique to the flow you would want to run as part of the triggered flow

shy-address-41011

07/22/2024, 2:49 PM

I guess what I am asking is - are the values of these parameters generated as part of running an upstream flow that then need to find their way into a downstream flow

stocky-electrician-3956

07/22/2024, 2:49 PM

No, they are things like the google drive folder id to put an output report in, or the name of a dataset to use

stocky-electrician-3956

07/22/2024, 2:49 PM

I get what you are saying, we do have some like that, but we also already have a parameter for the upstream flow, so I can see how we can do that already. 🙂

shy-address-41011

07/22/2024, 2:57 PM

in your particular use-case is it acceptable to pass the universe of parameters to the 1st flow in the daisy chain?

stocky-electrician-3956

07/22/2024, 2:59 PM

We could do, that is one of the options I'm looking at, the fun bit is still being able to start the process from any of the flows and being able to handle the parameters well

stocky-electrician-3956

07/22/2024, 3:00 PM

The thing I'm trying to avoid is all of the flows having to know about all of their down stream dependencies and know what all their parameters are

shy-address-41011

07/22/2024, 3:01 PM

oh interesting.. so in addition to the daisy chain you do want to retain the ability for the flows to be able to run independently as well

stocky-electrician-3956

07/22/2024, 3:03 PM

yeah

shy-address-41011

07/22/2024, 3:03 PM

> The thing I'm trying to avoid is all of the flows having to know about all of their down stream dependencies and know what all their parameters are makes sense.. the option I was thinking about was passing the universe of parameters into the 1st flow in the daisy chain. You could assign these parameters to

self

and then access them downstream https://docs.metaflow.org/production/event-triggering/flow-events

Copy code

When using @trigger_on_finish, you can access information about the triggering runs through current.trigger.run or current.trigger.runs in the case of multiple flows, which return one or more Run objects. Use the Run object to access artifacts as you do when using the Client API directly.

this 1

shy-address-41011

07/22/2024, 3:06 PM

the parameters themselves - if they do not have any dynamic portions then you could consider passing them into the workflow template as env vars?

stocky-electrician-3956

07/22/2024, 3:08 PM

ooh, thats an interesting idea. Thank you, that sounds like it might work. I'll have a play and see how we get on.

👍🏽 1

stocky-electrician-3956

07/23/2024, 6:07 AM

Got something that looks hopeful using a combination of environment variables and deploy time parameters. https://docs.metaflow.org/production/scheduling-metaflow-flows/scheduling-with-aws-step-functions#deploy-time-parameters I've created a function that returns a function that gets the named env var or default value.

Copy code

def env_var_parameter( default: str = None):
    def get_env_var(context):
        return os.getenv(context.parameter_name, default)

    return get_env_var

doc_folder_root = Parameter(
        "doc-folder-root",
        help="The docs folder to write the report to",
        default=env_var_parameter("UID_HERE"),
    )

Then modified our metaflow launch script to create the environment variables if its a deploy and they are provided.

2 Views

Open in Slack

Previous Next