Hi dev team, I have a question regarding the Run c...
# dev-metaflow
u
Hi dev team, I have a question regarding the Run class's 'finished' property. Is it supposed to be 'True' when a run fails prior to the end step? In my testing, it remained 'False' after I ran a flow that failed in an intermediate step. I brought up the issue on github and go into more detail there. I thought here might also be a good place to discuss. My apologies if discussion should be left to the github issue. Thank you for your time!
1
a
Hey Chris! Thanks for creating the issue! This (or GH) is indeed the right place to discuss this issue.
Let me quickly read through the issue.
As you pointed out in the issue - determining that a run has indeed failed can often times be difficult to assess without any heuristic. We should definitely fix the doc string for
finished
to reflect this issue.
finished
as currently defined in the API docs -
Copy code
Indicates whether or not the run completed.
A run completed if its 'end' step completed.
may not be what you need since this definition of
finished
is not equivalent to
not running
We can create utility methods (that can live even outside of Metaflow) to trace the status of workflows that are executed through Airflow, Argo Workflows & Step Functions. We have been looking into introducing a
running
like property that relies on heartbeats emitted at the task-level and run-level to deduce the liveness of any task/run.
u
For context, I am trying to make a function that takes in information to identify a flow, and then triggers a run. One way to implement would be for the function to trigger the run, then return a Run object so that the user can monitor the flow.
a
What capabilities would you like to provide the user to monitor the execution?
u
Chris is working on flow triggering flow in argo plug-in.
u
I built a kind of hacky version that determines a failed flow to be finished by iterating through each Task of each Step, looking for any Task with an exception. This works for my simple FailureFlow, but I am concerned that it would have trouble handling steps with retries, especially if it is difficult to know how many retries were specified in the decorator.
u
Mostly if user can tell whether a flow was: • successful • failed • running It would be nice. Chris is also working on argo workflow plugin specifically.
❤️ 2
👍 1
u
I would like user to be able to check the status of the flow. This would be helpful so that I can add an optional "wait" feature, which would wait to return from the function until the flow is done running.
u
Also, a user might want to run flows in parallel, and it would be necessary to know when flows finish so the user can move on to any downstream flows.
a
task['_graph_info'].data
includes all the information about various decorators - that should allow you to impute the retries specified by the user for a step.
Regarding the wait feature - what happens after the wait returns?
If you are specifically looking at the argo plugin, then you can rely on the status of workflow returned by Argo - that would be simplest.
u
Regardless of whether wait is turned on, it should return a Run object (or, alternatively, a new LiveRun object that lives in Metaflow, or an ArgoRun object that lives in the Argo plugin)
u
Yeah, I made an implementation that works using the ArgoClient class. I think the interest in using Metaflow's run class is that it better abstracts the functionality from users
u
Wait is like a programmatic way of monitoring a run. Think of it as an orchestrator that tells you how your flow ran. It keeps checking flow status until it is successful or failed. Then you can run other code following wait.
👍 1
u
But maybe Metaflow just isn't the place for this.
u
Signing off for the night. Thanks for your help! I'll be back on tomorrow.
👍 1
a
How about if there was a CLI command
python flow.py argo-workflows status --run-id foo
?
u
Wait in argo workflow plugin is like a less verbose version of argo watch
👍 1
u
Actually if you want more context here’s Chris’s code: https://github.com/zillow/metaflow/pull/185/files
👀 1
u
We’d prefer not going through cli. We may not have the flow spec file locally. We expect that upstream and downstream flow may be published by different teams.
u
Say downstream flow is a general model metrics. We can trigger it using its argo workflow template name. So one usecase can be upstream flow: • Runs some modeling • Trigger downstream metrics flow with parameters pointing to model path • Check metrics result and run some other code
a
And an event based approach (using argo events) wouldn't work?
u
It’s mostly for backward compatibility as we are supporting this push style triggering already for our internal teams. It also provide some more flexibility as it’s a dynamic trigger that can be ran multiple time per flow, or used even outside metaflow flow code to trigger a flow.
a
Makes sense. Let me go through the PR and collect some context before suggesting anything concrete 🙂
thanks ty 1
u
btw
get_workflow
used in the PR above was added to argo_client in our branch: https://github.com/zillow/metaflow/blob/zillow/argo/metaflow/plugins/argo/argo_client.py#L60-L74
u
Also signing off for today. Thanks for looking!
a
Sure - let's pick it up tomorrow
👍 1
u
@ancient-application-36103 What are your thoughts on the proposed solutions to my issue? I'm interested to know how open you and Outerbounds are to adding to the main branch of Metaflow in the ways described below (copied from the github issue I created, linked at beginning of thread). 1. Change the ‘finished’ property for Run (and potentially for Step, too) so that it returns ‘True’ in the case of a failed run that will not retry anymore. • This could be problematic because it changes code users depend on. 2. Add additional ‘running’ or ‘active’ property for Run (and Step?) that returns a boolean based on whether the Run is in progress or not. • This could be difficult because Metaflow may not currently have access to enough information to determine the correct response. • A pro is that it won’t affect functionality users depend on. 3. Don’t change Run class and use existing functionality at plugin level. • For example, the Argo plugin has an ArgoClient object that can provide users with up-to-date information on run status. 4. Add a LiveRun class to Metaflow that takes in the name of an under-layer (eg, “Argo” or “Kubeflow Pipelines”) and a dictionary with info necessary to trigger a run, and returns an object from the under-layer’s plugin. • This LiveRun object would have access to information on the run that is current, such as run status. This would help users know whether a run has failed or is still running. • This would only work for certain under-layers that have plugins that support this functionality. • A benefit to this model is that users would not have to get as familiar with what’s going on behind the scenes to access the information. 5. Same as 4, except that this functionality is added to the Metaflow Run class.
a
@User Given our strong promise of backwards compatibility, we wouldn't want to change the behavior of
finished
(also, it works as documented today). A
running
or
liveness
property can be added which tracks the heart beats (but the returned value will only be eventually consistent which wouldn't address your use case). The Metaflow Client right now has no notion/visibility into plugins and the logic to determine whether an Argo Workflow is currently executing or not has no dependency on the actual Metaflow code. At this point, we don't have a programmatic API to Metaflow commands implemented, but we can definitely introduce
python flow.py argo-workflows status --run-id foo
- but that wouldn't address your use case either. You should be able to introduce a
LiveRun
class within
metaflow-extensions
directly and we can introduce a similar capability when we ship support for programmatic API for Metaflow commands.
thanks ty 1
u
Thanks, @ancient-application-36103. To clarify, when you say
A
running
or
liveness
property can be added which tracks the heart beats (but the returned value will only be eventually consistent which wouldn't address your use case).
do you mean that the heart beat information is not yet available programmatically? My understanding is yes, that heart beats are used by the Metaflow UI, but are not yet able to help determine run status programmatically.
👍 1
a
The heart beat information is recorded in the database, but is not currently exposed directly through the metaflow service. The Metaflow UI relies on these heart beats to track run/task liveness.
👍 1
u
Thanks, @ancient-application-36103!