Hi dev team I have a question regarding the Run class s fini Outerbounds #dev-metaflow

Hi dev team, I have a question regarding the Run c...

user

08/09/2022, 4:11 AM

Hi dev team, I have a question regarding the Run class's 'finished' property. Is it supposed to be 'True' when a run fails prior to the end step? In my testing, it remained 'False' after I ran a flow that failed in an intermediate step. I brought up the issue on github and go into more detail there. I thought here might also be a good place to discuss. My apologies if discussion should be left to the github issue. Thank you for your time!

✅ 1

ancient-application-36103

08/09/2022, 4:12 AM

Hey Chris! Thanks for creating the issue! This (or GH) is indeed the right place to discuss this issue.

ancient-application-36103

08/09/2022, 4:13 AM

Let me quickly read through the issue.

ancient-application-36103

08/09/2022, 4:17 AM

As you pointed out in the issue - determining that a run has indeed failed can often times be difficult to assess without any heuristic. We should definitely fix the doc string for

finished

to reflect this issue.

ancient-application-36103

08/09/2022, 4:20 AM

finished

as currently defined in the API docs -

Copy code

Indicates whether or not the run completed.
A run completed if its 'end' step completed.

may not be what you need since this definition of

finished

is not equivalent to

not running

ancient-application-36103

08/09/2022, 4:25 AM

We can create utility methods (that can live even outside of Metaflow) to trace the status of workflows that are executed through Airflow, Argo Workflows & Step Functions. We have been looking into introducing a

running

like property that relies on heartbeats emitted at the task-level and run-level to deduce the liveness of any task/run.

user

08/09/2022, 4:27 AM

For context, I am trying to make a function that takes in information to identify a flow, and then triggers a run. One way to implement would be for the function to trigger the run, then return a Run object so that the user can monitor the flow.

ancient-application-36103

08/09/2022, 4:28 AM

What capabilities would you like to provide the user to monitor the execution?

user

08/09/2022, 4:29 AM

Chris is working on flow triggering flow in argo plug-in.

user

08/09/2022, 4:29 AM

I built a kind of hacky version that determines a failed flow to be finished by iterating through each Task of each Step, looking for any Task with an exception. This works for my simple FailureFlow, but I am concerned that it would have trouble handling steps with retries, especially if it is difficult to know how many retries were specified in the decorator.

user

08/09/2022, 4:30 AM

Mostly if user can tell whether a flow was: • successful • failed • running It would be nice. Chris is also working on argo workflow plugin specifically.

❤️ 2

👍 1

user

08/09/2022, 4:30 AM

I would like user to be able to check the status of the flow. This would be helpful so that I can add an optional "wait" feature, which would wait to return from the function until the flow is done running.

user

08/09/2022, 4:31 AM

Also, a user might want to run flows in parallel, and it would be necessary to know when flows finish so the user can move on to any downstream flows.

ancient-application-36103

08/09/2022, 4:32 AM

task['_graph_info'].data

includes all the information about various decorators - that should allow you to impute the retries specified by the user for a step.

ancient-application-36103

08/09/2022, 4:35 AM

Regarding the wait feature - what happens after the wait returns?

ancient-application-36103

08/09/2022, 4:36 AM

If you are specifically looking at the argo plugin, then you can rely on the status of workflow returned by Argo - that would be simplest.

user

08/09/2022, 4:36 AM

Regardless of whether wait is turned on, it should return a Run object (or, alternatively, a new LiveRun object that lives in Metaflow, or an ArgoRun object that lives in the Argo plugin)

user

08/09/2022, 4:37 AM

Yeah, I made an implementation that works using the ArgoClient class. I think the interest in using Metaflow's run class is that it better abstracts the functionality from users

user

08/09/2022, 4:37 AM

Wait is like a programmatic way of monitoring a run. Think of it as an orchestrator that tells you how your flow ran. It keeps checking flow status until it is successful or failed. Then you can run other code following wait.

👍 1

user

08/09/2022, 4:37 AM

But maybe Metaflow just isn't the place for this.

user

08/09/2022, 4:37 AM

Signing off for the night. Thanks for your help! I'll be back on tomorrow.

👍 1

ancient-application-36103

08/09/2022, 4:39 AM

How about if there was a CLI command

python flow.py argo-workflows status --run-id foo

user

08/09/2022, 4:39 AM

Wait in argo workflow plugin is like a less verbose version of argo watch

👍 1

user

08/09/2022, 4:40 AM

Actually if you want more context here’s Chris’s code: https://github.com/zillow/metaflow/pull/185/files

👀 1

user

08/09/2022, 4:41 AM

We’d prefer not going through cli. We may not have the flow spec file locally. We expect that upstream and downstream flow may be published by different teams.

user

08/09/2022, 4:42 AM

Say downstream flow is a general model metrics. We can trigger it using its argo workflow template name. So one usecase can be upstream flow: • Runs some modeling • Trigger downstream metrics flow with parameters pointing to model path • Check metrics result and run some other code

ancient-application-36103

08/09/2022, 4:43 AM

And an event based approach (using argo events) wouldn't work?

user

08/09/2022, 4:45 AM

It’s mostly for backward compatibility as we are supporting this push style triggering already for our internal teams. It also provide some more flexibility as it’s a dynamic trigger that can be ran multiple time per flow, or used even outside metaflow flow code to trigger a flow.

ancient-application-36103

08/09/2022, 4:46 AM

Makes sense. Let me go through the PR and collect some context before suggesting anything concrete 🙂

thanks ty 1

user

08/09/2022, 4:50 AM

btw

get_workflow

used in the PR above was added to argo_client in our branch: https://github.com/zillow/metaflow/blob/zillow/argo/metaflow/plugins/argo/argo_client.py#L60-L74

user

08/09/2022, 4:51 AM

Also signing off for today. Thanks for looking!

ancient-application-36103

08/09/2022, 4:52 AM

Sure - let's pick it up tomorrow

👍 1

user

08/09/2022, 9:02 PM

@ancient-application-36103 What are your thoughts on the proposed solutions to my issue? I'm interested to know how open you and Outerbounds are to adding to the main branch of Metaflow in the ways described below (copied from the github issue I created, linked at beginning of thread). 1. Change the ‘finished’ property for Run (and potentially for Step, too) so that it returns ‘True’ in the case of a failed run that will not retry anymore. • This could be problematic because it changes code users depend on. 2. Add additional ‘running’ or ‘active’ property for Run (and Step?) that returns a boolean based on whether the Run is in progress or not. • This could be difficult because Metaflow may not currently have access to enough information to determine the correct response. • A pro is that it won’t affect functionality users depend on. 3. Don’t change Run class and use existing functionality at plugin level. • For example, the Argo plugin has an ArgoClient object that can provide users with up-to-date information on run status. 4. Add a LiveRun class to Metaflow that takes in the name of an under-layer (eg, “Argo” or “Kubeflow Pipelines”) and a dictionary with info necessary to trigger a run, and returns an object from the under-layer’s plugin. • This LiveRun object would have access to information on the run that is current, such as run status. This would help users know whether a run has failed or is still running. • This would only work for certain under-layers that have plugins that support this functionality. • A benefit to this model is that users would not have to get as familiar with what’s going on behind the scenes to access the information. 5. Same as 4, except that this functionality is added to the Metaflow Run class.

ancient-application-36103

08/09/2022, 9:32 PM

@User Given our strong promise of backwards compatibility, we wouldn't want to change the behavior of

finished

(also, it works as documented today). A

running

liveness

property can be added which tracks the heart beats (but the returned value will only be eventually consistent which wouldn't address your use case). The Metaflow Client right now has no notion/visibility into plugins and the logic to determine whether an Argo Workflow is currently executing or not has no dependency on the actual Metaflow code. At this point, we don't have a programmatic API to Metaflow commands implemented, but we can definitely introduce

python flow.py argo-workflows status --run-id foo

- but that wouldn't address your use case either. You should be able to introduce a

LiveRun

class within

metaflow-extensions

directly and we can introduce a similar capability when we ship support for programmatic API for Metaflow commands.

Programmatic API for Managing Metaflow Commands

thanks ty 1

user

08/09/2022, 9:42 PM

Thanks, @ancient-application-36103. To clarify, when you say

A
running
or
liveness
property can be added which tracks the heart beats (but the returned value will only be eventually consistent which wouldn't address your use case).

do you mean that the heart beat information is not yet available programmatically? My understanding is yes, that heart beats are used by the Metaflow UI, but are not yet able to help determine run status programmatically.

👍 1

ancient-application-36103

08/10/2022, 11:33 PM

The heart beat information is recorded in the database, but is not currently exposed directly through the metaflow service. The Metaflow UI relies on these heart beats to track run/task liveness.

👍 1

user

08/11/2022, 12:13 AM

Thanks, @ancient-application-36103!

Open in Slack

Previous Next