Hello everyone I wanted to know if anyone can help me with t Outerbounds #dev-metaflow

Hello everyone, I wanted to know if anyone can hel...

mysterious-tailor-22852

12/17/2023, 12:02 AM

Hello everyone, I wanted to know if anyone can help me with two issues I'm facing. When I run pipelines with triggers in ArgoFlow on AWS EKS, the Metaflow UI shows that the pipeline failed every time it switches steps, and then when the next step begins, it appears again as running. This is quite bothersome because you never know if the pipeline actually failed or if it's just changing steps. This doesn't happen when we send pipelines directly to Kubernetes. Perhaps the following is related: with ArgoFlow, we are using the

--no-auto-emit-argo-events

command because otherwise, it generates an error (the error is in the image) I would like to better understand how the Metaflow UI knows when a pipeline fails or not, if using the no auto emit argo event is related to my problem, and if anyone knows how to solve these issues. I would be very grateful. Regards.

dry-beach-38304

12/17/2023, 9:09 AM

The issue is how to know if something is live or not and is not, in general, an easy problem. Metaflow relies on heartbeats to determine this. There are two levels of heartbeats: a run level one and a task level one. Unfortunately, with orchestrators, there is no run level heartbeat as metaflow doesn’t have a “run head node” on which to run this on so it can only rely on task level heartbeats. Those are only active, obviously, when a task is running. With orchestrators, tasks may take a while to be scheduled and during this time, there is no heartbeat so metaflow does not know if a run is dead or still in progress (or rather it can’t tell the difference). There is a setting for thr ui thst tells it how long to wait for a heartbeat before setting it as “dead” but unfortunately, right now, there isn’t much else to do to correct this issue. It’s something we are aware of but is unclear how to solve it. I am not super familiar with thr Argo implementation but based on what I know of metaflow and the ui in general, I doubt it is related to how Argo emits events.

future-carpet-9485

12/17/2023, 8:03 PM

Thanks for the explanation! Is possible to configure that setting on the environment variables?

dry-beach-38304

12/18/2023, 12:38 PM

yes — see here: https://github.com/Netflix/metaflow-service/blob/master/services/ui_backend_service/docs/environment.md#heartbeat-intervals. @bulky-afternoon-92433 can confirm but I believe the one you are looking for is:

RUN_INACTIVE_CUTOFF_TIME

dry-beach-38304

12/18/2023, 12:38 PM

and

HEARTBEAT_THRESHOLD

as well.

bulky-afternoon-92433

12/18/2023, 12:45 PM

correct on both. For the specific case of the run level status appearing failed when tasks are stuck in scheduler, you only need to tune the

RUN_INACTIVE_CUTOFF_TIME

(in seconds) to a rough estimate on how long tasks can take to start. Default for this is 6 minutes in the recent releases, but a while ago it was still quite optimistically set to 1minute

bulky-afternoon-92433

12/18/2023, 12:55 PM

The rough architecture for heartbeats is as follows: • metaflow client runtime starts a task ◦ it also runs a sidecar process that periodically does a heartbeat

POST

to the metadata service for the running task • metadata service updates heartbeat for task, and the associated run for the task • once the metadata service updates a task record in the database with a new heartbeat, a postgres table trigger sends a message for this to any listeners • the MetaflowUI backend is listening to these updates, and is running its own 'keepalive' process for running runs/tasks, which are refreshed based on received heartbeats ◦ if a keepalive fails, the real status will be inferred and broadcast to any open UI websocket that has subscribed to the affected resource As at the moment the only source of info for run level freshness comes from executing tasks, this is there reason we need a threshold for run level failures, instead of being able to provide completely accurate fail statuses

mysterious-tailor-22852

12/20/2023, 12:48 PM

hey guys, thanks for your help! Changing that variables fixed the problem a was having thankyou

8 Views

Open in Slack

Previous Next