< U01NGJ0GA0J> Some more research was done by < aloof butche Outerbounds #ask-metaflow

<@U01NGJ0GA0J> Some more research was done by <@U0...

elegant-beach-10818

07/25/2023, 4:47 PM

@User Some more research was done by @aloof-butcher-73366, here's what he found:

I heavily instrumented my fork of metaflow and tracked down the cause of the delay. The code is getting stuck in
tail_logs
here: https://github.com/Netflix/metaflow/blob/ec8fd6c4a161b13a0f3765efb1e7fea92cdc2baa/metaflow/plugins/kubernetes/kubernetes.py#L357C9-L357C9

Digging deeper, the reason is that
has_log_updates
(which is calling
self._job.is_running
) continues to be true after the task has finished.

Digging deeper, both
is_done
and
_are_pod_containers_done
continue to be false (once either one is true, the job will finish)

Digging deeper, the fields checked by
is_done
never seem to indicate that the job is done. So the cause is with
_are_pod_containers_done

Checking
_are_pod_containers_done
, the cause is that the container status continues to be “running” long after it’s printed the last statement of the function: https://github.com/Netflix/metaflow/blob/ec8fd6c4a161b13a0f3765efb1e7fea92cdc2baa/metaflow/plugins/kubernetes/kubernetes_job.py#L598

The container status seems to be set outside of metaflow so not sure what can be done to reduce the time from the job completing to when that status gets set to terminated. Usually it gets set within 1-2 seconds, but in my testing I had multiple instances where it took much longer

Here's his fork, which is based on the latest release with a few added debugging print statements: https://github.com/chadaeschliman/metaflow/tree/instrument_delays I've also recreated the issue a few minutes ago. For this latest particular job (gist here) the job is reported in the UI as completed at

07-25-2023 16:33:26

. In the standard out the "Task finished with exit code 0" printout at

16:39:17

. In Kubernetes that same pod says it completed at

Finished:     Tue, 25 Jul 2023 11:33:28 -0500

. For whatever reason there's a ~6 min delay from when the pod is finished in kubernetes and when that task is reported as complete to Metaflow.

Open in Slack

Previous Next