<@U01NGJ0GA0J> Some more research was done by <@U0...
# ask-metaflow
e
@User Some more research was done by @aloof-butcher-73366, here's what he found:
I heavily instrumented my fork of metaflow and tracked down the cause of the delay. The code is getting stuck in
tail_logs
here: https://github.com/Netflix/metaflow/blob/ec8fd6c4a161b13a0f3765efb1e7fea92cdc2baa/metaflow/plugins/kubernetes/kubernetes.py#L357C9-L357C9
Digging deeper, the reason is that
has_log_updates
(which is calling
self._job.is_running
) continues to be true after the task has finished.
Digging deeper, both
is_done
and
_are_pod_containers_done
continue to be false (once either one is true, the job will finish)
Digging deeper, the fields checked by
is_done
never seem to indicate that the job is done. So the cause is with
_are_pod_containers_done
Checking
_are_pod_containers_done
, the cause is that the container status continues to be “running” long after it’s printed the last statement of the function: https://github.com/Netflix/metaflow/blob/ec8fd6c4a161b13a0f3765efb1e7fea92cdc2baa/metaflow/plugins/kubernetes/kubernetes_job.py#L598
The container status seems to be set outside of metaflow so not sure what can be done to reduce the time from the job completing to when that status gets set to terminated. Usually it gets set within 1-2 seconds, but in my testing I had multiple instances where it took much longer
Here's his fork, which is based on the latest release with a few added debugging print statements: https://github.com/chadaeschliman/metaflow/tree/instrument_delays I've also recreated the issue a few minutes ago. For this latest particular job (gist here) the job is reported in the UI as completed at
07-25-2023 16:33:26
. In the standard out the "Task finished with exit code 0" printout at
16:39:17
. In Kubernetes that same pod says it completed at
Finished:     Tue, 25 Jul 2023 11:33:28 -0500
. For whatever reason there's a ~6 min delay from when the pod is finished in kubernetes and when that task is reported as complete to Metaflow.