elegant-beach-10818
07/25/2023, 4:47 PMI heavily instrumented my fork of metaflow and tracked down the cause of the delay. The code is getting stuck inhere: https://github.com/Netflix/metaflow/blob/ec8fd6c4a161b13a0f3765efb1e7fea92cdc2baa/metaflow/plugins/kubernetes/kubernetes.py#L357C9-L357C9tail_logs
Digging deeper, the reason is that(which is callinghas_log_updates
) continues to be true after the task has finished.self._job.is_running
Digging deeper, bothandis_done
continue to be false (once either one is true, the job will finish)_are_pod_containers_done
Digging deeper, the fields checked bynever seem to indicate that the job is done. So the cause is withis_done
_are_pod_containers_done
Checking, the cause is that the container status continues to be “running” long after it’s printed the last statement of the function: https://github.com/Netflix/metaflow/blob/ec8fd6c4a161b13a0f3765efb1e7fea92cdc2baa/metaflow/plugins/kubernetes/kubernetes_job.py#L598_are_pod_containers_done
The container status seems to be set outside of metaflow so not sure what can be done to reduce the time from the job completing to when that status gets set to terminated. Usually it gets set within 1-2 seconds, but in my testing I had multiple instances where it took much longerHere's his fork, which is based on the latest release with a few added debugging print statements: https://github.com/chadaeschliman/metaflow/tree/instrument_delays I've also recreated the issue a few minutes ago. For this latest particular job (gist here) the job is reported in the UI as completed at
07-25-2023 16:33:26
. In the standard out the "Task finished with exit code 0" printout at 16:39:17
. In Kubernetes that same pod says it completed at Finished: Tue, 25 Jul 2023 11:33:28 -0500
. For whatever reason there's a ~6 min delay from when the pod is finished in kubernetes and when that task is reported as complete to Metaflow.