The human-centric platform for production ML & AI

Outerbounds

image.png

We are noticing some intermittently excessive times between steps in a flow. It seems to be partially caused by some odd behavior with completion times on kubernetes

For example, the last print statement from the python file is a memory printout at `16:07:56`. One second later at `16:07:57` the task is marked as finished. Then 4 minutes later at `16:11:59` we get the standard output `Task finished with exit code 0.`

In flows where the parallelism is limited, this is causing the entire flow to hang until these stuck tasks get cleared out. Attached is the effect on the task graph

Any thoughts on potential causes? This is running on metaflow version `2.9.9`