Question about the expected behavior of the heartbeat daemon Outerbounds #ask-metaflow

Question about the expected behavior of the heartb...

prehistoric-salesclerk-95013

10/02/2024, 6:16 PM

Question about the expected behavior of the heartbeat-daemon launched under Argo Workflows in the just released 2.12.23 metaflow package. Specifically, the heartbeat-daemon template hard codes the request/limit for memory to 100Mi, which causes the heartbeat pod to be OOMKilled within 30s. Increasing that memory limit (edited the template on the fly and re-triggered) allows the pod to launch and run, but its eventual exit code is 143, which I would like to confirm is the expected exit behavior.

✅ 1

prehistoric-salesclerk-95013

10/02/2024, 6:18 PM

I think the 143 is expected because the task successfully finished then the kubernetes scheduler (via kubelet) sends a kill -15 to the heartbeat-daemon pod to shut down, as there's no natural termination. But the error messages are a little alarming to me and our users, so I'm wondering if there's a configuration I'm missing that would allow the heartbeat-daemon to exit more gracefully and with less scary looking exit codes and logs.

square-wire-39606

10/02/2024, 11:56 PM

@prehistoric-salesclerk-95013 we are also looking into some of the issues with the daemon container implementation within argo workflows and trying to work around that. the heartbeat daemon's are a best effort daemon with no impact to the execution of metaflow itself - the next release of metaflow will actually turn them off by default.

thankyou 1

2 Views

Open in Slack

Previous Next