purple-sundown-40080
10/31/2024, 10:10 PMJob has reached the specified backoff limit
on all of our GPU pods (g5.8xlarge
so single nvidia gpu). The pods start but then the job is killed mid-run after several minutes. We're running the latest version of metaflow (2.12.27
). There are no errors displayed in job events or in the pods or job logs.
Has anyone else run into this? Any suggestions for troubleshooting would be super helpful 🙏prehistoric-salesclerk-95013
11/05/2024, 7:01 PMspec.templates[].retryStrategy.limit
along with a .backoff.maxDuration and you've hit that. So there may be some pods in the job that are failing (or a really short maxDuration) which causes the whole workflow to be cancelled.purple-sundown-40080
11/05/2024, 7:05 PMpurple-sundown-40080
11/06/2024, 2:23 PMsquare-wire-39606
11/06/2024, 6:46 PMsquare-wire-39606
11/06/2024, 6:48 PMJob has reached the specified backoff limit
is indicative of all retries being exhausted so it is likely that something else is preventing the job from starting uppurple-sundown-40080
11/06/2024, 7:02 PM[MFLOG|0|2024-11-01T18:54:22.325076Z|runtime|98469d86-8b54-4d5a-bbf4-8d1f0cd0af26] Kubernetes error:
[MFLOG|0|2024-11-01T18:54:22.528574Z|runtime|d83e93a3-41d9-4267-89be-9640147627b3] Task crashed. This could be a transient error. Use @retry to retry.
[MFLOG|0|2024-11-01T18:54:22.530131Z|runtime|ed97ceb1-c4b7-45ef-8865-9b4c011ba1a3]
[MFLOG|0|2024-11-01T18:54:22.817185Z|runtime|75dc6b5e-ab0e-451d-b502-5f4d2a9f8e93]Task failed.
We've also been using the @platform_retry
shared in Slack
There are two failure modes we've observed for the locally orchestrated pipeline:
1. Flow fails b/c GPU spot instance took a while to start (even with @platform_retry
)
a. Re-running the flow then results in the tasks at least trying to schedule on the gpu node
2. GPU task pod starts and dies with Job has reached the specified backoff limit
warning in the cluster event logs (and that's it)
a. We have a hypothesis that any issues on the mac (e.g., vpn connectivity, python driver ram, aws token expiration, etc.) might be resulting in the local orchestration failure, but we don't have any evidence for this and we were seeing this across three different devsancient-application-36103
11/11/2024, 10:29 AMancient-application-36103
11/11/2024, 10:30 AMimportant-address-85681
01/14/2025, 7:14 AMancient-application-36103
01/14/2025, 4:32 PMimportant-address-85681
01/14/2025, 8:26 PMSleeping 2 minutes before the next retry
Kubernetes error:
Task crashed. This could be a transient error. Use @retry to retry.
Task failed.
And I cannot find the cause of the problem, in kubectl job description I getting Pod errors: BackoffLimitExceeded
status:
conditions:
- lastProbeTime: "2025-01-14T11:26:56Z"
lastTransitionTime: "2025-01-14T11:26:56Z"
message: Job has reached the specified backoff limit
reason: BackoffLimitExceeded
status: "True"
type: FailureTarget
- lastProbeTime: "2025-01-14T11:26:57Z"
lastTransitionTime: "2025-01-14T11:26:57Z"
message: Job has reached the specified backoff limit
reason: BackoffLimitExceeded
status: "True"
type: Failed
failed: 1
square-wire-39606
01/14/2025, 8:27 PMimportant-address-85681
01/14/2025, 8:27 PMsquare-wire-39606
01/14/2025, 8:28 PMsquare-wire-39606
01/14/2025, 8:29 PMimportant-address-85681
01/14/2025, 8:46 PMimportant-address-85681
01/14/2025, 8:47 PMimportant-address-85681
01/25/2025, 7:35 AMprehistoric-salesclerk-95013
01/28/2025, 10:50 PM<http://karpenter.sh/do-not-disrupt|karpenter.sh/do-not-disrupt>: true
annotation on our pods, as set by Argo Workflows. Maybe look to see if there's a similar k8s autoscaler annotation to prevent the scaler from terminating some pods?important-address-85681
02/02/2025, 8:06 PMannotations:
<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"