Howdy folks, we're getting started on metaflow usi...
# ask-metaflow
p
Howdy folks, we're getting started on metaflow using Argo/EKS and running into
Job has reached the specified backoff limit
on all of our GPU pods (
g5.8xlarge
so single nvidia gpu). The pods start but then the job is killed mid-run after several minutes. We're running the latest version of metaflow (
2.12.27
). There are no errors displayed in job events or in the pods or job logs. Has anyone else run into this? Any suggestions for troubleshooting would be super helpful 🙏
1
p
I believe the Argo Workflow template for your job will specify a
spec.templates[].retryStrategy.limit
along with a .backoff.maxDuration and you've hit that. So there may be some pods in the job that are failing (or a really short maxDuration) which causes the whole workflow to be cancelled.
👀 1
🙏 1
p
We have only been scheduling the flows locally (for testing since this is our first k8s deployment of metaflow). I will try scheduling the flow via Argo to see if we get the same results
To close the loop on this, the issue appears to be attributable to the differences between orchestrating flows locally and remotely (with Argo Workflows) that run remotely on k8s. Specifically, we're using GPU spot instances on AWS and Argo Workflows appears to better handle the slower node creation and initialization of these nodes as well as the pod lifecycle. We didn't observe this issue on any of our CPU workloads. While we cannot run full-scale GPU flows that use remote k8s with local orchestration (e.g., from our macs), the flows complete when orchestration with Argo Workflows. Perhaps this behavior is expected and local orchestration of full-size remote GPU workloads is not part of metaflow's development happy path? I'd be curious to learn more about how other folks develop, test, and iterate on metaflow pipelines locally since would seem to not support our data science workflow well (e.g., a data scientist would expect that they can run a long-running gpu task in interactive mode / using local orchestration during prototyping)
1
s
@purple-sundown-40080 the job waits till either the compute instance is available or a timeout is reached (default is 5 days). can you help me with the full console logs of the issue that you are running in?
Job has reached the specified backoff limit
is indicative of all retries being exhausted so it is likely that something else is preventing the job from starting up
p
Yes, happy to provide all relevant code and logs. Can you provide more guidance on what would be useful? For example , no error message appears in the cluster, pod, or job logs and the metaflow logs show:
Copy code
[MFLOG|0|2024-11-01T18:54:22.325076Z|runtime|98469d86-8b54-4d5a-bbf4-8d1f0cd0af26]    Kubernetes error:
[MFLOG|0|2024-11-01T18:54:22.528574Z|runtime|d83e93a3-41d9-4267-89be-9640147627b3]    Task crashed. This could be a transient error. Use @retry to retry.
[MFLOG|0|2024-11-01T18:54:22.530131Z|runtime|ed97ceb1-c4b7-45ef-8865-9b4c011ba1a3]
[MFLOG|0|2024-11-01T18:54:22.817185Z|runtime|75dc6b5e-ab0e-451d-b502-5f4d2a9f8e93]Task failed.
We've also been using the
@platform_retry
shared in Slack There are two failure modes we've observed for the locally orchestrated pipeline: 1. Flow fails b/c GPU spot instance took a while to start (even with
@platform_retry
) a. Re-running the flow then results in the tasks at least trying to schedule on the gpu node 2. GPU task pod starts and dies with
Job has reached the specified backoff limit
warning in the cluster event logs (and that's it) a. We have a hypothesis that any issues on the mac (e.g., vpn connectivity, python driver ram, aws token expiration, etc.) might be resulting in the local orchestration failure, but we don't have any evidence for this and we were seeing this across three different devs
a
Where are you seeing these metaflow logs? It seems that something is not configured as expected - you shouldn’t see the MGLOG|0 bits anywhere
We use spot instances for all our internal metaflow dev and have never seen an error like this pop up.
i
@purple-sundown-40080 did you solve this problem? I'm getting the same error but with CPU workloads.
a
@important-address-85681 what do your console logs and pod logs look like?
i
@square-wire-39606, I assume this is problem with k8s configuration. I moved from autopilot to standard GKE. And on high load (300+ workers) I'm getting errors like:
Copy code
Sleeping 2 minutes before the next retry
    Kubernetes error:
    Task crashed. This could be a transient error. Use @retry to retry.

Task failed.
And I cannot find the cause of the problem, in kubectl job description I getting Pod errors: BackoffLimitExceeded
Copy code
status:
  conditions:
  - lastProbeTime: "2025-01-14T11:26:56Z"
    lastTransitionTime: "2025-01-14T11:26:56Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: FailureTarget
  - lastProbeTime: "2025-01-14T11:26:57Z"
    lastTransitionTime: "2025-01-14T11:26:57Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  failed: 1
s
do you have any pod logs? 300 is not much scale at all
i
Some jobs works several minute and than randomly crashes
s
yeah - maybe the logs will help in figuring out the issue
also is this the job status or the pod status?
i
Job status, I will try to get logs tomorrow, right now the cluster is busy with another payload....
The pod logs are hard to get because they are deleted as soon as pod crashes.
Hi all! After long investigation I've found the issues that caused the "Task crashed. This could be a transient error.". Usually this could be caused by OOM (or silent OOM, when running children processes inside of the container), and I've expected to find it. But in my case the problem was the k8s autoscaler and its configuration. Because some of the task finished earlier that others, the k8s autoscaler was triggering the scaling procedure and try to pack the workload as tied as possible wrt the resources requirements by removing the pods from nodes to be deleted. On the large scale it happens much often and some step gets kicked out of nodes more time that any reasonable @retry can handle. On "spot" nodes this problem is much more severe than on "on demand" nodes. • One of the way to solve this issue to run in configuration 1 pod 1 node, but it causes waste of resources. And also causes a mess when we have multiple steps with different resources in the mapped (parallel) section. • Another way is control the scale manually (from python) but it is not provider agnostic, and also can be done on map/reduce nodes. Right now I'm reading the documentation of k8s to understand what else can be done.@square-wire-39606 any thoughts about this?
p
I don't know the k8s autoscaler, but we use karpenter and to prevent the scale-in of nodes still running jobs that shouldn't be interrupted we use the
<http://karpenter.sh/do-not-disrupt|karpenter.sh/do-not-disrupt>: true
annotation on our pods, as set by Argo Workflows. Maybe look to see if there's a similar k8s autoscaler annotation to prevent the scaler from terminating some pods?
👍 1
i
We solved this problem with:
Copy code
annotations:
    <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"
among us party 1