Howdy folks we re getting started on metaflow using Argo EKS Outerbounds #ask-metaflow

Howdy folks, we're getting started on metaflow usi...

purple-sundown-40080

10/31/2024, 10:10 PM

Howdy folks, we're getting started on metaflow using Argo/EKS and running into

Job has reached the specified backoff limit

on all of our GPU pods (

g5.8xlarge

so single nvidia gpu). The pods start but then the job is killed mid-run after several minutes. We're running the latest version of metaflow (

2.12.27

). There are no errors displayed in job events or in the pods or job logs. Has anyone else run into this? Any suggestions for troubleshooting would be super helpful 🙏

✅ 1

prehistoric-salesclerk-95013

11/05/2024, 7:01 PM

I believe the Argo Workflow template for your job will specify a

spec.templates[].retryStrategy.limit

along with a .backoff.maxDuration and you've hit that. So there may be some pods in the job that are failing (or a really short maxDuration) which causes the whole workflow to be cancelled.

👀 1

🙏 1

purple-sundown-40080

11/05/2024, 7:05 PM

We have only been scheduling the flows locally (for testing since this is our first k8s deployment of metaflow). I will try scheduling the flow via Argo to see if we get the same results

purple-sundown-40080

11/06/2024, 2:23 PM

To close the loop on this, the issue appears to be attributable to the differences between orchestrating flows locally and remotely (with Argo Workflows) that run remotely on k8s. Specifically, we're using GPU spot instances on AWS and Argo Workflows appears to better handle the slower node creation and initialization of these nodes as well as the pod lifecycle. We didn't observe this issue on any of our CPU workloads. While we cannot run full-scale GPU flows that use remote k8s with local orchestration (e.g., from our macs), the flows complete when orchestration with Argo Workflows. Perhaps this behavior is expected and local orchestration of full-size remote GPU workloads is not part of metaflow's development happy path? I'd be curious to learn more about how other folks develop, test, and iterate on metaflow pipelines locally since would seem to not support our data science workflow well (e.g., a data scientist would expect that they can run a long-running gpu task in interactive mode / using local orchestration during prototyping)

✅ 1

square-wire-39606

11/06/2024, 6:46 PM

@purple-sundown-40080 the job waits till either the compute instance is available or a timeout is reached (default is 5 days). can you help me with the full console logs of the issue that you are running in?

square-wire-39606

11/06/2024, 6:48 PM

Job has reached the specified backoff limit

is indicative of all retries being exhausted so it is likely that something else is preventing the job from starting up

purple-sundown-40080

11/06/2024, 7:02 PM

Yes, happy to provide all relevant code and logs. Can you provide more guidance on what would be useful? For example , no error message appears in the cluster, pod, or job logs and the metaflow logs show:

Copy code

[MFLOG|0|2024-11-01T18:54:22.325076Z|runtime|98469d86-8b54-4d5a-bbf4-8d1f0cd0af26]    Kubernetes error:
[MFLOG|0|2024-11-01T18:54:22.528574Z|runtime|d83e93a3-41d9-4267-89be-9640147627b3]    Task crashed. This could be a transient error. Use @retry to retry.
[MFLOG|0|2024-11-01T18:54:22.530131Z|runtime|ed97ceb1-c4b7-45ef-8865-9b4c011ba1a3]
[MFLOG|0|2024-11-01T18:54:22.817185Z|runtime|75dc6b5e-ab0e-451d-b502-5f4d2a9f8e93]Task failed.

We've also been using the

@platform_retry

shared in Slack There are two failure modes we've observed for the locally orchestrated pipeline: 1. Flow fails b/c GPU spot instance took a while to start (even with

@platform_retry

) a. Re-running the flow then results in the tasks at least trying to schedule on the gpu node 2. GPU task pod starts and dies with

Job has reached the specified backoff limit

warning in the cluster event logs (and that's it) a. We have a hypothesis that any issues on the mac (e.g., vpn connectivity, python driver ram, aws token expiration, etc.) might be resulting in the local orchestration failure, but we don't have any evidence for this and we were seeing this across three different devs

ancient-application-36103

11/11/2024, 10:29 AM

Where are you seeing these metaflow logs? It seems that something is not configured as expected - you shouldn’t see the MGLOG|0 bits anywhere

ancient-application-36103

11/11/2024, 10:30 AM

We use spot instances for all our internal metaflow dev and have never seen an error like this pop up.

important-address-85681

01/14/2025, 7:14 AM

@purple-sundown-40080 did you solve this problem? I'm getting the same error but with CPU workloads.

ancient-application-36103

01/14/2025, 4:32 PM

@important-address-85681 what do your console logs and pod logs look like?

important-address-85681

01/14/2025, 8:26 PM

@square-wire-39606, I assume this is problem with k8s configuration. I moved from autopilot to standard GKE. And on high load (300+ workers) I'm getting errors like:

Copy code

Sleeping 2 minutes before the next retry
    Kubernetes error:
    Task crashed. This could be a transient error. Use @retry to retry.

Task failed.

And I cannot find the cause of the problem, in kubectl job description I getting Pod errors: BackoffLimitExceeded

Copy code

status:
  conditions:
  - lastProbeTime: "2025-01-14T11:26:56Z"
    lastTransitionTime: "2025-01-14T11:26:56Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: FailureTarget
  - lastProbeTime: "2025-01-14T11:26:57Z"
    lastTransitionTime: "2025-01-14T11:26:57Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  failed: 1

square-wire-39606

01/14/2025, 8:27 PM

do you have any pod logs? 300 is not much scale at all

important-address-85681

01/14/2025, 8:27 PM

Some jobs works several minute and than randomly crashes

square-wire-39606

01/14/2025, 8:28 PM

yeah - maybe the logs will help in figuring out the issue

square-wire-39606

01/14/2025, 8:29 PM

also is this the job status or the pod status?

important-address-85681

01/14/2025, 8:46 PM

Job status, I will try to get logs tomorrow, right now the cluster is busy with another payload....

important-address-85681

01/14/2025, 8:47 PM

The pod logs are hard to get because they are deleted as soon as pod crashes.

important-address-85681

01/25/2025, 7:35 AM

Hi all! After long investigation I've found the issues that caused the "Task crashed. This could be a transient error.". Usually this could be caused by OOM (or silent OOM, when running children processes inside of the container), and I've expected to find it. But in my case the problem was the k8s autoscaler and its configuration. Because some of the task finished earlier that others, the k8s autoscaler was triggering the scaling procedure and try to pack the workload as tied as possible wrt the resources requirements by removing the pods from nodes to be deleted. On the large scale it happens much often and some step gets kicked out of nodes more time that any reasonable @retry can handle. On "spot" nodes this problem is much more severe than on "on demand" nodes. • One of the way to solve this issue to run in configuration 1 pod 1 node, but it causes waste of resources. And also causes a mess when we have multiple steps with different resources in the mapped (parallel) section. • Another way is control the scale manually (from python) but it is not provider agnostic, and also can be done on map/reduce nodes. Right now I'm reading the documentation of k8s to understand what else can be done.@square-wire-39606 any thoughts about this?

prehistoric-salesclerk-95013

01/28/2025, 10:50 PM

I don't know the k8s autoscaler, but we use karpenter and to prevent the scale-in of nodes still running jobs that shouldn't be interrupted we use the

<http://karpenter.sh/do-not-disrupt|karpenter.sh/do-not-disrupt>: true

annotation on our pods, as set by Argo Workflows. Maybe look to see if there's a similar k8s autoscaler annotation to prevent the scaler from terminating some pods?

👍 1

important-address-85681

02/02/2025, 8:06 PM

We solved this problem with:

Copy code

annotations:
    <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"

among us party 1

3 Views

Open in Slack

Previous Next