Hi team we have a AWS batch flow that uses a `for each` spli Outerbounds #ask-metaflow

Hi team, we have a AWS batch flow that uses a `for...

bulky-portugal-95315

02/06/2025, 3:28 PM

Hi team, we have a AWS batch flow that uses a

for_each

split roughly into 540 tasks in a step, the odd part is, we have a random occurrence where some tasks will show a 0.0s runtime and be marked as failed, but we see no logs related to that task. The occurrence in this case varies, one run till have 4/540 fail, one run might have 11/540 fail, one run might have 0/540 fail. I’m trying to figure out how to diagnose the issue when we have a

0.0s

failure so there’s no runtime logs related to the task itself to understand what happened

✅ 1

ancient-application-36103

02/06/2025, 3:30 PM

it's likely that the failure might have happened on aws batch's end - leaving metaflow with no logs to scrape. what does your aws batch job console say for these workloads?

bulky-portugal-95315

02/06/2025, 3:31 PM

let me try and take a look back on the batch side

bulky-portugal-95315

02/06/2025, 3:32 PM

oh it’s been a bit so prob missing batch logs, let me re-run this job to trigger the failure again (hopefully) and get back to you. thanks @square-wire-39606

👍🏼 1

bulky-portugal-95315

02/14/2025, 7:47 PM

@square-wire-39606 interesting, i finally got around to digging back into this, i re-ran the job from scratch, i get to the failing step (which should run 209 for-each splits), and when i look into the batch jobs, there are no failed tasks

bulky-portugal-95315

02/14/2025, 7:47 PM

the experience is almost like the tasks never get started, the UI shows 0.0s of task execution time

bulky-portugal-95315

02/14/2025, 7:49 PM

i have 15/209 that didn’t run, and these 15 essentially never get an instance or any batch metadata, it’s just like they get lost. the failure happens then on the

end

step because it tries to go to the next step but realizes there were “failed” steps from before

ancient-application-36103

02/14/2025, 7:51 PM

are you around for a quick screen share sess?

bulky-portugal-95315

02/14/2025, 7:52 PM

got a couple meetings to round out today before the long weekend, think we can do sometime next week? maybe tuesday or thursday?

square-wire-39606

02/14/2025, 7:53 PM

how is thursday at 9a PT? i am at savin@outerbounds.co. please invite sakari@outerbounds.co too

bulky-portugal-95315

02/14/2025, 7:54 PM

that works, i’ll shoot it over

bulky-portugal-95315

02/14/2025, 7:56 PM

thanks!

square-wire-39606

02/14/2025, 7:57 PM

btw what is the max workers that you are specifying?

square-wire-39606

02/14/2025, 7:57 PM

i wonder if this is happening because you are running into aws throttling your calls - it should be handled gracefully but there could be a regression

square-wire-39606

02/14/2025, 7:57 PM

cc @bulky-afternoon-92433

bulky-portugal-95315

02/14/2025, 8:00 PM

oh i wonder, you mean EC2 ICE issues?

bulky-portugal-95315

02/14/2025, 8:00 PM

we do encounter those from time to time but i think we usually catch it eventually, this flow has been problematic pretty consistently

bulky-portugal-95315

02/14/2025, 8:00 PM

so i haven’t felt like it’s ICE related

square-wire-39606

02/14/2025, 8:01 PM

You can hack metaflow to print the AWS batch job calls and check what happens

bulky-portugal-95315

02/14/2025, 8:01 PM

happy to try that before we meet next week

bulky-portugal-95315

02/14/2025, 8:01 PM

would we need to a custom metaflow build

hundreds-rainbow-67050

02/14/2025, 8:02 PM

are you launching the batch jobs from SFN?

bulky-portugal-95315

02/14/2025, 8:02 PM

no this one is from an EC2

bulky-portugal-95315

02/14/2025, 8:34 PM

if you have some guidance on hacking metaflow to expose the batch calls happy to give it a try 🙂

hundreds-rainbow-67050

02/14/2025, 8:37 PM

you'd have to add something in here. i think

launch_job()

should do it

bulky-portugal-95315

02/14/2025, 9:45 PM

and then just do a local build of the library right?

hundreds-rainbow-67050

02/14/2025, 10:01 PM

yea

👍 1

bulky-portugal-95315

02/14/2025, 10:02 PM

i’ll poke around and give it a whirl

hundreds-rainbow-67050

02/14/2025, 10:41 PM

if you deploy to SFN and run from there you should be able to see the calls from there too (or at least tell what happened)

👀 1

bulky-portugal-95315

02/17/2025, 6:51 PM

interesting, can maybe give that a shot

bulky-portugal-95315

02/20/2025, 6:01 PM

@square-wire-39606 @bulky-afternoon-92433 thanks for jumping on with me today, that job I was running ended up not creating 16 jobs in batch, which lined up with the # of jobs failed for

DescribeJobDefinitions

too many requests. I’m wondering if we could try to patch that failure on a feature branch to try it out? I believe the 4 retries failing does cause the job to not get registered.

bulky-portugal-95315

02/20/2025, 6:05 PM

if i want to fully validate this, do i need to create an issue before i create a PR?

square-wire-39606

02/20/2025, 6:36 PM

nope - you can just create a pr directly

bulky-portugal-95315

02/20/2025, 6:59 PM

ha been a bit since i’ve done community contribute, forgot i needed to fork first

5 Views

Open in Slack

Previous Next