Hi team, we have a AWS batch flow that uses a `for...
# ask-metaflow
b
Hi team, we have a AWS batch flow that uses a
for_each
split roughly into 540 tasks in a step, the odd part is, we have a random occurrence where some tasks will show a 0.0s runtime and be marked as failed, but we see no logs related to that task. The occurrence in this case varies, one run till have 4/540 fail, one run might have 11/540 fail, one run might have 0/540 fail. I’m trying to figure out how to diagnose the issue when we have a
0.0s
failure so there’s no runtime logs related to the task itself to understand what happened
1
a
it's likely that the failure might have happened on aws batch's end - leaving metaflow with no logs to scrape. what does your aws batch job console say for these workloads?
b
let me try and take a look back on the batch side
oh it’s been a bit so prob missing batch logs, let me re-run this job to trigger the failure again (hopefully) and get back to you. thanks @square-wire-39606
👍🏼 1
@square-wire-39606 interesting, i finally got around to digging back into this, i re-ran the job from scratch, i get to the failing step (which should run 209 for-each splits), and when i look into the batch jobs, there are no failed tasks
the experience is almost like the tasks never get started, the UI shows 0.0s of task execution time
i have 15/209 that didn’t run, and these 15 essentially never get an instance or any batch metadata, it’s just like they get lost. the failure happens then on the
end
step because it tries to go to the next step but realizes there were “failed” steps from before
a
are you around for a quick screen share sess?
b
got a couple meetings to round out today before the long weekend, think we can do sometime next week? maybe tuesday or thursday?
s
how is thursday at 9a PT? i am at savin@outerbounds.co. please invite sakari@outerbounds.co too
b
that works, i’ll shoot it over
thanks!
s
btw what is the max workers that you are specifying?
i wonder if this is happening because you are running into aws throttling your calls - it should be handled gracefully but there could be a regression
cc @bulky-afternoon-92433
b
oh i wonder, you mean EC2 ICE issues?
we do encounter those from time to time but i think we usually catch it eventually, this flow has been problematic pretty consistently
so i haven’t felt like it’s ICE related
s
You can hack metaflow to print the AWS batch job calls and check what happens
b
happy to try that before we meet next week
would we need to a custom metaflow build
h
are you launching the batch jobs from SFN?
b
no this one is from an EC2
if you have some guidance on hacking metaflow to expose the batch calls happy to give it a try 🙂
h
you'd have to add something in here. i think
launch_job()
should do it
b
and then just do a local build of the library right?
h
yea
👍 1
b
i’ll poke around and give it a whirl
h
if you deploy to SFN and run from there you should be able to see the calls from there too (or at least tell what happened)
👀 1
b
interesting, can maybe give that a shot
@square-wire-39606 @bulky-afternoon-92433 thanks for jumping on with me today, that job I was running ended up not creating 16 jobs in batch, which lined up with the # of jobs failed for
DescribeJobDefinitions
too many requests. I’m wondering if we could try to patch that failure on a feature branch to try it out? I believe the 4 retries failing does cause the job to not get registered.
if i want to fully validate this, do i need to create an issue before i create a PR?
s
nope - you can just create a pr directly
b
ha been a bit since i’ve done community contribute, forgot i needed to fork first