We've run into an issue causing flows to deadlock ...
# ask-metaflow
s
We've run into an issue causing flows to deadlock when running in Batch: • We have a flow with a
foreach
step that spawns 20 or so workers. • Each worker gets a ~11GB file from S3 with
s3.get()
and does some (quick) computation on it. • Some workers finish near-instantly, the others run perpetually (>1h instead of 10 seconds). • Any new flows are stuck in "Starting", then timeout after four minutes. My theory is that workers are competing for the 100GB disk space from the launch template and deadlock while waiting for disk space to free up during the
s3.get()
call. Is this a likely cause? If so, is there a good way to avoid this situation, aside from preventing the root cause (downloading large files at the start of foreach-ed steps)?
1