The human-centric platform for production ML & AI

Outerbounds

We've run into an issue causing flows to deadlock when running in Batch:
• We have a flow with a `foreach` step that spawns 20 or so workers.
• Each worker gets a ~11GB file from S3 with `s3.get()`  and does some (quick) computation on it.
• Some workers finish near-instantly, the others run perpetually (&gt;1h instead of 10 seconds).
• Any new flows are stuck in "Starting", then timeout after four minutes.
My theory is that workers are competing for the 100GB disk space from the launch template and deadlock while waiting for disk space to free up during the `s3.get()` call. Is this a likely cause? If so, is there a good way to avoid this situation, aside from preventing the root cause (downloading large files at the start of foreach-ed steps)?