shy-refrigerator-15055
05/26/2023, 2:01 PMforeach
step that spawns 20 or so workers.
• Each worker gets a ~11GB file from S3 with s3.get()
and does some (quick) computation on it.
• Some workers finish near-instantly, the others run perpetually (>1h instead of 10 seconds).
• Any new flows are stuck in "Starting", then timeout after four minutes.
My theory is that workers are competing for the 100GB disk space from the launch template and deadlock while waiting for disk space to free up during the s3.get()
call. Is this a likely cause? If so, is there a good way to avoid this situation, aside from preventing the root cause (downloading large files at the start of foreach-ed steps)?