The human-centric platform for production ML & AI

Outerbounds

image.png

Curious if anyone might have pointers on where to look when SFN-deployed flows are not reaching the expected level of concurrency in foreach fanouts?

Couple things already checked:
• SFN deployed with sufficient concurrency via `step-functions create --max-workers 200` 
    ◦ Verified the `Map` state in the SFN definition has the expected `"MaxConcurrency": 200`
• Batch Compute Environment has sufficient capacity headroom to provision more instances
• EC2 Service Quota for that particular instance type has sufficient headroom
• The run itself has over 4000 splits in the fanout, so there's no shortage on that side
• Nothing is stuck in pending/runnable in Batch
Despite that, it seems to only want to run 50 Batch jobs in parallel at a time. I also ran two of the SFN executions at the same time to verify that the Batch Compute Environment was able to scale further, which it did, granted with the same 50 concurrency (100 total jobs between the two SFNs).

This leads me to believe that there's nothing within the Batch service that's restricting the scaling out, and it's likely coming from Step Functions :thinking_face:

Running a bit dry on what the potential snags could be, and would appreciate if anyone has ideas!