Given AWS batch setup like so see below For some reason my s Outerbounds #ask-metaflow

Given AWS batch setup like so (see below). For so...

happy-wolf-7852

02/21/2025, 1:58 PM

Given AWS batch setup like so (see below). For some reason, my supposedly parallel jobs do not run in parallel. In fact, given the below example, I only get one Running job and the remaining are “stuck” in Runnable.

Copy code

class Postprocessing(FlowSpec):
   
   @retry(times=0)
   @batch(cpu=8, memory=64000, queue='metaflow-684414486554-55ff636')
   @step
   def start(self):

              self.regions = ['reg1', 'reg2', 'reg3']

              self.next(self.post_process_by_region, foreach='regions')

   @batch(cpu=48, memory=384000, queue='metaflow-684414486554-55ff636')
   @step
   def post_process_by_region(self):

              current_region_name = self.input
              # do some really exciting stuff !
              self.next(self.join)
       
   @step
   def join(self, inputs):

              # do some even more really exciting stuff !

   self.next(self.end)
        
   @step
   def end(self):
       pass

if __name__ == '__main__':
   Postprocessing()

I run it as (scheduled) step function like so:

Copy code

python Postprocessing.py --package-suffixes .sql --environment=conda --with retry step-functions create
python Postprocessing.py --package-suffixes .sql --environment=conda step-functions trigger

Any idea why it does not run in parallel? Thanks!

✅ 1

square-wire-39606

02/21/2025, 3:47 PM

Max VCPUs in the Batch config are set to 96 - an instance with 48 vCPUs offers slightly less than 48 vCPUs for tasks - AWS Batch reserves a bit for itself. So running a task with 48 vCPUs will launch an instance with more than 48 vCPUs to fit it - which implies that there is no mechanism with which another concurrent task requesting 48 vCPUs can run.

hundreds-rainbow-67050

02/21/2025, 4:06 PM

It's actually the memory and not the CPU. Try to request something like 90% of the max instead of the full amount https://docs.aws.amazon.com/batch/latest/userguide/ecs-reserved-memory.html

square-wire-39606

02/21/2025, 4:10 PM

Ah right. The CPU reservation applies to EKS and not AWS Batch

happy-wolf-7852

02/21/2025, 5:42 PM

Ok thanks I will try. Mind you, I do not think I can do: @batch(cpu=8, memory=32000, queue=‘metaflow-684414486554-55ff636’) instead of: @batch(cpu=8, memory=64000, queue=‘metaflow-684414486554-55ff636’) IIRC it has to exactly align exactly with these “permutations”: https://aws.amazon.com/ec2/instance-types/r7i/ Or maybe you mean something else?

wonderful-gpu-7095

02/21/2025, 5:43 PM

it doesn't have to align exactly - we can pack multiple jobs across flows onto a single instance if there is space

👍 1

hundreds-rainbow-67050

02/21/2025, 5:59 PM

Batch will allocate an instance that has at least what was requested. So you can request odd numbers like 7 cores and it will spin up an instance with at least 8 cores, which means it can then schedule another job that only requested 1 core on that same instance (assuming the memory is sufficient)

👍 1

happy-wolf-7852

02/24/2025, 11:07 AM

Thanks I have to admit I do not follow 100%. I have these available:

c7i.large, r7i.xlarge, r7i.16xlarge, r7i.12xlarge, r7i.8xlarge, r7i.24xlarge, r7i.large, r7i.4xlarge, r7i.2xlarge

and my Maximum vCPUs is set to 96. I cannot see a max memory. Let us say, ideally I want to run N>>10 jobs in parallel using “foreach steps” each requiring about 32 GB memory like my local machine. What would I have to spec? I do not care too much about speed/CPU … Often/thus far something like this: @batch(cpu=4, memory=32000, queue=‘metaflow-1234’) resulted in N<<10 parallel jobs and the rest “stuck in” runnable.

hundreds-rainbow-67050

02/24/2025, 2:17 PM

Do you see any messages in the ASG target group logs?

happy-wolf-7852

02/24/2025, 2:20 PM

Thanks. Will have a look. As it stands just tried low CPU (2) and local machine’s memory (32GB) and I have spun up 17 “parallel” and running (!) jobs. So being less “greedy” CPU wise appears to help …

hundreds-rainbow-67050

02/24/2025, 4:25 PM

17*2 is still much less than 96 though

happy-wolf-7852

02/24/2025, 5:36 PM

yes it seems to have worked

10 Views

Open in Slack

Previous Next