Given AWS batch setup like so (see below). For so...
# ask-metaflow
h
Given AWS batch setup like so (see below). For some reason, my supposedly parallel jobs do not run in parallel. In fact, given the below example, I only get one Running job and the remaining are “stuck” in Runnable.
Copy code
class Postprocessing(FlowSpec):
   
   @retry(times=0)
   @batch(cpu=8, memory=64000, queue='metaflow-684414486554-55ff636')
   @step
   def start(self):

              self.regions = ['reg1', 'reg2', 'reg3']

              self.next(self.post_process_by_region, foreach='regions')

   @batch(cpu=48, memory=384000, queue='metaflow-684414486554-55ff636')
   @step
   def post_process_by_region(self):

              current_region_name = self.input
              # do some really exciting stuff !
              self.next(self.join)
       
   @step
   def join(self, inputs):

              # do some even more really exciting stuff !

   self.next(self.end)
        
   @step
   def end(self):
       pass

if __name__ == '__main__':
   Postprocessing()
I run it as (scheduled) step function like so:
Copy code
python Postprocessing.py --package-suffixes .sql --environment=conda --with retry step-functions create
python Postprocessing.py --package-suffixes .sql --environment=conda step-functions trigger
Any idea why it does not run in parallel? Thanks!
1
s
Max VCPUs in the Batch config are set to 96 - an instance with 48 vCPUs offers slightly less than 48 vCPUs for tasks - AWS Batch reserves a bit for itself. So running a task with 48 vCPUs will launch an instance with more than 48 vCPUs to fit it - which implies that there is no mechanism with which another concurrent task requesting 48 vCPUs can run.
h
It's actually the memory and not the CPU. Try to request something like 90% of the max instead of the full amount https://docs.aws.amazon.com/batch/latest/userguide/ecs-reserved-memory.html
s
Ah right. The CPU reservation applies to EKS and not AWS Batch
h
Ok thanks I will try. Mind you, I do not think I can do: @batch(cpu=8, memory=32000, queue=‘metaflow-684414486554-55ff636’) instead of: @batch(cpu=8, memory=64000, queue=‘metaflow-684414486554-55ff636’) IIRC it has to exactly align exactly with these “permutations”: https://aws.amazon.com/ec2/instance-types/r7i/ Or maybe you mean something else?
w
it doesn't have to align exactly - we can pack multiple jobs across flows onto a single instance if there is space
👍 1
h
Batch will allocate an instance that has at least what was requested. So you can request odd numbers like 7 cores and it will spin up an instance with at least 8 cores, which means it can then schedule another job that only requested 1 core on that same instance (assuming the memory is sufficient)
👍 1
h
Thanks I have to admit I do not follow 100%. I have these available:
c7i.large, r7i.xlarge, r7i.16xlarge, r7i.12xlarge, r7i.8xlarge, r7i.24xlarge, r7i.large, r7i.4xlarge, r7i.2xlarge
and my Maximum vCPUs is set to 96. I cannot see a max memory. Let us say, ideally I want to run N>>10 jobs in parallel using “foreach steps” each requiring about 32 GB memory like my local machine. What would I have to spec? I do not care too much about speed/CPU … Often/thus far something like this: @batch(cpu=4, memory=32000, queue=‘metaflow-1234’) resulted in N<<10 parallel jobs and the rest “stuck in” runnable.
h
Do you see any messages in the ASG target group logs?
h
Thanks. Will have a look. As it stands just tried low CPU (2) and local machine’s memory (32GB) and I have spun up 17 “parallel” and running (!) jobs. So being less “greedy” CPU wise appears to help …
h
17*2 is still much less than 96 though
h
yes it seems to have worked