So I try to schedule a flow that uses aws batch as...
# ask-metaflow
h
So I try to schedule a flow that uses aws batch as compute and aws step functions to schedule. It seems to hang at some point but looking at cloud watch (e.g. in aws batch) I cannot see anything obvious wrong but in the metaflow ui I get the below. So something must have failed and it also retries for whatever reason. How could I best log/identify what is wrong more explicitly? PS: I was told by devops that: “AWS Batch is running on managed ec2 instances. But they do not have any cloudwatch agent installed.” if I understand there is a metaflow “custom terraform module” …. which we would need to debug to get the agent installed? maybe there is an easier fix?
h
If you don't want to edit that terraform module you can directly edit the launch template associated with the Batch ASG to install the CW agent, but making manual changes isn't really recommended if you're managing your infrastructure via code
h
Thanks. Is there no way to see what led to the failure shown in metflow’s ui and the resulting retries? There is nothing in any logs (e.g. AWS batch ones …) it may be some weird time out as some parts of the code are not (at the moment) efficient. - e.g. they use “apply” via pandas … I can see/have access to a lot in AWS (batch, step, cloud watch you name it and our metaflow env. was installed using “your” TF etc.) - any pointers welcome.
h
The UI just shows the same logs from Batch. You can still check the job status code in Batch though. If it's OOM it will show 137
❤️ 1
h
oops thanks - had a look at the aws batch log only - see below. I used: @batch(cpu=16, memory=30000, kind of the the same as my local machine memory wise (Mac M1 …) will try something larger:
So I have marked my step with:
Copy code
@batch(cpu=36, memory=60000, queue='metaflow-684414486554-55ff636')
but when I look at the running job’s container it says - see screenshot below. Could this be the issue? My use case is that i schedule the flow like so:
Copy code
python AmazingFlow.py --environment=pypi --package-suffixes .sql --with retry step-functions create
python AmazingFlow.py --environment=pypi --package-suffixes .sql step-functions trigger
Maybe the above decorator is ignored?
h
that's weird. can you look at the definition in SFN?