Hi, looking some advice from the community or inpu...
# dev-metaflow
m
Hi, looking some advice from the community or input from the Outerbounds team. This is an issue that was brought up around one year ago and it’s still causing us issues - was wondering whether there was any new advice or potential solutions out there. I know there was an open PR purporting to solve this with an exponential backoff solution but there was suggestion this was going to be solved by a less hacky solution by the Outerbounds team. Our Metaflow setup is all on AWS with AWS Batch as our compute. Our orchestration is a custom tool in house. We have relatively large amount of parallel jobs - sometimes around 1000 jobs at once. What we experience however is that ostensibly, everything appears to work nicely and once we have got to the end, maybe 5-10% of jobs might have failed silently due to this error:
Copy code
botocore.exceptions.ClientError: An error occurred (TooManyRequestsException) when calling the DescribeJobDefinitions operation (reached max retries: 4): Too Many Requests
This, as has been previously discussed here is caused by the heartbeat of each parallel process on the orchestrator hitting the max request limit on AWS’ Batch API. We do have a workaround that involves logging which jobs start and then restarting only those that have failed but would love to be able to not have to run the flow multiple times to get it to succeed, hence looking for advice. Thanks in advance
1
a
@melodic-train-1526 given that these limits are global in nature, have you spoken to your AWS TAMs to raise the quota for these API calls?
m
@ancient-application-36103 Hey Savin, I think we did before but nothing came of it, we are considering chasing again but wanted to check there was nothing on the Metaflow side that we could do
a
Unfortunately, there isn't a proper fix for this issue given that it is a global limit - we can definitely reduce the probability of occurrence a bit (by relying on AWS Batch Array jobs), but you can imagine that if you have 100s of workflows in flight at the same time, you are likely to run into this issue with AWS Batch. Another way to reduce the probability would be to limit the number of max-workers to a lower number to reduce the total number of inflight jobs.
m
Okay, thanks a lot Savin, appreciate your input
HI @ancient-application-36103 @straight-shampoo-11124 - just an update on this, we have had confirmation from AWS that they are unable to change these limits. Given this, it would be good to discuss the possibility of offering tuning on the calls to AWS Batch particularly giving the user the possibility of reducing the frequency of calls that each job makes to the API - is this something you would consider?
a
Totally - integrating with Batch Array jobs will be useful here
s
@melodic-train-1526 totally. Maybe most efficient to chat over Zoom. Would any day next week at 4:30pm London time / 8:30am SF time work for you?
m
Hi @straight-shampoo-11124 maybe towards the end of the week, Thursday could work?
s
cool. Invite sent for Thu!