Hi looking some advice from the community or input from the Outerbounds #dev-metaflow

Hi, looking some advice from the community or inpu...

melodic-train-1526

03/31/2023, 10:05 AM

Hi, looking some advice from the community or input from the Outerbounds team. This is an issue that was brought up around one year ago and it’s still causing us issues - was wondering whether there was any new advice or potential solutions out there. I know there was an open PR purporting to solve this with an exponential backoff solution but there was suggestion this was going to be solved by a less hacky solution by the Outerbounds team. Our Metaflow setup is all on AWS with AWS Batch as our compute. Our orchestration is a custom tool in house. We have relatively large amount of parallel jobs - sometimes around 1000 jobs at once. What we experience however is that ostensibly, everything appears to work nicely and once we have got to the end, maybe 5-10% of jobs might have failed silently due to this error:

Copy code

botocore.exceptions.ClientError: An error occurred (TooManyRequestsException) when calling the DescribeJobDefinitions operation (reached max retries: 4): Too Many Requests

This, as has been previously discussed here is caused by the heartbeat of each parallel process on the orchestrator hitting the max request limit on AWS’ Batch API. We do have a workaround that involves logging which jobs start and then restarting only those that have failed but would love to be able to not have to run the flow multiple times to get it to succeed, hence looking for advice. Thanks in advance

✅ 1

ancient-application-36103

03/31/2023, 1:32 PM

@melodic-train-1526 given that these limits are global in nature, have you spoken to your AWS TAMs to raise the quota for these API calls?

melodic-train-1526

03/31/2023, 1:42 PM

@ancient-application-36103 Hey Savin, I think we did before but nothing came of it, we are considering chasing again but wanted to check there was nothing on the Metaflow side that we could do

ancient-application-36103

03/31/2023, 2:25 PM

Unfortunately, there isn't a proper fix for this issue given that it is a global limit - we can definitely reduce the probability of occurrence a bit (by relying on AWS Batch Array jobs), but you can imagine that if you have 100s of workflows in flight at the same time, you are likely to run into this issue with AWS Batch. Another way to reduce the probability would be to limit the number of max-workers to a lower number to reduce the total number of inflight jobs.

melodic-train-1526

03/31/2023, 3:35 PM

Okay, thanks a lot Savin, appreciate your input

melodic-train-1526

05/04/2023, 10:36 AM

HI @ancient-application-36103 @straight-shampoo-11124 - just an update on this, we have had confirmation from AWS that they are unable to change these limits. Given this, it would be good to discuss the possibility of offering tuning on the calls to AWS Batch particularly giving the user the possibility of reducing the frequency of calls that each job makes to the API - is this something you would consider?

ancient-application-36103

05/04/2023, 2:39 PM

Totally - integrating with Batch Array jobs will be useful here

straight-shampoo-11124

05/04/2023, 3:59 PM

@melodic-train-1526 totally. Maybe most efficient to chat over Zoom. Would any day next week at 4:30pm London time / 8:30am SF time work for you?

melodic-train-1526

05/09/2023, 7:57 AM

Hi @straight-shampoo-11124 maybe towards the end of the week, Thursday could work?

straight-shampoo-11124

05/10/2023, 5:22 AM

cool. Invite sent for Thu!

4 Views

Open in Slack

Previous Next