I am running into the infamous `An error occurred ...
# ask-metaflow
b
I am running into the infamous
An error occurred (TooManyRequestsException) when calling the DescribeJobDefinitions operation (reached max retries: 4): Too Many Requests
when fanning out to ~100 steps. I see some open PRs about the subject. I am wondering about two things: • Are there config values (max retries, backoff?) that I can play with • Can I catch this issue happening in runner api? I would be happy if the whole run fails immediately (or if I can make it fail). The annoying thing that happens that 2 out of 100 jobs fail to even be registered, the whole job runs and crashes before join because some jobs were not even registered, wasting a lot of time.
Similar with other step level errors like running into a limit of requesting secrets via
@secrets
. How can I catch this / make note of it? Currently I can only figure out why 2 / 100 tasks are stillborn because I happen to see the error fly by in my console output.
a
TooManyRequestException
for
DescribeJobDefinitions
is unfortunately due to a global AWS limit - a better bet would be to get those limits raised by working with your AWS TAM.
can you help us with the error that you are running into with
@secrets
?
also - unfortunately never found time to implement this which will reduce the probability of running into this issue