The human-centric platform for production ML & AI

Outerbounds

Have a perplexing problem...

Intermittently in the environment bootstrapping I'm seeing errors about `Temporary failure in name resolution` full error log <https://gist.github.com/tylerpotts/bf089dcaf5daf10639c31e3a62667712|here>

My initial theory was that we were overloading a particular instance. The job we were running was set to have 2cpu and 2GB RAM and schedule ~20 jobs in its parallel step. However whether these jobs all get scheduled on the same node or multiple nodes, the same error pops up.

It's also odd in that it doesn't fail at the same point every time. Occasionally it will fail when pip installing `requests`, other times when installing `awscli` etc.

I tried to see if it was an issue with <https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html#vpc-dns-limits|route53 DNS throttling>. I followed the steps <https://repost.aws/knowledge-center/vpc-find-cause-of-failed-dns-queries|described here>, but again didn't see an indication that this was the problem

Any thoughts on where to go from here? Right now our jobs are having about a 10% failure rate