elegant-beach-10818
08/01/2023, 8:18 PMTemporary failure in name resolution
full error log here
My initial theory was that we were overloading a particular instance. The job we were running was set to have 2cpu and 2GB RAM and schedule ~20 jobs in its parallel step. However whether these jobs all get scheduled on the same node or multiple nodes, the same error pops up.
It's also odd in that it doesn't fail at the same point every time. Occasionally it will fail when pip installing requests
, other times when installing awscli
etc.
I tried to see if it was an issue with route53 DNS throttling. I followed the steps described here, but again didn't see an indication that this was the problem
Any thoughts on where to go from here? Right now our jobs are having about a 10% failure rate