Getting the following error when I am trying to ru...
# ask-metaflow
s
Getting the following error when I am trying to run a multi-gpu training run using metaflow
Copy code
botocore.exceptions.CredentialRetrievalError: Error when retrieving credentials from container-role: Error retrieving metadata: Received non 200 response (429) from ECS metadata: You have reached maximum request limit.
The issue which I am facing: • The dataloader object is being called for each GPU process, which is resulting in calling the
get_many()
for each GPU (I am using torch.distributed.launch for multi GPU training), which is resulting in multiple calls and sometimes results in OOM erros.