some-noon-7401
04/09/2023, 9:40 AMbotocore.exceptions.CredentialRetrievalError: Error when retrieving credentials from container-role: Error retrieving metadata: Received non 200 response (429) from ECS metadata: You have reached maximum request limit.
The issue which I am facing:
• The dataloader object is being called for each GPU process, which is resulting in calling the get_many()
for each GPU (I am using torch.distributed.launch for multi GPU training), which is resulting in multiple calls and sometimes results in OOM erros.