dry-angle-21635
01/30/2025, 11:21 AMgpu=1
(the job queue being duly connected to a compute environment launching an instance containing a gpu), the job associated to the decorated step remains in RUNNABLE state indefinitely (never waited for the timeout though so I don't have the status reason...). Moreover, the memory i'm asking (8Go) is totally acceptable given the instance family launched (g5 family).
If I remove the gpu=1
inside the decorator call, the job does enter the STARTING state, but the gpu is not available (as torch.cuda.is_avalaible()
is False).
Thank you for your help 😄hundreds-rainbow-67050
01/30/2025, 2:31 PMdry-angle-21635
01/30/2025, 2:39 PMhundreds-rainbow-67050
01/30/2025, 2:52 PMhundreds-rainbow-67050
01/30/2025, 2:56 PMdry-angle-21635
01/30/2025, 3:41 PMsudo systemctl status ecs
and it returned
"Unit ecs.service could not be found."hundreds-rainbow-67050
02/05/2025, 2:39 PM