Hey :wave: I'm facing a weird problem with Aws Ba...
# ask-metaflow
d
Hey 👋 I'm facing a weird problem with Aws Batch. when using @batch decorator with
gpu=1
(the job queue being duly connected to a compute environment launching an instance containing a gpu), the job associated to the decorated step remains in RUNNABLE state indefinitely (never waited for the timeout though so I don't have the status reason...). Moreover, the memory i'm asking (8Go) is totally acceptable given the instance family launched (g5 family). If I remove the
gpu=1
inside the decorator call, the job does enter the STARTING state, but the gpu is not available (as
torch.cuda.is_avalaible()
is False). Thank you for your help 😄
h
This is usually because Batch is unable to allocate an ec2 instance due to capacity limits. If you look at the ASG target group logs you should be able to see the reason why. There’s also a bug in Batch where if you have a job stuck in RUNNABLE it won’t launch any other jobs (eg if you requested impossible resources) — in this case you should kill the earliest stuck job from the queue
d
I'm not sure this is it, because I see the right instance getting launched and running properly. I'm assuming no instance would be launched if this was due to a capacity limit right ?
h
To confirm, you see the G5 instance launched? It’s just not starting the job?
👌 1
Is the ECS agent running on the instance?
d
I would say no. Logged on the launched instance, I have checked it with:
sudo systemctl status ecs
and it returned "Unit ecs.service could not be found."
h
Sorry forgot to respond earlier. Did you manage to get it working?