Hey wave I m facing a weird problem with Aws Batch when usin Outerbounds #ask-metaflow

Hey :wave: I'm facing a weird problem with Aws Ba...

dry-angle-21635

01/30/2025, 11:21 AM

Hey 👋 I'm facing a weird problem with Aws Batch. when using @batch decorator with

gpu=1

(the job queue being duly connected to a compute environment launching an instance containing a gpu), the job associated to the decorated step remains in RUNNABLE state indefinitely (never waited for the timeout though so I don't have the status reason...). Moreover, the memory i'm asking (8Go) is totally acceptable given the instance family launched (g5 family). If I remove the

gpu=1

inside the decorator call, the job does enter the STARTING state, but the gpu is not available (as

torch.cuda.is_avalaible()

is False). Thank you for your help 😄

hundreds-rainbow-67050

01/30/2025, 2:31 PM

This is usually because Batch is unable to allocate an ec2 instance due to capacity limits. If you look at the ASG target group logs you should be able to see the reason why. There’s also a bug in Batch where if you have a job stuck in RUNNABLE it won’t launch any other jobs (eg if you requested impossible resources) — in this case you should kill the earliest stuck job from the queue

dry-angle-21635

01/30/2025, 2:39 PM

I'm not sure this is it, because I see the right instance getting launched and running properly. I'm assuming no instance would be launched if this was due to a capacity limit right ?

hundreds-rainbow-67050

01/30/2025, 2:52 PM

To confirm, you see the G5 instance launched? It’s just not starting the job?

👌 1

hundreds-rainbow-67050

01/30/2025, 2:56 PM

Is the ECS agent running on the instance?

dry-angle-21635

01/30/2025, 3:41 PM

I would say no. Logged on the launched instance, I have checked it with:

sudo systemctl status ecs

and it returned "Unit ecs.service could not be found."

hundreds-rainbow-67050

02/05/2025, 2:39 PM

Sorry forgot to respond earlier. Did you manage to get it working?

Open in Slack

Previous Next