The human-centric platform for production ML & AI

Outerbounds

Hey y'all, we've been migrating to Metaflow, and I'm running into an issue with the GPU instances. This may not be a problem with Metaflow, but figured I would throw it out there. Using the default image fails due to it not being configured correctly, but when I switch to a different AMI (right now trying the Deep Learning Pytorch, and also a custom AMI), it just hangs. I can see the instance being created, but the job just stays in Runnable. I'm configuring it to use the right VPC, and the right security group, and I'm not seeing any failures anywhere. Any help is appreciated :pray: (this is with Batch)

can you confirm that the instance is running the ECS agent?

hey <@U07K6BVS6HX> thanks for the reply! I will take a look at that. I did some more digging this morning, and I was seeing that the GPU isn't available in the instance. I need to double check that the AMI and the instance type are compatible, it seems like there might be a version mismatch with the NVIDIA drivers, or something similar. I don't think it's a Metaflow issue, but figured I would ask if anyone had a similar experience.