The human-centric platform for production ML & AI

Outerbounds

Hello fellow Metaflow users,

Could I please kindly survey how long it takes you to provision an AWS Batch GPU instance? For us, it's about 8 minutes overhead (the time elapsed from API runtime call to Batch Instance status transitioning to "STARTING") and it looks like most of that time is due to EC2 initialization. We suspect it's because we are using an internal security hardened AMI that has a lot of bloat. We are trying to think of ways to reduce that overhead because 8 minutes to start up a batch instance is quite a long time and some of our users are finding that a bit frustrating when debugging.

I can clarify that we set min cpus and desired cpus to both zero for the launch template