The human-centric platform for production ML & AI

Outerbounds

<@U01P6EGHFDW> Resurrecting this discussion because we are running into the limits of the EKS scheduler for kubernetes. Currently we’re running 1000+ jobs and the scheduler seems to be struggling to keep up despite us having enough resources available in our nodegroups/limits

Some of the things we are currently implementing:
• lowering the TTL of the job objects due to demonstrated slowdown with 50k + jobs (current 7 days is too long)
• having the datascientists manually collaborate to not schedule simultaneously
How would I go about leveraging volcano with metaflow? I see you mentioned it in this thread but my guess is that it’s not just plug-and-play