@User Resurrecting this discussion because we are running into the limits of the EKS scheduler for kubernetes. Currently we’re running 1000+ jobs and the scheduler seems to be struggling to keep up despite us having enough resources available in our nodegroups/limits
Some of the things we are currently implementing:
• lowering the TTL of the job objects due to demonstrated slowdown with 50k + jobs (current 7 days is too long)
• having the datascientists manually collaborate to not schedule simultaneously
How would I go about leveraging volcano with metaflow? I see you mentioned it in this thread but my guess is that it’s not just plug-and-play