Hi Team,
I hope you're all doing well!
We've been encountering recurring issues with some AWS Batch queues getting stuck in a
Runnable state. This seems to be caused by resource availability constraints or potential misconfigurations in our workflows. To address this, I’d like to propose implementing a
Job State Limit for these jobs in the
metaflow-computation module. The AWS
documentation allows us to extend the capabilities of this resource
Here’s the suggested approach:
• Introduce an
optional variable that allows us to configure the timeout and define the action to take when a job exceeds the allowed state duration.
The benefits of this change are clear:
• It would prevent jobs from running excessively long (In our case we had a run 40+ hours over weekends), which has been blocking new executions and impacting production workloads.
• It addresses an issue several teams have mentioned in Slack threads, making it a valuable improvement for our broader community.
I'm happy to take the lead on implementing this change if there's consensus. Please let me know your thoughts and any feedback on the proposed approach.
Looking forward to collaborating on this!