Hi Team I hope you re all doing well We ve been encountering Outerbounds #dev-metaflow

Hi Team, I hope you're all doing well! We've been ...

microscopic-plastic-70677

12/05/2024, 8:19 AM

Hi Team, I hope you're all doing well! We've been encountering recurring issues with some AWS Batch queues getting stuck in a Runnable state. This seems to be caused by resource availability constraints or potential misconfigurations in our workflows. To address this, I’d like to propose implementing a Job State Limit for these jobs in the metaflow-computation module. The AWS documentation allows us to extend the capabilities of this resource Here’s the suggested approach: • Introduce an optional variable that allows us to configure the timeout and define the action to take when a job exceeds the allowed state duration. The benefits of this change are clear: • It would prevent jobs from running excessively long (In our case we had a run 40+ hours over weekends), which has been blocking new executions and impacting production workloads. • It addresses an issue several teams have mentioned in Slack threads, making it a valuable improvement for our broader community. I'm happy to take the lead on implementing this change if there's consensus. Please let me know your thoughts and any feedback on the proposed approach. Looking forward to collaborating on this!

👀 1

🙌 1

microscopic-plastic-70677

12/11/2024, 11:57 AM

I took the lead and already sent a PR with the suggested changes, looking forward for your response 😊

fresh-accountant-22910

01/24/2025, 3:49 PM

I’ve made a comment on the PR.

microscopic-plastic-70677

02/03/2025, 11:55 AM

Thanks @fresh-accountant-22910, I implemented your suggestions.

2 Views

Open in Slack

Previous Next