The human-centric platform for production ML & AI

Outerbounds

Hello!

Need some help with failed runs. This might be more of a AWS question, any help is appreciated.

I have a large batch step with 7000+ splits (will look into grouping them into smaller # splits). After few smooth tasks of this step, it fails with Docker timeout. I have seen other threads relating to checking EBS burst balance of failed tasks and using custom launch templates.

1. These EBS instances seem to clear soon, is having them run again and monitor till they fail again the only way?
2. More important Metaflow related question: The console output of my flow script showed step failed and the task have "killed by orchestrator". The UI still showed one of these tasks to be running. And it was infact still running when I looked into Batch Jobs. Any idea on why or when this happens. i.e - Metaflow says that its killed but the job not being killed?

Thank you!