The human-centric platform for production ML & AI

Outerbounds

Has anyone seen metaflow and argo pods get deleted during scale down events on kubernetes? It seems that when the autoscaler is trying to scale down, it will randomly pick nodes for scale down. If there happen to be running metaflow/argo pods on that node, then the pod gets terminated and the job fails.

We've tried adding the annotation  `<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: false` to the argo pods but that doesn't prevent the issue