fast-pizza-24629
06/14/2023, 6:30 PMpython my_flow.py run --with kubernetes:namespace=metaflow,service_account=metaflow --max-num-splits 500
The flow has 3 steps, start
[no resources definition], a_middle_step
[@resources(gpu=1, memory=16 * 1024
)] and `end`(no resources definition). The start
step has a self.next(self.a_middle_step, foreach="keys")
. I can see that the start
step is executed in a pod with no errors, and triggers the subsequent for_each
jobs/pods. However, all the pods for the a_middle_step
stay in pending status. In our cluster we have a node group with instances types: [g4dn.xlarge, g4dn.2xlarge, g5.16xlarge, g5.2xlarge, g5.4xlarge, g5.8xlarge, g5.xlarge]. However, I get this output in EKS
Warning FailedScheduling 2 minutes ago default-scheduler 0/3 nodes are available: 2 Insufficient cpu, 3 Insufficient memory, 3 Insufficient <http://nvidia.com/gpu|nvidia.com/gpu>. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.
Normal NotTriggerScaleUp 7 minutes ago cluster-autoscaler pod didn't trigger scale-up: 3 Insufficient memory, 3 Insufficient <http://nvidia.com/gpu|nvidia.com/gpu>, 5 node(s) had untolerated taint {dedicated: gpu}, 2 Insufficient ephemeral-storage
And pods just stay in pending for long time, any suggestion on how to address this issue? thank a lot for the great work 🙏