Hi guys. I am running a flow using a command like ...
# ask-metaflow
f
Hi guys. I am running a flow using a command like this:
Copy code
python my_flow.py run --with kubernetes:namespace=metaflow,service_account=metaflow --max-num-splits 500
The flow has 3 steps,
start
[no resources definition],
a_middle_step
[
@resources(gpu=1, memory=16 * 1024
)] and `end`(no resources definition). The
start
step has a
self.next(self.a_middle_step, foreach="keys")
. I can see that the
start
step is executed in a pod with no errors, and triggers the subsequent
for_each
jobs/pods. However, all the pods for the
a_middle_step
stay in pending status. In our cluster we have a node group with instances types: [g4dn.xlarge, g4dn.2xlarge, g5.16xlarge, g5.2xlarge, g5.4xlarge, g5.8xlarge, g5.xlarge]. However, I get this output in EKS
Copy code
Warning	FailedScheduling	2 minutes ago	default-scheduler	0/3 nodes are available: 2 Insufficient cpu, 3 Insufficient memory, 3 Insufficient <http://nvidia.com/gpu|nvidia.com/gpu>. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.
Normal	NotTriggerScaleUp	7 minutes ago	cluster-autoscaler	pod didn't trigger scale-up: 3 Insufficient memory, 3 Insufficient <http://nvidia.com/gpu|nvidia.com/gpu>, 5 node(s) had untolerated taint {dedicated: gpu}, 2 Insufficient ephemeral-storage
And pods just stay in pending for long time, any suggestion on how to address this issue? thank a lot for the great work 🙏
1