Hi guys I am running a flow using a command like this ```pyt Outerbounds #ask-metaflow

Hi guys. I am running a flow using a command like ...

fast-pizza-24629

06/14/2023, 6:30 PM

Hi guys. I am running a flow using a command like this:

Copy code

python my_flow.py run --with kubernetes:namespace=metaflow,service_account=metaflow --max-num-splits 500

The flow has 3 steps,

start

[no resources definition],

a_middle_step

[

@resources(gpu=1, memory=16 * 1024

)] and `end`(no resources definition). The

start

step has a

self.next(self.a_middle_step, foreach="keys")

. I can see that the

start

step is executed in a pod with no errors, and triggers the subsequent

for_each

jobs/pods. However, all the pods for the

a_middle_step

stay in pending status. In our cluster we have a node group with instances types: [g4dn.xlarge, g4dn.2xlarge, g5.16xlarge, g5.2xlarge, g5.4xlarge, g5.8xlarge, g5.xlarge]. However, I get this output in EKS

Copy code

Warning	FailedScheduling	2 minutes ago	default-scheduler	0/3 nodes are available: 2 Insufficient cpu, 3 Insufficient memory, 3 Insufficient <http://nvidia.com/gpu|nvidia.com/gpu>. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.
Normal	NotTriggerScaleUp	7 minutes ago	cluster-autoscaler	pod didn't trigger scale-up: 3 Insufficient memory, 3 Insufficient <http://nvidia.com/gpu|nvidia.com/gpu>, 5 node(s) had untolerated taint {dedicated: gpu}, 2 Insufficient ephemeral-storage

And pods just stay in pending for long time, any suggestion on how to address this issue? thank a lot for the great work 🙏

✅ 1

3 Views

Open in Slack

Previous Next