Hello All I ve noticed a slight discrepancy in how the `kube Outerbounds #ask-metaflow

Hello All, I've noticed a slight discrepancy in h...

mammoth-rainbow-82717

10/10/2024, 2:15 PM

Hello All, I've noticed a slight discrepancy in how the

kubernetes

and

resources

decorators handle GPUs. The

kubernetes

decorator defaults to None, while the

resources

decorator defaults to 0. Additionally, the

kubernetes

decorator updates the value of

gpu

if it is not none in the

resources

decorator - https://github.com/Netflix/metaflow/blob/master/metaflow/plugins/kubernetes/kubernetes_decorator.py#L280 Having a GPU limit of

results in the ExtendedResourceToleration admission controller adding a toleration to such steps, which means that they can run on GPU nodes. This is a general issue with the admission controller, but it would be good if this default were

None

in the resources decorator too. Is it set to

intentionally? Or it is a bug?

👀 1

square-wire-39606

10/14/2024, 3:07 PM

we are looking into this. cc @bulky-afternoon-92433

thankyou 1

bulky-afternoon-92433

10/15/2024, 6:28 PM

currently pitching for changing the default to

None

as it fixes part of the issues here. Still need to verify that some internal compute decorators are not affected. A related issue to this seems to be https://github.com/Netflix/metaflow/issues/2005 where GKE fails to schedule such pods altogether, whereas EKS runs them just fine. Out of curiosity, @mammoth-rainbow-82717 are you running on some managed Kubernetes service, or self-hosted?

mammoth-rainbow-82717

10/16/2024, 7:02 AM

We are running them on EKS.

mammoth-rainbow-82717

10/16/2024, 7:03 AM

currently pitching for changing the default to None as it fixes part of the issues here.

-> This would make sense to me.

2 Views

Open in Slack

Previous Next