mammoth-rainbow-82717
10/10/2024, 2:15 PMkubernetes
and resources
decorators handle GPUs. The kubernetes
decorator defaults to None, while the resources
decorator defaults to 0.
Additionally, the kubernetes
decorator updates the value of gpu
if it is not none in the resources
decorator - https://github.com/Netflix/metaflow/blob/master/metaflow/plugins/kubernetes/kubernetes_decorator.py#L280
Having a GPU limit of 0
results in the ExtendedResourceToleration admission controller adding a toleration to such steps, which means that they can run on GPU nodes. This is a general issue with the admission controller, but it would be good if this default were None
in the resources decorator too.
Is it set to 0
intentionally? Or it is a bug?square-wire-39606
10/14/2024, 3:07 PMbulky-afternoon-92433
10/15/2024, 6:28 PMNone
as it fixes part of the issues here. Still need to verify that some internal compute decorators are not affected.
A related issue to this seems to be https://github.com/Netflix/metaflow/issues/2005 where GKE fails to schedule such pods altogether, whereas EKS runs them just fine. Out of curiosity, @mammoth-rainbow-82717 are you running on some managed Kubernetes service, or self-hosted?mammoth-rainbow-82717
10/16/2024, 7:02 AMmammoth-rainbow-82717
10/16/2024, 7:03 AMcurrently pitching for changing the default to None as it fixes part of the issues here.
-> This would make sense to me.