Sorry, an additional (completely unrelated) questi...
# dev-metaflow
m
Sorry, an additional (completely unrelated) question. It seems that all Kubernetes related functionality in Metaflow, e.g., Argo Workflows, relies on resource requests only and there is no use of limits. Ideally we would like to be able to at least specify limits on resources. Is there any appetite in Metaflow to allow for the specification of limits in Kubernetes?
1
7
s
@mammoth-rainbow-82717 great point - ideally we would only like to support limits for specific attributes. one way to introduce them could be through a flag in metaflow config. what do you think of that idea?
m
Hey Thanks for the response!
So the idea would be to use a flag to specify whether to use either requests or limits in your workflow configuration? I guess (certain) input arguments to the Kubernetes decorator would then be used as the setting of the corresponding request or limit. Did I understand correctly? If so, then yes, I think this would cover our requirements.
s
correct
m
Cool. Yeah, it makes sense to me. Would be happy to make an issue for it, if you think it is worth recording.
s
totally!
m
Also happy to have a stab at if, if you are open to that. 😄 No problem, if not. 🙂
s
that will be great!
m
ok, cool! I'll make the issue now and then pick up the work soon.
🙌 1
t
@mammoth-rainbow-82717 Did you create an issue on this? (Found it) While this might solve a few peoples problem incl. Thomas, I think it is still not sufficient for most peoples needs. Request are “I need at least X”, limits are “I want this pod to use at most Y”, These should not really be exclusive. When using kubernetes, often you want to specify both of these. So I am a little worried that the solution suggested by @ancient-application-36103, will move us to a worse position for making a good solution in the future, because then we end up having this flag that then needs to be deprecated.
m
Hi, I have made a PR with the proposed approach. Would be great to get a review, if you have time. There is an alternative approach in a related PR and some open comments on the issue too btw. Both are linked in the PR. Thanks!
Just bumping this one. Would be great to get a review on it, if you have the time. Thanks! 🙂
👀 1
Hi Again, Just looping around again on the Kubernetes limits functionality in Metaflow. I was wondering if we could pick this discussion up again at some point? Would be really helpful on our side! 🙂 There is either this PR (quite old now, so might need updating) or another that is linked in this PR. Thanks!
💯 4
a
@mammoth-rainbow-82717 thanks for the reminder! one proposal here is to introduce a
qos
argument -
guaranteed
,
burstable
and
besteffort
- similar in spirit to what kubernetes enables today. this will avoid the complexity of having to think about requests and limits (and what do they even mean for gpus)
let me know your thoughts!
m
Hi Thanks for the response. Slightly confused by the suggestion though. Unless I am missing something, the docs suggest that
qos
is automatically assigned (by Kubernetes) depending on the limit/request specification of the pod. Am I missing something here? For context, one of our main use cases for limits would be to put limit ranges in place for our namespaces, which requires all pods to have limits in place. One could try to set a default limit, but there is no guarantee it will be less than the request. This would lead to pods that can't be scheduled.
m
Agree with @mammoth-rainbow-82717 I don't think qos is the solution here because we may not know the request & limit at declaration time - pushing this decision to the step's decorator fits more use cases
This links directly to the problem we faced here: https://outerbounds-community.slack.com/archives/C02116BBNTU/p1710781444044259 with noisy neighbors on Kubernetes. As team's start maturing their autoscalers and scheduling tooling (like Karpenter on AWS), this problem becomes worse. We see about 15-20% of our flows hit this issuefor something that is related to the scheduling of pods on kubernetes because there are no limits
m
Indeed, this week we've had more issues related to not being able to set limits. Would love this to get prioritised.