Hi when using the `metaflow_ray` decorator is ther...
# ask-metaflow
h
Hi when using the
metaflow_ray
decorator is there any way to set the automatic resource cleanup for the transient ray cluster? Ray offers that capability when submitting the job via the manifest via the
shutdownAfterJobFinishes
configuration, but I'm not sure if this is possible when using this extension. If not are there other ways to cleanup the leftover pods after the workflow has completed? https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/rayjob-quick-start.html#rayjob-configuration
1
s
Hi! The pods are automatically cleaned up. We don’t use ray clusters for running ray workloads, but just spin up an ephemeral gang that can run many distributed frameworks.
h
at least in my testing on kubernetes the pods stick around in the Completed or Error state. Is there a mechanism that cleans out the pods?
s
Can you help me with what is the infrastructure for this? Have you deployed jobsets and kueue?
h
We are using JobSets but not Kueue
We have plans to use Kueue but it is not configured yet for these jobs.
s
That can explain the issue. Jobsets by themselves do not provide gang scheduling and gang execution semantics
h
Interesting the examples in the metaflow-ray repo do seem to work correctly without kueue. I was just concerned that the pods were not being cleaned up after Flow completed. Is there anything written about how to optimally configure this feature on kubernetes?
After some investigation I did find the
Ttl Seconds After Finished
property set on the JobSet object, I just did not wait long enough for this to be triggered. Thanks for the help.