Hi, I want to run a multi-node training job using ...
# ask-metaflow
c
Hi, I want to run a multi-node training job using Metaflow and I intend to use the parallel decorator and I have setup my flow similar to the flow and training code found here: https://github.com/rileyhun/llm_finetuning_metaflow/tree/main/pytorch-deepspeed. Currently we are using Kubernetes cluster and I'm using the Kubernetes decorator to provision my compute. However, when I use the the parallel decorator with the Kubernetes decorator, I get the following error:
Copy code
Kubernetes error:
    @kubernetes does not support parallel execution currently.
I was able to trace the error to https://github.com/Netflix/metaflow/blob/0acf15a0efaa51a5421e953ae8e587f734539618/metaflow/plugins/kubernetes/kubernetes_decorator.py#L219 and this indicates we can't use the parallel decorator or any of its variants (pytorch_parallel or ray_parallel) with the Kubernetes decorator. I'm wondering what is required to get support for multi-node training with the Kubernetes decorator? I see that for the AWS Batch decorator, there is a function specifically for setting up the here: https://github.com/Netflix/metaflow/blob/0acf15a0efaa51a5421e953ae8e587f734539618/metaflow/plugins/aws/batch/batch_decorator.py#L360. Would we need to do something similar for the Kubernetes decorator to get the parallel decorator supported? Thanks!
2
1