chilly-midnight-51093
10/10/2023, 3:58 PMKubernetes error:
@kubernetes does not support parallel execution currently.
I was able to trace the error to https://github.com/Netflix/metaflow/blob/0acf15a0efaa51a5421e953ae8e587f734539618/metaflow/plugins/kubernetes/kubernetes_decorator.py#L219 and this indicates we can't use the parallel decorator or any of its variants (pytorch_parallel or ray_parallel) with the Kubernetes decorator. I'm wondering what is required to get support for multi-node training with the Kubernetes decorator? I see that for the AWS Batch decorator, there is a function specifically for setting up the here: https://github.com/Netflix/metaflow/blob/0acf15a0efaa51a5421e953ae8e587f734539618/metaflow/plugins/aws/batch/batch_decorator.py#L360. Would we need to do something similar for the Kubernetes decorator to get the parallel decorator supported? Thanks!