Looks like this has been discussed a little bit in the past Outerbounds #dev-metaflow

Looks like this has been discussed a little bit in...

acoustic-van-30942

03/04/2023, 1:25 AM

Looks like this has been discussed a little bit in the past, but is native Ray support and distributed training on the roadmap for Metaflow?

✅ 1

victorious-lawyer-58417

03/04/2023, 2:35 AM

yep, we have been investigating distributed training in the past, especially in the context of PyTorch Lightning and Horovod

victorious-lawyer-58417

03/04/2023, 2:35 AM

using Ray is certainly possible with their SDK (see discussion here) but there isn't any special integration for it

victorious-lawyer-58417

03/04/2023, 2:35 AM

what kind of distributed training / Ray use cases you have in mind?

acoustic-van-30942

03/04/2023, 3:25 AM

None yet. Our customer team we are working with deprioritized distributed training as a pain point. I think training via vertical scaling is more common amongst our teams and we do have access to Sagemaker training. Just wanting to ask because the leaders in our organization are VERY interested in "foundational models" and want to make sure Metaflow has the necessary tools to allow Data Scientists to train really large transformer models that have billions of parameters should that use case arise. Ray was mentioned (particularly Ray Train) where it has been used to facilitate distributed training.

victorious-lawyer-58417

03/04/2023, 3:32 AM

yep, we have a `@parallel` decorator implemented in preparation for such use cases that will require gang scheduling. We have tested it with distributed PyTorch/TF in the past and it works. We haven't announced it as a stable feature yet since it seems mostly everyone is in the same situation as you: They love the optionality of being able to do it in the future but practically it is not needed today 🙂 Today you can get really far with large multi-GPU instances, which provide unbeatable performance anyways compared to many basic distributed setups. Let us know when you actually want to start testing it and we are happy to explore

@parallel

and other related features with you.

💯 1

acoustic-van-30942

03/04/2023, 3:34 AM

Absolutely perfect! Thank you so much Ville. Our leaders will be very thrilled to see that this capability does exist within Metaflow.

🙌 1

Open in Slack

Previous Next