Looks like this has been discussed a little bit in...
# dev-metaflow
a
Looks like this has been discussed a little bit in the past, but is native Ray support and distributed training on the roadmap for Metaflow?
1
v
yep, we have been investigating distributed training in the past, especially in the context of PyTorch Lightning and Horovod
using Ray is certainly possible with their SDK (see discussion here) but there isn't any special integration for it
what kind of distributed training / Ray use cases you have in mind?
a
None yet. Our customer team we are working with deprioritized distributed training as a pain point. I think training via vertical scaling is more common amongst our teams and we do have access to Sagemaker training. Just wanting to ask because the leaders in our organization are VERY interested in "foundational models" and want to make sure Metaflow has the necessary tools to allow Data Scientists to train really large transformer models that have billions of parameters should that use case arise. Ray was mentioned (particularly Ray Train) where it has been used to facilitate distributed training.
v
yep, we have a `@parallel` decorator implemented in preparation for such use cases that will require gang scheduling. We have tested it with distributed PyTorch/TF in the past and it works. We haven't announced it as a stable feature yet since it seems mostly everyone is in the same situation as you: They love the optionality of being able to do it in the future but practically it is not needed today 🙂 Today you can get really far with large multi-GPU instances, which provide unbeatable performance anyways compared to many basic distributed setups. Let us know when you actually want to start testing it and we are happy to explore
@parallel
and other related features with you.
💯 1
a
Absolutely perfect! Thank you so much Ville. Our leaders will be very thrilled to see that this capability does exist within Metaflow.
🙌 1