Hello all, I'm happy to share that I got Ray work...
# ask-metaflow
a
Hello all, I'm happy to share that I got Ray working on Metaflow. The idea is to use Metaflow as the orchestration layer for creating the Batch multi-node gang scheduled cluster, and then transform that said cluster into a Ray Cluster. To accomplish this, I worked on a Metaflow plugin/decorator
@ray_parallel
here which creates the Ray Cluster from AWS Batch. I have tested this and have a working example here: https://github.com/rileyhun/llm_finetuning_metaflow/tree/main/ray-distributed There are some caveats I want to discuss though and would welcome assistance from the team: • You have to ensure the workers nodes are kept alive - so in your flow, you have to add a "heartbeat" for all the worker nodes. The heartbeat looks something like this (which is kind of ugly). But essentially, for the worker nodes, we ping the Metaflow client API and keep the nodes alive until the head node's task is finished. If anyone has a more elegant way of handling this, I'm all ears! • You need to
pip install ray
on all the nodes. I'm not sure if the best place to do this is in the
@ray_parallel
decorator OR move this out somewhere else. I tried installing
ray
on
@conda_base
but I ran into issues • Not sure if this
time
sleep is necessary here. Just making sure the nodes are initialized with
ray
. • Also I needed to install the
pip
packages into the Ray run-time env o/w I would encounter an error, which didn't make sense to me because I already installed those
pip
packages into the nodes via Metaflow's custom
pip
decorator
🌟 6
🎉 5
party 7
mind blown 4