acoustic-van-30942
07/31/2023, 11:15 PM@ray_parallel
here which creates the Ray Cluster from AWS Batch.
I have tested this and have a working example here:
https://github.com/rileyhun/llm_finetuning_metaflow/tree/main/ray-distributed
There are some caveats I want to discuss though and would welcome assistance from the team:
• You have to ensure the workers nodes are kept alive - so in your flow, you have to add a "heartbeat" for all the worker nodes. The heartbeat looks something like this (which is kind of ugly). But essentially, for the worker nodes, we ping the Metaflow client API and keep the nodes alive until the head node's task is finished. If anyone has a more elegant way of handling this, I'm all ears!
• You need to pip install ray
on all the nodes. I'm not sure if the best place to do this is in the @ray_parallel
decorator OR move this out somewhere else. I tried installing ray
on @conda_base
but I ran into issues
• Not sure if this time
sleep is necessary here. Just making sure the nodes are initialized with ray
.
• Also I needed to install the pip
packages into the Ray run-time env o/w I would encounter an error, which didn't make sense to me because I already installed those pip
packages into the nodes via Metaflow's custom pip
decorator