Request for feedback: Enabling multi-node training...
# dev-metaflow
u
Request for feedback: Enabling multi-node training in Metaflow Hello, together with the Outerbounds team, I have been working to implement multinode training support for Metaflow, starting with supporting AWS Batch Multinode. Kubernetes would follow shortly. My example code is with PyTorch Lightning, as it itself has very nice API for data parallel training. In practice this would be used for training with (a) one node and multiple gpus or (b) multiple nodes each with multiple gpus. There is no change in code needed. Below is screenshot of the code of my step to run on multiple nodes, each with multiple gpus. It actually works, and also you will see logs of each node interleaved with each other (suffixed with #1, #2, etc..). I believe we could make these separate logs available in Metaflow GUI as well, as they are just separate files in S3. Questions: • Does this API feel intuitive? • Would you use something like this? • The fact that same step code is run in parallel on multiple machines can be confusing. This is solved by treating all but machine id 0 as "shadow" or "replica", i.e their changes to "self" would not be saved. Does this feel like a reasonable approach, or should there be something more explicit? Note that data parallel training always has some consideration like this, as each worker runs the same code. • Note: this is not tied to PyTorch lightning. You could use the same multinode execution to run, say, distributed XGBoost training. • 👍 or 🤦‍♂️ or 👎 ?
🚀 3
🔥 6
👍🏼 1
excited 5
❤️ 2
👀 1
👍 3
2
💪 4
s
Looks pretty straightforward. Looking forward to this feature
c
Exciting! How would it handle node being lost and retries?
u
Node failures & retries are on my list to figure out. Should not be too difficult as we can poll the status of all the multinode tasks. As with checkpointing support, so that retry can find the latest checkpoint from S3.
u
Also I want to make this work with PyTorch Elastic Training (PET), so the training can survive node failures and also don't need to wait for all nodes to start.
u
But in general in basic setting, any node failure would cause the whole task to fail, and incur a retry. This should resume from a checkpoint.
u
This is great! What method do you use for initializing DDP?
u
Lightning takes care of callign torch.distributed.init_process_group. Environment variables are used to pass information to it, so I do this:
Copy code
def setup_torch_distributed(num_local_devices):
    os.environ["MASTER_PORT"] = "64398"  # arbitrary
    os.environ["MASTER_ADDR"] = str(get_main_node_ip())
    os.environ["NODE_RANK"] = str(get_node_index())
    os.environ["WORLD_SIZE"] = str(get_world_size(num_local_devices))
    os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo"
    os.environ["METAFLOW_SHADOW_TASK"] = "1"
Now, the get_world_size() and get_node_index() on the other hand access environment variables set by AWS Batch: AWS_BATCH_JOB_NUM_NODES, AWS_BATCH_JOB_NODE_INDEX.
🙌 2
f
This looks excellent! If you’d like a guinea pig to help test this out I’ve got a chonky transformer model I could throw this on 🚀
🙌 1
s
this looks awesome. great work @User! have u tried/any idea to make it work for tensorflow?
u
@stocky-twilight-23298 I have not tried Tensorflow yet but I expect the code will look quite similar (perhaps with Horovod).
s
Thanks @User. Native distributed training on Tensorflow uses MultiWorkerMirrorStrategy which require each node to know every nodes IP addresses. In AWS Batch Multinode each node can see the master node IP but not slave nodes. I am keen to test this for Tensorflow. Does the latest metaflow version already support this @batch decorator with multinode capability?
u
No this is just on my private branch. To support knowing all workers ip addresses we would need to write some kind of handshaking server. I’ll look at it soon.
s
awesome @User pls let me know if u need help in testing