silly-motorcycle-86975
08/17/2021, 5:23 PMsquare-wire-39606
08/17/2021, 5:31 PMfor-each
or AWS Batch array jobs)?silly-motorcycle-86975
08/17/2021, 5:33 PMsquare-wire-39606
08/17/2021, 5:33 PMsquare-wire-39606
08/17/2021, 5:35 PMsilly-motorcycle-86975
08/17/2021, 5:36 PMsquare-wire-39606
08/17/2021, 5:36 PMsquare-wire-39606
08/17/2021, 5:37 PMsilly-motorcycle-86975
08/17/2021, 5:44 PMsilly-motorcycle-86975
08/17/2021, 5:46 PM@pytorch_distributed
decorator: https://github.com/zillow/metaflow/pull/57
I haven’t looked in detail so I don’t know if this works on multi-node or notsilly-motorcycle-86975
08/17/2021, 5:48 PMaccelerator='dp'
) (multiple-gpus, 1 machine)
• DistributedDataParallel (accelerator='ddp'
) (multiple-gpus across many machines (python script based)).
• DistributedDataParallel (accelerator='ddp_spawn'
) (multiple-gpus across many machines (spawn based)).
• DistributedDataParallel 2 (accelerator='ddp2'
) (DP in a machine, DDP across machines).
• Horovod (accelerator='horovod'
) (multi-machine, multi-gpu, configured at runtime)
• TPUs (tpu_cores=8|x
) (tpu or TPU pod)silly-motorcycle-86975
08/17/2021, 5:49 PMnccl
and gloo
as distributed backendsilly-motorcycle-86975
08/17/2021, 5:51 PMfresh-laptop-72652
08/17/2021, 6:04 PMfresh-laptop-72652
08/17/2021, 6:06 PMfresh-laptop-72652
08/17/2021, 6:06 PMhallowed-glass-14538
08/17/2021, 6:24 PMsilly-motorcycle-86975
08/17/2021, 6:27 PMfresh-laptop-72652
08/17/2021, 6:28 PMfresh-laptop-72652
08/17/2021, 6:29 PMsilly-motorcycle-86975
08/17/2021, 6:34 PM1) All containers have to thrown into same VPC;I think this can be easily done with my metaflow infra as all of its compute envs/job queues are configured under the same VPC
2) Master container where you are running initialization Needs to have an IP which is known;Yeah this sounds un-trivial to me
If u need DDP over many nodes, then write all the code under a foreach that throws all jobs in the same VPC; Keep workers equal to the world_size; The IP resolution seems non trivial but I think there may be a hack with CIDR specification for the batch VPCI don’t understand what this means
write all the code under a foreach that throws all jobs in the same VPC
Do you mean write model training code under a foreach
metaflow step? What should I foreach on?little-apartment-49355
08/17/2021, 9:39 PMfrom metaflow import FlowSpec,step
class DDPExample(FlowSpec):
@step
def start(self):
self.new_workers = [1,2,3,4,5]
self.next(self.ddp_step,foreach='new_workers')
@step
def ddp_step(self):
# There will be five instances of this step;
# One of the container instances needs to be the master container
# All the other containers need the IP address of the master container;
# Doing this seems Non Trivial as you need to set IP address in some or the other way;
# Your ddp synchronization code comes here
...
...
...
self.next(self.join)
@step
def join(self,inputs):
self.next(self.end)
@step
def end(self):
print("done")
silly-motorcycle-86975
08/17/2021, 9:44 PMsquare-wire-39606
08/18/2021, 1:18 AMddp_step
tasks execute concurrently.silly-motorcycle-86975
08/18/2021, 3:37 AM