user
10/14/2021, 8:13 PMsilly-motorcycle-86975
10/14/2021, 8:30 PMcuddly-rocket-69327
10/14/2021, 8:38 PMuser
10/14/2021, 8:45 PMuser
10/14/2021, 8:46 PMuser
10/14/2021, 8:54 PMuser
10/14/2021, 8:59 PMuser
10/14/2021, 9:11 PMdef setup_torch_distributed(num_local_devices):
os.environ["MASTER_PORT"] = "64398" # arbitrary
os.environ["MASTER_ADDR"] = str(get_main_node_ip())
os.environ["NODE_RANK"] = str(get_node_index())
os.environ["WORLD_SIZE"] = str(get_world_size(num_local_devices))
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo"
os.environ["METAFLOW_SHADOW_TASK"] = "1"
Now, the get_world_size() and get_node_index() on the other hand access environment variables set by AWS Batch: AWS_BATCH_JOB_NUM_NODES, AWS_BATCH_JOB_NODE_INDEX.fresh-laptop-72652
10/15/2021, 12:04 AMstocky-twilight-23298
10/15/2021, 4:52 AMuser
10/15/2021, 1:57 PMstocky-twilight-23298
10/15/2021, 10:07 PMuser
10/15/2021, 11:14 PMstocky-twilight-23298
10/16/2021, 9:31 AM