curved-island-17262
06/26/2023, 3:44 PM@pytorch_parallel, my compute environment for GPU training jobs currently only supports a g5.8xlarge meaning I can only perform single GPU training jobs. If I use the @pytorch_parallel I should be able to perform model training leveraging multiple of these instances just as in a foreach but in an unbounded manner?
Would it be better to initially just use a g5.24xlarge which has 4 GPU’s as I would not need to leverage the @pytorch_parallel which is quite experimental? Also, there would be more overhead in I/O communication between the smaller g5 instances compared to the larger instance.
Are there any issues with the current implementation of the @pytorch_parallel and are there any examples that I can refer to?