curved-island-17262
06/26/2023, 3:44 PM@pytorch_parallel
, my compute environment for GPU training jobs currently only supports a g5.8xlarge
meaning I can only perform single GPU training jobs. If I use the @pytorch_parallel
I should be able to perform model training leveraging multiple of these instances just as in a foreach
but in an unbounded manner?
Would it be better to initially just use a g5.24xlarge
which has 4 GPU’s as I would not need to leverage the @pytorch_parallel
which is quite experimental? Also, there would be more overhead in I/O communication between the smaller g5
instances compared to the larger instance.
Are there any issues with the current implementation of the @pytorch_parallel
and are there any examples that I can refer to?