Hey, I have a question about the `@pytorch_paralle...
# ask-metaflow
c
Hey, I have a question about the
@pytorch_parallel
, my compute environment for GPU training jobs currently only supports a
g5.8xlarge
meaning I can only perform single GPU training jobs. If I use the
@pytorch_parallel
I should be able to perform model training leveraging multiple of these instances just as in a
foreach
but in an unbounded manner? Would it be better to initially just use a
g5.24xlarge
which has 4 GPU’s as I would not need to leverage the
@pytorch_parallel
which is quite experimental? Also, there would be more overhead in I/O communication between the smaller
g5
instances compared to the larger instance. Are there any issues with the current implementation of the
@pytorch_parallel
and are there any examples that I can refer to?
1