Hey everyone, I currently have a training setup th...
# ask-metaflow
a
Hey everyone, I currently have a training setup that uses PT Lightning + Hydra and I want to migrate this to Metaflow to orchestrate training jobs on a kubernetes cluster. I haven't seen too many examples that use these tools (in particular for lightning) and wanted to see if anyone can point me in the right direction to integrate these tools. I've seen that there is a
@torchrun
decorator, is this the recommended way to do distributed training across multiple nodes with lightning (I know they have their own way by setting
num_nodes
in the
Trainer
)? Thanks for the help!
s
sorry for the delay? were you able to find your way? otherwise maybe @flat-television-23413 might have thoughts on this
f
hey @acoustic-lamp-86528 we have a minimum example with torchrun and hydra working together. let us know if this doesn't hit the mark