acoustic-van-30942
05/20/2023, 1:54 AM@pytorch_parallel
decorator. I'm still rather new to distributed training, so learning the ropes. I've tried fsdp
to do sharded training, but it was having some trouble with the model parameters. I was getting this error - valueerror: optimizer got an empty parameter list
. When I printed out the params though, they're definitely there. I also tried ddp
as the strategy, which didn't give any explicit error, but it just stalled and didn't have any progress output in the stdout logs...
I have a reproducible example here - https://github.com/rileyhun/llm_finetuning_metaflow/blob/main/gpt-j-8bit-flow.py. Any pointers or guidance would be greatly appreciated.