The human-centric platform for production ML & AI

Outerbounds

Screenshot 2023-05-19 at 6.51.57 PM.png

Hi team,

Was wondering if I could kindly ask for some assistance. I'm working on a demo for LLM Fine Tuning using PyTorch Lightning that leverages Metaflow's `@pytorch_parallel` decorator. I'm still rather new to distributed training, so learning the ropes. I've tried `fsdp` to do sharded training, but it was having some trouble with the model parameters. I was getting this error - `valueerror: optimizer got an empty parameter list`. When I printed out the params though, they're definitely there. I also tried `ddp` as the strategy, which didn't give any explicit error, but it just stalled and didn't have any progress output in the stdout logs...

I have a reproducible example here - <https://github.com/rileyhun/llm_finetuning_metaflow/blob/main/gpt-j-8bit-flow.py>. Any pointers or guidance would be greatly appreciated.