Hi all, new here and have a few questions: - I've ...
# ask-metaflow
r
Hi all, new here and have a few questions: • I've been working on a production build of metaflow on Argo Workflows (Kubernetes) and attemting to weave in Karpenter and KEDA. Wanted to see if anyone here had a terraform repo incorporating this kind of autoscaling that they would be willing to part with. • How does metaflow handle PyTorch FSDP? I didn't see as much about this online when I searched.
h
Hey Jack! Metaflow in itself doesn’t affect any semantics for using FSDP. If you are running single node multi gpu training you can directly just use fsdp out of the box in your torch code. If you want to do fsdp training with a multi node setup then you will need to use the torchrun decorator. This decorator will create a gang scheduled multi node cluster where you can run your distributed training job using FSDP. The decorator also provides syntactic sugar over torch run functionality directly call your training scripts. The @kubernetes integration with @torchrun uses Jobsets under the hood to create gang scheduled clusters on the fly.
a
we don't have an official karpenter deployment template with metaflow but if you end up working on one, i would love a community contribution 🙂
r
Great, thank you @hallowed-glass-14538! And will do, @ancient-application-36103. We may end up switching platforms for now just since we do not have a lot of room for infra development overhead, but I will keep this in the back of my mind! We may still use these notes for a pre-training deployment, that is, we just may not get to building a Karpenter/KEDA setup in TF right away.
a
sure! if you are doing multi-node training, then the stock auto-scaler may also suffice well. also happy to chat more about our internal setup if that's useful