Hi all new here and have a few questions I ve been working o Outerbounds #ask-metaflow

Hi all, new here and have a few questions: - I've ...

rich-wire-55966

01/09/2025, 7:40 AM

Hi all, new here and have a few questions: • I've been working on a production build of metaflow on Argo Workflows (Kubernetes) and attemting to weave in Karpenter and KEDA. Wanted to see if anyone here had a terraform repo incorporating this kind of autoscaling that they would be willing to part with. • How does metaflow handle PyTorch FSDP? I didn't see as much about this online when I searched.

hallowed-glass-14538

01/09/2025, 9:17 AM

Hey Jack! Metaflow in itself doesn’t affect any semantics for using FSDP. If you are running single node multi gpu training you can directly just use fsdp out of the box in your torch code. If you want to do fsdp training with a multi node setup then you will need to use the torchrun decorator. This decorator will create a gang scheduled multi node cluster where you can run your distributed training job using FSDP. The decorator also provides syntactic sugar over torch run functionality directly call your training scripts. The @kubernetes integration with @torchrun uses Jobsets under the hood to create gang scheduled clusters on the fly.

ancient-application-36103

01/10/2025, 12:47 AM

we don't have an official karpenter deployment template with metaflow but if you end up working on one, i would love a community contribution 🙂

rich-wire-55966

01/10/2025, 7:28 AM

Great, thank you @hallowed-glass-14538! And will do, @ancient-application-36103. We may end up switching platforms for now just since we do not have a lot of room for infra development overhead, but I will keep this in the back of my mind! We may still use these notes for a pre-training deployment, that is, we just may not get to building a Karpenter/KEDA setup in TF right away.

ancient-application-36103

01/10/2025, 5:27 PM

sure! if you are doing multi-node training, then the stock auto-scaler may also suffice well. also happy to chat more about our internal setup if that's useful

4 Views

Open in Slack

Previous Next