Hello I stumbled upon metaflow today and I was wondering if Outerbounds #ask-metaflow

Hello, I stumbled upon metaflow today and I was wo...

most-analyst-45184

07/18/2025, 6:32 AM

Hello, I stumbled upon metaflow today and I was wondering if it supports heterogeneous clusters ? The use case goes as following: Let's say you want to process videos with a deep learning model on gpus. Decoding videos requires a lot of compute, and currently is the cheapest on arm cores with instances like c7g on aws. Deep learning model requires gpus to run. Using gpu instance only to both decode and process the video often lead to underprovisionning of cpu compute. You can dramatically reduce the cost by decoding videos on dedicated cpu instances and transfer the decoded video through a socket to gpu instances to be processed. This in turns means that you need to setup a heterogenous cluster with a hybrid workflow, one for cpu instance and one for gpu instance. Is this scenario supported by metaflow ?

✅ 1

victorious-lawyer-58417

07/18/2025, 8:49 AM

that's a fun scenario! I can quickly imagine three ways how you could it with Metaflow. 1. Two flows: a.

MainDecodingFlow

runs on CPUs. It gets everything ready for decoding (you could use

foreach

parallel

for distributed decoding), and then uses event triggering to launch the TrainingFlow, and pauses all tasks until the receiving end is ready (easy to do with sockets). b.

TrainingFlow

runs on GPUs - it's event-triggered by

MainDecodingFlow

and it waits for incoming video stream. Once the stream starts flowing, it can spin up training on GPUs. 2. Leverage Metaflow support for ephemeral compute clusters with interconnected nodes, but you'd need to wait for support for heterogeneous clusters which is on the roadmap. 3. Leverage Metaflow's support for various clouds, including GPU clouds like Nebius, which allows you to find nodes outside AWS with a suitable balance of CPUs and GPUs so you can just pack everything in a big box (maybe leveraging GPU-accelerated decoding too)

most-analyst-45184

07/18/2025, 8:50 AM

> Leverage Metaflow's support for various clouds, > including GPU clouds like Nebius, which allows you to find nodes > outside AWS with a suitable balance of CPUs and GPUs so you can just > pack everything in a big box (maybe leveraging GPU-accelerated decoding > too) The problem with this is that cpu compute on gpu boxes is always pricier than cpu only instances, especially arm instances. And the speed up of gpu decoding is not enough to make it cheaper than cpu decoding

most-analyst-45184

07/18/2025, 8:52 AM

you'd need to wait for support for heterogeneous clusters which is on the roadmap.

Is there a way I can track the progress for this features, so that I can get a notif when it is ready ?

victorious-lawyer-58417

07/18/2025, 8:54 AM

right, you'd have to do the math about the total cost. A single big box avoids any networking overhead, spin up overhead etc., so while the cost/CPU-second may be higher, the total cost of getting the job done could be lower - much depends on the GPU type too, and the nature of your training.

most-analyst-45184

07/18/2025, 8:56 AM

So, for our type of training, we already did the math and we regularly run heterogenous clusters with a custom terraform based implementation. Using cluster placement on aws makes the networking overhead almost negligible and the cost reduction of heterogenous cluster is very significant.

victorious-lawyer-58417

07/18/2025, 8:59 AM

great! You can track this ticket for progress

victorious-lawyer-58417

07/18/2025, 9:01 AM

meanwhile you can try the first approach if you are curious - it should work

most-analyst-45184

07/18/2025, 9:01 AM

Thank you !

most-analyst-45184

07/18/2025, 9:02 AM

> meanwhile you can try the first approach if you are curious - it should work Do you know if I can make sure that the two flow are in the same cluster placement in aws ?

victorious-lawyer-58417

07/18/2025, 9:12 AM

if you use

@batch

, create two compute environments with the same placement group. I am not sure if you can have CPU and GPU instances in the same placement group though but you can try even without a placement group if you stay within an AZ, you should be able to get 10-50Gbps, which is pretty decent for uncompressed video, unless you have multiple HD streams being decoded in parallel

victorious-lawyer-58417

07/18/2025, 9:12 AM

if you use EKS, you can set up node groups accordingly

victorious-lawyer-58417

07/18/2025, 9:15 AM

feel free to follow up here if you need help 🙂

most-analyst-45184

07/18/2025, 9:16 AM

thank you for the tips, I'll try it when I have the time 🙂

🙌 1

Open in Slack

Previous Next