Hey team, is there any comparison analysis between...
# ask-metaflow
m
Hey team, is there any comparison analysis between Metaflow and Temporal for ML Workflow orchestration?
1
v
I haven't seen one. In general it's bit apples to oranges: Temporal is great for managing even millions of concurrent lightweight DAGs with low latency. Metaflow is more about high-throughput, compute-heavy DAGs, each of which may contain thousands of tasks
while you could use Temporal to orchestrate a DAG, it won't provision compute for you or manage terabytes of data
c
I believe that Temporal would be a great plugin for Metaflow, where Metaflow gets the best of both worlds. Temporal design is based on worker groups consuming task queues which lends well to high requests per second. Hence the state of the workflow is handled by task queues avoiding state transition orchestration bugs, and it is on the engineer to provision the scalable worker groups that could be anywhere! (kubernetes, AWS Lambda, webservice, etc ). If a Temporal Metaflow plugin would provision the workers on a set of supported compute (ex: k8s, aws lambda) then Metaflow would expand scope from high-throughput to eventing based millions of requests per second ML scenarios. ML is moving to event based, some of which is high compute (minutes to hour).
Has there been thought towards a Metaflow plugin for Temporal where it would provision the compute?
v
yeah, we are working on a number of things related to agentic use cases so this has come up
we have been working on Argo to support much higher scale workloads than what you can do otherwise, which addresses a number of high-throughput cases
c
I’m hitting possibly similar issues to Argo Workflows at high scale: the controller handling state issues, and lots of pods starting and stoping)… What are you considering for the higher scale in Argo? https://github.com/numaproj/numaflow ? 🙂
but considering Temporal…
v
we have been

patching Argo

I don't think throughput / scale is the bottleneck (at least with patched Argo) currently - the biggest benefit of something like Temporal (or Step Functions) would be lower latency. Argo hasn't been designed for low-latency use cases
thanks ty 1