Our team has a cloud-based multi-tenant SaaS application that helps users determine where to best spend their maintenance budgets get the most out of making their facilities reliable. In our model, we have multiple tenant which contain facilities which contain units that are broken down into assets. Users configure dataset, including a few time series, some properties that are non-date/time effective. Within the scope of these individual assets we run a calculation pipeline which we currently have in Azure with service bus messaging and Azure Durable Functions for orchestration. This triggers a couple of downstream workflows to aggregate data up to the unit level and then up to the facility level. So all the data is specific to a particular ID value of some entity it is scoped to.
We are looking at porting the calculation pipelines to Metaflow to gain the benefits of simpler versioning, visibility in Metaflow and Argo dashboards, debugging/resume support, and simpler management of our data science code and deployments. Other things that are important to us are individual calculation performance, scaling infrastructure up and down to control cost, error handling for long running workflow orchestrations, and traceability across many different composed workflows. One of my initial concerns has been startup time of steps, because we may have calculation pipelines kicked of for 1000 assets at a time each with 5 or so steps, so adding 10's of seconds of environment setup adds up for our use case. I know @User has been doing some excellent work in this area and we can alternatively manage our own docker images per metaflow with python and dependencies pre-configured to mitigate this, so this is less of a concern now.
My real question is what are we not yet seeing that could be a blocker for our use case? We've noticed all examples we've seen are around having a single large data set and needing to process through this dataset periodically in bulk rather than the multi-tenant dataset per entity type of use case. Are we looking at the right framework? Would Metaflow be a good fit for what we are trying to achieve? I am not looking to make Metaflow fit my specific cases perfectly as I know it needs to be general purpose, but mostly need to know if we are both heading in 2 fundamentally different directions. Thanks for any insights you are able to provide!