Hello again!
I have a question in regards to optimal resource usage.
Here is the setup:
We have ~7k 3D models in an S3 bucket and we would like to run a data transformation flow on it using step functions / batch.
The flow currently looks something like this:
Start -> Collect model urls from S3
foreach model url (potentially 7000) -> Transform 3D model
The transform model task itself is quite light, takes around 10-30 seconds.
Now running this flow using step functions we notice two issues:
Performance: Launching a container and bootstrapping the cached conda environment takes so much longer than the actual work to be done. Every step takes 3 minutes just to start up and stop.
Robustness: If one of the branches chokes on bad data or goes OOM the whole workflow crashes and I cannot resume with modified code or resources (right?)
That makes us wonder if we are utilising Metaflow correctly. Should the transformation of the data happen in one step instead of being fanned out?
On my local machine this works out perfectly of course, it will utilize my available cpu cores and parallelize the steps as much as possible. In the distributed environment that leads to a huge performance penalty.
Questions:
• Is it possible to make Metaflow / batch re-use the same container for executing the steps
◦ And even parallelize within the container using multiple cores?
• Or is the recommended way to not fan out tasks that take less time than launching and bootstrapping a container?
• I assumed that for a large workflow one could inspect a faulty branch, fix the data or the code and resume. If that is not the case how does one write robust Metaflow code? Do I need to resort to wrapping everything into a try catch block if I want to avoid having to re-run the whole workflow for every iteration?