Hello again! I have a question in regards to opti...
# ask-metaflow
b
Hello again! I have a question in regards to optimal resource usage. Here is the setup: We have ~7k 3D models in an S3 bucket and we would like to run a data transformation flow on it using step functions / batch. The flow currently looks something like this: Start -> Collect model urls from S3 foreach model url (potentially 7000) -> Transform 3D model The transform model task itself is quite light, takes around 10-30 seconds. Now running this flow using step functions we notice two issues: Performance: Launching a container and bootstrapping the cached conda environment takes so much longer than the actual work to be done. Every step takes 3 minutes just to start up and stop. Robustness: If one of the branches chokes on bad data or goes OOM the whole workflow crashes and I cannot resume with modified code or resources (right?) That makes us wonder if we are utilising Metaflow correctly. Should the transformation of the data happen in one step instead of being fanned out? On my local machine this works out perfectly of course, it will utilize my available cpu cores and parallelize the steps as much as possible. In the distributed environment that leads to a huge performance penalty. Questions: • Is it possible to make Metaflow / batch re-use the same container for executing the steps ◦ And even parallelize within the container using multiple cores? • Or is the recommended way to not fan out tasks that take less time than launching and bootstrapping a container? • I assumed that for a large workflow one could inspect a faulty branch, fix the data or the code and resume. If that is not the case how does one write robust Metaflow code? Do I need to resort to wrapping everything into a try catch block if I want to avoid having to re-run the whole workflow for every iteration?
1
v
No way to reuse the same container currently as we guarantee clean execution environments for each task when running remotely
a recommended approach is to create shards/batches of items that take at least a few minutes to process: That's enough work that the startup latency isn't a huge issue but not too much work, so you don't have to do much extra work if a task fails
you can also look into optimizing S3 access • see this previous threadand performance tips here
for starters, definitely add
@retry
(or run
--with retry
) to handle transient failures. If you want to prevent the whole workflow failing if a task fails, use
@catch
you can certainly use
resume
to resume a past execution
also you can take a look at this custom decorator that you can run repeatedly to process items incrementally