Hello again I have a question in regards to optimal resource Outerbounds #ask-metaflow

Hello again! I have a question in regards to opti...

brave-fall-84099

08/01/2024, 9:32 AM

Hello again! I have a question in regards to optimal resource usage. Here is the setup: We have ~7k 3D models in an S3 bucket and we would like to run a data transformation flow on it using step functions / batch. The flow currently looks something like this: Start -> Collect model urls from S3 foreach model url (potentially 7000) -> Transform 3D model The transform model task itself is quite light, takes around 10-30 seconds. Now running this flow using step functions we notice two issues: Performance: Launching a container and bootstrapping the cached conda environment takes so much longer than the actual work to be done. Every step takes 3 minutes just to start up and stop. Robustness: If one of the branches chokes on bad data or goes OOM the whole workflow crashes and I cannot resume with modified code or resources (right?) That makes us wonder if we are utilising Metaflow correctly. Should the transformation of the data happen in one step instead of being fanned out? On my local machine this works out perfectly of course, it will utilize my available cpu cores and parallelize the steps as much as possible. In the distributed environment that leads to a huge performance penalty. Questions: • Is it possible to make Metaflow / batch re-use the same container for executing the steps ◦ And even parallelize within the container using multiple cores? • Or is the recommended way to not fan out tasks that take less time than launching and bootstrapping a container? • I assumed that for a large workflow one could inspect a faulty branch, fix the data or the code and resume. If that is not the case how does one write robust Metaflow code? Do I need to resort to wrapping everything into a try catch block if I want to avoid having to re-run the whole workflow for every iteration?

✅ 1

victorious-lawyer-58417

08/02/2024, 1:54 AM

No way to reuse the same container currently as we guarantee clean execution environments for each task when running remotely

victorious-lawyer-58417

08/02/2024, 1:56 AM

a recommended approach is to create shards/batches of items that take at least a few minutes to process: That's enough work that the startup latency isn't a huge issue but not too much work, so you don't have to do much extra work if a task fails

victorious-lawyer-58417

08/02/2024, 1:57 AM

you can also look into optimizing S3 access • see this previous thread • and performance tips here

victorious-lawyer-58417

08/02/2024, 1:59 AM

for starters, definitely add

@retry

(or run

--with retry

) to handle transient failures. If you want to prevent the whole workflow failing if a task fails, use

@catch

victorious-lawyer-58417

08/02/2024, 1:59 AM

you can certainly use

resume

to resume a past execution

victorious-lawyer-58417

08/02/2024, 2:00 AM

also you can take a look at this custom decorator that you can run repeatedly to process items incrementally

3 Views

Open in Slack

Previous Next