I have a large data processing pipeline that may either Be e Outerbounds #ask-metaflow

I have a large data processing pipeline that may e...

clean-monitor-54487

07/19/2024, 3:04 PM

I have a large data processing pipeline that may either: • Be executed locally on a huge node (400 CPUs, 16 GPUs) • Be executed in Batch on g4dn.xlarge instances (or similar) What are good practices when writing workflows for such different types of compute? Is it recommended to write two different workflows? If not, how can I shard my data accordingly in

foreach

steps?

✅ 1

victorious-lawyer-58417

07/19/2024, 5:34 PM

hi Corentin! in that case two separate workflows might make most sense. You could have an additional "bootstrapper" workflow that contains the logic to decide which path to take. It'd then trigger either one of the worker flows through an event.

victorious-lawyer-58417

07/19/2024, 5:37 PM

another approach, if you are planning to run the flow locally on a huge node, would be to have two static branches and you'd make the path that's not needed a no-op on the fly. A downside is that you need to occupy a node for ~tens of seconds to run a no-op task.

clean-monitor-54487

07/20/2024, 7:12 AM

Would you recommend against decorating a method with

@batch

on the fly depending on cli args? Alternatively, is it trivial to tell from

current

whether the current method is running

@batch

or not?

clean-monitor-54487

07/20/2024, 7:30 AM

Also is foreach at a sample-level the intended method for horizontally scaling nodes? I'm asking because of the

--max-num-splits

being set to 100 by defaults lets me think that I should not increase to 1M samples. In that case, is mini-batching or sharding the dataset the recommended method?

victorious-lawyer-58417

07/20/2024, 5:41 PM

yeah, you could attach a

@batch

decorator on the fly too. There are a few days to do it: • You can set

--with batch

as a CLI option • set

METAFLOW_DECOSPECS=batch

environment variable • or you could even make a simple custom decorator (see inspiration here) that attaches

@batch

conditionally only to certain steps, depending on an environment variable / config file etc.

victorious-lawyer-58417

07/20/2024, 5:42 PM

re: how to know if you are running on

@batch

, you can use this snippet:

Copy code

is_running_on_batch = bool(os.environ.get('AWS_BATCH_JOB_ID'))

victorious-lawyer-58417

07/20/2024, 5:45 PM

re: horizontal scaling - you can have tens of thousands of splits, having up to thousands of tasks running in parallel (depending on your compute environment) there's some latency in starting a task, so it wouldn't make sense to launch a separate task to process a sample, if it takes less than, say, 30secs. Hence usually you want to shared samples so that you have up to thousands of shards in total - or less, depending how expensive it is to process a sample

clean-monitor-54487

07/22/2024, 8:41 AM

Thank you for these answers, this is very helpful

🙌 1

4 Views

Open in Slack

Previous Next