I have a large data processing pipeline that may e...
# ask-metaflow
c
I have a large data processing pipeline that may either: • Be executed locally on a huge node (400 CPUs, 16 GPUs) • Be executed in Batch on g4dn.xlarge instances (or similar) What are good practices when writing workflows for such different types of compute? Is it recommended to write two different workflows? If not, how can I shard my data accordingly in
foreach
steps?
1
v
hi Corentin! in that case two separate workflows might make most sense. You could have an additional "bootstrapper" workflow that contains the logic to decide which path to take. It'd then trigger either one of the worker flows through an event.
another approach, if you are planning to run the flow locally on a huge node, would be to have two static branches and you'd make the path that's not needed a no-op on the fly. A downside is that you need to occupy a node for ~tens of seconds to run a no-op task.
c
Would you recommend against decorating a method with
@batch
on the fly depending on cli args? Alternatively, is it trivial to tell from
current
whether the current method is running
@batch
or not?
Also is foreach at a sample-level the intended method for horizontally scaling nodes? I'm asking because of the
--max-num-splits
being set to 100 by defaults lets me think that I should not increase to 1M samples. In that case, is mini-batching or sharding the dataset the recommended method?
v
yeah, you could attach a
@batch
decorator on the fly too. There are a few days to do it: • You can set
--with batch
as a CLI option • set
METAFLOW_DECOSPECS=batch
environment variable • or you could even make a simple custom decorator (see inspiration here) that attaches
@batch
conditionally only to certain steps, depending on an environment variable / config file etc.
re: how to know if you are running on
@batch
, you can use this snippet:
Copy code
is_running_on_batch = bool(os.environ.get('AWS_BATCH_JOB_ID'))
re: horizontal scaling - you can have tens of thousands of splits, having up to thousands of tasks running in parallel (depending on your compute environment) there's some latency in starting a task, so it wouldn't make sense to launch a separate task to process a sample, if it takes less than, say, 30secs. Hence usually you want to shared samples so that you have up to thousands of shards in total - or less, depending how expensive it is to process a sample
c
Thank you for these answers, this is very helpful
🙌 1