Hi team I have a general design question around loading Parq Outerbounds #ask-metaflow

Hi team, I have a general design question around ...

nutritious-forest-60017

04/24/2025, 3:15 PM

Hi team, I have a general design question around loading Parquet data from S3 for training. I'm working on moving some of our model training over to Metaflow using the AWS Batch stack. One of our models uses ~1TB of partitioned Parquet files (~300MB each) stored in S3. The model itself is small and blazes through batches, so we’re often I/O bound. Currently, our ML scientists run this on an on-prem machine and just download all the data to disk before training. In theory, I could do something similar with Metaflow by adding more EBS to the compute environment and downloading everything in the training step but that feels a bit clunky. I’ve played around with the fast data loading approach using tmpfs, and it’s really fast (nice work!). So, I was wondering if that’s something I should be leaning on for data staging, or if it makes more sense to just stick with a traditional dataset/dataloader setup and scale up instance RAM, cores, num_workers to get over any data loading bottlenecks. I guess my main question is: is there anything Metaflow-specific I should be taking advantage of here?

✅ 1

square-wire-39606

04/24/2025, 7:53 PM

fast data loading using metaflow.s3 + tmpfs should work in this case - that will avoid the need of using more EBS.

square-wire-39606

04/24/2025, 7:54 PM

also, you might want to look into using the

foreach

construct - that might help in sharding your work across more nodes

✅ 1

nutritious-forest-60017

04/24/2025, 8:06 PM

thanks, Savin. not sure I’m following how

foreach

would work here in this case. I can definitely split the list of parquet files into N groups and run those in parallel, but wouldn’t that just train a bunch of separate models? or is there something obvious I'm missing here?

square-wire-39606

04/24/2025, 10:09 PM

ah - i misread the question as wanting to do inference

square-wire-39606

04/24/2025, 10:09 PM

🧠 🥶

😅 1

2 Views

Open in Slack

Previous Next