Hi team, I have a general design question around ...
# ask-metaflow
n
Hi team, I have a general design question around loading Parquet data from S3 for training. I'm working on moving some of our model training over to Metaflow using the AWS Batch stack. One of our models uses ~1TB of partitioned Parquet files (~300MB each) stored in S3. The model itself is small and blazes through batches, so we’re often I/O bound. Currently, our ML scientists run this on an on-prem machine and just download all the data to disk before training. In theory, I could do something similar with Metaflow by adding more EBS to the compute environment and downloading everything in the training step but that feels a bit clunky. I’ve played around with the fast data loading approach using tmpfs, and it’s really fast (nice work!). So, I was wondering if that’s something I should be leaning on for data staging, or if it makes more sense to just stick with a traditional dataset/dataloader setup and scale up instance RAM, cores, num_workers to get over any data loading bottlenecks. I guess my main question is: is there anything Metaflow-specific I should be taking advantage of here?
1
s
fast data loading using metaflow.s3 + tmpfs should work in this case - that will avoid the need of using more EBS.
also, you might want to look into using the
foreach
construct - that might help in sharding your work across more nodes
1
n
thanks, Savin. not sure I’m following how
foreach
would work here in this case. I can definitely split the list of parquet files into N groups and run those in parallel, but wouldn’t that just train a bunch of separate models? or is there something obvious I'm missing here?
s
ah - i misread the question as wanting to do inference
🧠 🥶
😅 1