Hey all I have a bit of a paradigm question now the output o Outerbounds #dev-metaflow

Hey all, I have a bit of a paradigm question now: ...

silly-jewelry-58239

07/19/2022, 3:56 PM

Hey all, I have a bit of a paradigm question now: the output of one of our steps is a few large parquet files (probably 5gb in total) in a tree format (there's a folder at the top called data.parquet, which contains files 00.parquet, 01.parquet, etc. Pandas knows how to read the folder as one table). We would like to keep all of these files on s3 and in the same flow output as they come from the same data source. We would also like to be able to read these files by just pointing pandas at the S3 location, rather than downloading everything just to open the tables. What's the best way to store all the data? It seems like it would be to write some helper function to move the entire folder up to S3 using the metaflow.s3 tool, and then save the paths in the artifact, but I want to get an expert opinion on it.

✅ 1

straight-shampoo-11124

07/19/2022, 6:12 PM

something like this could work

Untitled.py

straight-shampoo-11124

07/19/2022, 6:12 PM

this example stores the Parquet locations as artifacts, which is a good pattern

straight-shampoo-11124

07/19/2022, 6:14 PM

note that if you data fits in memory, which is probably the case if the data is 5GB in total, the

download()

function doesn't actually store anything on disk but it moves data directly from S3 to memory, which is very fast, especially on a large cloud instance

😯 2

straight-shampoo-11124

07/19/2022, 6:15 PM

this example uses

pyarrow

to load data, which is very efficient and which Pandas also uses behind the scenes in many cases

straight-shampoo-11124

07/19/2022, 6:15 PM

especially with large amounts of data,

metaflow.S3

can be faster than the native S3 loading in Pandas, since it is more heavily parallelized

Open in Slack

Previous Next