Hey all, I have a bit of a paradigm question now: ...
# dev-metaflow
s
Hey all, I have a bit of a paradigm question now: the output of one of our steps is a few large parquet files (probably 5gb in total) in a tree format (there's a folder at the top called data.parquet, which contains files 00.parquet, 01.parquet, etc. Pandas knows how to read the folder as one table). We would like to keep all of these files on s3 and in the same flow output as they come from the same data source. We would also like to be able to read these files by just pointing pandas at the S3 location, rather than downloading everything just to open the tables. What's the best way to store all the data? It seems like it would be to write some helper function to move the entire folder up to S3 using the metaflow.s3 tool, and then save the paths in the artifact, but I want to get an expert opinion on it.
1
s
something like this could work
this example stores the Parquet locations as artifacts, which is a good pattern
note that if you data fits in memory, which is probably the case if the data is 5GB in total, the
download()
function doesn't actually store anything on disk but it moves data directly from S3 to memory, which is very fast, especially on a large cloud instance
😯 2
this example uses
pyarrow
to load data, which is very efficient and which Pandas also uses behind the scenes in many cases
especially with large amounts of data,
metaflow.S3
can be faster than the native S3 loading in Pandas, since it is more heavily parallelized