are there still plans to release Netflix's interna...
# dev-metaflow
i
are there still plans to release Netflix's internal Pandas implementation? or is that up to Netflix? if not, are there plans to integrate with another performant dataframe library like modin?
d
we don’t have an internal “pandas” implementation but we do have more data tools. There is nothing Netflix specific about it but no specific timeline yet since releasing it may be a bit more complex (it’s not pure Python). If there is interest (and what specifically that interest is), we can see if we can release at least part of it easily. @enough-nest-7788 for visibility.
e
Hey @important-jewelry-43189 - Can you give a sense of what types of operations you'd be interested in? Our current version is geared around providing a stable platform for getting data from the warehouse fast, decoding parquet, and doing some basic table manipulations (e.g. fast filtering of rows). A common pattern at netflix is to use our tools to retrieve and trim your data before converting to pandas.
i
For starters, Pandas uses a single core, so if there was a multicore drop-in replacement that would already be great
One painful op in Pandas is mapping a function to all rows using apply(). Writing a >1tb sharded parquet dataset also takes forever on a single core.
Sometimes even after trimming our tables our dataframes are pretty big. >2000 dims and millions of samples
e
Makes sense. Thanks for the input. We do offer some multi-core utilities but its not a drop in for pandas. We've been exploring ways to make it easy to write your own high-performance functions for the ops that matter.
apply
is tricky since it can be arbitrary python that could be very slow. We'll keep discussing internally if we can make public the data utilities.
🔥 1
i
@enough-nest-7788 FYI I was referencing this comment by @square-wire-39606 https://github.com/Netflix/metaflow/issues/4#issuecomment-565079725 and the "MetaFlow DataFrame" section on the roadmap https://docs.metaflow.org/introduction/roadmap#metaflow-dataframe
e
Yep, that's it. We just only provide about .01% of pandas functionality today :)
👍 1
u
I think the only Metaflow-specific part of this work here is integration of arrow/parquet at the datastore layer for large dataframe artifacts. For reading external data (outside the metaflow datastore) like parquet from S3, pyarrow is already performant and feature-rich. There are already a number of out-of-core dataframe libraries (Vaex, Modin, etc., and more coming regularly) that can be used in a Metaflow step once you've loaded your data, and probably not a lot of value add for Metaflow trying to integrate directly.
👍 2
e
Right, we could provide other converters (we have
to_pandas()
,
to_arrow()
) today. Internally we don't use
pyarrow
but rather maintain our own build of
Arrow C++
and a custom python interface. We find
pyarrow
to be too unstable for production workflows on its own, so we built the required functionality and package it with metaflow.
👍 1