are there still plans to release Netflix s internal Pandas i Outerbounds #dev-metaflow

are there still plans to release Netflix's interna...

important-jewelry-43189

06/07/2021, 5:07 PM

are there still plans to release Netflix's internal Pandas implementation? or is that up to Netflix? if not, are there plans to integrate with another performant dataframe library like modin?

dry-beach-38304

06/07/2021, 5:25 PM

we don’t have an internal “pandas” implementation but we do have more data tools. There is nothing Netflix specific about it but no specific timeline yet since releasing it may be a bit more complex (it’s not pure Python). If there is interest (and what specifically that interest is), we can see if we can release at least part of it easily. @enough-nest-7788 for visibility.

enough-nest-7788

06/07/2021, 5:31 PM

Hey @important-jewelry-43189 - Can you give a sense of what types of operations you'd be interested in? Our current version is geared around providing a stable platform for getting data from the warehouse fast, decoding parquet, and doing some basic table manipulations (e.g. fast filtering of rows). A common pattern at netflix is to use our tools to retrieve and trim your data before converting to pandas.

important-jewelry-43189

06/07/2021, 5:44 PM

For starters, Pandas uses a single core, so if there was a multicore drop-in replacement that would already be great

important-jewelry-43189

06/07/2021, 5:47 PM

One painful op in Pandas is mapping a function to all rows using apply(). Writing a >1tb sharded parquet dataset also takes forever on a single core.

important-jewelry-43189

06/07/2021, 5:52 PM

Sometimes even after trimming our tables our dataframes are pretty big. >2000 dims and millions of samples

enough-nest-7788

06/07/2021, 6:12 PM

Makes sense. Thanks for the input. We do offer some multi-core utilities but its not a drop in for pandas. We've been exploring ways to make it easy to write your own high-performance functions for the ops that matter.

apply

is tricky since it can be arbitrary python that could be very slow. We'll keep discussing internally if we can make public the data utilities.

🔥 1

important-jewelry-43189

06/07/2021, 6:16 PM

@enough-nest-7788 FYI I was referencing this comment by @square-wire-39606 https://github.com/Netflix/metaflow/issues/4#issuecomment-565079725 and the "MetaFlow DataFrame" section on the roadmap https://docs.metaflow.org/introduction/roadmap#metaflow-dataframe

enough-nest-7788

06/07/2021, 6:18 PM

Yep, that's it. We just only provide about .01% of pandas functionality today :)

👍 1

user

06/08/2021, 10:09 PM

I think the only Metaflow-specific part of this work here is integration of arrow/parquet at the datastore layer for large dataframe artifacts. For reading external data (outside the metaflow datastore) like parquet from S3, pyarrow is already performant and feature-rich. There are already a number of out-of-core dataframe libraries (Vaex, Modin, etc., and more coming regularly) that can be used in a Metaflow step once you've loaded your data, and probably not a lot of value add for Metaflow trying to integrate directly.

👍 2

enough-nest-7788

06/08/2021, 11:02 PM

Right, we could provide other converters (we have

to_pandas()

to_arrow()

) today. Internally we don't use

pyarrow

but rather maintain our own build of

Arrow C++

and a custom python interface. We find

pyarrow

to be too unstable for production workflows on its own, so we built the required functionality and package it with metaflow.

👍 1

2 Views

Open in Slack

Previous Next