A friendly note for all those that use apache arrow it looks Outerbounds #dev-metaflow

A friendly note for all those that use apache arro...

enough-nest-7788

06/22/2021, 4:21 PM

A friendly note for all those that use apache arrow - it looks like apache arrow 4.0 may have a memory bloat issue with `to_pandas`(maybe more). I'll investigate and drop a note here.

🙏 3

🐼 1

user

06/22/2021, 5:14 PM

when you say memory bloat issue, you mean separate and away from the usual doubling that occurs in most cases if you don't specify

split_blocks

and

self_destruct

user

06/22/2021, 5:16 PM

(fwiw, we've had luck dropping down to Cython and using the C++ wrappers to do more efficient cast from Table to Dataframe for certain types of data that are common for us)

user

06/22/2021, 5:16 PM

in the end there's no escaping the pandas block manager though

enough-nest-7788

06/22/2021, 7:38 PM

Right @User. I didn't test explicitly (but will do so later today). I just noticed some workflows starting using a lot more memory and that they were using different versions of apache arrow (3.0 vs 4.0). Given some prior experience I had a strong suspicion. I'll investigate further when I get a chance. For now we've used metaflow

@conda

to pin

pyarrow

and are waiting for results.

enough-nest-7788

06/22/2021, 7:41 PM

@User That's interesting about your use of cython. I used cython in the past to bind some custom c++ code (using apache arrow c++ api) to python for fast operators.

enough-nest-7788

06/22/2021, 7:44 PM

We are now taking a different approach to accomplish the same thing by distributing a static object with metaflow that contains our c-functions, then load it at runtime using CFFI.

🤔 1

Open in Slack

Previous Next