A friendly note for all those that use apache arro...
# dev-metaflow
e
A friendly note for all those that use apache arrow - it looks like apache arrow 4.0 may have a memory bloat issue with `to_pandas`(maybe more). I'll investigate and drop a note here.
🙏 3
🐼 1
u
when you say memory bloat issue, you mean separate and away from the usual doubling that occurs in most cases if you don't specify
split_blocks
and
self_destruct
?
u
(fwiw, we've had luck dropping down to Cython and using the C++ wrappers to do more efficient cast from Table to Dataframe for certain types of data that are common for us)
u
in the end there's no escaping the pandas block manager though
e
Right @User. I didn't test explicitly (but will do so later today). I just noticed some workflows starting using a lot more memory and that they were using different versions of apache arrow (3.0 vs 4.0). Given some prior experience I had a strong suspicion. I'll investigate further when I get a chance. For now we've used metaflow
@conda
to pin
pyarrow
and are waiting for results.
@User That's interesting about your use of cython. I used cython in the past to bind some custom c++ code (using apache arrow c++ api) to python for fast operators.
We are now taking a different approach to accomplish the same thing by distributing a static object with metaflow that contains our c-functions, then load it at runtime using CFFI.
🤔 1