hey guys, looking into the way metaflow persists d...
# dev-metaflow
f
hey guys, looking into the way metaflow persists data via using pickle. I was trying to look into how to upgrade the source code so generators can be used. Looking into this code https://github.com/Netflix/metaflow/blob/master/metaflow/datastore/task_datastore.py#L262 that seems to dump everything into a blob. Despite that, I think there is somewhere another type of serialization is going on, because at that part of the code my generators are empty. Does anybody know data flow before pickle write op? Any docs are welcome 🙂
âś… 1
d
Hello. I can send you the “path” data follows before thst when I get to my computer. Whr do you mean “generators can be used” out of curiosity.
f
I’m looking into the ways how the memory footprint can be reduced by using generators. In our case we load some metadata from a 3rd party API, It involves pagination, then we launch a step with a bunch of parallel jobs to download binaries and construct TFRecords. When download is fine, I would like to see if I can update Metaflow, so for things like pagination a single instance step can work by yielding on request.
I hope my explanation is not too messy.
anyway, I see that you get attributes from the flow, trying to see why generators are always empty after that step
d
I’m still not 100% sure I get the use case but to answer the “where is the data pickled” bit, it goes like this: • at the end of the metaflow task, we call
persist
here: https://github.com/Netflix/metaflow/blob/master/metaflow/task.py#L665 • this leads to here: https://github.com/Netflix/metaflow/blob/master/metaflow/datastore/task_datastore.py#L672 where we collect all the artifacts to save. Note that some things are excluded (methods, functions, certain names, etc.). In that function you will see we have a generator over the attributes of the flow to pickle. It’s a destructive iterator in the sense that it tries to be somewhat memory conscious • we then reach the function you were pointing to here: https://github.com/Netflix/metaflow/blob/master/metaflow/datastore/task_datastore.py#L236 which is responsible for encoding and dumping each artifact. This is again a generator function
pickle_iter
. • we then end up here: https://github.com/Netflix/metaflow/blob/master/metaflow/datastore/task_datastore.py#L236 with another generator function
packing_iter
here which is responsible for compressing the pickled blob (in the current implementation). • you then end up in one of the backends, for s3, it would be here: https://github.com/Netflix/metaflow/blob/master/metaflow/datastore/s3_storage.py#L77. In this implementation, you can see that it’s at this point that all the previous generators basically get “resolved” and things are pushed to S3. For a few artifacts, it’s one at a time, for many artifacts, it’ll happen in parallel. I hope this helps. Let me know if you have more questions.
a
@faint-hair-28386 I do not think its possible to pickle a generator unless you use Pypy.
f
I don’t want to pickle it, I want to have a better logic, so when there is a generator it is unrolled and each batch is pickled. So, for the rest of the system it will look like another array
thank you Romain, I will look into those steps. If you want to understand the use case, I will be happy to jump on a zoom call, or anything like that
I think I discovered this flow yesterday. And the issue is that
getattr(flow, attr)
returns the object definition, and generators are the tricky ones. So at this call, it knows that it is a generator but it cannot restore its state, that’s why it’s empty
d
are you looking for artifacts to be persisted partially through a step?
f
not really. I’m looking for a way so I don’t need to overload memory with data that anyway will be sunk into the persistence layer. I want to have some sort of a sink in the
next
function, that can work through generators
d
Ah. Got it finally I think. So we use generators in the process I described (although we materialize earlier I think) but you want to be able to pass in a generator to thst. Let me think about it a bit and also about the early materialization.