Some questions about moving to pickling artifacts ...
# dev-metaflow
q
Some questions about moving to pickling artifacts with protocol=4: 1. What level of backwards compatibility is needed? Just loading existing artifacts that were saved with protocol=2? Or do we need to be able to continue to write with protocol=2 to support organizations that might want to load new artifacts with old code? (cc @dry-beach-38304) 2. Can we expose a config option to use protocol=5? Or do we just want to focus on supporting 4 now?
1
d
Hey — @ancient-application-36103 may have other opinions too but for me: • for 1., just loading protocol 2 artifacts seems enough. I feel the need to load artifacts in an older metaflow is not too too useful. • for 2, I don’t mind exposing a maximum protocol level. I think we can also probably make it the default if python >= 3.8. In other words, pickle with protocol 5 if python 3.8 or later is ued. The downside for this is that then you need python 3.8 to access the artifact but I don’t feel that is a huge issue given that 3.8 is not that recent. There are fairly significant improvements that may be worthwhile: https://github.com/numpy/numpy/pull/12011 The only thing with protocol 5 though is the out of band thing — we should actually support that way of calling to get the most benefit. That requires a bit more change to the datastore but we are making some already (https://github.com/Netflix/metaflow/pull/1996 which introduces some support for multiple buffers in a different context). In short, I think it would make sense to support pickle 5 but I would rather fully support it if we do it.
maybe a slight caveat to what I said: the current default version is 4. We could stick to that but still support 5 as an option. I think that for some large tables/ML data, it may actually be useful.
q
maybe a slight caveat to what I said: the current default version is 4.
Based on the code, as well as an inspection of some of our artifacts, my understanding is that the current default is 2, and that we only save to 4 if that fails (typically due to the 2GB limit). The main impact for us is that every time we have a big artifact we spend a good chunk of time on the doomed protocol 2 attempt, before the successful protocol 4 attempt. If that's indeed how it works, and it's not the intended behavior, perhaps it makes sense to do this in two stages: 1. move to 4 by default ASAP 2. support for higher protocols, including out-of-band data, as the datastore improvements are ready
d
sorry, I mean that the default python version is 4 (ie: if you use pickle out of the box, it doesn’t use pickle version 5). The Metaflow behavior you describe is correct (2 first and then 4). I do think we can move to: • 4 by default if supported (ie: almost all cases) • 2 if not supported and error out for large artifacts. • And yes, in a second phase, we can look at supporting 5 with out of band buffers.
👍 1
q
ah, that makes sense. I'll plan on implementing the first phase soon.
okay, here's a first shot at moving to protocol 4 by default, though I haven't been able to test backwards compatibility with python <=3.3 because of a dependency issue (lack of a suitable boto3 version available) https://github.com/Netflix/metaflow/pull/2243