Hey Metaflow folks, i am wondering if we have some...
# ask-metaflow
p
Hey Metaflow folks, i am wondering if we have some thoughts about data/type validation for metaflow's data serialization. currently artifacts are pickled and unpickled from/to remote storage like s3 and i noticed that there is no explicit data type or schema validation performed before pickling data or after unpickling it from storage. that said, there could be potential issues like silent data corruption, security risks (unpickling arbitrary data can execute malicious code if compromised), and lack of guarantees for downstream steps. i am thinking we could implement a StepDecorator that dynamically checks the types of variables against an expected schema. e.g.,
Copy code
@validate(input_schema={var: schema}, output_schema={var: schema})
@step
def step_func()
the decorator can hook the step function and perform validation at the start and end of the step function (after unpickling input vars and before pickling output vars). i wonder if this is something interesting to Metaflow, or if there are more native way of doing validation as soon as the serialization happens. thanks!
v
yeah, sounds like a useful idea if you want to validate "the data contract" at the step level! for inspiration, you can take a look at this example that validates a parameter using Pydantic
in a few weeks we'll release official support for custom decorators, so it'll become ever easier to create and distribute decorators like
@validate
excited nothing stops you from implementing it today, like in the above example
here's how you can detect input and output artifacts (where outputs include all artifacts that are read (and possibly modified))
p
thanks @victorious-lawyer-58417 for the example! I implemented a StepDecorator that automatically validate variables against it's schema in type annotations, at the start and end of each step. it's nothing complex - a wrapper around StepDecorator that uses artifacts and cliend APIs. So i wonder if this could be a PR that the upstream Metaflow needs, or is there any design reason that data validation on pickling does not exist in metaflow (e.g., performance, compatibility)?