I'm having troubles understanding the general work...
# ask-metaflow
c
I'm having troubles understanding the general workflow when using metaflow. Say I need to clean and extract features from 3 different image datasets, each with their own peculiarities. The ML team expects a outputs in a standardized format, regardless of the source dataset. I'd go ahead and write my pipelines using flowspecs, then get that running on our compute. How then: • Are ML researchers supposed to access the pipeline results? Do they use the metaflow client API to collect runs of interest, and then obtain a local copy of the outputs stored on S3? • Do we trace data lineage? E.g. if image fail to be parsed because of a header issue, how can get a sense of how much data was spilled from my dataset due to the same problem? How does a ML researcher who gets a vector of nan as features trace it back to the step that caused the issue, when the whole pipeline may involve 20 different steps, some batched, some at a sample level? The UI seems limited to showing the details of compute for a run. • Do we do any form of data versioning? If I figure out a way to fix this header issue which affects 10% of our dataset, I'd like to be able to process these images again while still holding the results from the previous 90% to avoid wasting compute. Our ML researchers should then be able to verify that the version number for these bad samples has increased
1
v
good questions!
How to access results Use the Client API to access the results. Results can be stored as artifacts, so they can access the results directly through the client - no need to access S3 manually etc.
How to trace data lineage One approach is to store metadata about each vector in an artifact. Say, you could have a NxM matrix of feature vectors (N images) and then a list/dataframe of N pointers (urls) to original images and the result of decoding. Now if the vector
k
is all NaNs, I'd see the item
k
in the metadata artifact to see what went wrong with decoding. You can access this information through the Client API again. You could build your own small wrapper library around it to make common operations straightforward. Also, you could summarize the information in the UI through a Metaflow card.
How to do data versioning / incremental data processing If you had decoding metadata stored as an artifact as suggested above, you can use the Client API inside the flow to access past metadata, to see what images have failed to process in the past. Then you can try processing them again, together with any new images. Store the updated metadata as an artifact again. This way you get both data versioning - each run produces a snapshot of metadata showing how processing has progressed across run - as well as incremental data processing. Take a look at this example `@resumable_processing` decorator that demonstrates the idea.
hope this helps!
1