The human-centric platform for production ML & AI

Outerbounds

Hi, what's the best way to do dataset versioning inside of Metaflow?
We currently have our datasets as individual files in S3.
```dataset_A
...file-1
...file-2
dataset_B
...file-1
...file-2```
 The particular bucket is Version-enabled. There are scenarios where only some files in a dataset folder get updated, and some

During training, we read each file using Metaflow's s3 client
I can't use tools like DVC as we want the files to be human readable as well - as normal s3 objects, and the datasets aren't in any individual repo

Currently looking for a way to reproduce training runs, including the exact version of each file used, ideally re-using S3's versioning, but if not, other options are fine too
I could think of a few ways to do this inside Metaflow runs. Just checking if there are any recommended best practices for it