Questions regarding working of dataflow in metaflo...
# dev-metaflow
f
Questions regarding working of dataflow in metaflow: 1. In the s3, the file
0.data.json
is present with correspondence to the task-ID of the step. That file contain key called objects, Those objects are following:
Copy code
"name": "941c65aaf9cb5bfc65491038c316b2cd35cc1313", 
"_transition": "c1b945a8d07ea84f63e82ce28d16b2b2e7b90498", 
"_task_ok": "69e77141c3eb7a8c9ce864251d70c02723f29332", 
"_success": "69e77141c3eb7a8c9ce864251d70c02723f29332", 
"_graph_info": "80e79fc09f06dd1b03a1f180d49e82297fcb8c7f", 
"_foreach_stack": "4ca058df2ea422cca260c585409d6ac9face7ebe", 
"status_dict": "89f8475b8276fe53f69afbb15a963015d9cc0016",
"base_path": "e9d9627102ec5832fc63c3d4e3d5853e1bb6fb9b", 
"_foreach_var": "f3627f46179fdd95bf0e83101840fd1d71b60e40", 
"_foreach_num_splits": "f3627f46179fdd95bf0e83101840fd1d71b60e40", 
"_exception": "f3627f46179fdd95bf0e83101840fd1d71b60e40", 
"_current_step": "01b168d0b2c70f906b295772af9efb98b0797cae"
What are these values associated with this keys? Under the key
info
of the same file, we have something like,
Copy code
"status_dict": {
   "size": 80, 
   "type": "<class 'dict'>", 
   "encoding": "gzip+pickle-v2"
}
how are we calculating the value of size? 2. Are real values associated with the variables used in a step are stored in rds?
v
1. The values are content-hashes of artifacts stored in your datastore like S3. That's how Metaflow keeps track of artifacts (anything in
self.
), i.e. the inputs and the outputs of each task.
using the hash, Metaflow is able to load the right artifact from the datastore
2. The values are stored in datastore like S3, not in RDS. In fact, the basic execution of Metaflow flows works without a central metadata service and RDS, as all the information is available in the datastore This is deliberate, to avoid making RDS and the metadata service a bottleneck
c
How do I recover the S3 path from the hash? I have to do this instead of using the Metaflow Client since I want to read artifacts across different environments.
This is fragile but in case it helps others, I'm using something like this:
Copy code
METAFLOW_BUCKET = "my-bucket"

def download_s3_artifact(s3_path: S3Path) -> Any:
    assert s3_path.exists(), f"{s3_path} does not exist"
    subprocess.run(['aws', 's3', 'cp', str(s3_path), f'/tmp/{s3_path.stem}.pkl'], check=True)
    with gzip.open(f'/tmp/{s3_path.stem}.pkl', 'rb') as f:
        data = pickle.load(f)
    return data 

def download_run_artifact(step_name: str, artifact_name: str) -> Any:
    step_dir = S3Path(f"s3://{METAFLOW_BUCKET}/{step_name}")
    assert step_dir.exists(), f"step {step_name} does not exist"
    task_dir = list(step_dir.iterdir())[0]
    data_path = task_dir / "0.data.json"
    with data_path.open("r") as f:
        metadata = json.load(f)
    key = metadata['objects'][artifact_name]

    artifact_path = S3Path(f"s3://{METAFLOW_BUCKET}/{step_name.split('/')[0]}/data/{key[:2]}/{key}")
    data = download_s3_artifact(artifact_path)
    return data
Copy code
step_name = 'FlowName/RunName/StepName'
data = download_run_artifact(step_name=step_name, artifact_name='dataset')
f
Does
.metaflow/flowname/data/
contains all the data artifacts of the flow?
If everything is getting stored in S3. Why it is not recommended to store dataframes > 100 MB as data artifacts?
v
there’s nothing inherently wrong about storing large dataframes as artifacts. The main issues are size and speed: If you store large amounts of data as artifacts, the amount of space taken by artifacts can grow quickly. Other objects are orders of magnitude smaller, so they rarely become an issue The other issue is speed: it is often faster to serialize dataframes as parquet files than compressed pickle files, which is how artifacts are handled by default
f
Ok thank you so much. I had this confusion for so long time.
👍 1