Questions regarding working of dataflow in metaflow 1 In the Outerbounds #dev-metaflow

Questions regarding working of dataflow in metaflo...

fancy-mouse-14245

07/28/2023, 8:57 AM

Questions regarding working of dataflow in metaflow: 1. In the s3, the file

0.data.json

is present with correspondence to the task-ID of the step. That file contain key called objects, Those objects are following:

Copy code

"name": "941c65aaf9cb5bfc65491038c316b2cd35cc1313", 
"_transition": "c1b945a8d07ea84f63e82ce28d16b2b2e7b90498", 
"_task_ok": "69e77141c3eb7a8c9ce864251d70c02723f29332", 
"_success": "69e77141c3eb7a8c9ce864251d70c02723f29332", 
"_graph_info": "80e79fc09f06dd1b03a1f180d49e82297fcb8c7f", 
"_foreach_stack": "4ca058df2ea422cca260c585409d6ac9face7ebe", 
"status_dict": "89f8475b8276fe53f69afbb15a963015d9cc0016",
"base_path": "e9d9627102ec5832fc63c3d4e3d5853e1bb6fb9b", 
"_foreach_var": "f3627f46179fdd95bf0e83101840fd1d71b60e40", 
"_foreach_num_splits": "f3627f46179fdd95bf0e83101840fd1d71b60e40", 
"_exception": "f3627f46179fdd95bf0e83101840fd1d71b60e40", 
"_current_step": "01b168d0b2c70f906b295772af9efb98b0797cae"

What are these values associated with this keys? Under the key

info

of the same file, we have something like,

Copy code

"status_dict": {
   "size": 80, 
   "type": "<class 'dict'>", 
   "encoding": "gzip+pickle-v2"
}

how are we calculating the value of size? 2. Are real values associated with the variables used in a step are stored in rds?

victorious-lawyer-58417

07/28/2023, 6:20 PM

1. The values are content-hashes of artifacts stored in your datastore like S3. That's how Metaflow keeps track of artifacts (anything in

self.

), i.e. the inputs and the outputs of each task.

victorious-lawyer-58417

07/28/2023, 6:21 PM

using the hash, Metaflow is able to load the right artifact from the datastore

victorious-lawyer-58417

07/28/2023, 6:25 PM

2. The values are stored in datastore like S3, not in RDS. In fact, the basic execution of Metaflow flows works without a central metadata service and RDS, as all the information is available in the datastore This is deliberate, to avoid making RDS and the metadata service a bottleneck

cool-father-88039

08/01/2023, 1:35 PM

How do I recover the S3 path from the hash? I have to do this instead of using the Metaflow Client since I want to read artifacts across different environments.

cool-father-88039

08/01/2023, 2:44 PM

This is fragile but in case it helps others, I'm using something like this:

Copy code

METAFLOW_BUCKET = "my-bucket"

def download_s3_artifact(s3_path: S3Path) -> Any:
    assert s3_path.exists(), f"{s3_path} does not exist"
    subprocess.run(['aws', 's3', 'cp', str(s3_path), f'/tmp/{s3_path.stem}.pkl'], check=True)
    with gzip.open(f'/tmp/{s3_path.stem}.pkl', 'rb') as f:
        data = pickle.load(f)
    return data 

def download_run_artifact(step_name: str, artifact_name: str) -> Any:
    step_dir = S3Path(f"s3://{METAFLOW_BUCKET}/{step_name}")
    assert step_dir.exists(), f"step {step_name} does not exist"
    task_dir = list(step_dir.iterdir())[0]
    data_path = task_dir / "0.data.json"
    with data_path.open("r") as f:
        metadata = json.load(f)
    key = metadata['objects'][artifact_name]

    artifact_path = S3Path(f"s3://{METAFLOW_BUCKET}/{step_name.split('/')[0]}/data/{key[:2]}/{key}")
    data = download_s3_artifact(artifact_path)
    return data

Copy code

step_name = 'FlowName/RunName/StepName'
data = download_run_artifact(step_name=step_name, artifact_name='dataset')

fancy-mouse-14245

08/02/2023, 6:56 AM

Does

.metaflow/flowname/data/

contains all the data artifacts of the flow?

fancy-mouse-14245

08/02/2023, 6:57 AM

If everything is getting stored in S3. Why it is not recommended to store dataframes > 100 MB as data artifacts?

victorious-lawyer-58417

08/02/2023, 7:04 AM

there’s nothing inherently wrong about storing large dataframes as artifacts. The main issues are size and speed: If you store large amounts of data as artifacts, the amount of space taken by artifacts can grow quickly. Other objects are orders of magnitude smaller, so they rarely become an issue The other issue is speed: it is often faster to serialize dataframes as parquet files than compressed pickle files, which is how artifacts are handled by default

fancy-mouse-14245

08/02/2023, 7:05 AM

Ok thank you so much. I had this confusion for so long time.

👍 1

Open in Slack

Previous Next