Hi I m trying to setup training pipeline that saves four mod Outerbounds #ask-metaflow

Hi, I'm trying to setup training pipeline that sav...

tall-forest-7561

10/24/2024, 3:24 PM

Hi, I'm trying to setup training pipeline that saves four models, and later the predictions of these models would be blended for final inference (planning to write different

InferenceFlow

class). Please see the flow I'm trying to code. I wrote one with some dummy steps and redundant code but it seems inefficient. I can combine them all in one

step

but that won't isolate errors or benefit from parallelism. I am a beginner in metaflow. Please guide what's the best approach here.

✅ 1

square-wire-39606

10/24/2024, 3:31 PM

Hi! are the two merge and featureengineer executing the same code?

square-wire-39606

10/24/2024, 3:31 PM

also - are all model training steps executing the same code too?

tall-forest-7561

10/24/2024, 3:32 PM

merge and featurengineer

execute the same code Model definition is different but the training code is same.

happy-wolf-7852

10/24/2024, 3:32 PM

I may be wrong but could you not simply branch s described here: https://docs.metaflow.org/metaflow/basics and use AWS batch or so for compute? If the number of models is larger/more dynamic you could use Foreach?

💯 1

this 1

tall-forest-7561

10/24/2024, 3:37 PM

I used branches but it is getting complicated because there is this autoencode block. So the code currently look some thing like this

Copy code

@step
def parquet_preprocess(self):
    self.next(self.forward_pq, self.autoencode)

@step
def forward_pq(self):
    self.next(self.join1)

@step
def autoencode(self):
    ...
    self.next(self.join2)

@step
def csv_preprocess(self):
    self.next(self.forward_csv1, self.forward_csv2)

@step
def forward_csv1(self):
    self.next(self.join1)

@step
def forward_csv2(self):
    self.next(self.join2)

See I'm creating lot of empty steps just to parallelize it

tall-forest-7561

10/24/2024, 3:38 PM

And I am writing two joins having same code

happy-wolf-7852

10/24/2024, 3:40 PM

use a foreach and some dictionary where the key determines what code to execute (parameters could be put as json blob into values)?

tall-forest-7561

10/24/2024, 3:42 PM

Can we have conditional

self.next

statements in the block which joins

foreach

output?

tall-forest-7561

10/24/2024, 3:52 PM

Thanks, I got your point. 🙂

happy-wolf-7852

10/24/2024, 4:02 PM

Not sure, if I had a point but was trying to help 😀 you can of course also split your graph into different flows. So CSV processing could work scheduled writes some data somewhere (e.g. parquet S3) and the other flows can pick latest data.

2 Views

Open in Slack

Previous Next