Hi, I'm trying to setup training pipeline that sav...
# ask-metaflow
t
Hi, I'm trying to setup training pipeline that saves four models, and later the predictions of these models would be blended for final inference (planning to write different
InferenceFlow
class). Please see the flow I'm trying to code. I wrote one with some dummy steps and redundant code but it seems inefficient. I can combine them all in one
step
but that won't isolate errors or benefit from parallelism. I am a beginner in metaflow. Please guide what's the best approach here.
1
s
Hi! are the two merge and featureengineer executing the same code?
also - are all model training steps executing the same code too?
t
merge and featurengineer
execute the same code Model definition is different but the training code is same.
h
I may be wrong but could you not simply branch s described here: https://docs.metaflow.org/metaflow/basics and use AWS batch or so for compute? If the number of models is larger/more dynamic you could use Foreach?
💯 1
this 1
t
I used branches but it is getting complicated because there is this autoencode block. So the code currently look some thing like this
Copy code
@step
def parquet_preprocess(self):
    self.next(self.forward_pq, self.autoencode)

@step
def forward_pq(self):
    self.next(self.join1)

@step
def autoencode(self):
    ...
    self.next(self.join2)

@step
def csv_preprocess(self):
    self.next(self.forward_csv1, self.forward_csv2)

@step
def forward_csv1(self):
    self.next(self.join1)

@step
def forward_csv2(self):
    self.next(self.join2)
See I'm creating lot of empty steps just to parallelize it
And I am writing two joins having same code
h
use a foreach and some dictionary where the key determines what code to execute (parameters could be put as json blob into values)?
t
Can we have conditional
self.next
statements in the block which joins
foreach
output?
Thanks, I got your point. 🙂
h
Not sure, if I had a point but was trying to help 😀 you can of course also split your graph into different flows. So CSV processing could work scheduled writes some data somewhere (e.g. parquet S3) and the other flows can pick latest data.