I’m looking for a way to implement retry functiona...
# ask-metaflow
e
I’m looking for a way to implement retry functionality for training a model because we’re leveraging AWS spot instances for 6+ hr training sessions. Suppose there are 10 epochs, I’d like to save the model after each epoch. Then if there’s a failure, I’d like to load the latest trained model so I can resume training. I’ve attempted the following, but unfortunately I discovered that artifacts are not saved in the event of a failure:
Copy code
class ResumeTraining(FlowSpec):
    @step
    def start(self):
        self.finished_epoch = 0
        self.next(self.looping_training)
        
    @retry(times=1)
    @step
    def looping_training(self):
        import time
        print(f"finished_epoch: {self.finished_epoch}")
        if self.finished_epoch > 0:
            start =  self.finished_epoch + 1
        else:
            start = 1
        
        for epoch in range(start, 10):
            if epoch == 5:
                self.finished_epoch = epoch
                raise Exception('fake failure')

            time.sleep(1)
            
            self.model = f"model iteration {epoch}"
            print(self.model)
            self.finished_epoch = epoch
            print(f"in loop finished_epoch: {self.finished_epoch}")
        
        self.next(self.end)
1