I am building/maintaining a flow that is part of a...
# ask-metaflow
h
I am building/maintaining a flow that is part of a core platform, in which we expect two types of transient errors. One originating due to networking issue while building conda environment, another within logic in user code. The user code is very expensive and I don't want to retry if it fails, but if environment creation fails, it should try again. Ex.
Copy code
from metaflow import FlowSpec, step, conda_base, project


def blackbox():
    # Function not in core library
    import time

    time.sleep(10)

    # maybe fails sometimes
    import random

    if random.random() < 0.01:
        raise ValueError("Gotcha")

    return "SUCCESS"


@project("someproject")
@conda_base(pip=PIP_DEPENDENCIES, python=PYTHON_VERSION)
class SomeFlow(FlowSpec):
    @step
    def start(self):
        self.seeds = range(100)
        self.next(self.heavy_step, foreach="seeds")

    @conda(libraries=MORE_PIP_DEPENDENCIES)
    @step
    def heavy_step(self):
        seed = self.input
        # Do some expensive blackbox step
        # You don't want to retry if this fails due to transient error
        blackbox()

        self.next(self.join)

    @step
    def join(self, inputs):
        self.merge_artifacts(inputs)
        self.next(self.end)

    @step
    def end(self):
        print("ended")


if __name__ == "__main__":
    SomeFlow()
Is that possible with
@retry
? AFAIK it doesn't look like. Would adding a retry in conda_base/conda decorators and passing it all the way to batch_bootstrap fit with the overall design principle of metaflow ? If not, is there a better/easier way to do this ?
1