Hi Outerbounds team <@U01NGJ0GA0J>, I know we brou...
# dev-metaflow
m
Hi Outerbounds team @User, I know we brought up this PR earlier in the year but we have finally given it the inferentia step functions** testing you asked for and we really need this released as a metaflow feature now. Is there anything else we need to do to get this through? https://github.com/Netflix/metaflow/pull/997 @bulky-gpu-70936
1
a
@melodic-train-1526 thanks for the update. I will take a look and shepherd it to completion. Can you verify that the PR is rebased on top of the tip of master?
m
I have recently pulled the master branch in to the fork if that’s okay?
@ancient-application-36103 Having one issue actually that is somewhat related to the PR but I don’t think should affect the validity of this PR. Basically I am running some inference code where we have these basic steps: 1. Preprocess 2. Compile model for inferentia instance type 3. Do our inference 4. Other steps In step 2, we use something that was suggested by you or ville before:
Copy code
class InferentiaCompiledModel:
    @classmethod
    def save_model(cls, model):
        with tempfile.NamedTemporaryFile() as f:
            torch.jit.save(model, f.name)
            return f.read()

    @classmethod
    def load_model(cls, blob):
        with tempfile.NamedTemporaryFile() as f:
            f.write(blob)
            f.flush()
            return torch.jit.load(f.name)
So we can do (at the end of step 2 above)
self.model_obj = InferentiaCompiledModel(compiled_model)
. This can be read in step 3 similarly (which is a parallel AWS Batch step) :
model = InferentiaCompiledModel(self.model_obj)
and this works very nicely when orchestrating our code locally. However, when using step functions, this fails weirdly on the load step (step 3) with some failure to initialise neuron error that comes from this loading line. The docker images seem fine, the inferentia capable machine is running the step and the docker image itself has neuron enabled, as I say everything is the same as the one where we orchestrate this locally. Is there anything you can think that would be messing this up. Note, there is a small ‘launching’ intermediate step which is sandwiched between the compile step (which runs on AWS batch but only one machine) and the inference step (which runs on AWS batch in a parallel way). This is the only difference between these two methods, where in the local orchestrator, this step receives the self. model_obj from AWS and passes it to the parallel step whereas on Step Functions, this is all handled by AWS. The flow graph is attached as reference if any of this is unclear. I will say this is, in my opinion shouldn’t affect the PR as the container with inf enabled fires up correctly according to the decorator.
stepfunctions_graph.png
@dry-beach-38304 Would be good if you could re-add the okay-to-test label on this/also help to approve this
s
@melodic-train-1526 can you rebase your PR on top of master to remove conflicts?
d
oops, sorry @melodic-train-1526 that I missed this earlier (a lot earlier). yes, pelase rebase and I’ll run the tests.
given a lot seem to be due to formatting, you might have an easier time cherry-picking yoru chnages back onto master and pushing that instead.
m
I think the issue is that I have an auto black formatter, will need to turn that off temporarily I think
d
we also use black (it’s a pre-commit hook actually)
so it should be generally similar
m
what line length?
d
default (so 88 I suppose)
iirc, we use everything default.
If you have pre-commit installed, it should run it for you.
m
Hi both @dry-beach-38304 @ancient-application-36103 I have just remade the PR as I couldn’t get the rebase to work properly https://github.com/Netflix/metaflow/pull/1205/files
👍 1
d
launched all tests but looks simple enough 🙂
m
Great thanks @dry-beach-38304. Let me know @ancient-application-36103 if there is anything else you need from me or if the issue above is concerning for you.
a
I should be all good to review. Give me a few days to set everything up/
m
Hey @ancient-application-36103 any update?
a
Hi Daniel, we should be able to merge the PR in this week.
🙌 1
m
Hi Savin, any update?
a
I have approved the PR - it will be part of the next release train (likely today/tomorrow)
m
Perfect, thanks a lot