Hi guys, we are trying a new Metaflow - Batch use ...
# dev-metaflow
m
Hi guys, we are trying a new Metaflow - Batch use case whereby we use AWS’ new Inferentia chips https://aws.amazon.com/machine-learning/inferentia/ as our instance type on our queue. We have been working our way through some issues and realised that we needed to change the batch definition as we think inferentia ECS jobs need some additional linuxParameters which aren’t currently available in metaflow https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-inference.html. We think we have a working fork with all the requisite changes where we add:
Copy code
job_definition['linuxParameters']['devices'] = {
                    "containerPath": "/dev/neuron0",
                    "hostPath": "/dev/neuron0",
                    "permissions": [
                        "read",
                        "write"
                    ]
                }
, when we specify
@batch(inferentia=True,...)
parameter. This is working as far as metaflow lets us run the flow and run our preprocessing steps but we are hitting a very uninformative error somewhere in the process of sending the job to batch. The error just tells us:
Copy code
Usage: main.py batch step [OPTIONS] STEP_NAME CODE_PACKAGE_SHA
2022-03-15 17:01:59.282 [1776/infer_video/100870 (pid 15323)] CODE_PACKAGE_URL
2022-03-15 17:01:59.282 [1776/infer_video/100870 (pid 15323)] Try 'main.py batch step --help' for help.
2022-03-15 17:01:59.282 [1776/infer_video/100870 (pid 15323)]
2022-03-15 17:01:59.282 [1776/infer_video/100870 (pid 15323)] Error: Got unexpected extra argument (432000)
Has anyone tried to add new batch parameters in a branch/fork of metaflow? Has anyone seen this error or could advice. Thanks in advance
1
a
@melodic-train-1526 You would also have to add
inferentia
here - https://github.com/Netflix/metaflow/blob/master/metaflow/plugins/aws/batch/batch_cli.py#L153
You can check the CLI args getting invoked by uncommenting https://github.com/Netflix/metaflow/blob/master/metaflow/runtime.py#L1018
m
Thanks @square-wire-39606, I didn’t mention in the original message but we have already added it in the batch_cli file also. I will try however that debugging step
👍 1
Hey @square-wire-39606 is there somewhere else I should be adding the inferentia parameter. Below is my output of the debugging command, and I know that it hits my batch_client code because if I set
inferentia=1
for example, it throws a type error as in my code in
batch_client.py
, I check it is a boolean. I have added it in
batch_cli.py
batch.py
batch_client.py
and
batch_decorator
code, but I still getting the above error presumably because it sees
--inferentia --run-time_limit 43200
and doesn’t properly recognise the
--inferentia
?? Can you advise.
Copy code
running /home/ec2-user/pytorch_venv/bin/python3 main.py --quiet --metadata service --environment local --datastore s3 --event-logger nullSidecarLogger --monitor nullSidecarMonitor --datastore-root <s3://metaflow-s3-development-euwe1/metaflow> batch step infer_video 445e09c5381ee16b18570566ddfe471a9f338487 <s3://metaflow-s3-development-euwe1/metaflow/Inferrer/data/44/445e09c5381ee16b18570566ddfe471a9f338487> --run-id 1841 --task-id 101118 --input-paths 1841/start/101113 --split-index 4 --retry-count 0 --max-user-code-retries 0 --namespace user:ec2-user --cpu 1 --gpu 0 --memory 7900 --image <http://619782547715.dkr.ecr.eu-west-1.amazonaws.com/ml_research/metaflow_inference_docker:inferentia|619782547715.dkr.ecr.eu-west-1.amazonaws.com/ml_research/metaflow_inference_docker:inferentia> --queue inf-metaflow-development-euwe1 --iam-role arn:aws:iam::619782547715:role/metaflow-batch_s3_task_role-development-euwe1 --inferentia --run-time-limit 432000
sorted now!
s
Awesome!
f
Hey @melodic-train-1526, I have a similar use case. Are you guys looking to open a PR with this feature implemented?
m
Hi @fresh-battery-60373, yes we can. We have it all working, there are only a few changes. I will aim to make the PR by the end of the week
🙌 2
thankyou 1
Hi, @fresh-battery-60373 I have now made a PR (apologies was on holiday last week and didn’t get around to it before leaving). https://github.com/Netflix/metaflow/pull/997. @square-wire-39606 are you able to take a look please when you have some time
🙌 1
🥇 1
a
Thanks @melodic-train-1526! I am away today and tomorrow - is it okay if I take a look on Wednesday?
m
No probs, it’s no rush. Just drawing your attention to this when you have time. Enjoy your time off.
❤️ 1
s
@melodic-train-1526 I have made some comments. Overall the PR looks good - let's get it to the finish line!
m
Great, will aim to take a look tomorrow - also while I have your attention is there any reason why you haven’t yet merged the throttling PR from jaklinger?
The one that throttle calls to BatchDescribeJobs
a
Yep - I just made a comment on that PR too a few minutes back
We are planning on a major release around Kubernetes shortly. After that's out, I can shift focus on getting the throttling PR sorted and merged.
Basically, rather than throttling the calls, we can simply reduce the number of times these calls are invoked
m
That would be really nice, thanks
thankyou 1