Hi Outerbounds team, our team has recently encount...
# ask-metaflow
b
Hi Outerbounds team, our team has recently encountered what we believe is a unique use case that behaved a little differently than we expected (and this might be the expected way from the metaflow side but we wanted to confirm). We have a custom
@subflow
decorator that functions similarly to the way the
metaflow.Runner
functionality executes flows remotely, but we have some flows using our custom decorator that are deployed to step functions and running via batch. We found the issue with our custom decorator by deploying a flow to step functions, where the subflow being executed is using the
@batch
decorator, found we had issues kicking off the job within the flow on batch with an IAM role issue. We validated the same thing occurs if we call
metaflow.Runner
within the step function deployed flow, but can be remedied by specifying the
iam_role
for
@batch
. This tells us that the batch role passed to the step function is not being passed down to the flow within i think because the step function passes it as an argument to
python flow.py run
. We wanted to check if this was the expected behavior or if this was a use case that was new for
metaflow.Runner
. Thank you and happy to jump on a call and describe in more detail if needed.
1
s
Hi Kevin, metaflow.Runner assumes that you have a valid metaflow config from wherever you are invoking the runner API
Depending on how you have set up the infrastructure, it may or may not be desirable/possible for a batch job to be able to submit other batch jobs
b
totally understand, yeah we debated on it for some time as well, but felt we proved it was suitable for our use case. based on my initial digging, with step functions, we pass these config values via cli args i believe? (when deploying the step function)
In the event of this type of scenario, is there an optimal method to ensure a
@batch
job from step executing a
@batch
job would be able to share the same
iam_role
set further up the hierarchy?
I just see this call being made within the step function definition, where the
iam_role
is set as part of the
python flow.py step start
call, which I believe is why the first
@batch
call works but then it’s not part of the environment executed by the
metaflow.Runner
type call within since there is no
~/.metaflowconfig/config.json
?
Copy code
&& python mock_flow.py --with batch:cpu=1,gpu=0,memory=4096,image=<http://1234567890.dkr.ecr.us-east-1.amazonaws.com/metaflow:latest,queue=arn:aws:batch:us-east-1:1234567890:job-queue/metaflow-queue,iam_role=arn:aws:iam::1234567890:role/metaflowBatchRole,use_tmpfs=False,tmpfs_tempdir=True,tmpfs_path=/metaflow_temp|1234567890.dkr.ecr.us-east-1.amazonaws.com/metaflow:latest,queue=arn:aws:batch:us-east-1:1234567890:job-queue/metaflow-queue,iam_role=arn:aws:iam::1234567890:role/metaflowBatchRole,use_tmpfs=False,tmpfs_tempdir=True,tmpfs_path=/metaflow_temp> --quiet --metadata=service --environment=local --datastore=s3 --datastore-root=<s3://metaflow-bucket/metaflow> --event-logger=nullSidecarLogger --monitor=nullSidecarMonitor --no-pylint --with=step_functions_internal step start --run-id sfn-$METAFLOW_RUN_ID --task-id ${AWS_BATCH_JOB_ID} --retry-count $((AWS_BATCH_JOB_ATTEMPT-1)) --max-user-code-retries 0 --input-paths sfn-${METAFLOW_RUN_ID}/_parameters/${AWS_BATCH_JOB_ID}-params)
s
you would need to ensure that the role attached to the batch job has permissions to submit further batch jobs
and you would have to ensure that the config is also somehow available on the batch instance
b
got it, can you help me confirm that if we embed some default config in the batch container image, if new values are passed via batch run like in the cli args above, those will override the default
config.json
in the container image?
s
Unfortunately the easiest way would be to test it out
It should work - at least on kubernetes deployments since that’s a pattern we use internally quite a bit
👍 1