hello, we are starting to use metaflow with aws cl...
# dev-metaflow
b
hello, we are starting to use metaflow with aws cloudformation script to run on batch and step functions. I'm fumbling through initial work trying to create some examples for the team and ran in to something that I likely misconfigured somewhere and before I go adding permissions to the cloudformation script that is quite extensive and complete, I'd like to ask around and see if there is something more fundamental I can address. When running a simple FlowSpec with @batch decorator, I get this error on the console:
ECS was unable to assume the role 'arn:aws:iam::<ACCT>:role/metaflow-test-dev-BatchExecutionRole-Nw7a9bPcvQUI' that was provided for this task...
Copy code
from metaflow import FlowSpec, pypi, step, batch

class UnFlow(FlowSpec):

    @step
    def start(self):
        self.next(self.go)

    @batch(image="<ACCT>.dkr.ecr.us-east-1.amazonaws.com/ml/metaflow:base")
    @step
    def go(self):
        from unstructured.partition.pdf import partition_pdf

        elements = partition_pdf(filename="example.pdf")
        print("\n\n".join([str(el) for el in elements]))

        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == '__main__':
    UnFlow()
Apologies if this is not the correct spot for this question, and tia for any support!
s
@brave-camera-54148 does this error happen when submitting the job? it might be easier, if you are able to share the full terminal log..
b
Yes, this happens when I submit to run on aws batch
Copy code
ubuntu@ip-10-20-1-98:~/metaflow/app$ python unflow.py run
Metaflow 2.12.22 executing UnFlow for user:ubuntu
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint is happy!
2024-10-03 15:36:00.915 Workflow starting (run-id 52):
2024-10-03 15:36:01.856 [52/start/136 (pid 196460)] Task is starting.
2024-10-03 15:36:03.909 [52/start/136 (pid 196460)] Task finished successfully.
2024-10-03 15:36:04.126 [52/go/137 (pid 196499)] Task is starting.
2024-10-03 15:36:05.523 [52/go/137 (pid 196499)] [064e0bf8-df71-4488-8514-5d2984634b3f] Task is starting (status SUBMITTED)...
2024-10-03 15:36:09.614 [52/go/137 (pid 196499)] [064e0bf8-df71-4488-8514-5d2984634b3f] Task is starting (status RUNNABLE)...
2024-10-03 15:36:13.864 [52/go/137 (pid 196499)] [064e0bf8-df71-4488-8514-5d2984634b3f] Task is starting (status FAILED)...
2024-10-03 15:36:14.602 [52/go/137 (pid 196499)] AWS Batch error:
2024-10-03 15:36:15.067 [52/go/137 (pid 196499)] ECS was unable to assume the role 'arn:aws:iam::<ACCT>:role/metaflow-test-dev-BatchExecutionRole-Tabcd1234XPE' that was provided for this task. Please verify that the role being passed has the proper trust relationship and permissions and that your IAM user has permissions to pass this role. This could be a transient error. Use @retry to retry.
2024-10-03 15:36:15.067 [52/go/137 (pid 196499)] 
2024-10-03 15:36:15.205 [52/go/137 (pid 196499)] Task failed.
2024-10-03 15:36:15.264 Workflow failed.
2024-10-03 15:36:15.264 Terminating 0 active tasks...
2024-10-03 15:36:15.264 Flushing logs...
    Step failure:
    Step go (task-id 137) failed.
I was able to make some progress by adding this to the batch execution role trust relationship, however I would expect this in the cloudformation script and unlikely that I'd be the first one to find this issue, so perhaps some other config that I set incorrectly might be causing this?
Copy code
"Effect": "Allow",
            "Principal": {
                "Service": "<http://ecs-tasks.amazonaws.com|ecs-tasks.amazonaws.com>"
            },
            "Action": "sts:AssumeRole"
and now, the starting.. took a while, so assuming the image is being loaded to the ECS container, however it's no failing with Using an image since one of the libraries doesn't have a wheel and can't use the @pypi or @conda decorators so felt like using an image was the only solution left?
Copy code
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/root/services/ui_backend_service/data/cache/client/cache_server.py", line 365, in <module>
    cli(auto_envvar_prefix='MFCACHE')
  File "/opt/latest/lib/python3.11/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/opt/latest/lib/python3.11/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/opt/latest/lib/python3.11/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/latest/lib/python3.11/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/root/services/ui_backend_service/data/cache/client/cache_server.py", line 359, in cli
    Scheduler(store, max_actions).loop()
  File "/root/services/ui_backend_service/data/cache/client/cache_server.py", line 327, in loop
    self.cleanup_if_necessary()
  File "/root/services/ui_backend_service/data/cache/client/cache_server.py", line 291, in cleanup_if_necessary
    self.cleanup_workers()
  File "/root/services/ui_backend_service/data/cache/client/cache_server.py", line 299, in cleanup_workers
    self.cleanup_pool()
  File "/root/services/ui_backend_service/data/cache/client/cache_server.py", line 305, in cleanup_pool
    self.pool = multiprocessing.Pool(
  File "/usr/local/lib/python3.11/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 215, in __init__
    self._repopulate_pool()
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 306, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 329, in _repopulate_pool_static
    w.start()
  File "/usr/local/lib/python3.11/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/local/lib/python3.11/multiprocessing/context.py", line 281, in _Popen
    return Popen(process_obj)
  File "/usr/local/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/local/lib/python3.11/multiprocessing/popen_fork.py", line 71, in _launch
    code = process_obj._bootstrap(parent_sentinel=child_r)
  File "/usr/local/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/root/services/ui_backend_service/data/cache/client/cache_worker.py", line 29, in execute_action
    execute(tempdir, action_cls, request)
  File "/root/services/ui_backend_service/data/cache/client/cache_worker.py", line 51, in execute
    res = action_cls.execute(
  File "/root/services/ui_backend_service/data/cache/get_log_file_action.py", line 141, in execute
    with streamed_errors(stream_output):
  File "/usr/local/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/root/services/ui_backend_service/data/cache/utils.py", line 130, in streamed_errors
    get_traceback_str()
  File "/root/services/ui_backend_service/data/cache/utils.py", line 124, in streamed_errors
    yield
  File "/root/services/ui_backend_service/data/cache/get_log_file_action.py", line 150, in execute
    total_lines = count_total_lines(local_paths)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/services/ui_backend_service/data/cache/get_log_file_action.py", line 293, in count_total_lines
    for path in paths:

TypeError: 'NoneType' object is not iterable
s
this stack trace is coming from the image for the metaflow service ^ you don't need to use this image for your task
any image that has python and curl would work
b
it's the unstructured dependency that I am trying to use in the flow, and why I used that image. or, are you saying there is a way to just use a python/curl image and install the dependencies after?
s
you don't need to use the metaflow-service image as the base image
b
it's only named that...I'm using ubuntu as the base image
s
is the stack trace for a metaflow task on your console or in the ui?
b
ui
s
where exactly are you seeing this error?
b
sorry, edited
s
ah ok
b
I did remove the import for unstructured and the image in @batch decorator and just did a simple print and getting same error
s
it seems that the ui is not configured properly. it should already be correctly set but maybe there is a regression somewhere. however, you should see all the logs correctly in the terminal
b
in the recent run (not using the image I built), this is what the console shows:
Copy code
Metaflow 2.12.22 executing UnFlow for user:ubuntu
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint is happy!
2024-10-03 17:26:10.620 Workflow starting (run-id 66):
2024-10-03 17:26:11.603 [66/start/181 (pid 237352)] Task is starting.
2024-10-03 17:26:13.458 [66/start/181 (pid 237352)] Task finished successfully.
2024-10-03 17:26:13.851 [66/go/182 (pid 237381)] Task is starting.
2024-10-03 17:26:15.653 [66/go/182 (pid 237381)] [0f0a100f-4919-4f84-ac0d-51591f0a3834] Task is starting (status SUBMITTED)...
2024-10-03 17:26:18.721 [66/go/182 (pid 237381)] [0f0a100f-4919-4f84-ac0d-51591f0a3834] Task is starting (status RUNNABLE)...
2024-10-03 17:26:20.889 [66/go/182 (pid 237381)] [0f0a100f-4919-4f84-ac0d-51591f0a3834] Task is starting (status STARTING)...
2024-10-03 17:26:27.262 [66/go/182 (pid 237381)] [0f0a100f-4919-4f84-ac0d-51591f0a3834] Task is starting (status RUNNING)...
2024-10-03 17:27:32.358 [66/go/182 (pid 237381)] AWS Batch error:
2024-10-03 17:27:32.358 [66/go/182 (pid 237381)] Essential container in task exited This could be a transient error. Use @retry to retry.
2024-10-03 17:27:32.358 [66/go/182 (pid 237381)] 
2024-10-03 17:27:32.654 [66/go/182 (pid 237381)] Task failed.
2024-10-03 17:27:32.746 Workflow failed.
2024-10-03 17:27:32.747 Terminating 0 active tasks...
2024-10-03 17:27:32.747 Flushing logs...
    Step failure:
    Step go (task-id 182) failed.
for this code
Copy code
from metaflow import FlowSpec, pypi, step, batch

class UnFlow(FlowSpec):

    @step
    def start(self):
        self.next(self.go)

    @batch()
    @step
    def go(self):
        print('hello world')
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == '__main__':
    UnFlow()
...this is the first app we have tried to run since running the cloudformation script, so just to be clear this is likely something we did incorrectly during setup as we have never had anything run successfully before
s
ah i see - if you log into the batch console and search for the job - 0f0a100f-4919-4f84-ac0d-51591f0a3834 - do you see anything in the job logs?
b
yes, it seems to be failing downloading an s3 package: Failed to download code package from s3://metaflow-test-dev-metaflows3bucket-jm0gdjt6urim/winter/UnFlow/data/ed/ed7871c064920ecd553f9ca891b119841201bd87 after 6 tries. Exiting...
s
it seems that the aws batch task role doesn't have the permissions to talk to s3
can you check what is the role that is being used and is it the same one that the cfn template minted
1
b
ok, we ran in to permission issues earlier...had to add ecs-tasks
I added s3 perms to test and it was successful. I have a good starting point now, and thank you so much for your help! Do you think the batch execution iam role should have had the ecs-tasks permissions and s3 permissions from the cloudformation script? I'd like to track that down when I get some time to see where we goofed that up
s
it should already have those
b
ty again, I wad able to get everything working as expected! I'm not sure why there were permission issues, and will follow up with them later (something we likely did incorrectly during setup). I do have a question about docker. Is it common for metaflow images to create a user or to use root? I am not entirely sure where the username is being specified, and found setting the local env USERNAME will allow execution even logged in as root. I'm just trying to create a decent dockerfile to use for our projects
s
we don't enforce any requirements on the docker image