This message was deleted.
# dev-metaflow
m
This message was deleted.
w
You should find them in CloudFormation - the output of running the stack should have the variables that you can copy and paste (it;s a one off activity!)
👀 2
h
thanks - will ask devops tomorrow …
Hope this is the right place to ask. So I think I configured AWS metaflow correctly and I try to run:
python batch2.py run --with batch
see below code but get: Validating your flow... The graph looks good! Running pylint... Pylint is happy! AWS Batch error: The @batch decorator requires --datastore=s3. from metaflow import FlowSpec, step, batch
class AddTwoNumbersFlow(FlowSpec):
@step
def start(self):
self.number1 = 20
self.number2 = 30
self.next(self.add)
@batch(cpu=1, memory=500)
@step
def add(self):
self.result = self.number1 + self.number2
self.next(self.end)
@step
def end(self):
print("The result is:", self.result)
if __name__ == '__main__':
AddTwoNumbersFlow().main()
w
If you use Batch, your data needs to be accessible to a batch instance, i.e., they cant be on your laptop - that's why you get the message. It means your metaflow configurations are not correctly pointing to s3 for data snapshots
👍🏽 1
👍 1
I'm sure @ancient-application-36103 can be more precise of course 😉
h
Ok thanks. I think I configured everything fine given the details provided by devops. There is a file config.json in .metaflowconfig METAFLOW_DATASTORE_SYSROOT_S3, METAFLOW_DATATOOLS_S3ROOT, METAFLOW_DEFAULT_DATASTORE are configured: “METAFLOW_DATASTORE_SYSROOT_S3”: “s3://metaflow-xxx/metaflow”, “METAFLOW_DATATOOLS_S3ROOT”: “s3://metaflow-xxx/data”, “METAFLOW_DEFAULT_DATASTORE”: “s3", mind you I cannot see the folders: mefalow and data when I go to S3? My aws profile ABC is pointing at (?) the one I useully use for AWS Sagemaker work ... maybe that’s why? How do I force to use config.json in .metaflowconfig?
w
you can "force" both AWS profile and metaflow profile when you run stuff -> e.g. if you have nmore than one AWS profile you can do
Copy code
AWS_PROFILE=metaflow_enabled_profile python file run
and same with Metaflow profiles
this line for example quickly check the s3 setup ->
Copy code
<https://github.com/jacopotagliabue/recs-at-resonable-scale/blob/2b16a7701abfad74f9b2c7c3354f1ad27f792137/src/my_merlin_flow.py#L109>
🙌 1
h
This really helped. Thanks! I got closer but get now: botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the RegisterJobDefinition operation: Cross-account pass role is not allowed. Think this is related but closed/not resolved … https://github.com/Netflix/metaflow/issues/145
c
Hey Christian, this stuff can get pretty complex, so I'm with you feeling overwhelmed 🙂 Have you tried a HelloWorld flow that doesn't read anything external and runs empty start and end steps decorated with
@batch
? I am not sure why you'd be getting this
Cross-account pass role
error unless you are trying to access resources in another AWS account from the account Metaflow was deployed in. Is this the case?
h
Thanks for the reply. I will try to liaise with the devops next week. I am using credentials/profile/a user I usually use to spin up sagemaker jobs in AWS (selected in AWS). I also use the details provided by the devops which are in ~.metaflowconfig/config.json. I can see the below in the UI. But I get: 2023-05-19 151828.993 [5/start/10 (pid 92971)] raise error_class(parsed_response, operation_name) 2023-05-19 151828.993 [5/start/10 (pid 92971)] botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the RegisterJobDefinition operation: Cross-account pass role is not allowed. 2023-05-19 151829.138 [5/start/10 (pid 92971)] Task failed. 2023-05-19 161829.351 This failed task will not be retried. Internal error: The end step was not successful by the end of flow. Maybe I have to adapt my “sagemaker user”? I googled a lot but could not find anything useful … Hopefully the devops can help next week.
c
Sounds good. The solution will be different if you are trying to use resources in a Metaflow step that come from an AWS account that is different than the account Metaflow was deployed in, for example maybe your Sagemaker account or some account that stores your S3 buckets is different than where DevOps deployed Metaflow for you. This case is more complex, but we can help you tell DevOps specifically what is needed. The case where you don't need to access resources in another account and just want the flow to run is easier. I don't think you should need any AWS profile besides your Metaflow config to get started with a simple flow.
h
thanks I really only try to run a hello world piece: from metaflow import FlowSpec, step, batch, retry
class AddTwoNumbersFlow(FlowSpec):
@step
def start(self):
self.number1 = 20
self.number2 = 30
self.next(self.add)
@batch(cpu=1, memory=500)
# @retry
@step
def add(self):
self.result = self.number1 + self.number2
self.next(self.end)
@step
def end(self):
print("The result is:", self.result)
if __name__ == '__main__':
AddTwoNumbersFlow().main()
c
What do you see when you run this hello world?
Copy code
from metaflow import FlowSpec, step

class F(FlowSpec):

    @step
    def start(self):
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == '__main__':
    F()
h
I use: python batch2.py run --with batch it seems to use my account. maye that’s the issue: Metaflow 2.8.3 executing AddTwoNumbersFlow for user:bla.bla Validating your flow... The graph looks good! Running pylint... Pylint is happy! 2023-05-20 151630.733 Workflow starting (run-id 6): 2023-05-20 151631.540 [6/start/12 (pid 99214)] Task is starting. 2023-05-20 141632.479 [6/start/12 (pid 99214)] Traceback (most recent call last): 2023-05-20 141632.533 [6/start/12 (pid 99214)] File “/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/metaflow/plugins/aws/batch/batch_cli.py”, line 285, in step 2023-05-20 141632.533 [6/start/12 (pid 99214)] batch.launch_job( 2023-05-20 141632.533 [6/start/12 (pid 99214)] File “/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/metaflow/plugins/aws/batch/batch.py”, line 332, in launch_job 2023-05-20 141632.533 [6/start/12 (pid 99214)] job = self.create_job( 2023-05-20 141632.533 [6/start/12 (pid 99214)] File “/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/metaflow/plugins/aws/batch/batch.py”, line 197, in create_job 2023-05-20 141632.534 [6/start/12 (pid 99214)] self._client.job() 2023-05-20 141632.534 [6/start/12 (pid 99214)] File “/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/metaflow/plugins/aws/batch/batch_client.py”, line 382, in job_def 2023-05-20 141632.534 [6/start/12 (pid 99214)] self.payload[“jobDefinition”] = self._register_job_definition( 2023-05-20 141632.534 [6/start/12 (pid 99214)] File “/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/metaflow/plugins/aws/batch/batch_client.py”, line 361, in _register_job_definition 2023-05-20 141632.534 [6/start/12 (pid 99214)] raise ex 2023-05-20 141632.534 [6/start/12 (pid 99214)] File “/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/metaflow/plugins/aws/batch/batch_client.py”, line 350, in _register_job_definition 2023-05-20 141632.534 [6/start/12 (pid 99214)] response = self._client.register_job_definition(**job_definition) 2023-05-20 141632.534 [6/start/12 (pid 99214)] File “/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/botocore/client.py”, line 530, in _api_call 2023-05-20 141632.534 [6/start/12 (pid 99214)] return self._make_api_call(operation_name, kwargs) 2023-05-20 141632.534 [6/start/12 (pid 99214)] File “/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/botocore/client.py”, line 960, in _make_api_call 2023-05-20 141632.534 [6/start/12 (pid 99214)] raise error_class(parsed_response, operation_name) 2023-05-20 141632.534 [6/start/12 (pid 99214)] botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the RegisterJobDefinition operation: Cross-account pass role is not allowed. 2023-05-20 141632.639 [6/start/12 (pid 99214)] Task failed. 2023-05-20 151632.815 This failed task will not be retried. Internal error: The end step was not successful by the end of flow.
c
does it run without batch?
python batch2.py run
oh will also need to comment out your @batch decorator if using above flow you posted
👍 1
h
thanks. your code above works. using: python batch2.py run also produces: botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the RegisterJobDefinition operation: Cross-account pass role is not allowed. which is a bit bizarre as this is local right?
ah yes: oh will also need to comment out your @batch decorator if using above flow you posted I was actually want to ask. I tried this before I tried to use AWS - yes that worked/works fine locally … Thanks. So not sure what to do …
Can I not somehow force it to only use what is in: ~.metaflowconfig/config.json I think the issue may be that it uses me as user not even the profile I usually use when I do sagemaker stuff …
c
Ya thats what I'm wondering too. Not sure how to do some kind of "ignore any other AWS" config but looking for some info on that
What happens if you remove env vars first? with this command:
Copy code
unset AWS_ACCESS_KEY_ID
unset AWS_SECRET_ACCESS_KEY
unset AWS_PROFILE
and then run the flow --with batch again?
h
same … I am still unsure why my user name is actually mentioned used. at least it should use my aws profile …
Interesting if I use: AWS_PROFILE=xyz python batch2.py run --with batch afik forcing the aws profile I usually use for sagemaker I get: Metaflow 2.8.3 executing AddTwoNumbersFlow for user:x.y Validating your flow... The graph looks good! Running pylint... Pylint is happy! S3 access denied: s3://metaflow-684414486554-s3-55ff636/metaflow/AddTwoNumbersFlow/12/_parameters/26/0.attempt.json
which profile/user does one have to use for the stuff specified in .metaflowconfig/config.json?
btw the profile xyz is the same aws environment where metaflow was installed/configured. so running: AWS_PROFILE=xyz python batch2.py run --with batch gets rid off: botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the RegisterJobDefinition operation: Cross-account pass role is not allowed. but I get: S3 access denied: s3://metaflow-684414486554-s3-55ff636/metaflow/AddTwoNumbersFlow/16/_parameters/31/0.attempt.json this indicates that xyz profile cannot issue/execute the metaflow flow. how is the profile related to what is in .metaflowconfig/config.json? I do not understand. I need to use some profile to use metaflow? or does .metaflowconfig/config.json contain information and I should create another profile?
ok I think I found the solution. I created a new profile: metaflow configure aws --profile my-profile and added: export METAFLOW_PROFILE=my-profile to my .zprofile. I now get an error which the devops have to fix: metaflow.plugins.aws.batch.batch_client.BatchJobException: No AWS Fargate task execution IAM role found. Please see https://docs.aws.amazon.com/batch/latest/userguide/execution-IAM-role.html and set the role as METAFLOW_ECS_FARGATE_EXECUTION_ROLE environment variable. 2023-05-20 152720.274 [20/start/36 (pid 1566)] Task failed. 2023-05-20 162720.447 This failed task will not be retried.
noice 1
So I now make sure that I use the right aws environment using: AWS_PROFILE=profile1 python batch2.py run --with batch but I get: Metaflow 2.8.3 executing AddTwoNumbersFlow for user:firstname.surname Validating your flow... The graph looks good! Running pylint... Pylint is happy! S3 access denied: s3://metaflow-xyz/metaflow/AddTwoNumbersFlow/30/_parameters/50/0.attempt.json how can the user use metaflow? don’t think this is documented anywhere? BTW the odd thing is, that the user is not the one usually used in the specified profile1?