Gd mng Trying to setup sagemaker from here <https github com Outerbounds #dev-metaflow

Gd mng. Trying to setup sagemaker (from here: <htt...

cold-ability-47914

04/26/2023, 2:39 AM

Gd mng. Trying to setup sagemaker (from here: https://github.com/outerbounds/metaflow-tools/tree/master/aws/terraform/) in a private subnet to use with emr serverless. All the envs are set, auth key is set, yet execution of the helloworld.py errors out after pylint step. Link to docs here: https://github.com/outerbounds/metaflow-tools/tree/master/aws/terraform/sagemaker-notebook Is not available anymore. I will go over sg, nat and route tables to double check the rules, but any hints (like maybe I should use external url instead of internal - when what authentication is used in that case?) are welcome). Thank you very much 🙏

✅ 1

average-beach-28850

04/27/2023, 5:53 PM

That sounds like it can't reach metadata service. The solution depends on your networking setup. For example, the most common setup using OSS metaflow terraform modules is metadata service running on ECS behind API gateway, with API gateway handling auth via api keys. If your setup matches this scenario, API gateway endpoint is available on "public" internet. In this case I'd check if • you can curl the metadata service locally from your laptop and at least get auth error (not network timeout) • do the same from the sagemaker machine • try curl using the metadata service API key and see that auth is working , i.e.

Copy code

curl -v -H 'x-api-key: YOUR-API-KEY-GOES-HERE' <https://your-metadata-service.execute-api.us-west-2.amazonaws.com/api/flows/>

• do the same from the sagemaker machine

👍 1

average-beach-28850

04/27/2023, 5:54 PM

if it works from laptop but not from sagemaker then its time to dig deeper into sagemaker networking setup (missing internet gateway route? etc.)

average-beach-28850

04/27/2023, 5:58 PM

as for

METAFLOW_SERVICE_INTERNAL_URL

and `METAFLOW_SERVICE_URL`: • metaflow uses `METAFLOW_SERVICE_URL`when running on the machine where you do

flow.py run

(so, laptop or sagemaker) •

METAFLOW_SERVICE_INTERNAL_URL

is used instead of

METAFLOW_SERVICE_URL

by code inside tasks when they run on AWS batch. If you don't specify

METAFLOW_SERVICE_INTERNAL_URL

then they just use

METAFLOW_SERVICE_URL

Its primarily useful in more advanced custom deployments where AWS Batch compute env is running on an isolated network so it cannot reach METAFLOW_SERVICE_URL

cold-ability-47914

05/03/2023, 9:35 AM

Thank you very much for help. Do you have any examples of submitting AWS batch jobs with custom ECR docker image? I couldn't find anything in the docs.

square-wire-39606

05/03/2023, 4:53 PM

sure - you can do

@batch(image='foo')

cold-ability-47914

05/04/2023, 4:10 AM

Both local machine and the sagemaker connect fine, but generate these messages when run with or w/o /flows at the end: curl -v -H 'x-api-key:fOZqZclHgu5acbHzWGf9TXgsE81bkPa97xZ8JeCb' https://hps3egfji4.execute-api.ap-southeast-1.amazonaws.com/api/ {"message":"Missing Authentication Token"} and curl -v -H 'x-api-key:fOZqZclHgu5acbHzWGf9TXgsE81bkPa97xZ8JeCb' https://hps3egfji4.execute-api.ap-southeast-1.amazonaws.com/api/flows {"Message":"User: anonymous is not authorized to perform: execute-api:Invoke on resource: arnawsexecute-apiap southeast 1********4663:hps3egfji4/api/GET/flows with an explicit deny"} When sagemaker is on the metaflow service's subnet helloworld.py runs fine without @batch, --with batch (default image) and --with batch @batch(image=custom image). So only using the external url is an issue at the moment.

average-beach-28850

05/04/2023, 5:43 AM

if you used terraform to deploy, did you set

access_list_cidr_blocks

to a non default value there? one case where that error would happen is if your IP address doesn't match that setting

average-beach-28850

05/04/2023, 5:44 AM

(from this thread https://outerboundsco.slack.com/archives/C02116BBNTU/p1652377531427809 )

cold-ability-47914

05/04/2023, 5:54 AM

Yep. Exactly that - I set in the metaflow main.tf access_list_cidr_blocks = [ data.terraform_remote_state.infra.outputs.vpc_cidr_block ] , while keeping defaults in the prods.tf and variables.tf 🙈

average-beach-28850

05/04/2023, 3:09 PM

hmm I don't think that would work. Since API gw address resolves to a public IP,

access_list_cidr_blocks

would have to contain public IP addresses too. Maybe we should clarify the description of that variable. vpc_cidr_block is almost certainly internal VPC IP range

cold-ability-47914

05/05/2023, 2:06 AM

Yep, that's what I meant - this is why it didn't work. All good now. Thank you very much 🙏

👍 1

average-beach-28850

05/05/2023, 2:20 AM

great!

Open in Slack

Previous Next