Gd mng. Trying to setup sagemaker (from here: <htt...
# dev-metaflow
c
Gd mng. Trying to setup sagemaker (from here: https://github.com/outerbounds/metaflow-tools/tree/master/aws/terraform/) in a private subnet to use with emr serverless. All the envs are set, auth key is set, yet execution of the helloworld.py errors out after pylint step. Link to docs here: https://github.com/outerbounds/metaflow-tools/tree/master/aws/terraform/sagemaker-notebook Is not available anymore. I will go over sg, nat and route tables to double check the rules, but any hints (like maybe I should use external url instead of internal - when what authentication is used in that case?) are welcome). Thank you very much šŸ™
āœ… 1
a
That sounds like it can't reach metadata service. The solution depends on your networking setup. For example, the most common setup using OSS metaflow terraform modules is metadata service running on ECS behind API gateway, with API gateway handling auth via api keys. If your setup matches this scenario, API gateway endpoint is available on "public" internet. In this case I'd check if • you can curl the metadata service locally from your laptop and at least get auth error (not network timeout) • do the same from the sagemaker machine • try curl using the metadata service API key and see that auth is working , i.e.
Copy code
curl -v -H 'x-api-key: YOUR-API-KEY-GOES-HERE' <https://your-metadata-service.execute-api.us-west-2.amazonaws.com/api/flows/>
• do the same from the sagemaker machine
šŸ‘ 1
if it works from laptop but not from sagemaker then its time to dig deeper into sagemaker networking setup (missing internet gateway route? etc.)
as for
METAFLOW_SERVICE_INTERNAL_URL
and `METAFLOW_SERVICE_URL`: • metaflow uses `METAFLOW_SERVICE_URL`when running on the machine where you do
flow.py run
(so, laptop or sagemaker) •
METAFLOW_SERVICE_INTERNAL_URL
is used instead of
METAFLOW_SERVICE_URL
by code inside tasks when they run on AWS batch. If you don't specify
METAFLOW_SERVICE_INTERNAL_URL
then they just use
METAFLOW_SERVICE_URL
Its primarily useful in more advanced custom deployments where AWS Batch compute env is running on an isolated network so it cannot reach METAFLOW_SERVICE_URL
c
Thank you very much for help. Do you have any examples of submitting AWS batch jobs with custom ECR docker image? I couldn't find anything in the docs.
s
sure - you can do
@batch(image='foo')
c
Both local machine and the sagemaker connect fine, but generate these messages when run with or w/o /flows at the end: curl -v -H 'x-api-key:fOZqZclHgu5acbHzWGf9TXgsE81bkPa97xZ8JeCb' https://hps3egfji4.execute-api.ap-southeast-1.amazonaws.com/api/ {"message":"Missing Authentication Token"} and curl -v -H 'x-api-key:fOZqZclHgu5acbHzWGf9TXgsE81bkPa97xZ8JeCb' https://hps3egfji4.execute-api.ap-southeast-1.amazonaws.com/api/flows {"Message":"User: anonymous is not authorized to perform: execute-api:Invoke on resource: arnawsexecute-apiap southeast 1********4663:hps3egfji4/api/GET/flows with an explicit deny"} When sagemaker is on the metaflow service's subnet helloworld.py runs fine without @batch, --with batch (default image) and --with batch @batch(image=custom image). So only using the external url is an issue at the moment.
a
if you used terraform to deploy, did you set
access_list_cidr_blocks
to a non default value there? one case where that error would happen is if your IP address doesn't match that setting
c
Yep. Exactly that - I set in the metaflow main.tf access_list_cidr_blocks = [ data.terraform_remote_state.infra.outputs.vpc_cidr_block ] , while keeping defaults in the prods.tf and variables.tf šŸ™ˆ
a
hmm I don't think that would work. Since API gw address resolves to a public IP,
access_list_cidr_blocks
would have to contain public IP addresses too. Maybe we should clarify the description of that variable. vpc_cidr_block is almost certainly internal VPC IP range
c
Yep, that's what I meant - this is why it didn't work. All good now. Thank you very much šŸ™
šŸ‘ 1
a
great!