Struggling with getting a new metaflow cloud deployment work Outerbounds #ask-metaflow

Struggling with getting a new metaflow cloud deplo...

incalculable-xylophone-18959

08/06/2024, 9:33 PM

Struggling with getting a new metaflow cloud deployment working. I was getting an error about S3 PutObject that seemed to be due to the region not being set. I set that in my metaflow config file, and am now getting the below output. I've tried both using AWS_PROFILE as well as AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY/AWS_SESSION_TOKEN in the config without luck.

✅ 1

shy-address-41011

08/06/2024, 10:18 PM

hi edward.. could you share a snippet of the code you have ( removing sensitive info ) ?

incalculable-xylophone-18959

08/07/2024, 12:33 AM

It is the 05-hello-cloud tutorial

incalculable-xylophone-18959

08/07/2024, 4:20 PM

I guess a more direct question is: Where are the credentials used for the S3 metadata service supposed to be specified? I see some AWS credentials in my metaflowconfig, but they seem to be used for authenticating to kubernetes, not s3.

shy-address-41011

08/07/2024, 4:51 PM

could you please share the metaflow config you have ?

incalculable-xylophone-18959

08/07/2024, 6:45 PM

Copy code

(venv) esmith@oppie:/opt/barbenheimer/metaflow/metaflow/tutorials$ cat ~/.metaflowconfig/config_barbenheimer.json 
{
    "METAFLOW_DEFAULT_DATASTORE": "s3",
    "METAFLOW_DATASTORE_SYSROOT_S3": "<s3://edflows3tesas58h/metaflow>",
    "METAFLOW_DATATOOLS_S3ROOT": "<s3://edflows3tesas58h/data>",
    "METAFLOW_ARGO_EVENTS_EVENT_BUS": "default",
    "METAFLOW_ARGO_EVENTS_INTERNAL_WEBHOOK_URL": "<http://argo-events-webhook-eventsource-svc.default:12000/metaflow-event>",
    "METAFLOW_DEFAULT_METADATA": "service",
    "METAFLOW_KUBERNETES_NAMESPACE": "default",
    "METAFLOW_KUBERNETES_SERVICE_ACCOUNT": "default",
    "METAFLOW_SERVICE_INTERNAL_URL": "<http://edflownlbtesas58h-1936e4ca1afd9089.elb.us-gov-west-1.amazonaws.com/>",
    "METAFLOW_SERVICE_URL": "<https://az7mgjx6q4.execute-api.us-gov-west-1.amazonaws.com/api/>",
    "METAFLOW_SERVICE_AUTH_KEY": "xxxxxxxxxxxxxxxxxxxxxxxxxxx"

}

incalculable-xylophone-18959

08/07/2024, 6:53 PM

It seems to me that I have to set the AWS_PROFILE as well as AWS_DEFAULT_REGION when calling a metaflow script

incalculable-xylophone-18959

08/07/2024, 7:13 PM

I would think it would be possible to put these items in the config somewhere? The AWS_PROFILE is there in the command to authenticate to EKS, but it isn't used for accessing the metadataservice. I was able to move the AWS_DEFAULT_REGION setting to ~/.aws/config to reduce my command line as shown below. However, it seems to just sit at this step:

shy-address-41011

08/07/2024, 7:23 PM

it could indicate that the nodes on your k8s cluster are completely scaled down. It takes ~2-3mins for the node to come up.

shy-address-41011

08/07/2024, 7:23 PM

if you have access to your k8s cluster - could you check the status of the nodes?

incalculable-xylophone-18959

08/07/2024, 7:35 PM

Ok, that is definitely what I saw on the AWS cluster, although it seems to have a long runtime. I'll check the local cluster pods and see if that was the issue / maybe just let it run longer

incalculable-xylophone-18959

08/07/2024, 7:43 PM

On the kubernetes side (local kubernetes cluster) you were right @broad-branch-21430 that it just needed more time. However, it has also now been trying to run the hello world for 5 minutes without success:

shy-address-41011

08/07/2024, 7:55 PM

is the pod actually running on the k8s cluster?

shy-address-41011

08/07/2024, 7:56 PM

just to rule things out - are there any taints on the nodes ( or groups ) that might prevent the pod from getting scheduled?

incalculable-xylophone-18959

08/07/2024, 7:57 PM

Ah, you've spotted it

👍🏽 1

incalculable-xylophone-18959

08/07/2024, 8:20 PM

Although, not sure what the workaround here is... I added more nodes, but it still can't succeed. Removing the @kubernetes decorator does allow it to run normally:

incalculable-xylophone-18959

08/07/2024, 8:34 PM

Do I have to indicate the resources of my nodes in order for the @kubernetes decorator to recognize them?

shy-address-41011

08/07/2024, 8:34 PM

what instance types do you have in your node group

incalculable-xylophone-18959

08/07/2024, 8:35 PM

r5.large, I believe I've just kept the defaults from the terraform

shy-address-41011

08/07/2024, 8:36 PM

gotcha.. was the pod able to schedule? or is it sill saying unschedulable?

incalculable-xylophone-18959

08/07/2024, 8:39 PM

Let me give it a fresh try and see what it does without changes

incalculable-xylophone-18959

08/07/2024, 8:47 PM

I had turned off the retry, but here it is running with it again... in S3 the stderr files look like this:

Copy code

[MFLOG|0|2024-08-07T20:41:18.785223Z|runtime|30e47531-edca-4e55-86e2-08fe2059d929]Sleeping 2 minutes before the next retry
[MFLOG|0|2024-08-07T20:44:51.516999Z|runtime|325f524a-6e4b-4263-8566-9d8fcee1f500]    Kubernetes error:
[MFLOG|0|2024-08-07T20:44:51.517259Z|runtime|1277899b-0776-4297-abd7-4c6afc543c9a]    Error (exit code 1). This could be a transient error. Use @retry to retry.
[MFLOG|0|2024-08-07T20:44:51.610089Z|runtime|f221e4fc-d5e8-4a4d-b8ea-f3b14198c265]
[MFLOG|0|2024-08-07T20:44:52.032605Z|runtime|e0b2b8cb-ba95-4477-b681-4e39d1b580e3]Task failed.

shy-address-41011

08/07/2024, 8:50 PM

can you describe the pod? see what the status in the pod are showing? there is a k8s error with exit code(1)..

incalculable-xylophone-18959

08/07/2024, 8:53 PM

Seems to maybe go back to permissions:

incalculable-xylophone-18959

08/07/2024, 8:55 PM

This kind of error sometimes comes from wrong or no region specified. I am setting it in my terminal, but I'm not sure where that gets specified for things running within k8s

ancient-application-36103

08/07/2024, 8:57 PM

@incalculable-xylophone-18959 which region are you in?

incalculable-xylophone-18959

08/07/2024, 8:59 PM

us-gov-west-1

incalculable-xylophone-18959

08/08/2024, 8:50 PM

FWIW, I was able to move past the 400/Bad Request by specifying

METAFLOW_S3_ENDPOINT_URL=<https://s3.us-gov-west-1.amazonaws.com>

incalculable-xylophone-18959

08/09/2024, 2:42 PM

I'm slowly working through this... the things I've had to tweak so far: • The env var above • KMS perms don't allow the EC2 instance role to decrypt • EC2 Role didn't have S3 access I've gotten this far, but it seems still more permissions issues block me:

incalculable-xylophone-18959

08/09/2024, 2:44 PM

message has been deleted

incalculable-xylophone-18959

08/09/2024, 3:03 PM

• EKS NodeGroup IAM Role needs kms:GenerateDataKey

incalculable-xylophone-18959

08/09/2024, 3:03 PM

Success!

🎉 1

incalculable-xylophone-18959

08/09/2024, 3:04 PM

• I also modified some of the ARNs in the terraform to use the correct AWS partition for govcloud

ancient-application-36103

08/09/2024, 3:28 PM

amazing!

ancient-application-36103

08/09/2024, 3:29 PM

if you are open to documenting your deployment in aws gov cloud and publishing it, we would be quite happy to assist!

ancient-application-36103

08/09/2024, 3:30 PM

we made a similar effort for the AWS Batch / Step Functions deployment a while back, but haven't gotten to gov cloud - primarily due to the overhead of getting access to a gov account

incalculable-xylophone-18959

08/09/2024, 3:35 PM

Yes, I have a fork of the repo in github, and then I have a fork of that in our local gitlab... but once I've cleaned it all up, I'll try to publish the changes back. Is there a summary anywhere of what the various roles are and their purpose? My rough idea as a newbie is: • There is the role that you use to run metaflow... seemingly this needs permissions to S3 and KMS so that it can use the metadataservice • There is the kubernetes cluster role... needs lots of stuff • There is the EC2 NodeGroup role... needs S3 and KMS and probably more • There is the KMS key policy which needs to name 2+ of the roles above but currently only names 1 The other changes are: • Support alternate partitions than

aws

everwhere an arn is manually created in TF • There seems to be a bug in how the S3 endpoint URL is calculated and it doesn't work for govcloud... specifying it as an env var is a workaround General recommendations: • Document all of the env vars and config possiblities • Am I right to understand that nowhere in the metaflow config do you specify the k8s cluster? That is strictly done by env var?

49 Views

Open in Slack

Previous Next