Struggling with getting a new metaflow cloud deplo...
# ask-metaflow
i
Struggling with getting a new metaflow cloud deployment working. I was getting an error about S3 PutObject that seemed to be due to the region not being set. I set that in my metaflow config file, and am now getting the below output. I've tried both using AWS_PROFILE as well as AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY/AWS_SESSION_TOKEN in the config without luck.
1
s
hi edward.. could you share a snippet of the code you have ( removing sensitive info ) ?
i
It is the 05-hello-cloud tutorial
I guess a more direct question is: Where are the credentials used for the S3 metadata service supposed to be specified? I see some AWS credentials in my metaflowconfig, but they seem to be used for authenticating to kubernetes, not s3.
s
could you please share the metaflow config you have ?
i
Copy code
(venv) esmith@oppie:/opt/barbenheimer/metaflow/metaflow/tutorials$ cat ~/.metaflowconfig/config_barbenheimer.json 
{
    "METAFLOW_DEFAULT_DATASTORE": "s3",
    "METAFLOW_DATASTORE_SYSROOT_S3": "<s3://edflows3tesas58h/metaflow>",
    "METAFLOW_DATATOOLS_S3ROOT": "<s3://edflows3tesas58h/data>",
    "METAFLOW_ARGO_EVENTS_EVENT_BUS": "default",
    "METAFLOW_ARGO_EVENTS_INTERNAL_WEBHOOK_URL": "<http://argo-events-webhook-eventsource-svc.default:12000/metaflow-event>",
    "METAFLOW_DEFAULT_METADATA": "service",
    "METAFLOW_KUBERNETES_NAMESPACE": "default",
    "METAFLOW_KUBERNETES_SERVICE_ACCOUNT": "default",
    "METAFLOW_SERVICE_INTERNAL_URL": "<http://edflownlbtesas58h-1936e4ca1afd9089.elb.us-gov-west-1.amazonaws.com/>",
    "METAFLOW_SERVICE_URL": "<https://az7mgjx6q4.execute-api.us-gov-west-1.amazonaws.com/api/>",
    "METAFLOW_SERVICE_AUTH_KEY": "xxxxxxxxxxxxxxxxxxxxxxxxxxx"

}
It seems to me that I have to set the AWS_PROFILE as well as AWS_DEFAULT_REGION when calling a metaflow script
I would think it would be possible to put these items in the config somewhere? The AWS_PROFILE is there in the command to authenticate to EKS, but it isn't used for accessing the metadataservice. I was able to move the AWS_DEFAULT_REGION setting to ~/.aws/config to reduce my command line as shown below. However, it seems to just sit at this step:
s
it could indicate that the nodes on your k8s cluster are completely scaled down. It takes ~2-3mins for the node to come up.
if you have access to your k8s cluster - could you check the status of the nodes?
i
Ok, that is definitely what I saw on the AWS cluster, although it seems to have a long runtime. I'll check the local cluster pods and see if that was the issue / maybe just let it run longer
On the kubernetes side (local kubernetes cluster) you were right @broad-branch-21430 that it just needed more time. However, it has also now been trying to run the hello world for 5 minutes without success:
s
is the pod actually running on the k8s cluster?
just to rule things out - are there any taints on the nodes ( or groups ) that might prevent the pod from getting scheduled?
i
Ah, you've spotted it
👍🏽 1
Although, not sure what the workaround here is... I added more nodes, but it still can't succeed. Removing the @kubernetes decorator does allow it to run normally:
Do I have to indicate the resources of my nodes in order for the @kubernetes decorator to recognize them?
s
what instance types do you have in your node group
i
r5.large, I believe I've just kept the defaults from the terraform
s
gotcha.. was the pod able to schedule? or is it sill saying unschedulable?
i
Let me give it a fresh try and see what it does without changes
I had turned off the retry, but here it is running with it again... in S3 the stderr files look like this:
Copy code
[MFLOG|0|2024-08-07T20:41:18.785223Z|runtime|30e47531-edca-4e55-86e2-08fe2059d929]Sleeping 2 minutes before the next retry
[MFLOG|0|2024-08-07T20:44:51.516999Z|runtime|325f524a-6e4b-4263-8566-9d8fcee1f500]    Kubernetes error:
[MFLOG|0|2024-08-07T20:44:51.517259Z|runtime|1277899b-0776-4297-abd7-4c6afc543c9a]    Error (exit code 1). This could be a transient error. Use @retry to retry.
[MFLOG|0|2024-08-07T20:44:51.610089Z|runtime|f221e4fc-d5e8-4a4d-b8ea-f3b14198c265]
[MFLOG|0|2024-08-07T20:44:52.032605Z|runtime|e0b2b8cb-ba95-4477-b681-4e39d1b580e3]Task failed.
s
can you describe the pod? see what the status in the pod are showing? there is a k8s error with exit code(1)..
i
Seems to maybe go back to permissions:
This kind of error sometimes comes from wrong or no region specified. I am setting it in my terminal, but I'm not sure where that gets specified for things running within k8s
a
@incalculable-xylophone-18959 which region are you in?
i
us-gov-west-1
FWIW, I was able to move past the 400/Bad Request by specifying
METAFLOW_S3_ENDPOINT_URL=<https://s3.us-gov-west-1.amazonaws.com>
I'm slowly working through this... the things I've had to tweak so far: • The env var above • KMS perms don't allow the EC2 instance role to decrypt • EC2 Role didn't have S3 access I've gotten this far, but it seems still more permissions issues block me:
message has been deleted
• EKS NodeGroup IAM Role needs kms:GenerateDataKey
Success!
🎉 1
• I also modified some of the ARNs in the terraform to use the correct AWS partition for govcloud
a
amazing!
if you are open to documenting your deployment in aws gov cloud and publishing it, we would be quite happy to assist!
we made a similar effort for the AWS Batch / Step Functions deployment a while back, but haven't gotten to gov cloud - primarily due to the overhead of getting access to a gov account
i
Yes, I have a fork of the repo in github, and then I have a fork of that in our local gitlab... but once I've cleaned it all up, I'll try to publish the changes back. Is there a summary anywhere of what the various roles are and their purpose? My rough idea as a newbie is: • There is the role that you use to run metaflow... seemingly this needs permissions to S3 and KMS so that it can use the metadataservice • There is the kubernetes cluster role... needs lots of stuff • There is the EC2 NodeGroup role... needs S3 and KMS and probably more • There is the KMS key policy which needs to name 2+ of the roles above but currently only names 1 The other changes are: • Support alternate partitions than
aws
everwhere an arn is manually created in TF • There seems to be a bug in how the S3 endpoint URL is calculated and it doesn't work for govcloud... specifying it as an env var is a workaround General recommendations: • Document all of the env vars and config possiblities • Am I right to understand that nowhere in the metaflow config do you specify the k8s cluster? That is strictly done by env var?