Hello team! Going down a rabbit hole I think runni...
# ask-metaflow
a
Hello team! Going down a rabbit hole I think running a Metaflow flow on Kubernetes. Getting the following error:
Copy code
jobs.batch is forbidden: User "system:anonymous" cannot create resource "jobs" in API group "batch" in the namespace "xyz"
I'm able to connect to the Kubernetes cluster outside of Metaflow on my remote dev box Linux terminal, and not sure why it's declaring me as
system: anonymous
For context, The EKS cluster is in a separate account (account B) from the remote dev box (account A). We use the IAM role attached to the dev box in account A to assume an IAM role in account B. The IAM role in account B is added to the EKS cluster's configmap and has a service account attached with the appropriate role and rolebindings. After assuming the IAM role in account B, we attempt to run the Metaflow flow... The Metaflow config we have looks like this:
Copy code
{
  "METAFLOW_KUBERNETES_CONTAINER_IMAGE": "metaflow_batch_sample:main-6463f16454-183",
  "METAFLOW_KUBERNETES_CONTAINER_REGISTRY": "<http://dummyaccountnum.dkr.ecr.us-west-2.amazonaws.com|dummyaccountnum.dkr.ecr.us-west-2.amazonaws.com>",
  "METAFLOW_DATASTORE_SYSROOT_S3": "<s3://dummyaccount-s3-amp/metaflow>",
  "METAFLOW_DATATOOLS_S3ROOT": "<s3://dummyaccount-s3-amp/data>",
  "METAFLOW_DEFAULT_DATASTORE": "s3",
  "METAFLOW_DEFAULT_METADATA": "service",
  "METAFLOW_ECS_FARGATE_EXECUTION_ROLE": "arn:aws:iam::dummyaccountnum:role/dummyaccount-ecs-execution-role-amp",
  "METAFLOW_ECS_S3_ACCESS_IAM_ROLE": "arn:aws:iam::dummyaccountnum:role/StudioIAMRole",
  "METAFLOW_SERVICE_INTERNAL_URL": "<http://123-metadata-nlb-amp-c9275d2461784673.elb.us-west-2.amazonaws.com/>",
  "METAFLOW_SERVICE_URL": "<https://metaflow-service.xyz.cloudos.test.com>",
  "METAFLOW_KUBERNETES_NAMESPACE": "xyz",
  "METAFLOW_KUBERNETES_SERVICE_ACCOUNT": "xyz-ksa",
  "METAFLOW_ARGO_EVENTS_EVENT_BUS": "jobs-eventbus",
  "METAFLOW_ARGO_EVENTS_EVENT_SOURCE": "argo-events-webhook",
  "METAFLOW_ARGO_EVENTS_SERVICE_ACCOUNT": "operate-workflow-sa",
  "METAFLOW_ARGO_EVENTS_EVENT": "metaflow-event",
  "METAFLOW_ARGO_EVENTS_WEBHOOK_URL": "<https://argo-events-webhook.xyz.dev.test.com>",
  "METAFLOW_DEFAULT_SECRETS_BACKEND_TYPE": "aws-secrets-manager",
  "TEAM_COST_ATTRIBUTION_TAG": "tenant-totqpe",
  "METAFLOW_SERVICE_AUTH_KEY": "dummy-auth"
}
h
Hey @acoustic-van-30942, are you able to list namespaces or other resources using
kubectl
in the environment you are running metaflow? Do you have the assumed role's credentials in the environment?
👀 1
a
Hello @hundreds-zebra-57629 - Yes and yes. Confirmed it's using the assumed role.
h
Interesting, the error you are getting likely means the metaflow client running locally doesn't have proper credentials set to authenticate and make api requests against your EKS cluster. Can you try running this python code in the same environment as metaflow:
Copy code
from kubernetes import client, config

# Load Kubernetes configuration from the environment
config.load_kube_config()

# Create an API client
v1 = client.CoreV1Api()

# List all pods in all namespaces
pods = v1.list_pod_for_all_namespaces(watch=False)

# Print the name and namespace of each pod
for pod in pods.items:
    print(f"Pod Name: {pod.metadata.name}, Namespace: {pod.metadata.namespace}")
You don't need to share the output of the code but let me know if it actually list pods
👀 1
a
Getting this error unfortunately:
Copy code
kubernetes.client.exceptions.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': '393043a4-cc27-4469-9bf1-f48c6f98c99c', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': 'c42ebf89-2913-4d64-a29a-2e4106cd7dd7', 'X-Kubernetes-Pf-Prioritylevel-Uid': '2dfdc37d-aa44-4431-a64e-bb989ad35b96', 'Date': 'Sat, 25 Jan 2025 00:59:38 GMT', 'Content-Length': '262'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods is forbidden: User \"system:anonymous\" cannot list resource \"pods\" in API group \"\" in the namespace \"cagepart\"","reason":"Forbidden","details":{"kind":"pods"},"code":403}
However, when I do
Copy code
sagemaker-user@studio$ kubectl get pods --namespace cagepart
I get...
Copy code
No resources found in cagepart namespace.
Some other examples:
Copy code
sagemaker-user@studio$ kubectl get rayclusters
Error from server (Forbidden): <http://rayclusters.ray.io|rayclusters.ray.io> is forbidden: User "cagepart-kuberay-developer" cannot list resource "rayclusters" in API group "<http://ray.io|ray.io>" in the namespace "default"
sagemaker-user@studio$ kubectl get rayclusters --namespace cagepart
No resources found in cagepart namespace.
h
I believe this means there is something specific about how your kubeconfig file is authenticating to the cluster. Can you open you kubeconfig file and look through it? I am guessing it does some assumption of a role or runs some command to get kubeconfig token...
a
Looks something like this:
Copy code
apiVersion: v1
clusters:
- cluster:
    server: <NETWORK_LOAD_BALANCER_URL>
    insecure-skip-tls-verify: true
  name: <EKS_CLUSTER_ARN>
contexts:
- context:
    cluster: <EKS_CLUSTER_ARN>
    user: <EKS_CLUSTER_ARN>
  name: <EKS_CLUSTER_ARN>
current-context: <EKS_CLUSTER_ARN>
kind: Config
preferences: {}
users:
- name: <EKS_CLUSTER_ARN>
  user:
    exec:
      apiVersion: <http://client.authentication.k8s.io/v1beta1|client.authentication.k8s.io/v1beta1>
      args:
      - --region
      - us-west-2
      - eks
      - get-token
      - --cluster-name
      - <CLUSTER_NAME>
      - --output
      - json
      command: aws
      env: null
      interactiveMode: IfAvailable
      provideClusterInfo: false
h
Did you get this config by first running
aws eks update-kubeconfig --name <cluster-name>
?
a
It's first derived from
aws eks update-kubeconfig --name <cluster-name>
, but have to manually update it though because we use a secure network load balancer instead of the kube api server directly and also set
insecure-skip-tls-verify: true
Oh no I figured out the issue. The python kubernetes client version needs to be the same version as the kubernetes cluster and kubectl. That took me forever to figure out lol
h
Oh! that is an important discovery. Do you know what version it was on?
a
29.0.0
, so I downgraded
kubernetes
to
29.0.0
and it finally works
Maybe it doesn't need to be the same version... But
32.0.0
doesn't work, so I downgraded it
Now it's having trouble scheduling the pod. Perhaps I need to set tolerance? Node selector?
Copy code
2025-01-27 22:28:35.033 [170/start/2669 (pid 2970)] Task finished successfully.
2025-01-27 22:28:35.499 [170/process/2670 (pid 2990)] Task is starting.
2025-01-27 22:28:38.049 [170/process/2670 (pid 2990)] [job t-5fdbf935-2vfgg] Task is starting (Job status is unknown)...
2025-01-27 22:33:32.342 1 task is running: process (1 running; 0 done).
2025-01-27 22:33:32.342 No tasks are waiting in the queue.
2025-01-27 22:33:32.342 end step has not started
j
Hello, Can confirm that it stops working with kubernetes
32.0.0
. The last working version is
31.0.0
. This also seems like a related issue https://github.com/kubernetes-client/python/issues/2334
c
@acoustic-van-30942 You just saved me hours of debugging I'm sure. Thank you
👍 1
a
Haha! Happy to help alleviate the pain of debugging @cold-airport-8333