Hi team, We have deployed Metaflow on GKE and hav...
# ask-metaflow
f
Hi team, We have deployed Metaflow on GKE and have been operating it for over a year without an issue but from this weekend suddenly we can no longer start any workflow. Any run ends up "pending". https://localhost:2746/ show individual workflows but showing "Nothing to show". https://localhost:3000 doesn't even show any new workflow.
argo list -n argo
shows triggered workflows but STATUS is PENDING for all. Around the same time this started, we noticed our GKE cluster started to show
Grant critical permissions to Node service account to allow for non-degraded operations
coming from this update https://cloud.google.com/kubernetes-engine/docs/release-notes-new-features#February_28_2025 . As instructed, we added
roles/container.defaultNodeServiceAccount
to the service account but didn't help. Do you know the root cause of this issue and how to fix this? Thanks!
h
Hey @few-salesmen-35936
argo list -n argo
shows triggered workflows but STATUS is PENDING for all.
Does this mean there are corresponding task pods that are also pending? Also did you recently upgrade your GKE cluster? (btw GCP may also automatically upgrade the clusters unless you explicitly disable auto upgrades). Finally, can you share the versions of metaflow-service, argo-workflows, GKE control-plane and worker nodes you are running?
f
Hey @hundreds-zebra-57629 I found the root cause. somehow our GKE cluster lost access to argo namespace suddenly. Adding
Copy code
# Create ClusterRole to allow Argo to list namespaces
resource "kubernetes_cluster_role" "argo_namespace_reader" {
  metadata {
    name = "argo-namespace-reader"
  }

  rule {
    api_groups = [""]
    resources  = ["namespaces"]
    verbs      = ["get", "list", "watch"]
  }
}

# Bind ClusterRole to Argo's ServiceAccount
resource "kubernetes_cluster_role_binding" "argo_namespace_access" {
  metadata {
    name = "argo-namespace-access"
  }

  subject {
    kind      = "ServiceAccount"
    name      = "argo"
    namespace = "argo"
  }

  role_ref {
    kind      = "ClusterRole"
    name      = kubernetes_cluster_role.argo_namespace_reader.metadata[0].name
    api_group = "<http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>"
  }
}
to
terraform/services/argo.tf
solved the issue. I still don't understand how this happened but we figured out the argo workflow controler service lost namespace access. just for the record, I just answer your questions
Does this mean there are corresponding task pods that are also pending?
our individual metaflow and argo service pods were running, but we couldn't trigger any DAG thus no DAG pod spawened.
Also did you recently upgrade your GKE cluster?
Not manually though, we didn't disable auto upgrade so there must have been regular updtes.
can you share the versions of metaflow-service, argo-workflows, GKE control-plane and worker nodes you are running?
metaflow: 2.11.3 argo-workflow: v3.4.10 control plane: 1.30.9-gke.1127000 nodes: 1.30.9-gke.1127000