few-salesmen-35936
03/07/2025, 1:55 PMargo list -n argo
shows triggered workflows but STATUS is PENDING for all.
Around the same time this started, we noticed our GKE cluster started to show Grant critical permissions to Node service account to allow for non-degraded operations
coming from this update https://cloud.google.com/kubernetes-engine/docs/release-notes-new-features#February_28_2025 .
As instructed, we added roles/container.defaultNodeServiceAccount
to the service account but didn't help.
Do you know the root cause of this issue and how to fix this?
Thanks!hundreds-zebra-57629
03/07/2025, 7:09 PMDoes this mean there are corresponding task pods that are also pending? Also did you recently upgrade your GKE cluster? (btw GCP may also automatically upgrade the clusters unless you explicitly disable auto upgrades). Finally, can you share the versions of metaflow-service, argo-workflows, GKE control-plane and worker nodes you are running?shows triggered workflows but STATUS is PENDING for all.argo list -n argo
few-salesmen-35936
03/10/2025, 10:08 AM# Create ClusterRole to allow Argo to list namespaces
resource "kubernetes_cluster_role" "argo_namespace_reader" {
metadata {
name = "argo-namespace-reader"
}
rule {
api_groups = [""]
resources = ["namespaces"]
verbs = ["get", "list", "watch"]
}
}
# Bind ClusterRole to Argo's ServiceAccount
resource "kubernetes_cluster_role_binding" "argo_namespace_access" {
metadata {
name = "argo-namespace-access"
}
subject {
kind = "ServiceAccount"
name = "argo"
namespace = "argo"
}
role_ref {
kind = "ClusterRole"
name = kubernetes_cluster_role.argo_namespace_reader.metadata[0].name
api_group = "<http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>"
}
}
to terraform/services/argo.tf
solved the issue.
I still don't understand how this happened but we figured out the argo workflow controler service lost namespace access.
just for the record, I just answer your questions
Does this mean there are corresponding task pods that are also pending?our individual metaflow and argo service pods were running, but we couldn't trigger any DAG thus no DAG pod spawened.
Also did you recently upgrade your GKE cluster?Not manually though, we didn't disable auto upgrade so there must have been regular updtes.
can you share the versions of metaflow-service, argo-workflows, GKE control-plane and worker nodes you are running?metaflow: 2.11.3 argo-workflow: v3.4.10 control plane: 1.30.9-gke.1127000 nodes: 1.30.9-gke.1127000