Hi team We have deployed Metaflow on GKE and have been opera Outerbounds #ask-metaflow

Join Slack

Channels

ask-metaflow

dev-metaflow

Hi team, We have deployed Metaflow on GKE and hav...

# ask-metaflow

few-salesmen-35936

03/07/2025, 1:55 PM

Hi team, We have deployed Metaflow on GKE and have been operating it for over a year without an issue but from this weekend suddenly we can no longer start any workflow. Any run ends up "pending". https://localhost:2746/ show individual workflows but showing "Nothing to show". https://localhost:3000 doesn't even show any new workflow.

argo list -n argo

shows triggered workflows but STATUS is PENDING for all. Around the same time this started, we noticed our GKE cluster started to show

Grant critical permissions to Node service account to allow for non-degraded operations

coming from this update https://cloud.google.com/kubernetes-engine/docs/release-notes-new-features#February_28_2025 . As instructed, we added

roles/container.defaultNodeServiceAccount

to the service account but didn't help. Do you know the root cause of this issue and how to fix this? Thanks!

hundreds-zebra-57629

03/07/2025, 7:09 PM

Hey @few-salesmen-35936

argo list -n argo
shows triggered workflows but STATUS is PENDING for all.

Does this mean there are corresponding task pods that are also pending? Also did you recently upgrade your GKE cluster? (btw GCP may also automatically upgrade the clusters unless you explicitly disable auto upgrades). Finally, can you share the versions of metaflow-service, argo-workflows, GKE control-plane and worker nodes you are running?

few-salesmen-35936

03/10/2025, 10:08 AM

Hey @hundreds-zebra-57629 I found the root cause. somehow our GKE cluster lost access to argo namespace suddenly. Adding

Copy code

# Create ClusterRole to allow Argo to list namespaces
resource "kubernetes_cluster_role" "argo_namespace_reader" {
  metadata {
    name = "argo-namespace-reader"
  }

  rule {
    api_groups = [""]
    resources  = ["namespaces"]
    verbs      = ["get", "list", "watch"]
  }
}

# Bind ClusterRole to Argo's ServiceAccount
resource "kubernetes_cluster_role_binding" "argo_namespace_access" {
  metadata {
    name = "argo-namespace-access"
  }

  subject {
    kind      = "ServiceAccount"
    name      = "argo"
    namespace = "argo"
  }

  role_ref {
    kind      = "ClusterRole"
    name      = kubernetes_cluster_role.argo_namespace_reader.metadata[0].name
    api_group = "<http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>"
  }
}

terraform/services/argo.tf

solved the issue. I still don't understand how this happened but we figured out the argo workflow controler service lost namespace access. just for the record, I just answer your questions

Does this mean there are corresponding task pods that are also pending?

our individual metaflow and argo service pods were running, but we couldn't trigger any DAG thus no DAG pod spawened.

Also did you recently upgrade your GKE cluster?

Not manually though, we didn't disable auto upgrade so there must have been regular updtes.

can you share the versions of metaflow-service, argo-workflows, GKE control-plane and worker nodes you are running?

metaflow: 2.11.3 argo-workflow: v3.4.10 control plane: 1.30.9-gke.1127000 nodes: 1.30.9-gke.1127000

9 Views

Open in Slack

Previous Next