I have another question when try to run flow on k8 of GCP I Outerbounds #ask-metaflow

I have another question. when try to run flow on k...

bumpy-orange-22261

03/26/2025, 9:43 AM

I have another question. when try to run flow on k8 of GCP. I get this error

Copy code

2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] kubernetes.launch_job(
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] File "/tmp/tmpnwvasbag/metaflow/plugins/kubernetes/kubernetes.py", line 154, in launch_job
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] self._job = self.create_jobset(**kwargs).execute()
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] File "/tmp/tmpnwvasbag/metaflow/plugins/kubernetes/kubernetes_jobsets.py", line 927, in execute
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] raise KubernetesJobsetException(
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] metaflow.plugins.kubernetes.kubernetes_jobsets.KubernetesJobsetException: Exception when calling CustomObjectsApi->create_namespaced_custom_object: (404)
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] Reason: Not Found
2025-03-26 16:53:24.555 [3/tune/7 (pid 20009)] HTTP response headers: HTTPHeaderDict({'Audit-Id': '0b7f7b27-fc2c-4bfb-a8c5-0275a7b25bf4', 'Cache-Control': 'no-cache, private', 'Content-Type': 'text/plain; charset=utf-8', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': 'f7a37b4e-83b2-4bf9-b706-8c2f2c5abe7d', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'd8feb481-0d86-493f-a05e-072dbd75ac82', 'Date': 'Wed, 26 Mar 2025 09:53:24 GMT', 'Content-Length': '19'})
2025-03-26 16:53:24.555 [3/tune/7 (pid 20009)] HTTP response body: 404 page not found

look at the k8 container log there is error

psycopg2.errors.UndefinedTable: relation "public.flows_v3" does not exist

👀 2

square-wire-39606

03/26/2025, 2:49 PM

@bulky-afternoon-92433 can you assist here?

thankful-ambulance-42457

03/26/2025, 3:20 PM

@bumpy-orange-22261 can you give some details on how the services are deployed in your setup? did you use one of the available cloudformation / terraform templates? The error would lead me to believe that database migrations have failed to apply. these are usually run during the metadata service startup as a script. Checking the initial log output of the service might give a clue what is going wrong

bumpy-orange-22261

03/26/2025, 6:16 PM

@bulky-afternoon-92433, yes , I'm using terraform template from metaflow-tools. I think the error in database does not relate to flow error. We can ignore it. Something I see that it's failed to create k8 job

bumpy-orange-22261

03/27/2025, 3:21 AM

I have some update. the flow error which I run above from example to use metaflow_ray. https://github.com/outerbounds/metaflow-ray/blob/main/examples/tune_pytorch/flow_gpu.py If I run a normal flow without ray on k8s, it successful

python metaflow-tutorials/00-helloworld/helloworld.py run --with kubernetes:cpu=4,memory=10000,namespace=default,image=python:latest

thankful-ambulance-42457

03/27/2025, 2:32 PM

oh ok, I might have a clue on what is going on here. Tried out the example myself and ran into a timeout with the gpu example. One cause can be if there are no gpu nodes available in the cluster, the pod for the tune task ends up being unschedulable and times out eventually. Another cause can be that there are no permissions to create

JobSets

on kubernetes, which the ray decorator relies on for parallel jobs. You can verify this with a simple

@parallel

flow (the ray decorator builds upon

@parallel

)

bumpy-orange-22261

03/27/2025, 2:48 PM

@bulky-afternoon-92433, yes, I'm also facing with another problem about gpu node even we have gpu quota. Do I need any extra configuration in terraform to enable gpu node ?

thankful-ambulance-42457

03/27/2025, 3:05 PM

I believe the terraform templates are meant to be a starting point and a minimum viable stack. The autoscaling setup for the GKE cluster doesn't seem to define any gpu resources so you will have to edit the template to accommodate for those. If its for testing purposes, simply adding one gpu node to the cluster should be enough as well

bumpy-orange-22261

03/28/2025, 2:19 AM

@bulky-afternoon-92433 Now, I can see the job successful request gpu resource. but pytorch still show training in CPU . Do you have any idea about this thing ?

Copy code

if torch.cuda.is_available():
        device = torch.device('cuda:0')
        print('Training on GPU.')
    else:
        device = torch.device('cpu')
        print('Training on CPU.')

bulky-afternoon-92433

03/28/2025, 12:42 PM

what platform are you deploying the flow from? and what platform is the flow executing on? there is an issue with cross-platform deployments regarding pytorch, where some of its dependencies are platform-specific (most importantly the bundled cuda libraries are only included for linux-64) and these fail to be included when doing a cross-platform resolve for the environment. an example of this is resolving on a mac and running with kubernetes on a linux-64 platform.

bumpy-orange-22261

03/28/2025, 12:49 PM

the test run in k8 cluster on GCP. I also tried to create another pod from same k8 node to call nvidia-msi. I see the gpu information

bumpy-orange-22261

03/28/2025, 1:00 PM

I deploy flow from mac arm.

bumpy-orange-22261

03/31/2025, 10:08 AM

@bulky-afternoon-92433, I solved almost the issue. now, I get another issue when try to run your example above

ParallelTestFlow()

flow stuck in

process

step

Copy code

2025-03-31 16:53:29.706 [48/process/196 (pid 2364)] [pod js-f23a6f-control-0-0-bl4cv] Setting up task environment.
2025-03-31 16:53:41.107 [48/process/196 (pid 2364)] [pod js-f23a6f-control-0-0-bl4cv] Downloading code package...
2025-03-31 16:53:42.576 [48/process/196 (pid 2364)] [pod js-f23a6f-control-0-0-bl4cv] Code package downloaded.
2025-03-31 16:53:42.648 [48/process/196 (pid 2364)] [pod js-f23a6f-control-0-0-bl4cv] Task is starting.
2025-03-31 16:53:46.320 [48/process/196 (pid 2364)] [pod js-f23a6f-control-0-0-bl4cv] Processing...

bumpy-orange-22261

03/31/2025, 10:09 AM

I see this error on k8 cluster

Failed to get host by name for MF_MASTER_ADDR after waiting for 600 seconds.

thankful-ambulance-42457

03/31/2025, 12:46 PM

Sorry for the delay, was battling with some hardware issues over the weekend and recovering data 😓 I'll look at this later today, but from the error it would seem to be some connectivity/permission issue within the Kubernetes cluster.

bumpy-orange-22261

04/01/2025, 4:15 PM

@bulky-afternoon-92433 do you have any news ? Could you guide me a few steps to debug this issue ?

thankful-ambulance-42457

04/02/2025, 12:00 PM

Parallel ah, yes. for the parallel issue, you could check what the

MF_MASTER_ADDR

env variable is set to on the pod. It is most likely the K8S internal name of the control-node that is used to coordinate the parallel execution over worker pods. Unreachable would lead me to believe that pod-to-pod communication is not working in the cluster. --- PyTorch as for torch being stuck with cpu, there could be two reasons; • the torch version being installed is a cpu-only one. This would be controlled by the pypi indices that you have possibly configured in your local pip config.

pip config list

to check. There are cpu / gpu-only pip indices available for pytorch as I recall • there is also a known issue with cross-platform resolving for torch, as torch ships with CUDA packages included, but these dependencies are limited to

linux-64

. The way pip resolves means we are not able to include platform-specific transitive dependencies when resolving cross-platform (e.g.

macos aarm64 -> linux x86_64

) ◦ This case usually ends up with the flow encountering an error with missing dependencies though, so the flow succeeding due to torch only using cpu would mean that this case is unlikely the cause.

11 Views

Open in Slack

Previous Next