I have another question. when try to run flow on k...
# ask-metaflow
b
I have another question. when try to run flow on k8 of GCP. I get this error
Copy code
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] kubernetes.launch_job(
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] File "/tmp/tmpnwvasbag/metaflow/plugins/kubernetes/kubernetes.py", line 154, in launch_job
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] self._job = self.create_jobset(**kwargs).execute()
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] File "/tmp/tmpnwvasbag/metaflow/plugins/kubernetes/kubernetes_jobsets.py", line 927, in execute
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] raise KubernetesJobsetException(
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] metaflow.plugins.kubernetes.kubernetes_jobsets.KubernetesJobsetException: Exception when calling CustomObjectsApi->create_namespaced_custom_object: (404)
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] Reason: Not Found
2025-03-26 16:53:24.555 [3/tune/7 (pid 20009)] HTTP response headers: HTTPHeaderDict({'Audit-Id': '0b7f7b27-fc2c-4bfb-a8c5-0275a7b25bf4', 'Cache-Control': 'no-cache, private', 'Content-Type': 'text/plain; charset=utf-8', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': 'f7a37b4e-83b2-4bf9-b706-8c2f2c5abe7d', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'd8feb481-0d86-493f-a05e-072dbd75ac82', 'Date': 'Wed, 26 Mar 2025 09:53:24 GMT', 'Content-Length': '19'})
2025-03-26 16:53:24.555 [3/tune/7 (pid 20009)] HTTP response body: 404 page not found
look at the k8 container log there is error
psycopg2.errors.UndefinedTable: relation "public.flows_v3" does not exist
šŸ‘€ 2
s
@bulky-afternoon-92433 can you assist here?
t
@bumpy-orange-22261 can you give some details on how the services are deployed in your setup? did you use one of the available cloudformation / terraform templates? The error would lead me to believe that database migrations have failed to apply. these are usually run during the metadata service startup as a script. Checking the initial log output of the service might give a clue what is going wrong
b
@bulky-afternoon-92433, yes , I'm using terraform template from metaflow-tools. I think the error in database does not relate to flow error. We can ignore it. Something I see that it's failed to create k8 job
I have some update. the flow error which I run above from example to use metaflow_ray. https://github.com/outerbounds/metaflow-ray/blob/main/examples/tune_pytorch/flow_gpu.py If I run a normal flow without ray on k8s, it successful
python metaflow-tutorials/00-helloworld/helloworld.py run --with kubernetes:cpu=4,memory=10000,namespace=default,image=python:latest
t
oh ok, I might have a clue on what is going on here. Tried out the example myself and ran into a timeout with the gpu example. One cause can be if there are no gpu nodes available in the cluster, the pod for the tune task ends up being unschedulable and times out eventually. Another cause can be that there are no permissions to create
JobSets
on kubernetes, which the ray decorator relies on for parallel jobs. You can verify this with a simple
@parallel
flow (the ray decorator builds upon
@parallel
)
b
@bulky-afternoon-92433, yes, I'm also facing with another problem about gpu node even we have gpu quota. Do I need any extra configuration in terraform to enable gpu node ?
t
I believe the terraform templates are meant to be a starting point and a minimum viable stack. The autoscaling setup for the GKE cluster doesn't seem to define any gpu resources so you will have to edit the template to accommodate for those. If its for testing purposes, simply adding one gpu node to the cluster should be enough as well
b
@bulky-afternoon-92433 Now, I can see the job successful request gpu resource. but pytorch still show training in CPU . Do you have any idea about this thing ?
Copy code
if torch.cuda.is_available():
        device = torch.device('cuda:0')
        print('Training on GPU.')
    else:
        device = torch.device('cpu')
        print('Training on CPU.')
b
what platform are you deploying the flow from? and what platform is the flow executing on? there is an issue with cross-platform deployments regarding pytorch, where some of its dependencies are platform-specific (most importantly the bundled cuda libraries are only included for linux-64) and these fail to be included when doing a cross-platform resolve for the environment. an example of this is resolving on a mac and running with kubernetes on a linux-64 platform.
b
the test run in k8 cluster on GCP. I also tried to create another pod from same k8 node to call nvidia-msi. I see the gpu information
I deploy flow from mac arm.
@bulky-afternoon-92433, I solved almost the issue. now, I get another issue when try to run your example above
ParallelTestFlow()
flow stuck in
process
step
Copy code
2025-03-31 16:53:29.706 [48/process/196 (pid 2364)] [pod js-f23a6f-control-0-0-bl4cv] Setting up task environment.
2025-03-31 16:53:41.107 [48/process/196 (pid 2364)] [pod js-f23a6f-control-0-0-bl4cv] Downloading code package...
2025-03-31 16:53:42.576 [48/process/196 (pid 2364)] [pod js-f23a6f-control-0-0-bl4cv] Code package downloaded.
2025-03-31 16:53:42.648 [48/process/196 (pid 2364)] [pod js-f23a6f-control-0-0-bl4cv] Task is starting.
2025-03-31 16:53:46.320 [48/process/196 (pid 2364)] [pod js-f23a6f-control-0-0-bl4cv] Processing...
I see this error on k8 cluster
Failed to get host by name for MF_MASTER_ADDR after waiting for 600 seconds.
t
Sorry for the delay, was battling with some hardware issues over the weekend and recovering data šŸ˜“ I'll look at this later today, but from the error it would seem to be some connectivity/permission issue within the Kubernetes cluster.
b
@bulky-afternoon-92433 do you have any news ? Could you guide me a few steps to debug this issue ?
t
Parallel ah, yes. for the parallel issue, you could check what the
MF_MASTER_ADDR
env variable is set to on the pod. It is most likely the K8S internal name of the control-node that is used to coordinate the parallel execution over worker pods. Unreachable would lead me to believe that pod-to-pod communication is not working in the cluster. --- PyTorch as for torch being stuck with cpu, there could be two reasons; • the torch version being installed is a cpu-only one. This would be controlled by the pypi indices that you have possibly configured in your local pip config.
pip config list
to check. There are cpu / gpu-only pip indices available for pytorch as I recall • there is also a known issue with cross-platform resolving for torch, as torch ships with CUDA packages included, but these dependencies are limited to
linux-64
. The way pip resolves means we are not able to include platform-specific transitive dependencies when resolving cross-platform (e.g.
macos aarm64 -> linux x86_64
) ā—¦ This case usually ends up with the flow encountering an error with missing dependencies though, so the flow succeeding due to torch only using cpu would mean that this case is unlikely the cause.