bumpy-orange-22261
03/26/2025, 9:43 AM2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] kubernetes.launch_job(
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] File "/tmp/tmpnwvasbag/metaflow/plugins/kubernetes/kubernetes.py", line 154, in launch_job
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] self._job = self.create_jobset(**kwargs).execute()
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] File "/tmp/tmpnwvasbag/metaflow/plugins/kubernetes/kubernetes_jobsets.py", line 927, in execute
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] raise KubernetesJobsetException(
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] metaflow.plugins.kubernetes.kubernetes_jobsets.KubernetesJobsetException: Exception when calling CustomObjectsApi->create_namespaced_custom_object: (404)
2025-03-26 16:53:24.554 [3/tune/7 (pid 20009)] Reason: Not Found
2025-03-26 16:53:24.555 [3/tune/7 (pid 20009)] HTTP response headers: HTTPHeaderDict({'Audit-Id': '0b7f7b27-fc2c-4bfb-a8c5-0275a7b25bf4', 'Cache-Control': 'no-cache, private', 'Content-Type': 'text/plain; charset=utf-8', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': 'f7a37b4e-83b2-4bf9-b706-8c2f2c5abe7d', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'd8feb481-0d86-493f-a05e-072dbd75ac82', 'Date': 'Wed, 26 Mar 2025 09:53:24 GMT', 'Content-Length': '19'})
2025-03-26 16:53:24.555 [3/tune/7 (pid 20009)] HTTP response body: 404 page not found
look at the k8 container log there is error
psycopg2.errors.UndefinedTable: relation "public.flows_v3" does not exist
square-wire-39606
03/26/2025, 2:49 PMthankful-ambulance-42457
03/26/2025, 3:20 PMbumpy-orange-22261
03/26/2025, 6:16 PMbumpy-orange-22261
03/27/2025, 3:21 AMpython metaflow-tutorials/00-helloworld/helloworld.py run --with kubernetes:cpu=4,memory=10000,namespace=default,image=python:latest
thankful-ambulance-42457
03/27/2025, 2:32 PMJobSets
on kubernetes, which the ray decorator relies on for parallel jobs. You can verify this with a simple @parallel
flow (the ray decorator builds upon @parallel
)bumpy-orange-22261
03/27/2025, 2:48 PMthankful-ambulance-42457
03/27/2025, 3:05 PMbumpy-orange-22261
03/28/2025, 2:19 AMif torch.cuda.is_available():
device = torch.device('cuda:0')
print('Training on GPU.')
else:
device = torch.device('cpu')
print('Training on CPU.')
bulky-afternoon-92433
03/28/2025, 12:42 PMbumpy-orange-22261
03/28/2025, 12:49 PMbumpy-orange-22261
03/28/2025, 1:00 PMbumpy-orange-22261
03/31/2025, 10:08 AMParallelTestFlow()
flow stuck in process
step
2025-03-31 16:53:29.706 [48/process/196 (pid 2364)] [pod js-f23a6f-control-0-0-bl4cv] Setting up task environment.
2025-03-31 16:53:41.107 [48/process/196 (pid 2364)] [pod js-f23a6f-control-0-0-bl4cv] Downloading code package...
2025-03-31 16:53:42.576 [48/process/196 (pid 2364)] [pod js-f23a6f-control-0-0-bl4cv] Code package downloaded.
2025-03-31 16:53:42.648 [48/process/196 (pid 2364)] [pod js-f23a6f-control-0-0-bl4cv] Task is starting.
2025-03-31 16:53:46.320 [48/process/196 (pid 2364)] [pod js-f23a6f-control-0-0-bl4cv] Processing...
bumpy-orange-22261
03/31/2025, 10:09 AMFailed to get host by name for MF_MASTER_ADDR after waiting for 600 seconds.
thankful-ambulance-42457
03/31/2025, 12:46 PMbumpy-orange-22261
04/01/2025, 4:15 PMthankful-ambulance-42457
04/02/2025, 12:00 PMMF_MASTER_ADDR
env variable is set to on the pod. It is most likely the K8S internal name of the control-node that is used to coordinate the parallel execution over worker pods. Unreachable would lead me to believe that pod-to-pod communication is not working in the cluster.
---
PyTorch
as for torch being stuck with cpu, there could be two reasons;
⢠the torch version being installed is a cpu-only one. This would be controlled by the pypi indices that you have possibly configured in your local pip config. pip config list
to check. There are cpu / gpu-only pip indices available for pytorch as I recall
⢠there is also a known issue with cross-platform resolving for torch, as torch ships with CUDA packages included, but these dependencies are limited to linux-64
. The way pip resolves means we are not able to include platform-specific transitive dependencies when resolving cross-platform (e.g. macos aarm64 -> linux x86_64
)
⦠This case usually ends up with the flow encountering an error with missing dependencies though, so the flow succeeding due to torch only using cpu would mean that this case is unlikely the cause.