I'm running into an issue with `@kubernetes` CUDA ...
# ask-metaflow
c
I'm running into an issue with
@kubernetes
CUDA and
pypi
The code works on AWS Batch. I've deployed k8s and the nvidia plugin for pods (https://github.com/NVIDIA/k8s-device-plugin/tree/main/deployments/helm/nvidia-device-plugin) my pod can see the GPU using
nvidia-smi
however I'm getting the following error.
Copy code
@pypi_base(
    python="3.11.7",
    packages={"torch": "2.2.1", "torchmetrics": ""},
    extra_indices=["<https://download.pytorch.org/whl/cu118>"],
)
Copy code
@kubernetes(cpu=1, memory=6000, gpu=1, shared_memory=6000)
@environment(vars={"LD_LIBRARY_PATH": "/tmp", "NVIDIA_DRIVER_CAPABILITIES": "compute,utility"})
@step
Copy code
OSError: libcudart.so.11.0: cannot open shared object file: No such file or directory
I've also tried using the docker image
pytorch/pytorch:2.2.1-cuda11.8-cudnn8-runtime
same error.
1