I'm having some trouble getting GPUs working on AW...
# ask-metaflow
e
I'm having some trouble getting GPUs working on AWS kubernetes. I've got the nvidia-device-plugin installed and my jobs are scheduling on my GPU instance when using
@kubernetes(gpu=1)
I'm getting stuck trying to install
tensorflow-gpu
. When running the following code:
Copy code
from metaflow import FlowSpec, step, kubernetes, retry, conda_base

libraries = {
    "tensorflow-gpu": "2.11.1",
    "cudatoolkit": "11.0.3"
}

@conda_base(libraries=libraries, python="3.10.9")
class Tensorflow(FlowSpec):
Copy code
Step: start, Error: command '['/opt/conda/condabin/mamba', 'create', '--yes', '--no-default-packages', '--name', 'metaflow_Tensorflow_linux-64_615c884d84bb1f7ae93a2e331b5189159b559cc8', '--quiet', b'python==3.10.9', b'requests==>=2.21.0', b'boto3==>=1.14.0', b'tensorflow-gpu==2.11.1', b'cudatoolkit==11.0.3']' returned error (1): b'Could not solve for environment specs\nEncountered problems while solving:\n  - nothing provides __cuda needed by tensorflow-2.11.1-cuda112py310he87a039_0\n\nThe environment can\'t be solved, aborting the operation\n\n{\n    "success": false\n}\n', stderr=b''
I was able to shell into the running docker container, install tensorflow-gpu manually, and it worked fine. Any thoughts?
1