Hey guys I am trying to run some training with pyt...
# ask-metaflow
i
Hey guys I am trying to run some training with pytorch, however when I kick off the flow I get the error
Copy code
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED_ARCH_MISMATCH
Has anyone dealt with this sort of error before ? if so how did you manage to get past it
c
Hey Ameen, how are you installing pytorch and cudnn in the image? More to the point, the issue is likely a mismatch between pytorch version and cuda-toolkit version.
i
Right now I am using a docker image, the version of cuda according to the p2.xlarge instance I am using has cuda 11.4 So I installed pytorch on my image using the command
Copy code
RUN pip install torch torchvision --index-url <https://download.pytorch.org/whl/cu118>
There was no index url for cu114 so this was the next. best thing from what I found on the pytorch forums
c
When you run the container independent of Metaflow can you run your torch logic without this error?
i
I am using a mac, so I did not think I would be able to run GPU aided training using the image
c
I see. It is kind of indirect to debug this through Metaflow. I'd suggest manually spinning up an ec2 with the same instance type, downloading your container image to that box, running it, and seeing if you can run the same program you want to run inside a Metaflow task. It'll be faster to manually install things directly on the box, as this error is quite likely independent of Metaflow.
i
Okay, will see if I can get this working and go from there
👍 1