Tensorflow執行環境的cuda+cudnn版本問題
阿新 • • 發佈:2019-02-02
問題
CentOS Linux release 7.3.1611伺服器上以前裝過tensorflow1.0,cuda8.0,cudnn v5.1,原本是能正常執行tf程式,一段時間沒用,出了點小問題,故查資料解決一下
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcudnn.so.5 . LD_LIBRARY_PATH: /usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/lib:
I tensorflow/stream_executor/cuda/cuda_dnn.cc:3517] Unable to load cuDNN DSO
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
···
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla K40m
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:82:00.0
Total memory: 11.17 GiB
Free memory: 2.08GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:82:00.0)
F tensorflow/stream_executor/cuda/cuda_dnn.cc:222] Check failed: s.ok() could not find cudnnCreate in cudnn DSO; dlerror: /usr/local/python3/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so: undefined symbol: cudnnCreate
Aborted (core dumped)
說 libcudnn.so.5 找不到,到 /usr/local/cuda-8.0/lib64 目錄下檢視
確實沒有,而且cudnn以前升過級,現在系統裡裝了6和7兩個版本,沒有5怎麼辦呢,沒關係,建個軟連結就行 ln -s libcudnn.so.7 libcudnn.so.5
然而,
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
···
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:82:00.0)
E tensorflow/stream_executor/cuda/cuda_dnn.cc:390] Loaded runtime CuDNN library: 7004 (compatibility version 7000) but source was compiled with 5105 (compatibility version 5100). If using a binary install, upgrade your CuDNN library to match. If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
F tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)
Aborted (core dumped)
還是出錯,說cudnn7不相容,要求cudnn5.1。意思是版本太高了?我查了一些別人部落格,大部分都是說cudnn版本低不相容的;然後又到cuda官網查了一下cuda 8.0對應cudnn版本
看來是不對,我直接換成了 ln -s libcudnn.so.6 libcudnn.so.5
然後程式成功執行。
總結
兩個問題,cudnn庫不存在,和cudnn庫版本不對。
解決辦法雖然簡單,但要多注意,搞GPU計算環境時,系統版本、顯示卡計算能力、cuda版本、cudnn版本,這些東西的匹配問題。