1. 程式人生 > >caffe:cudaSuccess (2 vs. 0) out of memory

caffe:cudaSuccess (2 vs. 0) out of memory

問題描述:

我試圖在Caffe上訓練一個網路。 我有512x640的影象大小。 批量大小是1.我試圖實現FCN-8s。
我目前在一個帶有4GB GPU記憶體的Amazon EC2例項(g2.2xlarge)上執行。 但是當我執行solver,它立即丟擲一個錯誤。

Check failed: error == cudaSuccess (2 vs. 0)  out of memory
*** Check failure stack trace: ***
Aborted (core dumped)
The error you get is indeed out of memory, but
it's not the RAM, but rather GPU memory (note the the error comes from CUDA). Usually, when caffe is out of memory - the first thing to do is reduce the batch size (at the cost of gradient accuracy), but since you are already at batch size = 1... Are you sure batch size is 1 for both TRAIN and TEST phases? 你得到的錯誤確實是記憶體不足,但它不是RAM,而是GPU記憶體(注意錯誤來自CUDA)。 通常,當caffe記憶體不足時 - 首先要做的是減少批量大小(以梯度精度為代價),但由於您已經在批處理大小= 1
... 您確定TRAIN和TEST階段的批量大小是1嗎?
I guessed so. And yes, both train and test phases' batch size is 1. I think I have resize the training images to something smaller and try it out. But why is 4GB of GPU Memory turning out to be less space? It says The total number of bytes read was 537399810 which is much smaller than 4
GB. – Abhilash Panigrahi Nov 19 '15 at 8:11 @AbhilashPanigrahi is it possible some other processes are using GPU at the same time? try command line nvidia-smi to see what's going on on your GPU. – Shai Nov 19 '15 at 8:18 I did. No other process is running apart from this (which automatically quits after a few seconds because of the error). – Abhilash Panigrahi Nov 19 '15 at 8:21 1 I just reduced the image and label size to about 256x320. It runs successfully. I saw it is using around 3.75 GB of GPU memory. Thanks for the help. – Abhilash Panigrahi Nov 19 '15 at 8:47 是的,train和測試階段的批量大小是1.我認為我已經將訓練影象調整為很小,並測試。 但是為什麼4GB的GPU記憶體出來更少的空間? 它說讀取的總位元組數為537399810,遠小於4GB。 可能是一些其他程序同時使用GPU? 嘗試命令列nvidia-smi看看你的GPU上發生了什麼。 我只是將影象和標籤大小減少到大約256x320 ----------------------- 它執行成功。 我看到它使用大約3.75 GB的GPU記憶體。
當然也可以用cpu來計算,不過速度超慢,以下是我用cpu執行圖:

這裡寫圖片描述

由圖中的時間可以看到cpu執行好慢。囧...