PyTorch(總)——PyTorch遇到令人迷人的BUG與記錄

阿新 • • 發佈：2019-01-01

這篇部落格就用來記錄在使用pytorch時遇到的BUG，雖然年紀大了，但是調出BUG還是令人興奮^_^!

BUG1：
在使用NLLLoss()啟用函式時，NLLLoss用來做n類分類的，一般最後一層網路為LogSoftmax，如果其他的則需要使用CrossEntropyLoss。其使用格式為：loss(m(input), target)，其中input為2DTensor大小為（minibatch，n），target為真實分類的標籤。
如果輸入的input型別為torch.cuda.FloatTensor，target型別為torch.cuda.IntTensor，則會出現如下錯誤：
TypeError: CudaClassNLLCriterion_updateOutput received an invalid combination of arguments - got (int, torch.cuda.FloatTensor, !torch.cuda.IntTensor!, torch.cuda.FloatTensor, bool, NoneType, torch.cuda.FloatTensor), but expected (int state, torch.cuda.FloatTensor input, torch.cuda.LongTensor target, torch.cuda.FloatTensor output, bool sizeAverage, [torch.cuda.FloatTensor weights or None], torch.cuda.FloatTensor total_weight)
因此需要保證target型別為torch.cuda.LongTensor，需要在資料讀取的迭代其中把target的型別轉換為int64位的：target = target.astype(np.int64)，這樣，輸出的target型別為torch.cuda.LongTensor。（或者在使用前使用Tensor.type(torch.LongTensor)

進行轉換）。

為了說明pytorch中numpy和toch的轉換關係，測試如下：

首先輸入int32的numpy陣列轉換為torch，得到的IntTensor型別

如果輸入的為int64的numpy，得到LongTensor型別：

這裡寫圖片描述

如果把int32的陣列轉換為LongTensor，則會出錯：

這裡寫圖片描述

如果把int64的陣列轉換為LongTensor，正常：

這裡寫圖片描述

PS: 2017/8/8（奇怪，在使用binary_cross_entropy進行分類時又要求型別為FloatTensor型別，簡直夠了）

BUG2：
同樣是NLLLoss()使用時的問題。網路傳播都正常，但是在計算loss時出現如下錯誤：
RuntimeError: cuda runtime error (59) : device-side assert triggered at /home/loop/pytorch-master/torch/lib/THC/generic/THCTensorMath.cu:15

斷點除錯發現數據型別出現如下變化：

bug2.1

我以為顯示卡除了問題，最後在pytoch#1204中發現一個人的標籤中出現-1，發生了類似的錯誤：
bug2.2

而我的標籤為1~10，最後把標籤定義為1~9，解決這個問題。^_^!

BUG3：
當使用torch.view()時出現 RuntimeError: input is not contiguous at /home/loop/pytorch-master/torch/lib/TH/generic/THTensor.c:231

這個是由於淺拷貝出現的問題。
如下：定義初始化一個Tensor值，並且對其進行維度交換，在進行Tensor.view()操作時出現以上錯誤。

BUG3_1

這是由於淺拷貝的原因，y只是複製了x的指標，x改變，y也要隨之改變，如下：

BUG3_2

可以使用tensor.contiguous()解決：

BUG3_3

BUG4：
使用Cross_entropy損失函式時出現 RuntimeError: multi-target not supported at …

仔細看其引數說明：

input has to be a 2D Tensor of size batch x n.

This criterion expects a class index (0 to nClasses-1) as the target for each value of a 1D tensor of size n

其標籤必須為0~n-1，而且必須為1維的，如果設定標籤為[nx1]的，則也會出現以上錯誤。

BUG4：
按照官網的方式編譯PyTorch原始碼時出現：undefined reference to ... @GLIBCXX_3.4.21 (未定義的引用問題) 我的是出現在編譯90%左右的broadcast_test附近出現的。問題估計是GCC的版本造成的，雖然GCC -v顯示的5.0，但是呼叫的庫不是，需要執行：

conda install libgcc

然後python setup.py clean重新生成即可解決問題

BUG5：
出現如下錯誤：

ValueError: Expected more than 1 value per channel when training, got input size [1, 5,1,1]

這個是在使用BatchNorm時不能把batchsize設定為1，一個樣本的話y = (x - mean(x)) / (std(x) + eps)的計算中，x==mean(x)導致輸出為0，注意這個情況是在feature map為1的情況時，才可能出現x==mean(x)。

NOTE1： 共享引數問題

在tensorflow中有variable_scope方法實現引數共享，也就是說對於2張圖片，第二張訓練時的權重引數與第一張圖片所使用的相同，詳見tf.variable_scope. 同樣，在PyTorch則不存在這樣的問題，因為PyTorch中使用的卷積（或者其他）層首先需要初始化，也就是需要建立一個例項，然後使用例項搭建網路，因此在多次使用這個例項時權重都是共享的。

NOTE2： torch.nn.Module.cuda作用

之前看教程中在定義完網路後會進行：

if gpu:
    net.cuda()

現在才發現這個的作用，官方文件上寫的是：Moves all model parameters and buffers to the GPU.

也就是在定義時並沒有把weight引數傳入gpu中，在呼叫網路進行計算時，如果傳入的資料為GPU資料，則會出現：tensors are on different GPUs 錯誤，因此使用torch.nn.Module.cuda可以把定義的網路引數傳入gpu中。

NOTE3： 對同一個網路連續進行兩次梯度求解（backward）

如果使用一個Variable資料傳入到網路，通過backward求解其梯度值，然後在使用另一個Variable傳入網路，再次求解梯度值，其最終結果會怎麼樣呢？正如你所想得樣，是兩次梯度之和。測試程式碼如下：

import torch.nn as nn
from torch.autograd import Variable
import torch.optim as optim
def init_weigts(m):
    classname = m.__class__.__name__
    if classname.find('Linear') != -1:
        m.weight.data.fill_(0)
        m.bias.data.fill_(0)

net = nn.Sequential(nn.Linear(2, 2))
net.apply(init_weigts)

input = Variable(torch.FloatTensor(1, 2).fill_(1))
label = Variable(torch.FloatTensor(1, 2).fill_(1))
criterion = nn.MSELoss()
# compute first time network
net.zero_grad()
print('before backward')
print(net[0].bias.grad)
output = net(input)
loss = criterion(output, label)
loss.backward()
print('after backward1')
print(net[0].bias.grad)
# compute second time network
input2 = Variable(torch.FloatTensor(1, 2).fill_(1))
label2 = Variable(torch.FloatTensor(1, 2).fill_(1))
output2 = net(input2)
loss2 = criterion(output2, label2)
loss2.backward()
print('after2 backward1')
print(net[0].bias.grad)

定義一個一層的線性網路，並且其權重（weight）和偏置（bias）都初始化為0，在每次求解梯度後輸出梯度值，其結果如下：

3_1

可以發現，在進行梯度求解前，沒有梯度，在第一次計算後梯度為-1，第二次計算後為-2，如果在第一次求解後初始化梯度net.zero_grad()，則來嗯次都是-1，則連續多次求解梯度為多次梯度之和。

NOTE4： PyTorch自定義權重初始化

在上面的NOTE3中使用自定意的權重引數初始化，使用toch.nn.Module.apply()對定義的網路引數進行初始化，首先定義一個權重初始化的函式，如果傳入的類是所定義的網路，則對其權重進行in_place賦值。

如果對weight_init(m)中的classname輸出，可以發現有多個類：（因此需要判斷是否為所定義的網路）

Linear
Sequential

NOTE5： PyTorch權重的更新

關於網路傳遞中網路的定義、loss計算、backpropogate的計算，update weight在Neural Networks有簡單介紹，這裡測試下。只要定義一個優化器（optimizer），實現了常見的優化演算法（optimization algorithms），然後使用優化器和計算的梯度進行權重的更新。

在NOTE3中的程式碼後面增加如下（更新權重引數）：

print('before update parameters')
print(net[0].bias)
optimizer = optim.Adam(net.parameters(), 1)
optimizer.step()
print('after update parameters')
print(net[0].bias)

其執行結果為：

3_2

可見使用optimizer.step()實現了網路權重的更新。（而且可以選擇不同的更新方式，如：Adam、SGD等）

NOTE6： torch.autograd.backward()使用技巧

當計算多個梯度相加（相減）時，使用backward(torch.FloatTensor([-1]))可以簡單實現。

NOTE6： 監控記憶體使用，防止記憶體洩露(memory leak)

程式碼如下：

import gc
import resource

gc.collect()
max_mem_used = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print("{:.2f} MB".format(max_mem_used / 1024))

PyTorch(總)——PyTorch遇到令人迷人的BUG與記錄

這篇部落格就用來記錄在使用pytorch時遇到的BUG，雖然年紀大了，但是調出BUG還是令人興奮^_^!

PyTorch(BUG匯總)—PyTorch遇到令人迷人的BUG與記錄

PyTorch(總)——PyTorch遇到令人迷人的BUG與記錄

PyTorch—torchvision.models匯入預訓練模型與殘差網路講解

【pytorch】載入模型出現的bug

[PyTorch]論文pytorch復現中遇到的BUG

【小白學PyTorch】4 構建模型三要素與權重初始化

【小白學PyTorch】5 torchvision預訓練模型與資料集全覽

[PyTorch 學習筆記] 5.2 Hook 函式與 CAM 演算法

[PyTorch 學習筆記] 7.1 模型儲存與載入

【小白學PyTorch】17 TFrec檔案的建立與讀取

【小白學PyTorch】19 TF2模型的儲存與載入

【小白學PyTorch】20 TF2的eager模式與求導

2、前端總線FSB和南橋與北橋

Hive常見的bug與解決辦法

PyTorch(八)——pyTorch-To-Caffe

OpenSuSE下MySQL遇到的BUG與修復

【PyTorch】PyTorch中使用指定的GPU

學習總結：CSS(一)塊級與行級元素特性、盒模型、層模型、BUG與BFC、浮動模型

PyTorch(五)——PyTorch原始碼修改之增加ConvLSTM層

Unity3D引擎崩潰、異常、警告、BUG與提示總結

PyTorch(總)——PyTorch遇到令人迷人的BUG與記錄

這篇部落格就用來記錄在使用pytorch時遇到的BUG，雖然年紀大了，但是調出BUG還是令人興奮^_^!

相關推薦