CS231n-2017 Assignment3 RNN、LSTM、風格遷移
一、RNN
所需完成的步驟記錄在RNN_Captioning.ipynb
檔案中。
本例中所用的資料為Microsoft
於2014年釋出的COCO
資料集。該資料集中和影象標註想拐的圖片包含80000張訓練圖片和40000張驗證圖片。而這些圖片的特徵已通過VGG-16
網路獲得,儲存在train2014_vgg16_fc7.h5
和val2014_vgg16_fc7.h5
檔案中,每張圖片由一個4096維的向量表徵。為減少問題複雜度,本例還提供了經過PCA
處理之後的特徵,儲存在train2014_vgg16_fc7_pca.h5
和val2014_vgg16_fc7_pca.h5
檔案中,特徵維度由4096維降低為512維。
圖片和其標註示例如下,其中<START>
和<END>
為標註的起始和結束字元,<UNK>
為詞表中未出現的罕見詞,另外為保證標註的長度一致,會在較短的標註後填充<NULL>
特殊字元。
1. RNN
的單步前向傳播
前向傳播的實現的方式,與上次作業大同小異,只不過這裡將會實現迴圈網路層的邏輯。
考慮每次網路讀入一個標註詞時,將根據此次輸入和此時的網路隱藏狀態,計算新的網路隱藏狀態。
rnn_layers.py
檔案中的rnn_step_forward()
函式:
def rnn_step_forward(x, prev_h, Wx, Wh, b):
next_h, cache = None, None
# TODO: Implement a single forward step for the vanilla RNN.
next_h = tanh(np.dot(x, Wx) + np.dot(prev_h, Wh) + b)
cache = (next_h, Wx, Wh, x, prev_h)
return next_h, cache
其中tanh()
為求取超正切值的輔助函式,一個考慮計算溢位異常的穩定版本如下:
def tanh(x):
tmp = x.copy()
tmp[tmp > 10] = 10
tmp = np.exp(tmp*2)
return (tmp - 1)/(tmp + 1)
2. RNN
的單步反向傳播
關於超正切函式的求導:
故在rnn_layers.py
檔案中實現rnn_step_backward()
函式如下:
def rnn_step_backward(dnext_h, cache):
dx, dprev_h, dWx, dWh, db = None, None, None, None, None
# TODO: Implement the backward pass for a single step of a vanilla RNN.
next_h, Wx, Wh, x, prev_h = cache
dtanh_h = dnext_h * (1 - next_h**2)
dx = dtanh_h.dot(Wx.T)
dprev_h = dtanh_h.dot(Wh.T)
dWx = x.T.dot(dtanh_h)
dWh = prev_h.T.dot(dtanh_h)
db = np.sum(dtanh_h, axis=0)
return dx, dprev_h, dWx, dWh, db
3. RNN
的前向傳播
網路讀取一小批的標註資料x
,(樣本數為N
,每條標註的長度為T
),並使用這批標註所對應圖片的特徵作為網路的初始隱藏狀態h0
,通過前向傳播過程,獲得各個樣本在每一步推進中產生的隱藏狀態h
,並存儲反向傳播所需變數。
rnn_layers.py
檔案中的rnn_forward()
函式:
def rnn_forward(x, h0, Wx, Wh, b):
h, cache = None, None
# TODO: Implement forward pass for a vanilla RNN running on a sequence of input data.
N, T, D = x.shape
_, H = h0.shape
h = np.zeros((N, T, H))
prev_h = h0
for iter_time in range(T):
h[:, iter_time, :],_ = rnn_step_forward(x[:, iter_time, :], prev_h, Wx, Wh, b)
prev_h = h[:, iter_time, :]
cache = (h0, h, Wx, Wh, x)
return h, cache
4. RNN
的反向傳播
利用儲存的變數實現反向傳播過程。rnn_layers.py
檔案中的rnn_backward()
函式:
def rnn_backward(dh, cache):
dx, dh0, dWx, dWh, db = None, None, None, None, None
# TODO: Implement the backward pass for a vanilla RNN running an entire sequence of data.
N, T, H = dh.shape
h0, h, Wx, Wh, x = cache
dh0 = np.zeros_like(h0)
dx = np.zeros_like(x)
dWx = np.zeros_like(Wx)
dWh = np.zeros_like(Wh)
db = np.zeros(H)
h = np.concatenate((h0[:, np.newaxis, :], h), axis=1)
for iter_time in range(T):
dnext_h = dh[:, -(iter_time+1), :] + dh0
cache = (h[:, -(iter_time+1), :], Wx, Wh, x[:, -(iter_time+1), :], h[:, -(iter_time+2), :])
dx_step, dh0, dWx_step, dWh_step, db_step = rnn_step_backward(dnext_h, cache)
dx[:, -(iter_time+1), :] = dx_step
dWx += dWx_step
dWh += dWh_step
db += db_step
return dx, dh0, dWx, dWh, db
注意其中梯度值的累積,這其實就是RNN
共享引數的一種體現。
5. 字詞的向量化表達
將影象標註中的詞索引x
轉化為向量表達,並在後向傳播時更新字詞所對應的向量。
rnn_layers.py
檔案中的word_embedding_forward()
函式:
def word_embedding_forward(x, W):
out, cache = None, None
# TODO: Implement the forward pass for word embeddings.
out = W[x, :]
cache = (x, W.shape)
return out, cache
rnn_layers.py
檔案中的word_embedding_backward()
函式:
def word_embedding_backward(dout, cache):
dW = None
# TODO: Implement the backward pass for word embeddings.
x, shp = cache
dW = np.zeros(shp)
np.add.at(dW, x, dout)
return dW
6. 考慮損失函式
rnn.py
檔案中的loss()
函式:
def loss(self, features, captions):
captions_in = captions[:, :-1]
captions_out = captions[:, 1:]
# You'll need this
mask = (captions_out != self._null)
# Weight and bias for the affine transform from image features to initial
# hidden state
W_proj, b_proj = self.params['W_proj'], self.params['b_proj']
# Word embedding matrix
W_embed = self.params['W_embed']
# Input-to-hidden, hidden-to-hidden, and biases for the RNN
Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']
# Weight and bias for the hidden-to-vocab transformation.
W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']
loss, grads = 0.0, {}
############################################################################
# TODO: Implement the forward and backward passes for the CaptioningRNN.
h0, cache_affine = affine_forward(features, W_proj, b_proj) # (1)
captions_in_vec, cache_embed = word_embedding_forward(captions_in, W_embed) #(2)
if self.cell_type == "rnn":
h, cache_rnn = rnn_forward(captions_in_vec, h0, Wx, Wh, b) # (3)
elif self.cell_type == "lstm":
h, cache_lstm = lstm_forward(captions_in_vec, h0, Wx, Wh, b)
scores, cache_score = temporal_affine_forward(h, W_vocab, b_vocab) # (4)
loss, dscores = temporal_softmax_loss(scores, captions_out, mask) # (5)
dh, dW_vocab, db_vocab = temporal_affine_backward(dscores, cache_score) # (4)
if self.cell_type == "rnn":
dcaptions_in_vec, dh0, dWx, dWh, db = rnn_backward(dh, cache_rnn) # (3)
elif self.cell_type == "lstm":
dcaptions_in_vec, dh0, dWx, dWh, db = lstm_backward(dh, cache_lstm) # (3)
dW_embed = word_embedding_backward(dcaptions_in_vec, cache_embed) # (2)
_, dW_proj, db_proj = affine_backward(dh0, cache_affine) # (1)
grads = {"W_vocab": dW_vocab, "b_vocab": db_vocab,
"Wx": dWx, "Wh": dWh, "b": db,
"W_embed": dW_embed, "W_proj": dW_proj, "b_proj": db_proj}
return loss, grads
7. 測試過程
rnn.py
檔案中的sample()
函式:
def sample(self, features, max_length=30):
N = features.shape[0]
captions = self._null * np.ones((N, max_length), dtype=np.int32)
# Unpack parameters
W_proj, b_proj = self.params['W_proj'], self.params['b_proj']
W_embed = self.params['W_embed']
Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']
W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']
# TODO: Implement test-time sampling for the model.
c = np.zeros(b.shape[0]//4)
h = features.dot(W_proj) + b_proj # (1)
captions[:, 0] = self._start
for iter_time in range(1, max_length):
prev_word = captions[:, iter_time-1]
captions_in_vec, _ = word_embedding_forward(prev_word, W_embed) #(2)
if self.cell_type == "rnn":
h, _ = rnn_step_forward(captions_in_vec, h, Wx, Wh, b) # (3)
else:
h, c, _ = lstm_step_forward(captions_in_vec, h, c, Wx, Wh, b) # (3)
scores = np.dot(h, W_vocab) + b_vocab # (4)
captions[:, iter_time] = np.argmax(scores, axis=1)
pass
return captions
二、LSTM
所需完成的步驟記錄在LSTM_Captioning.ipynb
檔案中。
1. LSTM
的單步前向傳播
rnn_layers.py
檔案中的lstm_step_forward()
函式:
def lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b):
next_h, next_c, cache = None, None, None
# TODO: Implement the forward pass for a single timestep of an LSTM.
H = b.shape[0]
ifog = x.dot(Wx) + prev_h.dot(Wh) + b
ifog = getIFOG(ifog, "T")
next_c = getIFOG(ifog, 'f')*prev_c + getIFOG(ifog,'i')*getIFOG(ifog,"g")
next_h = getIFOG(ifog, 'o')*tanh(next_c)
cache = (x, prev_h, prev_c, Wx, Wh, next_c, ifog)
return next_h, next_c, cache
其中getIFOG()
函式為變換並拆分四門輸出的輔助函式:
def getIFOG(ifog, which):
H = ifog.shape[1]//4
indx = {char:i*H for i, char in enumerate("ifog")}
if which == "t" or which == "T":
for char in indx:
if char == "g":
ifog[:, indx[char]:indx[char]+H] = tanh(ifog[:, indx[char]:indx[char]+H])
else:
ifog[:, indx[char]:indx[char]+H] = sigmoid(ifog[:, indx[char]:indx[char]+H])
return ifog
else:
if which == "g":
return ifog[:, indx[which]:indx[which]+H]
else:
return ifog[:, indx[which]:indx[which]+H]
2. LSTM
的單步後向傳播
rnn_layers.py
檔案中實現lstm_step_backward()
函式:
def lstm_step_backward(dnext_h, dnext_c, cache):
dx, dh, dc, dWx, dWh, db = None, None, None, None, None, None
# TODO: Implement the backward pass for a single timestep of an LSTM.
N, H = dnext_c.shape
da = np.zeros((N, 4*H))
x, prev_h, prev_c, Wx, Wh, next_c, ifog = cache
tanhc_t = tanh(next_c)
i = getIFOG(ifog, "i")
f = getIFOG(ifog, "f")
o = getIFOG(ifog, "o")
g = getIFOG(ifog, "g")
dh_c = dnext_h*o*(1-tanhc_t**2)
setIFOG(da, "i", (dnext_c + dh_c)*g*(1-i)*i)
setIFOG(da, "f", (dnext_c + dh_c)*prev_c*(1-f)*f)
setIFOG(da, "o", dnext_h*tanhc_t*(1-o)*o)
setIFOG(da, "g", (dnext_c + dh_c)*i*(1-g**2))
dx = da.dot(Wx.T)
dprev_h = da.dot(Wh.T)
dprev_c = (dnext_c + dh_c) * f
dWx = x.T.dot(da)
dWh = prev_h.T.dot(da)
db = np.sum(da, axis=0)
return dx, dprev_h, dprev_c, dWx, dWh, db
由實現過程可見:LSTM
中反饋到前一層的梯度除了dprev_h
外,還包含dprev_c
。其中dprev_h
涉及與係數矩陣W
的相乘,因此這一項在經歷多步操作時,極易出現梯度爆炸或消失。而dprev_c
這一項,只涉及元素相乘,因此,緩解了上述問題。
3. LSTM
的前向傳播
rnn_layers.py
檔案中的lstm_forward()
函式:
def lstm_forward(x, h0, Wx, Wh, b):
h, cache = None, None
# TODO: Implement the forward pass for an LSTM over an entire timeseries.
N, T, D = x.shape
_, H = h0.shape
h = np