機器學習- RNN以及LSTM的原理分析
- 概述
RNN是遞迴神經網路,它提供了一種解決深度學習的另一個思路,那就是每一步的輸出不僅僅跟當前這一步的輸入有關,而且還跟前面和後面的輸入輸出有關,尤其是在一些NLP的應用中,經常會用到,例如在NLP中,每一個輸出的Word,都跟整個句子的內容都有關係,而不僅僅跟某一個詞有關。LSTM是RNN的一種升級版本,它的核心思想跟RNN是一樣的,但是它透過一下方法避免了一些RNN的缺點。那麼下面就逐步的解析一下RNN和LSTM的結構,然後分析一下它們的原理吧。
- RNN解析
要理解RNN,咱們得先來看一下RNN的結構,然後就來解釋一下它的原理
上圖中左邊的圖是一個RNN網路結構中總體的圖,右邊的圖片是一個RNN Cell裡面的具體細節; 從上面的左邊的圖咱們可以看出,一個RNN的網路結構中,無論RNNcell迴圈多少次,它的weights都是share的,即一個weights只有一份copy, 而每一步的Hidden state(即右圖中的a<t>
def rnn_cell(xt, a_prev, parameters): """ Implements a single forward step of the RNN-cell as described Arguments: xt -- your input data at timestep "t", numpy array of shape (n_x, m). a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m) parameters -- python dictionary containing: Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x) Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a) Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a) ba -- Bias, numpy array of shape (n_a, 1) by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1) Returns: a_next -- next hidden state, of shape (n_a, m) yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m) cache -- tuple of values needed for the backward pass, contains (a_next, a_prev, xt, parameters) """ # Retrieve parameters from "parameters" Wax = parameters["Wax"] Waa = parameters["Waa"] Wya = parameters["Wya"] ba = parameters["ba"] by = parameters["by"] # compute next activation state using the formula given above a_next = np.tanh(Wax.dot(xt)+Waa.dot(a_prev)+ba) # compute output of the current cell using the formula given above yt_pred = softmax(Wya.dot(a_next)+by) # store values you need for backward propagation in cache cache = (a_next, a_prev, xt, parameters) return a_next, yt_pred, cache np.random.seed(1) xt_tmp = np.random.randn(3,10) a_prev_tmp = np.random.randn(5,10) parameters_tmp = {} parameters_tmp['Waa'] = np.random.randn(5,5) parameters_tmp['Wax'] = np.random.randn(5,3) parameters_tmp['Wya'] = np.random.randn(2,5) parameters_tmp['ba'] = np.random.randn(5,1) parameters_tmp['by'] = np.random.randn(2,1) a_next_tmp, yt_pred_tmp, cache_tmp = rnn_cell_forward(xt_tmp, a_prev_tmp, parameters_tmp) print("a_next[4] = ", a_next_tmp[4]) print("a_next.shape = ", a_next_tmp.shape) print("yt_pred[1] =", yt_pred_tmp[1]) print("yt_pred.shape = ", yt_pred_tmp.shape) print( a_next_tmp[:,:]) print( a_next_tmp[:,0])
- LSTM解析
根據上面的RNN的結構圖片,你們仔細的看一下有沒有什麼缺點。如果怎麼的RNN需要迴圈很多次的話,咱們可能會有丟失資訊的可能,就是gradient vanishing的情況發生,如果gradient vanishing的情況發生的話,它就不會繼續學習咱們的資訊了,就變成了standard neuro network了,RNN就失去了意義了。而且,咱們的sequence越長(即:迴圈的次數越多),gradient vanishing的可能性就越大。這個時候,咱們就有必要優化咱們的RNN了,讓優化了的結構不但能夠不斷的學習能力,還能夠有記憶功能,能把咱們學習的主要的東西能夠記住,這就讓咱們的RNN進化到了LSTM(Long short term memory)階段了。為了能夠更好的解釋LSTM的網路結構,咱們還是先來看一些它就結構圖片,然後再來解釋一下吧
上圖是一個LSTM cell的基本結構,這個圖有有些不重要的元素,我都省略了,主要留下了一些最重要的資訊。首先對比RNN, 咱們可以看出咱們多了3個gate和一個memory cell - C<t>。這個memory cell也稱作internal hidden state。那麼咱們來看看這個三個gate到底是幹什麼的。第一個gate是forget gate,它是幫助咱們的memory cell刪除(或者稱之為過濾)掉一些不重要的資訊的,這個gate的值是在[0,1]這個區間,一般咱們用sigmoid函式來運算了,然後再和C來做element-wise的乘法運算,如果forget gate中的值趨向於0就刪除掉相應的資訊,如果forget gate中的值趨向於1則保留相應的值。第二個gate是update gate; 這個gate要跟咱們candidate memory cell來共同作用來產生新的資訊,它們兩個進行element-wise的相乘運算後,再跟咱們經過forget gate後的memory cell來進行element-wise的相加勻速,相當於把咱們從當前time step中學習到的資訊新增到咱們的memory cell中。第三個gate就是output gate了;顧名思義就是過濾咱們輸出的hidden state的, 這個gate也是sigmoid函式,根據前一個hidden state a<t-1> 和當前time step的輸入 X<t>來共同決定的,它跟經過forget gate和update gate處理後的memory cell 的tanh值進行element-wise相乘過後,得到了了咱們當前time step的hidden state-a<t>,同時還得到了咱們當前time step的memory cell值;從這裡咱們也可以看出output hidden state-a<t>和internal hidden state(memory cell)的dimension是一樣的。上面解釋了一下一個LSTM Cell中具體的結構跟功能。為了能夠更好的加深大家對於LSTM的理解,我還是用程式碼演示一下如何構建一個LSTM cell, 程式碼如下所示:
def lstm_cell(xt, a_prev, c_prev, parameters): """ Implement a single forward step of the LSTM-cell as described in Figure (4) Arguments: xt -- your input data at timestep "t", numpy array of shape (n_x, m). a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m) c_prev -- Memory state at timestep "t-1", numpy array of shape (n_a, m) parameters -- python dictionary containing: Wf -- Weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x) bf -- Bias of the forget gate, numpy array of shape (n_a, 1) Wi -- Weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x) bi -- Bias of the update gate, numpy array of shape (n_a, 1) Wc -- Weight matrix of the first "tanh", numpy array of shape (n_a, n_a + n_x) bc -- Bias of the first "tanh", numpy array of shape (n_a, 1) Wo -- Weight matrix of the output gate, numpy array of shape (n_a, n_a + n_x) bo -- Bias of the output gate, numpy array of shape (n_a, 1) Wy -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a) by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1) Returns: a_next -- next hidden state, of shape (n_a, m) c_next -- next memory state, of shape (n_a, m) yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m) cache -- tuple of values needed for the backward pass, contains (a_next, c_next, a_prev, c_prev, xt, parameters) Note: ft/it/ot stand for the forget/update/output gates, cct stands for the candidate value (c tilde), c stands for the cell state (memory) """ # Retrieve parameters from "parameters" Wf = parameters["Wf"] # forget gate weight bf = parameters["bf"] Wi = parameters["Wi"] # update gate weight (notice the variable name) bi = parameters["bi"] # (notice the variable name) Wc = parameters["Wc"] # candidate value weight bc = parameters["bc"] Wo = parameters["Wo"] # output gate weight bo = parameters["bo"] Wy = parameters["Wy"] # prediction weight by = parameters["by"] # Retrieve dimensions from shapes of xt and Wy n_x, m = xt.shape n_y, n_a = Wy.shape ### START CODE HERE ### # Concatenate a_prev and xt concat = np.concatenate((a_prev, xt), axis=0) # Compute values for ft (forget gate), it (update gate), # cct (candidate value), c_next (cell state), # ot (output gate), a_next (hidden state) ft = sigmoid(Wf.dot(concat)+bf) # forget gate it = sigmoid(Wi.dot(concat)+bi) # update gate cct = np.tanh(Wc.dot(concat)+bc) # candidate value c_next = c_prev*ft + cct*it # cell state ot = sigmoid(Wo.dot(concat)+bo) # output gate a_next = ot*np.tanh(c_next) # hidden state # Compute prediction of the LSTM cell yt_pred = softmax(Wy.dot(a_next)+by) ### END CODE HERE ### # store values needed for backward propagation in cache cache = (a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters) return a_next, c_next, yt_pred, cache np.random.seed(1) xt_tmp = np.random.randn(3,10) a_prev_tmp = np.random.randn(5,10) c_prev_tmp = np.random.randn(5,10) parameters_tmp = {} parameters_tmp['Wf'] = np.random.randn(5, 5+3) parameters_tmp['bf'] = np.random.randn(5,1) parameters_tmp['Wi'] = np.random.randn(5, 5+3) parameters_tmp['bi'] = np.random.randn(5,1) parameters_tmp['Wo'] = np.random.randn(5, 5+3) parameters_tmp['bo'] = np.random.randn(5,1) parameters_tmp['Wc'] = np.random.randn(5, 5+3) parameters_tmp['bc'] = np.random.randn(5,1) parameters_tmp['Wy'] = np.random.randn(2,5) parameters_tmp['by'] = np.random.randn(2,1) a_next_tmp, c_next_tmp, yt_tmp, cache_tmp = lstm_cell_forward(xt_tmp, a_prev_tmp, c_prev_tmp, parameters_tmp) print("a_next[4] = \n", a_next_tmp[4]) print("a_next.shape = ", c_next_tmp.shape) print("c_next[2] = \n", c_next_tmp[2]) print("c_next.shape = ", c_next_tmp.shape) print("yt[1] =", yt_tmp[1]) print("yt.shape = ", yt_tmp.shape) print("cache[1][3] =\n", cache_tmp[1][3]) print("len(cache) = ", len(cache_tmp))
- 總結
上面的兩個部分主要介紹了RNN和LSTM的結構,以及分析了它們結構內部的功能和流程。並且分別在每一個cell後面都用Python展示瞭如何用程式碼去構建一個RNN cell和LSTM cell。咱們可以理解LSTM是對RNN的一種優化,同時要明白為什麼要進行這樣的優化;其次最重要的是理解RNN的這樣一種新的解決問題的方法和思路,這跟咱們之前見過的standard neuro network最明顯的一個區別就是,之前在神經網路,regressor 或者classfier中,每一個輸出只跟咱們的輸入features相關, 而RNN的思路則是不僅僅跟當前的輸入有關,還跟前面的輸入有關,這在以前sequence model中是非常常見的,例如Language modeling, machine translation等等的應用中,都是要用到RNN的思想的。
&n