1. 程式人生 > >Assignment | 05-week1 -Character level language model

Assignment | 05-week1 -Character level language model

該系列僅在原課程基礎上課後作業部分添加個人學習筆記,如有錯誤,還請批評指教。- ZJ

Welcome to Dinosaurus Island! 65 million years ago, dinosaurs existed, and in this assignment they are back. You are in charge of a special task. Leading biology researchers are creating new breeds of dinosaurs and bringing them to life on earth, and your job is to give names to these dinosaurs. If a dinosaur does not like its name, it might go beserk, so choose wisely!

歡迎來到恐龍島! 6500萬年前,恐龍就存在了,在這項任務中他們又回來了。 你負責一項特殊任務。 領先的生物學研究人員正在創造新的恐龍種類並將它們帶到地球上,您的工作就是為這些恐龍命名。 如果一隻恐龍不喜歡它的名字,它可能會被人誤以為是,所以明智地選擇!

這裡寫圖片描述

Luckily you have learned some deep learning and you will use it to save the day. Your assistant has collected a list of all the dinosaur names they could find, and compiled them into this

dataset. (Feel free to take a look by clicking the previous link.) To create new dinosaur names, you will build a character level language model to generate new names. Your algorithm will learn the different name patterns, and randomly generate new names. Hopefully this algorithm will keep you and your team safe from the dinosaurs’ wrath!

幸運的是,你已經學會了一些深入的學習,你會用它來儲存一天。 你的助手收集了他們可以找到的所有恐龍名稱的列表,並將它們編譯到這個dataset. 中。 (可以通過點選上一個連結來觀看。)要建立新的恐龍名稱,您將構建一個角色級語言模型以生成新名稱。 您的演算法將學習不同的名稱模式,並隨機生成新名稱。 希望這個演算法能夠讓你和你的團隊免於恐龍的憤怒!

By completing this assignment you will learn:

  • How to store text data for processing using an RNN
  • How to synthesize 合成 data, by sampling 取樣predictions at each time step and passing it to the next RNN-cell unit
  • How to build a character-level 字元級 text generation recurrent neural network
  • Why clipping the gradients 梯度裁剪 is important 防止梯度爆炸

We will begin by loading in some functions that we have provided for you in rnn_utils. Specifically, you have access to functions such as rnn_forward and rnn_backward which are equivalent to those you’ve implemented in the previous assignment.

我們將首先載入我們在rnn_utils中為您提供的一些函式。 具體而言,您可以訪問諸如rnn_forwardrnn_backward等功能,這些功能與您在前面的任務中實現的功能相同。

import numpy as np
from utils import *
import random
'''utils 中的程式碼'''

import numpy as np

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

def smooth(loss, cur_loss):
    return loss * 0.999 + cur_loss * 0.001

def print_sample(sample_ix, ix_to_char):
    txt = ''.join(ix_to_char[ix] for ix in sample_ix)
    txt = txt[0].upper() + txt[1:]  # capitalize first character 
    print ('%s' % (txt, ), end='')

def get_initial_loss(vocab_size, seq_length):
    return -np.log(1.0/vocab_size)*seq_length


def initialize_parameters(n_a, n_x, n_y):
    """
    Initialize parameters with small random values

    Returns:
    parameters -- python dictionary containing:
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        b --  Bias, numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
    """
    np.random.seed(1)
    Wax = np.random.randn(n_a, n_x)*0.01 # input to hidden
    Waa = np.random.randn(n_a, n_a)*0.01 # hidden to hidden
    Wya = np.random.randn(n_y, n_a)*0.01 # hidden to output
    b = np.zeros((n_a, 1)) # hidden bias
    by = np.zeros((n_y, 1)) # output bias

    parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "b": b,"by": by}

    return parameters

def rnn_step_forward(parameters, a_prev, x):

    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    a_next = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b) # hidden state
    p_t = softmax(np.dot(Wya, a_next) + by) # unnormalized log probabilities for next chars # probabilities for next chars 

    return a_next, p_t

def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):

    gradients['dWya'] += np.dot(dy, a.T)
    gradients['dby'] += dy
    da = np.dot(parameters['Wya'].T, dy) + gradients['da_next'] # backprop into h
    daraw = (1 - a * a) * da # backprop through tanh nonlinearity
    gradients['db'] += daraw
    gradients['dWax'] += np.dot(daraw, x.T)
    gradients['dWaa'] += np.dot(daraw, a_prev.T)
    gradients['da_next'] = np.dot(parameters['Waa'].T, daraw)
    return gradients

def update_parameters(parameters, gradients, lr):

    parameters['Wax'] += -lr * gradients['dWax']
    parameters['Waa'] += -lr * gradients['dWaa']
    parameters['Wya'] += -lr * gradients['dWya']
    parameters['b']  += -lr * gradients['db']
    parameters['by']  += -lr * gradients['dby']
    return parameters

def rnn_forward(X, Y, a0, parameters, vocab_size = 27):

    # Initialize x, a and y_hat as empty dictionaries
    x, a, y_hat = {}, {}, {}

    a[-1] = np.copy(a0)

    # initialize your loss to 0
    loss = 0

    for t in range(len(X)):

        # Set x[t] to be the one-hot vector representation of the t'th character in X.
        # if X[t] == None, we just have x[t]=0. This is used to set the input for the first timestep to the zero vector. 
        x[t] = np.zeros((vocab_size,1)) 
        if (X[t] != None):
            x[t][X[t]] = 1

        # Run one step forward of the RNN
        a[t], y_hat[t] = rnn_step_forward(parameters, a[t-1], x[t])

        # Update the loss by substracting the cross-entropy term of this time-step from it.
        loss -= np.log(y_hat[t][Y[t],0])

    cache = (y_hat, a, x)

    return loss, cache

def rnn_backward(X, Y, parameters, cache):
    # Initialize gradients as an empty dictionary
    gradients = {}

    # Retrieve from cache and parameters
    (y_hat, a, x) = cache
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']

    # each one should be initialized to zeros of the same dimension as its corresponding parameter
    gradients['dWax'], gradients['dWaa'], gradients['dWya'] = np.zeros_like(Wax), np.zeros_like(Waa), np.zeros_like(Wya)
    gradients['db'], gradients['dby'] = np.zeros_like(b), np.zeros_like(by)
    gradients['da_next'] = np.zeros_like(a[0])

    ### START CODE HERE ###
    # Backpropagate through time
    for t in reversed(range(len(X))):
        dy = np.copy(y_hat[t])
        dy[Y[t]] -= 1
        gradients = rnn_step_backward(dy, gradients, parameters, x[t], a[t], a[t-1])
    ### END CODE HERE ###

    return gradients, a

1 - Problem Statement

1.1 - Dataset and Preprocessing

Run the following cell to read the dataset of dinosaur names, create a list of unique characters (such as a-z), and compute the dataset and vocabulary size.

執行以下單元格以讀取恐龍名稱的資料集,建立唯一字元列表(例如a-z),並計算資料集和詞彙大小。

data = open('dinos.txt', 'r').read()
data= data.lower() # 小寫
chars = list(set(data)) #先轉化為集合 去除重複的,再轉化為list
data_size, vocab_size = len(data), len(chars)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))
There are 19909 total characters and 27 unique characters in your data.

The characters are a-z (26 characters) plus the “\n” (or newline character), which in this assignment plays a role similar to the <EOS> (or “End of sentence”) token we had discussed in lecture, only here it indicates the end of the dinosaur name rather than the end of a sentence. In the cell below, we create a python dictionary (i.e., a hash table) to map each character to an index from 0-26. We also create a second python dictionary that maps each index back to the corresponding character character. This will help you figure out what index corresponds to what character in the probability distribution output of the softmax layer. Below, char_to_ix and ix_to_char are the python dictionaries.

這些字元是az(26個字元)加上 “\n” (或換行符),它在本次任務中扮演類似於我們在講座中討論過的<EOS>(或“句尾結束” 只有在這裡它表明恐龍名稱的結尾,而不是句子的結尾。 在下面的單元格中,我們建立了一個Python字典(即雜湊表),將每個字元對映到0-26的索引。 我們還建立了第二個python字典,將每個索引映射回相應的字元。 這將幫助您找出哪些索引與softmax圖層的概率分佈輸出中的哪個字元相對應。 下面,char_to_ixix_to_char是python字典。

char_to_ix = { ch:i for i,ch in enumerate(sorted(chars)) }
ix_to_char = { i:ch for i,ch in enumerate(sorted(chars)) }
print(ix_to_char)
print(char_to_ix)
{0: '\n', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z'}
{'\n': 0, 'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}

1.2 - Overview of the model

Your model will have the following structure:

  • Initialize parameters
  • Run the optimization loop
    • Forward propagation to compute the loss function
    • Backward propagation to compute the gradients with respect to the loss function
    • Clip the gradients to avoid exploding gradients
    • Using the gradients, update your parameter with the gradient descent update rule.
  • Return the learned parameters

您的模型將具有以下結構:

  • 初始化引數
  • 執行優化迴圈
    • 正向傳播以計算損失函式
    • 反向傳播以計算相對於損失函式的梯度
    • 剪下漸變以避免梯度爆炸
    • 使用梯度,使用梯度下降更新規則更新您的引數。
  • 返回學習的引數

這裡寫圖片描述

Figure 1: Recurrent Neural Network, similar to what you had built in the previous notebook “Building a RNN - Step by Step”.

At each time-step, the RNN tries to predict what is the next character given the previous characters. The dataset X=(x1,x2,...,xTx) is a list of characters in the training set, while Y=(y1,y2,...,yTx) is such that at every time-step t, we have yt=xt+1.

2 - Building blocks of the model

In this part, you will build two important blocks of the overall model:
- Gradient clipping: to avoid exploding gradients 梯度裁剪-避免梯度爆炸
- Sampling: a technique used to generate characters 取樣- 生成字元的一個技巧

You will then apply these two functions to build the model.

2.1 - Clipping the gradients in the optimization loop

In this section you will implement the clip function that you will call inside of your optimization loop. Recall that your overall loop structure usually consists of a forward pass, a cost computation, a backward pass, and a parameter update. Before updating the parameters, you will perform gradient clipping when needed to make sure that your gradients are not “exploding,” meaning taking on overly large values.

在本節中,您將實現您將在優化迴圈中呼叫的clip函式。 回想一下,您的整體迴圈結構通常由正向傳播,成本計算,反向傳遞和引數更新組成。 在更新引數之前,您需要在需要時執行梯度裁剪,以確保您的梯度不會“爆炸”,這意味著會出現過大的值。

包含 前向廣播,損失計算,反向傳播,引數更新,4 部分。 在引數更新之前,執行梯度修剪

In the exercise below, you will implement a function clip that takes in a dictionary of gradients and returns a clipped version of gradients if needed. There are different ways to clip gradients; we will use a simple element-wise clipping procedure, in which every element of the gradient vector is clipped to lie between some range [-N, N]. More generally, you will provide a maxValue (say 10). In this example, if any component of the gradient vector is greater than 10, it would be set to 10; and if any component of the gradient vector is less than -10, it would be set to -10. If it is between -10 and 10, it is left alone.

在下面的練習中,您將實現一個函式clip,它接收梯度字典並在需要時返回裁剪後版本的梯度。 有不同的方法來裁剪梯度。 我們將使用一個簡單的基於元素的裁剪程式,其中梯度向量的每個元素都被裁剪以位於某個範圍[-N,N]之間。 更一般地說,你會提供一個maxValue(比如說10)。 在這個例子中,如果梯度向量的任何分量大於10,它將被設定為10; 並且如果梯度向量的任何分量小於-10,則將其設定為-10。 如果它在-10到10之間,它將被單獨放置。

我們使用最簡單的逐元乘積執行裁剪過程 保證其在[-N, N] 區間內 ,設定 一個 最大值 10,如果任何一個 值大於10 了,則設定為 10 ,最小值同理。

這裡寫圖片描述

Figure 2: Visualization of gradient descent with and without gradient clipping, in a case where the network is running into slight “exploding gradient” problems.

Exercise: Implement the function below to return the clipped gradients of your dictionary gradients. Your function takes in a maximum threshold and returns the clipped versions of your gradients. You can check out this hint for examples of how to clip in numpy. You will need to use the argument out = ....

### GRADED FUNCTION: clip

def clip(gradients, maxValue):
    '''
    Clips the gradients' values between minimum and maximum.

    Arguments:
    gradients -- a dictionary containing the gradients "dWaa", "dWax", "dWya", "db", "dby"
    maxValue -- everything above this number is set to this number, and everything less than -maxValue is set to -maxValue

    Returns: 
    gradients -- a dictionary with the clipped gradients.
    '''

    ### START CODE HERE ###
    # clip to mitigate exploding gradients, loop over [dWax, dWaa, dWya, db, dby]. (≈2 lines)
    for name,val in gradients.items():
        gradients[name] = np.clip(val, -maxValue, maxValue, out=gradients[name])
    ### END CODE HERE ###

    return gradients

所以,這裡使用的梯度裁剪,僅僅是設定了,設定了最大值和最小值區間。去除了較大或較小的部分。

np.random.seed(3)
dWax = np.random.randn(5,3)*10
dWaa = np.random.randn(5,5)*10
dWya = np.random.randn(2,5)*10
db = np.random.randn(5,1)*10
dby = np.random.randn(2,1)*10
gradients = {"dWax": dWax, "dWaa": dWaa, "dWya": dWya, "db": db, "dby": dby}
gradients = clip(gradients, 10)
print("gradients[\"dWaa\"][1][2] =", gradients["dWaa"][1][2])
print("gradients[\"dWax\"][3][1] =", gradients["dWax"][3][1])
print("gradients[\"dWya\"][1][2] =", gradients["dWya"][1][2])
print("gradients[\"db\"][4] =", gradients["db"][4])
print("gradients[\"dby\"][1] =", gradients["dby"][1])
gradients["dWaa"][1][2] = 10.0
gradients["dWax"][3][1] = -10.0
gradients["dWya"][1][2] = 0.2971381536101662
gradients["db"][4] = [10.]
gradients["dby"][1] = [8.45833407]

* Expected output:*

**gradients[“dWaa”][1][2] ** 10.0
**gradients[“dWax”][3][1]** -10.0
**gradients[“dWya”][1][2]** 0.29713815361
**gradients[“db”][4]** [ 10.]
**gradients[“dby”][1]** [ 8.45833407]

2.2 - Sampling

Now assume that your model is trained. You would like to generate new text (characters). The process of generation is explained in the picture below:

這裡寫圖片描述

Figure 3: In this picture, we assume the model is already trained. We pass in x1=0 at the first time step, and have the network then sample one character at a time.

Exercise: Implement the sample function below to sample characters. You need to carry out 4 steps:

  • Step 1: Pass the network the first “dummy” input x1=0 (the vector of zeros). This is the default input before we’ve generated any characters. We also set a0=0

  • Step 2: Run one step of forward propagation to get a1 and y^1. Here are the equations: