theano學習指南--詞向量的迴圈神經網路(翻譯)

阿新 • • 發佈：2019-02-18

最近在學習Git，所以正好趁這個機會，把學習到的知識實踐一下~ 看完DeepLearning的原理，有了大體的瞭解，但是對於theano的程式碼，還是自己擼一遍印象更深所以照著deeplearning.net上的程式碼，重新寫了一遍，註釋部分是原文翻譯和自己的理解。感興趣的小夥伴可以一起完成這個工作哦~ 有問題歡迎聯絡我 Email: [email protected] QQ: 3062984605

概述

本教程中，你將會學到：

詞向量
迴圈神經網路架構
文字視窗

從而實現Semantic Parsing / Slot-Filling(自然語言的理解)。

程式碼—引用—聯絡方式

程式碼

論文

如果使用本教程，請引用下列文獻：

Grégoire Mesnil, Xiaodong He, Li Deng and Yoshua Bengio. Investigation of Recurrent-Neural-Network Architectures and Learning Methods for Spoken Language Understanding. Interspeech, 2013.
Gokhan Tur, Dilek Hakkani-Tur and Larry Heck. What is left to be understood in ATIS?

Christian Raymond and Giuseppe Riccardi. Generative and discriminative algorithms for spoken language understanding. Interspeech, 2007.
Christian Raymond and Giuseppe Riccardi. Generative and discriminative algorithms for spoken language understanding. Interspeech, 2007.
Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010.

謝謝！

聯絡方式

有問題請聯絡 Grégoire Mesnil (first-add-a-dot-last-add-at-gmail-add-a-dot-com)。我們很樂意收到您的反饋。

任務

Slot-Filling (Spoken Language Understanding)是對給定的句子中每個單詞標定標籤。這是一個分類問題。

資料集

資料集是DARPA的一個小型資料集：ATIS (Airline Travel Information System)。這裡的語句例子使用Inside Outside Beginning (IOB)表示。

input(words)	show	flights	from	Boston	to	New	York	today
Output(labels)	O	O	O	B-dept	O	B-arr	I-arr	B-date

ATIS 包含單詞4978個，句子893個，測試集合包含單詞56590個，句子9198個（平均句子長度為15）。類的數量(不同的slots)為128，其中包括O標籤(NULL)。
在論文 Microsoft Research people，對於只出現一次的單詞，標記為，運用同樣的方法標記未出現的單詞。在論文Ronan Collobert and colleagues中，用數字表示字串，例如1984表示DIGITDIGITDIGITDIGIT。
我們將訓練集合分為訓練集和驗證集，分別包含80%和20%的訓練語句。 Significant performance improvement difference has to be greater than 0.6% in F1 measure at the 95% level due to the small size of the dataset。為了驗證效果，實驗中定義了三個矩陣：

迴圈神經網路模型

原始輸入編碼

一個token對應一個單詞。ATIS中詞彙表對應的每個token都有相應的索引。每個語句是索引的陣列(int32)。其次，每個集合（訓練集、驗證集、測試集）是索引陣列的列表。定義python字典將索引對映到單詞。

>>> sentence
array([383, 189,  13, 193, 208, 307, 195, 502, 260, 539,
        7,  60,  72, 8, 350, 384], dtype=int32)
>>> map(lambda x: index2word[x], sentence)
['please', 'find', 'a', 'flight', 'from', 'miami', 'florida',
        'to', 'las', 'vegas', '<UNK>', 'arriving', 'before', 'DIGIT', "o'clock", 'pm']

對於標籤，採用同樣的方法：

>>> labels
array([126, 126, 126, 126, 126,  48,  50, 126,  78, 123,  81, 126,  15,
        14,  89,  89], dtype=int32)
>>> map(lambda x: index2label[x], labels)
['O', 'O', 'O', 'O', 'O', 'B-fromloc.city_name', 'B-fromloc.state_name',
        'O', 'B-toloc.city_name', 'I-toloc.city_name', 'B-toloc.state_name',
        'O', 'B-arrive_time.time_relative', 'B-arrive_time.time',
        'I-arrive_time.time', 'I-arrive_time.time']

文字窗

給定語句：索引的陣列，視窗大小：1,3,5,…。現在需要將語句中每個詞根據文字窗選定該詞周圍的詞。具體實現如下：

def contextwin(l, win):
    '''
    win :: int corresponding to the size of the window
    given a list of indexes composing a sentence

    l :: array containing the word indexes

    it will return a list of list of indexes corresponding
    to context windows surrounding each word in the sentence
    '''
    assert (win % 2) == 1
    assert win >= 1
    l = list(l)

    lpadded = win // 2 * [-1] + l + win // 2 * [-1]
    out = [lpadded[i:(i + win)] for i in range(len(l))]

    assert len(out) == len(l)
    return out

PADDING索引中的-1插在語句的開始/結束位置。
例子如下：

>>> x
array([0, 1, 2, 3, 4], dtype=int32)
>>> contextwin(x, 3)
[[-1, 0, 1],
 [ 0, 1, 2],
 [ 1, 2, 3],
 [ 2, 3, 4],
 [ 3, 4,-1]]
>>> contextwin(x, 7)
[[-1, -1, -1, 0, 1, 2, 3],
 [-1, -1,  0, 1, 2, 3, 4],
 [-1,  0,  1, 2, 3, 4,-1],
 [ 0,  1,  2, 3, 4,-1,-1],
 [ 1,  2,  3, 4,-1,-1,-1]]

總的來說，輸入為一個索引的陣列，輸出為索引的矩陣。每行是指定單詞的文字窗。

詞向量

將語句轉換成文字窗：索引的矩陣，下一步需要將索引轉換為詞向量。使用Theano。程式碼如下：

import theano, numpy
from theano import tensor as T

# nv :: size of our vocabulary
# de :: dimension of the embedding space
# cs :: context window size
nv, de, cs = 1000, 50, 5

embeddings = theano.shared(0.2 * numpy.random.uniform(-1.0, 1.0, \
    (nv+1, de)).astype(theano.config.floatX)) # add one for PADDING at the end

idxs = T.imatrix() # as many columns as words in the context window and as many lines as words in the sentence
x    = self.emb[idxs].reshape((idxs.shape[0], de*cs))

符號變數x表示矩陣的維度(語句中單詞數量，文字窗的長度)。
下面開始編譯theano函式：

>>> sample
array([0, 1, 2, 3, 4], dtype=int32)
>>> csample = contextwin(sample, 7)
[[-1, -1, -1, 0, 1, 2, 3],
 [-1, -1,  0, 1, 2, 3, 4],
 [-1,  0,  1, 2, 3, 4,-1],
 [ 0,  1,  2, 3, 4,-1,-1],
 [ 1,  2,  3, 4,-1,-1,-1]]
>>> f = theano.function(inputs=[idxs], outputs=x)
>>> f(csample)
array([[-0.08088442,  0.08458307,  0.05064092, ...,  0.06876887,
        -0.06648078, -0.15192257],
       [-0.08088442,  0.08458307,  0.05064092, ...,  0.11192625,
         0.08745284,  0.04381778],
       [-0.08088442,  0.08458307,  0.05064092, ..., -0.00937143,
         0.10804889,  0.1247109 ],
       [ 0.11038255, -0.10563177, -0.18760249, ..., -0.00937143,
         0.10804889,  0.1247109 ],
       [ 0.18738101,  0.14727569, -0.069544  , ..., -0.00937143,
         0.10804889,  0.1247109 ]], dtype=float32)
>>> f(csample).shape
(5, 350)

我們現在得到了文字窗詞向量的一個序列(長度為5，表示語句長度)，該詞向量非常適用迴圈神經網路。

Elman迴圈神經網路

Elman迴圈神經網路(E-RNN)的輸入為當前輸入（t時刻）和之前隱層狀態（t-1時刻）。然後重複該步驟。
在之前章節中，我們構造輸入為時序結構。在上述矩陣中，第0行表示t=0時刻，第1行表示t=1時刻，如此等等。
E-RNN中需要學習的引數如下：

詞向量（真實值矩陣）
初始隱藏狀態（真實值向量）
作用於線性過程的t時刻輸入和t-1時刻隱層狀態的兩個矩陣
（優化）偏置。建議：不使用
頂層的softmax分類器

整個網路的超引數如下：

詞向量的維度
詞彙表的數量
隱層單元的數量
類的數量
用於初始化模型的隨機種子

程式碼如下：

class RNNSLU(object):
    ''' elman neural net model '''
    def __init__(self, nh, nc, ne, de, cs):
        '''
        nh :: dimension of the hidden layer
        nc :: number of classes
        ne :: number of word embeddings in the vocabulary
        de :: dimension of the word embeddings
        cs :: word window context size
        '''
        # parameters of the model
        self.emb = theano.shared(name='embeddings',
                                 value=0.2 * numpy.random.uniform(-1.0, 1.0,
                                 (ne+1, de))
                                 # add one for padding at the end
                                 .astype(theano.config.floatX))
        self.wx = theano.shared(name='wx',
                                value=0.2 * numpy.random.uniform(-1.0, 1.0,
                                (de * cs, nh))
                                .astype(theano.config.floatX))
        self.wh = theano.shared(name='wh',
                                value=0.2 * numpy.random.uniform(-1.0, 1.0,
                                (nh, nh))
                                .astype(theano.config.floatX))
        self.w = theano.shared(name='w',
                               value=0.2 * numpy.random.uniform(-1.0, 1.0,
                               (nh, nc))
                               .astype(theano.config.floatX))
        self.bh = theano.shared(name='bh',
                                value=numpy.zeros(nh,
                                dtype=theano.config.floatX))
        self.b = theano.shared(name='b',
                               value=numpy.zeros(nc,
                               dtype=theano.config.floatX))
        self.h0 = theano.shared(name='h0',
                                value=numpy.zeros(nh,
                                dtype=theano.config.floatX))

        # bundle
        self.params = [self.emb, self.wx, self.wh, self.w,
                       self.bh, self.b, self.h0]

以下程式碼構造詞矩陣的輸入：

 idxs = T.imatrix()
        x = self.emb[idxs].reshape((idxs.shape[0], de*cs))
        y_sentence = T.ivector('y_sentence')  # labels

呼叫scan函式實現遞迴，效果很神奇：

def recurrence(x_t, h_tm1):
            h_t = T.nnet.sigmoid(T.dot(x_t, self.wx)
                                 + T.dot(h_tm1, self.wh) + self.bh)
            s_t = T.nnet.softmax(T.dot(h_t, self.w) + self.b)
            return [h_t, s_t]

        [h, s], _ = theano.scan(fn=recurrence,
                                sequences=x,
                                outputs_info=[self.h0, None],
                                n_steps=x.shape[0])

        p_y_given_x_sentence = s[:, 0, :]
        y_pred = T.argmax(p_y_given_x_sentence, axis=1)

Theano會自動的計算所有梯度用於最大最小化似然概率：

lr = T.scalar('lr')

sentence_nll = -T.mean(T.log(p_y_given_x_sentence)
                               [T.arange(x.shape[0]), y_sentence])
sentence_gradients = T.grad(sentence_nll, self.params)
sentence_updates = OrderedDict((p, p - lr*g)
                                       for p, g in
                                       zip(self.params, sentence_gradients))

然後編譯函式：

self.classify = theano.function(inputs=[idxs], outputs=y_pred)
self.sentence_train = theano.function(inputs=[idxs, y_sentence, lr],
                                              outputs=sentence_nll,
                                              updates=sentence_updates)

在每次更新之後，需要將詞向量正則化：

        self.normalize = theano.function(inputs=[],
                                         updates={self.emb:
                                                  self.emb /
                                                  T.sqrt((self.emb**2)
                                                  .sum(axis=1))
                                                  .dimshuffle(0, 'x')})

這就是所有的工作！

評估

根據之前定義的函式，你可以比較預測標籤和真實標籤，並計算相關矩陣。在這個github倉庫，封裝了conlleval文字。計算關於Inside Outside Beginning (IOB)的矩陣是十分必要的。如果詞起始、詞中間、詞末端預測都是正確的，那麼就認為該預測是正確的。需要注意的是，文字字尾是txt，而計算過程中需要將其轉換為pl。

訓練

更新

對於隨機梯度下降法(SGD)的更新，我們將整句作為一個mini-batch，並對每句執行一次更新。對於純SGD(不同於mini-batch)，每個單詞執行一次更新。
每次迴圈/更新之後，需要正則化詞向量，保證它們有統一的單位。

停止引用

在驗證集上提前結束是一種常規技術：訓練集執行一定的代數，每代在驗證集上計算F1得分，並保留最好的模型。

超引數選擇

儘管已經有關於超引數選擇的研究/程式碼,這裡我們使用KISS隨機搜尋。
以下引數是一些建議值：

學習率：uniform([0.05，0.01])
視窗大小：集合{3,…,19}的隨機數
隱層單元數量:{100,200}之間的隨機數
詞向量維度：{50,100}之間的隨機數

執行程式

使用download.sh命令下載資料檔案後，可以呼叫以下命令執行程式：

python code/rnnslu.py

('NEW BEST: epoch', 25, 'valid F1', 96.84, 'best test F1', 93.79)
[learning] epoch 26 >> 100.00% completed in 28.76 (sec) <<
[learning] epoch 27 >> 100.00% completed in 28.76 (sec) <<
...
('BEST RESULT: epoch', 57, 'valid F1', 97.23, 'best test F1', 94.2, 'with the model', 'rnnslu')

時間

使用github倉庫中的程式碼測試ATIS資料集，每代少於40秒。實驗平臺為：n i7 CPU 950 @ 3.07GHz using less than 200 Mo of RAM。

[learning] epoch 0 >> 100.00% completed in 34.48 (sec) <<

進行若干代之後，F1得分下降為94.48% 。

NEW BEST: epoch 28 valid F1 96.61 best test F1 94.19
NEW BEST: epoch 29 valid F1 96.63 best test F1 94.42
[learning] epoch 30 >> 100.00% completed in 35.04 (sec) <<
[learning] epoch 31 >> 100.00% completed in 34.80 (sec) <<
[...]
NEW BEST: epoch 40 valid F1 97.25 best test F1 94.34
[learning] epoch 41 >> 100.00% completed in 35.18 (sec) <<
NEW BEST: epoch 42 valid F1 97.33 best test F1 94.48
[learning] epoch 43 >> 100.00% completed in 35.39 (sec) <<
[learning] epoch 44 >> 100.00% completed in 35.31 (sec) <<
[...]

詞向量近鄰

我們可以對學習到的詞向量進行K近鄰檢查。L2距離和cos距離返回結果相同，所以我們畫出詞向量的cos距離。

atlanta	back	ap80	but	aircraft	business	a	august	actually	cheap
phoenix	live	ap57	if	plane	coach	people	september	provide	weekday
denver	lives	ap	up	service	first	do	january	prices	weekdays
tacoma	both	connections	a	airplane	fourth	but	june	stop	am
columbus	how	tomorrow	now	seating	thrift	numbers	december	number	early
seattle	me	before	amount	stand	tenth	abbreviation	november	flight	sfo
minneapolis	out	earliest	more	that	second	if	april	there	milwaukee
pittsburgh	other	connect	abbreviation	on	fifth	up	july	serving	jfk
ontario	plane	thrift	restrictions	turboprop	third	serve	jfk	thank	shortest
montreal	service	coach	mean	mean	twelfth	database	october	ticket	bwi
philadelphia	fare	today	interested	amount	sixth	passengers	may	are	lastest

可以看出，較少的詞彙表（大約500單詞）可以較少計算量。根據人為識別，發現有些分類效果好，有些則較差。