基於pytorch的CNN、LSTM神經網絡模型調參小結

阿新 • • 發佈：2017-09-03

shu 結合手動 ces hid open ont 16px nbsp

（Demo）

這是最近兩個月來的一個小總結，實現的demo已經上傳github，裏面包含了CNN、LSTM、BiLSTM、GRU以及CNN與LSTM、BiLSTM的結合還有多層多通道CNN、LSTM、BiLSTM等多個神經網絡模型的的實現。這篇文章總結一下最近一段時間遇到的問題、處理方法和相關策略，以及經驗（其實並沒有什麽經驗）等，白菜一枚。
Demo Site: https://github.com/bamtercelboo/cnn-lstm-bilstm-deepcnn-clstm-in-pytorch

（一） Pytorch簡述

Pytorch是一個較新的深度學習框架，是一個 Python 優先的深度學習框架，能夠在強大的 GPU 加速基礎上實現張量和動態神經網絡。

（二） CNN、LSTM

卷積神經網絡CNN理解參考（https://www.zybuluo.com/hanbingtao/note/485480）
長短時記憶網絡LSTM理解參考（https://zybuluo.com/hanbingtao/note/581764）

（三）數據預處理

　　1、我現在使用的語料是基本規範的數據（例如下），但是加載語料數據的過程中仍然存在著一些需要預處理的地方，像一些數據的大小寫、數字的處理以及“\n \t”等一些字符，現在使用torchtext第三方庫進行加載數據預處理。

You Should Pay Nine Bucks for This : Because you can hear about suffering Afghan refugees on the news and 
 still be unaffected . ||| 2
Dramas like this make it human . ||| 4

View Code

　　2、torch建立詞表、處理語料數據的大小寫：

import torchtext.data as data
# lower word
text_field = data.Field(lower=True)

View Code

　　3、處理語料數據數字等特殊字符：

 1 from torchtext import data
 2       def clean_str(string):
 3             string = re.sub(r" 
[^A-Za-z0-9(),!?\‘\`]", " ", string)
 4             string = re.sub(r"\‘s", " \‘s", string)
 5             string = re.sub(r"\‘ve", " \‘ve", string)
 6             string = re.sub(r"n\‘t", " n\‘t", string)
 7             string = re.sub(r"\‘re", " \‘re", string)
 8             string = re.sub(r"\‘d", " \‘d", string)
 9             string = re.sub(r"\‘ll", " \‘ll", string)
10             string = re.sub(r",", " , ", string)
11             string = re.sub(r"!", " ! ", string)
12             string = re.sub(r"\(", " \( ", string)
13             string = re.sub(r"\)", " \) ", string)
14             string = re.sub(r"\?", " \? ", string)
15             string = re.sub(r"\s{2,}", " ", string)
16             return string.strip()
17 
18         text_field.preprocessing = data.Pipeline(clean_str)

View Code

　　4、需要註意的地方：

加載數據集的時候可以使用random打亂數據

1 if shuffle:
2     random.shuffle(examples_train)
3     random.shuffle(examples_dev)
4     random.shuffle(examples_test)

View Code

torchtext建立訓練集、開發集、測試集叠代器的時候，可以選擇在每次叠代的時候是否去打亂數據

 1 class Iterator(object):
 2     """Defines an iterator that loads batches of data from a Dataset.
 3 
 4     Attributes:
 5         dataset: The Dataset object to load Examples from.
 6         batch_size: Batch size.
 7         sort_key: A key to use for sorting examples in order to batch together
 8             examples with similar lengths and minimize padding. The sort_key
 9             provided to the Iterator constructor overrides the sort_key
10             attribute of the Dataset, or defers to it if None.
11         train: Whether the iterator represents a train set.
12         repeat: Whether to repeat the iterator for multiple epochs.
13         shuffle: Whether to shuffle examples between epochs.
14         sort: Whether to sort examples according to self.sort_key.
15             Note that repeat, shuffle, and sort default to train, train, and
16             (not train).
17         device: Device to create batches on. Use -1 for CPU and None for the
18             currently active GPU device.
19     """

View Code

（四）Word Embedding

　　1、word embedding簡單來說就是語料中每一個單詞對應的其相應的詞向量，目前訓練詞向量的方式最使用的應該是word2vec（參考 http://www.cnblogs.com/bamtercelboo/p/7181899.html）

　　2、上文中已經通過torchtext建立了相關的詞匯表，加載詞向量有兩種方式，一個是加載外部根據語料訓練好的預訓練詞向量，另一個方式是隨機初始化詞向量，兩種方式相互比較的話當時是使用預訓練好的詞向量效果會好很多，但是自己訓練的詞向量並不見得會有很好的效果，因為語料數據可能不足，像已經訓練好的詞向量，像Google News那個詞向量，是業界公認的詞向量，但是由於數量巨大，如果硬件設施（GPU）不行的話，還是不要去嘗試這個了。

　　3、提供幾個下載預訓練詞向量的地址

word2vec-GoogleNews-vectors(https://github.com/mmihaltz/word2vec-GoogleNews-vectors)
glove-vectors (https://nlp.stanford.edu/projects/glove/)

　　4、加載外部詞向量方式

加載詞匯表中在詞向量裏面能夠找到的詞向量

 1 # load word embedding
 2 def load_my_vecs(path, vocab, freqs):
 3     word_vecs = {}
 4     with open(path, encoding="utf-8") as f:
 5         count  = 0
 6         lines = f.readlines()[1:]
 7         for line in lines:
 8             values = line.split(" ")
 9             word = values[0]
10             # word = word.lower()
11             count += 1
12             if word in vocab:  # whether to judge if in vocab
13                 vector = []
14                 for count, val in enumerate(values):
15                     if count == 0:
16                         continue
17                     vector.append(float(val))
18                 word_vecs[word] = vector
19     return word_vecs

View Code

處理詞匯表中在詞向量裏面找不到的word，俗稱OOV(out of vocabulary)，OOV越多，可能對加過的影響也就越大，所以對OOV詞的處理就顯得尤為關鍵，現在有幾種策略可以參考：
對已經找到的詞向量平均化

 1 # solve unknown by avg word embedding
 2 def add_unknown_words_by_avg(word_vecs, vocab, k=100):
 3     # solve unknown words inplaced by zero list
 4     word_vecs_numpy = []
 5     for word in vocab:
 6         if word in word_vecs:
 7             word_vecs_numpy.append(word_vecs[word])
 8     print(len(word_vecs_numpy))
 9     col = []
10     for i in range(k):
11         sum = 0.0
12         # for j in range(int(len(word_vecs_numpy) / 4)):
13         for j in range(int(len(word_vecs_numpy))):
14             sum += word_vecs_numpy[j][i]
15             sum = round(sum, 6)
16         col.append(sum)
17     zero = []
18     for m in range(k):
19         # avg = col[m] / (len(col) * 5)
20         avg = col[m] / (len(word_vecs_numpy))
21         avg = round(avg, 6)
22         zero.append(float(avg))
23 
24     list_word2vec = []
25     oov = 0
26     iov = 0
27     for word in vocab:
28         if word not in word_vecs:
29             # word_vecs[word] = np.random.uniform(-0.25, 0.25, k).tolist()
30             # word_vecs[word] = [0.0] * k
31             oov += 1
32             word_vecs[word] = zero
33             list_word2vec.append(word_vecs[word])
34         else:
35             iov += 1
36             list_word2vec.append(word_vecs[word])
37     print("oov count", oov)
38     print("iov count", iov)
39     return list_word2vec

View Code

隨機初始化或者全部取zero,隨機初始化或者是取zero,可以是所有的OOV都使用一個隨機值，也可以每一個OOV word都是隨機的，具體效果看自己效果
隨機初始化的值看過幾篇論文，有的隨機初始化是在(-0.25,0.25)或者是(-0.1,0.1)之間，具體的效果可以自己去測試一下，不同的數據集，不同的外部詞向量估計效果不一樣，我測試的結果是0.25要好於0.1

 1 # solve unknown word by uniform(-0.25,0.25)
 2 def add_unknown_words_by_uniform(word_vecs, vocab, k=100):
 3     list_word2vec = []
 4     oov = 0
 5     iov = 0
 6     # uniform = np.random.uniform(-0.25, 0.25, k).round(6).tolist()
 7     for word in vocab:
 8         if word not in word_vecs:
 9             oov += 1
10             word_vecs[word] = np.random.uniform(-0.25, 0.25, k).round(6).tolist()
11             # word_vecs[word] = np.random.uniform(-0.1, 0.1, k).round(6).tolist()
12             # word_vecs[word] = uniform
13             list_word2vec.append(word_vecs[word])
14         else:
15             iov += 1
16             list_word2vec.append(word_vecs[word])
17     print("oov count", oov)
18     print("iov count", iov)
19     return list_word2vec

View Code

特別需要註意處理後的OOV詞向量是否在一定的範圍之內，這個一定要在處理之後手動或者是demo查看一下，想處理出來的詞向量大於15,30的這種，可能就是你自己處理方式的問題，也可以是說是你自己demo可能存在bug，對結果的影響很大。

技術分享

1 if shuffle:
2     random.shuffle(examples_train)
3     random.shuffle(examples_dev)
4     random.shuffle(examples，

基於pytorch的CNN、LSTM神經網絡模型調參小結

shu 結合手動 ces hid open ont 16px nbsp （Demo）這是最近兩個月來的一個小總結，實現的demo已經上傳github，裏面包含了CNN、LSTM、BiLSTM、GRU以及CNN與LSTM、BiLSTM的結合還有多層多通道CNN、LSTM

基於pytorch的CNN、LSTM神經網絡模型調參小結

（Demo）

Demo Site: https://github.com/bamtercelboo/cnn-lstm-bilstm-deepcnn-clstm-in-pytorch

（一） Pytorch簡述

（二） CNN、LSTM

（三）數據預處理

（四）Word Embedding

4、加載外部詞向量方式

處理詞匯表中在詞向量裏面找不到的word，俗稱OOV(out of vocabulary)，OOV越多，可能對加過的影響也就越大，所以對OOV詞的處理就顯得尤為關鍵，現在有幾種策略可以參考：

基於pytorch的CNN、LSTM神經網絡模型調參小結

xgboost、random forest等模型調參小結

改善深層神經網絡：超參數調試、正則化及優化

第九節，改善深層神經網絡：超參數調試、正則化以優化(下)

【TensorFlow/簡單網絡】MNIST數據集-softmax、全連接神經網絡，卷積神經網絡模型

利用Tensorflow實現神經網絡模型

【轉】LSTM神經網絡資源總結

【tensorflow:Google】四、深層神經網絡

bp神經網絡模型推導與c語言實現（轉載）

斯坦福大學公開課機器學習：Neural network-model representation（神經網絡模型及神經單元的理解）

VGG卷積神經網絡模型解析

20180813視頻筆記深度學習基礎上篇（1）之必備基礎知識點深度學習基礎上篇（2）神經網絡模型視頻筆記：深度學習基礎上篇（3）神經網絡案例實戰和深度學習基礎下篇

002-詞向量，神經網絡模型，CBOW，哈夫曼樹，Negative Sampling

基於LVD、貝葉斯模型演算法實現的電商行業商品評論與情感分析案例

深度學習模型調參-基於keras的python學習筆記（四）

2. RNN神經網絡模型的不同結構

通過TensorFlow訓練神經網絡模型

卷積神經網絡中的參數計算

模型調參-網格搜尋Sklearn應用

模型調參-網格搜尋

基於pytorch的CNN、LSTM神經網絡模型調參小結

（Demo）

Demo Site: https://github.com/bamtercelboo/cnn-lstm-bilstm-deepcnn-clstm-in-pytorch

（一） Pytorch簡述

（二） CNN、LSTM

（三）數據預處理

（四）Word Embedding

4、加載外部詞向量方式

處理詞匯表中在詞向量裏面找不到的word，俗稱OOV(out of vocabulary)，OOV越多，可能對加過的影響也就越大，所以對OOV詞的處理就顯得尤為關鍵，現在有幾種策略可以參考：

相關推薦

　　4、加載外部詞向量方式