006-深度學習與NLP簡單應用
Auto-Encoder
如果原始圖片輸入後經過神經網路壓縮成中間狀態(編碼過程Encoder),再由中間狀態解碼出的圖片與原始輸入差別很小(D解碼過程ecoder),那麼這個中間狀態的東西,就可以用來表示原始的輸入。
原先打算用AE來做神經網路中的W,但是發現效果不好,然後神經網路使用batch 的方法來平滑損失函式曲線,然後使用神經網路的“跳層”方法優化神經網路。
所以AE用處最多的就是降維。
有一個問題,農場主假設,如果一群雞每天10點餵食,那麼雞中比較聰明的雞就會認為每天10點鐘有食物是一種自然規律,這個雞認為的這種自然規律在機器學習中叫做區域性最優解,也就是過擬合。
同理,人類在認知世界的過程中,無法開啟上帝之眼,無法跳出三維而完全的認識三維世界。
AE實現過程:
from keras.layers import Input, Dense from keras.models import Model from sklearn.cluster import KMeans class ASCIIAutoencoder(): """基於字元的Autoencoder.""" def __init__(self, sen_len = 512, encoding_dim = 32, epoch = 50, val_ratio = 0.3): """ Init. :param sen_len: 把sentences pad成相同的⻓長度 :param encoding_dim: 壓縮後的維度dim :param epoch: 要跑多少epoch :param kmeanmodel: 簡單的KNN clustering模型 """ self.sen_len = sen_len self.encoding_dim = encoding_dim self.autoencoder = None self.encoder = None self.kmeanmodel = KMeans(n_clusters = 2) self.epoch = epoch def fit(self, x): """ 模型構建。 :param x: input text """ # 把所有的trainset都搞成同⼀一個size,並把每⼀一個字元都換成ascii碼 x_train = self.preprocess(x, length = self.sen_len) # 然後給input預留留好位置 input_text = Input(shape = (self.sen_len,)) # "encoded" 每⼀一經過⼀一層,都被重新整理成⼩小⼀一點的“壓縮後表示式” encoded = Dense(1024, activation = 'tanh')(input_text) encoded = Dense(512, activation = 'tanh')(encoded) encoded = Dense(128, activation = 'tanh')(encoded) encoded = Dense(self.encoding_dim, activation = 'tanh')(encoded) # "decoded" 就是把剛剛壓縮完的東⻄西,給反過來還原成input_text decoded = Dense(128, activation = 'tanh')(encoded) decoded = Dense(512, activation = 'tanh')(decoded) decoded = Dense(1024, activation = 'tanh')(decoded) decoded = Dense(self.sen_len, activation = 'sigmoid')(decoded) # 整個從⼤大到⼩小再到⼤大的model,叫 autoencoder self.autoencoder = Model(input = input_text, output = decoded) # 那麼 只從⼤大到⼩小(也就是⼀一半的model)就叫 encoder self.encoder = Model(input = input_text, output = encoded) # 同理理,我們接下來搞⼀一個decoder出來,也就是從⼩小到⼤大的model # 來,首先encoded的input size給預留留好 encoded_input = Input(shape = (1024,)) # autoencoder的最後⼀一層,就應該是decoder的第⼀一層 decoder_layer = self.autoencoder.layers[-1] # 然後我們從頭到尾連起來,就是⼀一個decoder了了! decoder = Model(input = encoded_input, output = decoder_layer(encoded_input)) # compile self.autoencoder.compile(optimizer = 'adam', loss = 'mse') # 跑起來 self.autoencoder.fit(x_train, x_train, nb_epoch = self.epoch, batch_size = 1000, shuffle = True, ) # 這⼀一部分是⾃自⼰己拿⾃自⼰己train⼀一下KNN,⼀一件簡單的基於距離的分類器器 x_train = self.encoder.predict(x_train) self.kmeanmodel.fit(x_train) def predict(self, x): """ 做預測。 :param x: input text :return: predictions """ # 同理理,第⼀一步 把來的 都給搞成ASCII化,並且⻓長度相同 x_test = self.preprocess(x, length = self.sen_len) # 然後⽤用encoder把test集給壓縮 x_test = self.encoder.predict(x_test) # KNN給分類出來 preds = self.kmeanmodel.predict(x_test) return preds def preprocess(self, s_list, length = 256): ...
CNN4Text
如何遷移到文書處理理?
1. 把文字表示成圖片
也可以把句子做成一維的:
CNN假設:
RNN假設:
邊界處理理:
Narrow vs Wide
Stride size:
步伐大小
# 有兩種⽅方法可以做CNN for text # 每種⽅方法⾥裡里⾯面,也有各種不不同的玩法思路路 # 效果其實基本都差不不多, # 我這⾥裡里講2種最普遍的: # 1. ⼀一維的vector [...] + ⼀一維的filter [...] # 這種⽅方法有⼀一定的資訊損失(因為你average了了向量量),但是速度快,效果也沒太差。 # 2. 通過w2v或其他⽅方法,做成⼆二維matrix,當做圖⽚片來做。 # 這是⽐比較“講道理理”的解決⽅方案,但是就是慢。。。放AWS上太燒錢了了。。。 # 1. 1D CNN for Text # ⼀一個IMDB電影評論的例例⼦子 import numpy as np np.random.seed(1337) # for reproducibility from keras.preprocessing import sequence from keras.models import Sequential from keras.layers import Dense, Dropout, Activation, Flatten from keras.layers import Embedding from keras.layers import Convolution1D, MaxPooling1D from keras.datasets import imdb # set parameters: max_features = 5000 maxlen = 400 batch_size = 32 embedding_dims = 50 nb_filter = 250 filter_length = 3 hidden_dims = 250 nb_epoch = 2 # ⾃自帶的資料集,load IMDB data (X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words = max_features) # 這個資料集是已經搞好了了的BoW,⻓長這樣: # [123, 2, 0, 45, 32, 1212, 344, 4, ... ] # 簡單的把他們搞成相同⻓長度,不不夠的補0,太多的砍掉 X_train = sequence.pad_sequences(X_train, maxlen = maxlen) X_test = sequence.pad_sequences(X_test, maxlen = maxlen) # 這⾥裡里我們可以換成word2vec的vector,他們就是天然的相同⻓長度了了 # 留留個作業,⼤大家可以試試 # 初始化我們的sequential model (指的是線性排列列的layer) model = Sequential() # 亮點來了了,這⾥裡里你需要⼀一個Embedding Layer來把你的input word indexes # 變成tensor vectors: ⽐比如 如果dim是3, 那麼: # [[2],[123], ...] --> [[0.1, 0.4, 0.21], [0.2, 0.4, 0.13], ... ] # 其實很像word2vec的結果,只不不過這個Embedding沒有什什麼太⼤大意義,只是向量量化了了input的int model.add(Embedding(max_features, embedding_dims, input_length = maxlen, dropout = 0.2)) # 這⼀一步對於直接⽤用BoW(⽐比如這個IMDB的資料集)很⽅方便便,但是對我們⾃自⼰己的word vector, # 就不不友好了了,可以選擇跳過它 # 現在可以新增⼀一個Conv layer了了 model.add(Convolution1D(nb_filter = nb_filter, filter_length = filter_length, border_mode = 'valid', activation = 'relu', subsample_length = 1)) # 後⾯面跟⼀一個MaxPooling model.add(MaxPooling1D(pool_length = model.output_shape[1])) # Pool出來的結果 就是類似於⼀一堆的⼩小vec # 把他們粗暴暴的flatten⼀一下(就是橫著 連成⼀一起) model.add(Flatten()) # 接下來就是簡單的MLP了了 # 在Keras⾥裡里,普通的layer⽤用Dense表示 model.add(Dense(hidden_dims)) model.add(Dropout(0.2)) model.add(Activation('relu')) # 最終層 model.add(Dense(1)) model.add(Activation('sigmoid')) # 如果⾯面對的是時間序列列的話, # 這⾥裡里也是可以把layers都換成LSTM # compile model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy']) # 跑起來 model.fit(X_train, y_train, batch_size = batch_size, nb_epoch = nb_epoch, validation_data = (X_test, y_test)) # 這⾥裡里有個fault啊,不不可以拿testset做validation的 # 這只是簡單的做個示例例,為了了跑起來⽅方便便 # 2. 2D CNN # 我們的input 應該是⼀一個list of M*N matrixs # 但是注意⼀一下,我們需要稍微reshape⼀一下,把每個matrix⽤用⼀一個⼀一維的[]包起來 # 這就等於 我們把input變成 list of lists, 每個list包含⼀一個M*N Matrix x_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1], X_train.shape[2]) # cast⼀一下type,以防Numpy衝突 x_train = x_train.astype('float32') # 接著,還是⼀一樣: model = Sequential() # n_filter, ⼀一共⼏幾個filter # n_conv,每個filter的size model.add(Convolution2D(n_filter, n_conv, n_conv, border_mode = border_mode, input_shape = (1, x_axis, y_axis))) model.add(Activation('relu')) model.add(Convolution2D(n_filter, n_conv, n_conv)) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size = (n_pool, n_pool))) model.add(Dropout(0.25)) model.add(Flatten())
案例
從每日新聞中預測金融市場變化
資料獲取
/r/worldnews
DJIA(道瓊斯指數)
RedditNews(經濟新聞)
Combine
Date, Text, Label
W2V模型的Pretrain:
1.GoogleNews.bin (https://code.google.com/archive/p/word2vec/)
2.RedditComments (https://www.reddit.com/r/datasets/comments/3mg812/full_reddit_submission_corpus_now_available_2006/)
3. 用我提供的dataset里的新聞文字直接當場train
4. 我預先訓練好的 reddit w2v model
ML
普通神經網路:
RNN:
RNN的目的是讓有sequential關係的資訊得到考慮。
什麼是sequential關係?
就是資訊在時間上的前後關係。
相比於普通神經網路:
每個時間點中的S計算
這個神經元最終的輸出,
基於最後一個S
簡單來說,對於t=5來說,其實就相當於把一個神經元拉伸成五個
換句句話說,S就是我們所說的記憶(因為把t從1-5的資訊都記錄下來了)
由前文可見,RNN可以帶上記憶。
假設,一個『生成下一個單詞』的例子:
『這頓飯真好』——>『吃』
很明顯,我們只要前5個字就能猜到下一個字是啥了了
However,
如果我問你,『穿山甲說了什麼?』
你能回答嘛?
(credit to 暴走漫畫)
LSTM
RNN
LSTM
LSTM中最重要的就是這個Cell State,它一路向下,貫穿這個時間線,代表了記憶的紐帶。它會被XOR和AND運算子搞一搞,來更更新記憶
而控制資訊的增加和減少的,就是靠這些閥門:Gate
閥門嘛,就是輸出一個1於0之間的值:
1 代表,把這一趟的資訊都記著
0 代表,這一趟的資訊可以忘記了
下面我們來模擬一遍資訊在LSTM里跑跑~
第一步:忘記門
來決定我們該忘記什麼資訊
它把上一次的狀態ht-1和這一次的輸入xt相比較
通過gate輸出一個0到1的值(就像是個activation function一樣),
1 代表:給我記著!
0 代表:快快忘記!
第二步:記憶門
哪些該記住
這個門比較複雜,分兩步:
第一步,用sigmoid決定什什麼資訊需要被我們更新(忘記舊的)
第二部,用Tanh造一個新的Cell State(更更新後的cell state)
第三步:更新門
把老cell state更新為新cell state
用XOR和AND這樣的門來更新我們的cell state:
第四步:輸出門
由記憶來決定輸出什什麼值
我們的Cell State已經被更更新,
於是我們通過這個記憶紐帶,來決定我們的輸出:
(這里的Ot類似於我們剛剛RNN裡里直接一步跑出來的output)
案例
題目原型:What’s Next?
可以用在不不同的維度上:
維度1:下一個字母是什麼?
維度2:下一個單詞是什麼?
維度3:下一個句子是什麼?
維度N:下一個圖片/音符/….是什麼?
用RNN做文字生成
舉個小小的例子,來看看LSTM是怎麼玩的
我們這裡用溫斯頓丘吉爾的人物傳記作為我們的學習語料。
第一步,一樣,先匯入各種庫
import numpy from keras.models import Sequential from keras.layers import Dense from keras.layers import Dropout from keras.layers import LSTM from keras.callbacks import ModelCheckpoint from keras.utils import np_utils
Using Theano backend. Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5105) /usr/local/lib/python3.5/dist-packages/theano/sandbox/cuda/__init__.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5. warnings.warn(warn)
接下來,我們把文字讀入
raw_text = open('../input/Winston_Churchil.txt').read() raw_text = raw_text.lower()
既然我們是以每個字母為層級,字母總共才26個,所以我們可以很方便的用One-Hot來編碼出所有的字母(當然,可能還有些標點符號和其他noise)
chars = sorted(list(set(raw_text))) char_to_int = dict((c, i) for i, c in enumerate(chars)) int_to_char = dict((i, c) for i, c in enumerate(chars))
我們看到,全部的chars:
chars
['\n', ' ', '!', '#', '$', '%', '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '@', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '‘', '’', '“', '”', '\ufeff']
一共有:
len(chars)
61
同時,我們的原文字一共有:
len(raw_text)
276830
我們這裡簡單的文字預測就是,給了前置的字母以後,下一個字母是誰?
比如,Winsto, 給出 n Britai 給出 n
構造訓練測試集
我們需要把我們的raw text變成可以用來訓練的x,y:
x 是前置字母們 y 是後一個字母
seq_length = 100 x = [] y = [] for i in range(0, len(raw_text) - seq_length): given = raw_text[i:i + seq_length] predict = raw_text[i + seq_length] x.append([char_to_int[char] for char in given]) y.append(char_to_int[predict])
我們可以看看我們做好的資料集的長相:
print(x[:3]) print(y[:3])
[[60, 45, 47, 44, 39, 34, 32, 49, 1, 36, 50, 49, 34, 43, 31, 34, 47, 36, 57, 48, 1, 47, 34, 30, 41, 1, 48, 44, 41, 33, 38, 34, 47, 48, 1, 44, 35, 1, 35, 44, 47, 49, 50, 43, 34, 9, 1, 31, 54, 1, 47, 38, 32, 37, 30, 47, 33, 1, 37, 30, 47, 33, 38, 43, 36, 1, 33, 30, 51, 38, 48, 0, 0, 49, 37, 38, 48, 1, 34, 31, 44, 44, 40, 1, 38, 48, 1, 35, 44, 47, 1, 49, 37, 34, 1, 50, 48, 34, 1, 44], [45, 47, 44, 39, 34, 32, 49, 1, 36, 50, 49, 34, 43, 31, 34, 47, 36, 57, 48, 1, 47, 34, 30, 41, 1, 48, 44, 41, 33, 38, 34, 47, 48, 1, 44, 35, 1, 35, 44, 47, 49, 50, 43, 34, 9, 1, 31, 54, 1, 47, 38, 32, 37, 30, 47, 33, 1, 37, 30, 47, 33, 38, 43, 36, 1, 33, 30, 51, 38, 48, 0, 0, 49, 37, 38, 48, 1, 34, 31, 44, 44, 40, 1, 38, 48, 1, 35, 44, 47, 1, 49, 37, 34, 1, 50, 48, 34, 1, 44, 35], [47, 44, 39, 34, 32, 49, 1, 36, 50, 49, 34, 43, 31, 34, 47, 36, 57, 48, 1, 47, 34, 30, 41, 1, 48, 44, 41, 33, 38, 34, 47, 48, 1, 44, 35, 1, 35, 44, 47, 49, 50, 43, 34, 9, 1, 31, 54, 1, 47, 38, 32, 37, 30, 47, 33, 1, 37, 30, 47, 33, 38, 43, 36, 1, 33, 30, 51, 38, 48, 0, 0, 49, 37, 38, 48, 1, 34, 31, 44, 44, 40, 1, 38, 48, 1, 35, 44, 47, 1, 49, 37, 34, 1, 50, 48, 34, 1, 44, 35, 1]] [35, 1, 30]
此刻,樓上這些表達方式,類似就是一個詞袋,或者說 index。
接下來我們做兩件事:
-
我們已經有了一個input的數字表達(index),我們要把它變成LSTM需要的陣列格式: [樣本數,時間步伐,特徵]
-
第二,對於output,我們在Word2Vec裡學過,用one-hot做output的預測可以給我們更好的效果,相對於直接預測一個準確的y數值的話。
n_patterns = len(x) n_vocab = len(chars) # 把x變成LSTM需要的樣子 x = numpy.reshape(x, (n_patterns, seq_length, 1)) # 簡單normal到0-1之間 x = x / float(n_vocab) # output變成one-hot y = np_utils.to_categorical(y) print(x[11]) print(y[11])
[[ 0.80327869] [ 0.55737705] [ 0.70491803] [ 0.50819672] [ 0.55737705] [ 0.7704918 ] [ 0.59016393] [ 0.93442623] [ 0.78688525] [ 0.01639344] [ 0.7704918 ] [ 0.55737705] [ 0.49180328] [ 0.67213115] [ 0.01639344] [ 0.78688525] [ 0.72131148] [ 0.67213115] [ 0.54098361] [ 0.62295082] [ 0.55737705] [ 0.7704918 ] [ 0.78688525] [ 0.01639344] [ 0.72131148] [ 0.57377049] [ 0.01639344] [ 0.57377049] [ 0.72131148] [ 0.7704918 ] [ 0.80327869] [ 0.81967213] [ 0.70491803] [ 0.55737705] [ 0.14754098] [ 0.01639344] [ 0.50819672] [ 0.8852459 ] [ 0.01639344] [ 0.7704918 ] [ 0.62295082] [ 0.52459016] [ 0.60655738] [ 0.49180328] [ 0.7704918 ] [ 0.54098361] [ 0.01639344] [ 0.60655738] [ 0.49180328] [ 0.7704918 ] [ 0.54098361] [ 0.62295082] [ 0.70491803] [ 0.59016393] [ 0.01639344] [ 0.54098361] [ 0.49180328] [ 0.83606557] [ 0.62295082] [ 0.78688525] [ 0. ] [ 0. ] [ 0.80327869] [ 0.60655738] [ 0.62295082] [ 0.78688525] [ 0.01639344] [ 0.55737705] [ 0.50819672] [ 0.72131148] [ 0.72131148] [ 0.6557377 ] [ 0.01639344] [ 0.62295082] [ 0.78688525] [ 0.01639344] [ 0.57377049] [ 0.72131148] [ 0.7704918 ] [ 0.01639344] [ 0.80327869] [ 0.60655738] [ 0.55737705] [ 0.01639344] [ 0.81967213] [ 0.78688525] [ 0.55737705] [ 0.01639344] [ 0.72131148] [ 0.57377049] [ 0.01639344] [ 0.49180328] [ 0.70491803] [ 0.8852459 ] [ 0.72131148] [ 0.70491803] [ 0.55737705] [ 0.01639344] [ 0.49180328] [ 0.70491803]] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
模型建造
LSTM模型構建
model = Sequential() model.add(LSTM(256, input_shape=(x.shape[1], x.shape[2]))) model.add(Dropout(0.2)) model.add(Dense(y.shape[1], activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam')
跑模型
model.fit(x, y, nb_epoch=50, batch_size=4096)
Epoch 1/50 276730/276730 [==============================] - 197s - loss: 3.1120 Epoch 2/50 276730/276730 [==============================] - 197s - loss: 3.0227 Epoch 3/50 276730/276730 [==============================] - 197s - loss: 2.9910 Epoch 4/50 276730/276730 [==============================] - 197s - loss: 2.9337 Epoch 5/50 276730/276730 [==============================] - 197s - loss: 2.8971 Epoch 6/50 276730/276730 [==============================] - 197s - loss: 2.8784 Epoch 7/50 276730/276730 [==============================] - 197s - loss: 2.8640 Epoch 8/50 276730/276730 [==============================] - 197s - loss: 2.8516 Epoch 9/50 276730/276730 [==============================] - 197s - loss: 2.8384 Epoch 10/50 276730/276730 [==============================] - 197s - loss: 2.8254 Epoch 11/50 276730/276730 [==============================] - 197s - loss: 2.8133 Epoch 12/50 276730/276730 [==============================] - 197s - loss: 2.8032 Epoch 13/50 276730/276730 [==============================] - 197s - loss: 2.7913 Epoch 14/50 276730/276730 [==============================] - 197s - loss: 2.7831 Epoch 15/50 276730/276730 [==============================] - 197s - loss: 2.7744 Epoch 16/50 276730/276730 [==============================] - 197s - loss: 2.7672 Epoch 17/50 276730/276730 [==============================] - 197s - loss: 2.7601 Epoch 18/50 276730/276730 [==============================] - 197s - loss: 2.7540 Epoch 19/50 276730/276730 [==============================] - 197s - loss: 2.7477 Epoch 20/50 276730/276730 [==============================] - 197s - loss: 2.7418 Epoch 21/50 276730/276730 [==============================] - 197s - loss: 2.7360 Epoch 22/50 276730/276730 [==============================] - 197s - loss: 2.7296 Epoch 23/50 276730/276730 [==============================] - 197s - loss: 2.7238 Epoch 24/50 276730/276730 [==============================] - 197s - loss: 2.7180 Epoch 25/50 276730/276730 [==============================] - 197s - loss: 2.7113 Epoch 26/50 276730/276730 [==============================] - 197s - loss: 2.7055 Epoch 27/50 276730/276730 [==============================] - 197s - loss: 2.7000 Epoch 28/50 276730/276730 [==============================] - 197s - loss: 2.6934 Epoch 29/50 276730/276730 [==============================] - 197s - loss: 2.6859 Epoch 30/50 276730/276730 [==============================] - 197s - loss: 2.6800 Epoch 31/50 276730/276730 [==============================] - 197s - loss: 2.6741 Epoch 32/50 276730/276730 [==============================] - 197s - loss: 2.6669 Epoch 33/50 276730/276730 [==============================] - 197s - loss: 2.6593 Epoch 34/50 276730/276730 [==============================] - 197s - loss: 2.6529 Epoch 35/50 276730/276730 [==============================] - 197s - loss: 2.6461 Epoch 36/50 276730/276730 [==============================] - 197s - loss: 2.6385 Epoch 37/50 276730/276730 [==============================] - 197s - loss: 2.6320 Epoch 38/50 276730/276730 [==============================] - 197s - loss: 2.6249 Epoch 39/50 276730/276730 [==============================] - 197s - loss: 2.6187 Epoch 40/50 276730/276730 [==============================] - 197s - loss: 2.6110 Epoch 41/50 276730/276730 [==============================] - 192s - loss: 2.6039 Epoch 42/50 276730/276730 [==============================] - 141s - loss: 2.5969 Epoch 43/50 276730/276730 [==============================] - 140s - loss: 2.5909 Epoch 44/50 276730/276730 [==============================] - 140s - loss: 2.5843 Epoch 45/50 276730/276730 [==============================] - 140s - loss: 2.5763 Epoch 46/50 276730/276730 [==============================] - 140s - loss: 2.5697 Epoch 47/50 276730/276730 [==============================] - 141s - loss: 2.5635 Epoch 48/50 276730/276730 [==============================] - 140s - loss: 2.5575 Epoch 49/50 276730/276730 [==============================] - 140s - loss: 2.5496 Epoch 50/50 276730/276730 [==============================] - 140s - loss: 2.5451Out[11]:
<keras.callbacks.History at 0x7fb6121b6e48>
我們來寫個程式,看看我們訓練出來的LSTM的效果:
def predict_next(input_array): x = numpy.reshape(input_array, (1, seq_length, 1)) x = x / float(n_vocab) y = model.predict(x) return y def string_to_index(raw_input): res = [] for c in raw_input[(len(raw_input)-seq_length):]: res.append(char_to_int[c]) return res def y_to_char(y): largest_index = y.argmax() c = int_to_char[largest_index] return c
好,寫成一個大程式:
def generate_article(init, rounds=200): in_string = init.lower() for i in range(rounds): n = y_to_char(predict_next(string_to_index(in_string))) in_string += n return in_string
init = 'His object in coming to New York was to engage officers for that service. He came at an opportune moment' article = generate_article(init) print(article)
his object in coming to new york was to engage officers for that service. he came at an opportune moment th the toote of the carie and the soote of the carie and the soote of the carie and the soote of the carie and the soote of the carie and the soote of the carie and the soote of the carie and the soo
用RNN做文字生成
舉個小小的例子,來看看LSTM是怎麼玩的
我們這裡不再用char級別,我們用word級別來做。
第一步,一樣,先匯入各種庫
import os import numpy as np import nltk from keras.models import Sequential from keras.layers import Dense from keras.layers import Dropout from keras.layers import LSTM from keras.callbacks import ModelCheckpoint from keras.utils import np_utils from gensim.models.word2vec import Word2Vec
接下來,我們把文字讀入
raw_text = '' for file in os.listdir("../input/"): if file.endswith(".txt"): raw_text += open("../input/"+file, errors='ignore').read() + '\n\n' # raw_text = open('../input/Winston_Churchil.txt').read() raw_text = raw_text.lower() sentensor = nltk.data.load('tokenizers/punkt/english.pickle') sents = sentensor.tokenize(raw_text) corpus = [] for sen in sents: corpus.append(nltk.word_tokenize(sen)) print(len(corpus)) print(corpus[:3])
91007 [['\ufeffthe', 'project', 'gutenberg', 'ebook', 'of', 'great', 'expectations', ',', 'by', 'charles', 'dickens', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.'], ['you', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'www.gutenberg.org', 'title', ':', 'great', 'expectations', 'author', ':', 'charles', 'dickens', 'posting', 'date', ':', 'august', '20', ',', '2008', '[', 'ebook', '#', '1400', ']', 'release', 'date', ':', 'july', ',', '1998', 'last', 'updated', ':', 'september', '25', ',', '2016', 'language', ':', 'english', 'character', 'set', 'encoding', ':', 'utf-8', '***', 'start', 'of', 'this', 'project', 'gutenberg', 'ebook', 'great', 'expectations', '***', 'produced', 'by', 'an', 'anonymous', 'volunteer', 'great', 'expectations', '[', '1867', 'edition', ']', 'by', 'charles', 'dickens', '[', 'project', 'gutenberg', 'editor’s', 'note', ':', 'there', 'is', 'also', 'another', 'version', 'of', 'this', 'work', 'etext98/grexp10.txt', 'scanned', 'from', 'a', 'different', 'edition', ']', 'chapter', 'i', 'my', 'father’s', 'family', 'name', 'being', 'pirrip', ',', 'and', 'my', 'christian', 'name', 'philip', ',', 'my', 'infant', 'tongue', 'could', 'make', 'of', 'both', 'names', 'nothing', 'longer', 'or', 'more', 'explicit', 'than', 'pip', '.'], ['so', ',', 'i', 'called', 'myself', 'pip', ',', 'and', 'came', 'to', 'be', 'called', 'pip', '.']]
好,w2v亂燉:
w2v_model = Word2Vec(corpus, size=128, window=5, min_count=5, workers=4)
可以了
w2v_model['office']
array([-0.01398709, 0.15975526, 0.03589381, -0.4449192 , 0.365403 , 0.13376504, 0.78731823, 0.01640314, -0.29723561, -0.21117583, 0.13451998, -0.65348488, 0.06038611, -0.02000343, 0.05698346, 0.68013376, 0.19010596, 0.56921762, 0.66904438, -0.08069923, -0.30662233, 0.26082459, -0.74816126, -0.41383636, -0.56303871, -0.10834043, -0.10635001, -0.7193433 , 0.29722607, -0.83104628, 1.11914253, -0.34119046, -0.39490014, -0.34709939, -0.00583572, 0.17824887, 0.43295503, 0.11827419, -0.28707108, -0.02838829, 0.02565269, 0.10328653, -0.19100265, -0.24102989, 0.23023468, 0.51493132, 0.34759828, 0.05510307, 0.20583512, -0.17160387, -0.10351282, 0.19884749, -0.03935663, -0.04055062, 0.38888735, -0.02003323, -0.16577065, -0.15858875, 0.45083243, -0.09268586, -0.91098118, 0.16775337, 0.3432925 , 0.2103184 , -0.42439541, 0.26097715, -0.10714807, 0.2415273 , 0.2352251 , -0.21662289, -0.13343927, 0.11787982, -0.31010333, 0.21146733, -0.11726214, -0.65574747, 0.04007725, -0.12032496, -0.03468512, 0.11063002, 0.33530036, -0.64098376, 0.34013858, -0.08341357, -0.54826909, 0.0723564 , -0.05169795, -0.19633259, 0.08620321, 0.05993884, -0.14693044, -0.40531522, -0.07695422, 0.2279872 , -0.12342903, -0.1919964 , -0.09589464, 0.4433476 , 0.38304719, 1.0319351 , 0.82628119, 0.3677327 , 0.07600326, 0.08538571, -0.44261214, -0.10997667, -0.03823839, 0.40593523, 0.32665277, -0.67680383, 0.32504487, 0.4009226 , 0.23463745, -0.21442334, 0.42727917, 0.19593567, -0.10731711, -0.01080817, -0.14738144, 0.15710345, -0.01099576, 0.35833639, 0.16394758, -0.10431164, -0.28202233, 0.24488974, 0.69327635, -0.29230621], dtype=float32)
接下來,其實我們還是以之前的方式來處理我們的training data,把源資料變成一個長長的x,好讓LSTM學會predict下一個單詞:
raw_input = [item for sublist in corpus for item in sublist] len(raw_input)
2115170
raw_input[12]
'ebook'
text_stream = [] vocab = w2v_model.vocab for word in raw_input: if word in vocab: text_stream.append(word) len(text_stream)
2058753
我們這裡的文字預測就是,給了前面的單詞以後,下一個單詞是誰?
比如,hello from the other, 給出 side
構造訓練測試集
我們需要把我們的raw text變成可以用來訓練的x,y:
x 是前置字母們 y 是後一個字母
seq_length = 10 x = [] y = [] for i in range(0, len(text_stream) - seq_length): given = text_stream[i:i + seq_length] predict = text_stream[i + seq_length] x.append(np.array([w2v_model[word] for word in given])) y.append(w2v_model[predict])
我們可以看看我們做好的資料集的長相:
print(x[10]) print(y[10])
[[-0.02218935 0.04861801 -0.03001036 ..., 0.07096259 0.16345282 -0.18007144] [ 0.1663752 0.67981642 0.36581406 ..., 1.03355932 0.94110376 -1.02763569] [-0.12611888 0.75773817 0.00454156 ..., 0.80544478 2.77890372 -1.00110698] ..., [ 0.34167829 -0.28152692 -0.12020591 ..., 0.19967555 1.65415502 -1.97690392] [-0.66742641 0.82389861 -1.22558379 ..., 0.12269551 0.30856156 0.29964617] [-0.17075984 0.0066567 -0.3894183 ..., 0.23729582 0.41993639 -0.12582727]] [ 0.18125793 -1.72401989 -0.13503326 -0.42429626 1.40763748 -2.16775346 2.26685596 -2.03301549 0.42729807 -0.84830129 0.56945151 0.87243706 3.01571465 -0.38155749 -0.99618471 1.1960727 1.93537641 0.81187075 -0.83017075 -3.18952608 0.48388934 -0.03766865 -1.68608069 -1.84907544 -0.95259917 0.49039507 -0.40943271 0.12804921 1.35876858 0.72395176 1.43591952 -0.41952157 0.38778016 -0.75301784 -2.5016799 -0.85931653 -1.39363682 0.42932403 1.77297652 0.41443667 -1.30974782 -0.08950856 -0.15183811 -1.59824061 -1.58920395 1.03765178 2.07559252 2.79692245 1.11855054 -0.25542653 -1.04980111 -0.86929852 -1.26279402 -1.14124119 -1.04608357 1.97869778 -2.23650813 -2.18115139 -0.26534671 0.39432198 -0.06398458 -1.02308178 1.43372631 -0.02581184 -0.96472031 -3.08931994 -0.67289352 1.06766248 -1.95796657 1.40857184 0.61604798 -0.50270212 -2.33530831 0.45953822 0.37867084 -0.56957626 -1.90680516 -0.57678169 0.50550407 -0.30320352 0.19682285 1.88185465 -1.40448165 -0.43952951 1.95433044 2.07346153 0.22390689 -0.95107335 -0.24579825 -0.21493609 0.66570002 -0.59126669 -1.4761591 0.86431485 0.36701021 0.12569368 1.65063572 2.048352 1.81440067 -1.36734581 2.41072559 1.30975604 -0.36556485 -0.89859813 1.28804696 -2.75488496 1.5667206 -1.75327337 0.60426879 1.77851915 -0.32698369 0.55594021 2.01069188 -0.52870172 -0.39022744 -1.1704396 1.28902853 -0.89315164 1.41299319 0.43392688 -2.52578211 -1.13480854 -1.05396986 -0.85470092 0.6618616 1.23047733 -0.28597715 -2.35096407]
print(len(x)) print(len(y)) print(len(x[12])) print(len(x[12][0])) print(len(y[12]))
2058743 2058743 10 128 128
x = np.reshape(x, (-1, seq_length, 128)) y = np.reshape(y, (-1,128))
接下來我們做兩件事:
-
我們已經有了一個input的數字表達(w2v),我們要把它變成LSTM需要的陣列格式: [樣本數,時間步伐,特徵]
-
第二,對於output,我們直接用128維的輸出
模型建造
LSTM模型構建
model = Sequential() model.add(LSTM(256, dropout_W=0.2, dropout_U=0.2, input_shape=(seq_length, 128))) model.add(Dropout(0.2)) model.add(Dense(128, activation='sigmoid')) model.compile(loss='mse', optimizer='adam')
跑模型
model.fit(x, y, nb_epoch=50, batch_size=4096)
Epoch 1/50 2058743/2058743 [==============================] - 150s - loss: 0.6839 Epoch 2/50 2058743/2058743 [==============================] - 150s - loss: 0.6670 Epoch 3/50 2058743/2058743 [==============================] - 150s - loss: 0.6625 Epoch 4/50 2058743/2058743 [==============================] - 150s - loss: 0.6598 Epoch 5/50 2058743/2058743 [==============================] - 150s - loss: 0.6577 Epoch 6/50 2058743/2058743 [==============================] - 150s - loss: 0.6562 Epoch 7/50 2058743/2058743 [==============================] - 150s - loss: 0.6549 Epoch 8/50 2058743/2058743 [==============================] - 150s - loss: 0.6537 Epoch 9/50 2058743/2058743 [==============================] - 150s - loss: 0.6527 Epoch 10/50 2058743/2058743 [==============================] - 150s - loss: 0.6519 Epoch 11/50 2058743/2058743 [==============================] - 150s - loss: 0.6512 Epoch 12/50 2058743/2058743 [==============================] - 150s - loss: 0.6506 Epoch 13/50 2058743/2058743 [==============================] - 150s - loss: 0.6500 Epoch 14/50 2058743/2058743 [==============================] - 150s - loss: 0.6496 Epoch 15/50 2058743/2058743 [==============================] - 150s - loss: 0.6492 Epoch 16/50 2058743/2058743 [==============================] - 150s - loss: 0.6488 Epoch 17/50 2058743/2058743 [==============================] - 151s - loss: 0.6485 Epoch 18/50 2058743/2058743 [==============================] - 150s - loss: 0.6482 Epoch 19/50 2058743/2058743 [==============================] - 150s - loss: 0.6480 Epoch 20/50 2058743/2058743 [==============================] - 150s - loss: 0.6477 Epoch 21/50 2058743/2058743 [==============================] - 150s - loss: 0.6475 Epoch 22/50 2058743/2058743 [==============================] - 150s - loss: 0.6473 Epoch 23/50 2058743/2058743 [==============================] - 150s - loss: 0.6471 Epoch 24/50 2058743/2058743 [==============================] - 150s - loss: 0.6470 Epoch 25/50 2058743/2058743 [==============================] - 150s - loss: 0.6468 Epoch 26/50 2058743/2058743 [==============================] - 150s - loss: 0.6466 Epoch 27/50 2058743/2058743 [==============================] - 150s - loss: 0.6464 Epoch 28/50 2058743/2058743 [==============================] - 150s - loss: 0.6463 Epoch 29/50 2058743/2058743 [==============================] - 150s - loss: 0.6462 Epoch 30/50 2058743/2058743 [==============================] - 150s - loss: 0.6461 Epoch 31/50 2058743/2058743 [==============================] - 150s - loss: 0.6460 Epoch 32/50 2058743/2058743 [==============================] - 150s - loss: 0.6458 Epoch 33/50 2058743/2058743 [==============================] - 150s - loss: 0.6458 Epoch 34/50 2058743/2058743 [==============================] - 150s - loss: 0.6456 Epoch 35/50 2058743/2058743 [==============================] - 150s - loss: 0.6456 Epoch 36/50 2058743/2058743 [==============================] - 150s - loss: 0.6455 Epoch 37/50 2058743/2058743 [==============================] - 150s - loss: 0.6454 Epoch 38/50 2058743/2058743 [==============================] - 150s - loss: 0.6453 Epoch 39/50 2058743/2058743 [==============================] - 150s - loss: 0.6452 Epoch 40/50 2058743/2058743 [==============================] - 150s - loss: 0.6452 Epoch 41/50 2058743/2058743 [==============================] - 150s - loss: 0.6451 Epoch 42/50 2058743/2058743 [==============================] - 150s - loss: 0.6450 Epoch 43/50 2058743/2058743 [==============================] - 150s - loss: 0.6450 Epoch 44/50 2058743/2058743 [==============================] - 150s - loss: 0.6449 Epoch 45/50 2058743/2058743 [==============================] - 150s - loss: 0.6448 Epoch 46/50 2058743/2058743 [==============================] - 150s - loss: 0.6447 Epoch 47/50 2058743/2058743 [==============================] - 150s - loss: 0.6447 Epoch 48/50 2058743/2058743 [==============================] - 150s - loss: 0.6446 Epoch 49/50 2058743/2058743 [==============================] - 150s - loss: 0.6446 Epoch 50/50 2058743/2058743 [==============================] - 150s - loss: 0.6445Out[130]:
<keras.callbacks.History at 0x7f6ed8816a58>
我們來寫個程式,看看我們訓練出來的LSTM的效果:
def predict_next(input_array): x = np.reshape(input_array, (-1,seq_length,128)) y = model.predict(x) return y def string_to_index(raw_input): raw_input = raw_input.lower() input_stream = nltk.word_tokenize(raw_input) res = [] for word in input_stream[(len(input_stream)-seq_length):]: res.append(w2v_model[word]) return res def y_to_word(y): word = w2v_model.most_similar(positive=y, topn=1) return word
好,寫成一個大程式:
def generate_article(init, rounds=30): in_string = init.lower() for i in range(rounds): n = y_to_word(predict_next(string_to_index(in_string))) in_string += ' ' + n[0][0] return in_string
init = 'Language Models allow us to measure how likely a sentence is, which is an important for Machine' article = generate_article(init) print(article)
language models allow us to measure how likely a sentence is, which is an important for machine engagement . to-day good-for-nothing fit job job job job job . i feel thing job job job ; thing really done certainly job job ; but i need not say