AI：深度學習用於文字處理

阿新 • • 發佈：2020-03-06

同本文一起釋出的另外一篇文章中，提到了 BlueDot 公司，這個公司致力於利用人工智慧保護全球人民免受傳染病的侵害，在本次疫情還沒有引起強烈關注時，就提前一週發出預警，一週的時間，多麼寶貴！他們的 AI 預警系統，就用到了深度學習對文字的處理，這個系統抓取網路上大量的新聞、公開宣告等獲取到的數十萬的資訊，對自然語言進行處理，我們今天就聊聊深度學習如何對文字的簡單處理。文字，String 或 Text，就是字元的序列或單詞的序列，最常見的是單詞的處理（我們暫時不考慮中文，中文的理解和處理與英文相比要複雜得多）。計算機就是固化的數學，對文字的處理，在本質上來說就是固化的統計學，這樣採用統計學處理後的模型就可以解決許多簡單的問題了。下面我們開始吧。 ## 處理文字資料與之前一致，如果原始要訓練的資料不是向量，我們要進行向量化，文字的向量化，有幾種方式： - 按照單詞分割 - 按照字元分割 - 提取單詞的 n-gram 我喜歡吃火……，你猜我接下來會說的是什麼？1-gram 接下來說什麼都可以，這個詞與前文沒關係；2-gram 接下來可能說“把，柴，焰”等，組成詞語“火把、火柴、火焰”；3-gram 接下來可能說“鍋”，組成“吃火鍋”，這個概率更大一些。先簡單這麼理解，n-gram 就是與前 n-1 個詞有關。我們今天先來填之前挖下來的一個坑，當時說以後將介紹 one-hot，現在是時候了。 ## one-hot 編碼 ``` def one_hot(): samples = ['The cat sat on the mat', 'The dog ate my homework'] token_index = {} # 分割成單詞 for sample in samples: for word in sample.split(): if word not in token_index: token_index[word] = len(token_index) + 1 # {'The': 1, 'cat': 2, 'sat': 3, 'on': 4, 'the': 5, 'mat.': 6, 'dog': 7, 'ate': 8, 'my': 9, 'homework.': 10} print(token_index) max_length = 8 results = np.zeros(shape=(len(samples), max_length, max(token_index.values()) + 1)) for i, sample in enumerate(samples): for j, word in list(enumerate(sample.split()))[:max_length]: index = token_index.get(word) results[i, j, index] = 1. print(results) ``` ![結果](https://imgconvert.csdnimg.cn/aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8yMDIzNTY5LTM3ZDI3YzNmMDQyMzY2MjI?x-oss-process=image/format,png) 我們看到，這個資料是不好的，mat 和 homework 後面都分別跟了一個英文的句話 '.'，要炫技寫那種高階的正則表示式去匹配這個莫名其妙的符號嗎？當然不是了，沒錯，Keras 有內建的方法。 ``` def keras_one_hot(): samples = ['The cat sat on the mat.', 'The dog ate my homework.'] tokenizer = Tokenizer(num_words=1000) tokenizer.fit_on_texts(samples) sequences = tokenizer.texts_to_sequences(samples) print(sequences) one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary') print(one_hot_results) word_index = tokenizer.word_index print(word_index) print('Found %s unique tokens.' % len(word_index)) ``` ![結果](https://imgconvert.csdnimg.cn/aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8yMDIzNTY5LWM2ZTIzYzg3MzA4MTRiMWM?x-oss-process=image/format,png) 這裡的 num_words 和上面的 max_length 都是用來表示多少個最常用的單詞，控制好這個，可以大大的減少運算量訓練時間，甚至有點時候能更好的提高準確率，希望引起一定注意。我們還可以看到得到的編碼的向量，很大一部分都是 0，不夠緊湊，這會導致大量的記憶體佔用，不好不好，有什麼什麼其他辦法呢？答案是肯定的。 ## 詞嵌入也叫詞向量。詞嵌入通常是密集的，維度低的（256、512、1024）。那到底什麼叫詞嵌入呢？本文我們的主題是處理文字資訊，文字資訊就是有語義的，對於沒有語義的文字我們什麼也幹不了，但是我們之前的處理方法，其實就是概率上的統計，，是一種單純的計算，沒有理解的含義（或者說很少），但是考慮到真實情況，“非常好” 和 “非常棒” 的含義是相近的，它們與 “非常差” 的含義是相反的，因此我們希望轉換成向量的時候，前兩個向量距離小，與後一個向量距離大。因此看下面一張圖，是不是就很容易理解了呢： ![image](https://imgconvert.csdnimg.cn/aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8yMDIzNTY5LTE2MmI0Mzc3YTNlZDk5Y2Y?x-oss-process=image/format,png) 可能直接讓你去實現這個功能有點難，幸好 Keras 簡化了這個問題，Embedding 是內建的網路層，可以完成這個對映關係。現在理解這個概念後，我們再來看看 IMDB 問題（電影評論情感預測），程式碼就簡單了，差不都可以達到 75%的準確率： ``` def imdb_run(): max_features = 10000 maxlen = 20 (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features) x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen) model = Sequential() model.add(Embedding(10000, 8, input_length=maxlen)) model.add(Flatten()) model.add(Dense(1, activation='sigmoid')) model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc']) model.summary() history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.2) ``` 我們的資料量有點少，怎麼辦呢？上一節我們在處理影象的時候，用到的方法是使用預訓練的網路，這裡我們採用類似的方法，採用預訓練的詞嵌入。最流行的兩種詞嵌入是 GloVe 和 Word2Vec，我們後面還是會在合適的時候分別介紹這兩個詞嵌入。今天我們採用 GloVe 的方法，具體做法我寫在了程式碼的註釋中。我們還是先看結果，程式碼還是放在最後： ![image](https://imgconvert.csdnimg.cn/aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8yMDIzNTY5LThiY2ZiOGYyMjEzNTlhNjU?x-oss-process=image/format,png) 很快就過擬合了，你可能覺得這個驗證精度接近 60%，考慮到訓練樣本只有 200 個，這個結果真的還挺不錯的，當然，你可能不信，那麼我再給出兩組對比圖，一組是沒有詞嵌入的： [外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片儲存下來直接上傳(img-Uem4R2hO-1583414769394)(https://upload-images.jianshu.io/upload_images/2023569-23b0d32d9d3db11d?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)] 驗證精度明顯偏低，再給出 2000 個訓練集的資料： ![image](https://imgconvert.csdnimg.cn/aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8yMDIzNTY5LWIzNmY3NTM0OGViNTA0Zjg?x-oss-process=image/format,png) 這個精度就高了很多，追求這個高低不是我們的目的，我們的目的是說明詞嵌入是有效的，我們達到了這個目的，好了，接下來我們看看程式碼吧： ``` #!/usr/bin/env python3 import os import time import matplotlib.pyplot as plt import numpy as np from keras.layers import Embedding, Flatten, Dense from keras.models import Sequential from keras.preprocessing.sequence import pad_sequences from keras.preprocessing.text import Tokenizer def deal(): # http://mng.bz/0tIo imdb_dir = '/Users/renyuzhuo/Documents/PycharmProjects/Data/aclImdb' train_dir = os.path.join(imdb_dir, 'train') labels = [] texts = [] # 讀出所有資料 for label_type in ['neg', 'pos']: dir_name = os.path.join(train_dir, label_type) for fname in os.listdir(dir_name): if fname[-4:] == '.txt': f = open(os.path.join(dir_name, fname)) texts.append(f.read()) f.close() if label_type == 'neg': labels.append(0) else: labels.append(1) # 對所有資料進行分詞 # 每個評論最多 100 個單詞 maxlen = 100 # 訓練樣本數量 training_samples = 200 # 驗證樣本數量 validation_samples = 10000 # 只取最常見 10000 個單詞 max_words = 10000 # 分詞，前文已經介紹過了 tokenizer = Tokenizer(num_words=max_words) tokenizer.fit_on_texts(texts) sequences = tokenizer.texts_to_sequences(texts) word_index = tokenizer.word_index print('Found %s unique tokens.' % len(word_index)) # 將整數列表轉換成張量 data = pad_sequences(sequences, maxlen=maxlen) labels = np.asarray(labels) print('Shape of data tensor:', data.shape) print('Shape of label tensor:', labels.shape) # 打亂資料 indices = np.arange(data.shape[0]) np.random.shuffle(indices) data = data[indices] labels = labels[indices] # 切割成訓練集和驗證集 x_train = data[:training_samples] y_train = labels[:training_samples] x_val = data[training_samples: training_samples + validation_samples] y_val = labels[training_samples: training_samples + validation_samples] # 下載詞嵌入資料，下載地址：https: // nlp.stanford.edu / projects / glove glove_dir = '/Users/renyuzhuo/Documents/PycharmProjects/Data/glove.6B' embeddings_index = {} f = open(os.path.join(glove_dir, 'glove.6B.100d.txt')) # 構建單詞與其x向量表示的索引 for line in f: values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close() print('Found %s word vectors.' % len(embeddings_index)) # 構建嵌入矩陣 embedding_dim = 100 embedding_matrix = np.zeros((max_words, embedding_dim)) for word, i in word_index.items(): if i < max_words: embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector # 構建模型 model = Sequential() model.add(Embedding(max_words, embedding_dim, input_length=maxlen)) model.add(Flatten()) model.add(Dense(32, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.summary() # 將 GloVe 載入到 Embedding 層，且將其設定為不可訓練 model.layers[0].set_weights([embedding_matrix]) model.layers[0].trainable = False # 訓練模型 model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc']) history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_val, y_val)) model.save_weights('pre_trained_glove_model.h5') # 畫圖 acc = history.history['acc'] val_acc = history.history['val_acc'] loss = history.history['loss'] val_loss = history.history['val_loss'] epochs = range(1, len(acc) + 1) plt.plot(epochs, acc, 'bo', label='Training acc') plt.plot(epochs, val_acc, 'b', label='Validation acc') plt.title('Training and validation accuracy') plt.legend() plt.show() plt.figure() plt.plot(epochs, loss, 'bo', label='Training loss') plt.plot(epochs, val_loss, 'b', label='Validation loss') plt.title('Training and validation loss') plt.legend() plt.show() if __name__ == "__main__": time_start = time.time() deal() time_end = time.time() print('Time Used: ', time_end - time_start) ``` > 本文首發自公眾號：[RAIS]([https://ai.renyuzhuo.cn/](https://ai.renyuzhu

AI：深度學習用於文字處理

AI：深度學習用於文字處理

深度學習用於文字和序列

人工智能AI專家分享：深度學習初學解惑

Yoshua Bengio首次中國演講：深度學習通往人類水平AI的挑戰

deeplearning.ai第二課第一週：深度學習實用技巧

AI專家分享：深度學習工程師的4個檔次

斯坦福AI實驗室又一力作：深度學習還能進一步擴充套件 | CVPR2016最佳學生論文詳解

“GANs 之父”Goodfellow親身傳授：深度學習未來的8大方向和入門AI必備的三大技能

【吳恩達deeplearning.ai】深度學習(9)：迴圈神經網路

15天倒計時：深度學習高端講座免費聽，最後200位贈教材名額！

【OCR技術系列之四】基於深度學習的文字識別（3755個漢字）

吳恩達：深度學習作業2相關

20180813視頻筆記深度學習基礎上篇（1）之必備基礎知識點深度學習基礎上篇（2）神經網絡模型視頻筆記：深度學習基礎上篇（3）神經網絡案例實戰和深度學習基礎下篇

【逐夢AI】深度學習與計算機視覺應用實戰課程（BAT工程師主講，無人汽車，機器人，神經網絡）

七牛雲李朝光：深度學習平臺助力億級別內容審核系統

【火爐煉AI】深度學習001-神經網路的基本單元-感知器

【火爐煉AI】深度學習002-構建並訓練單層神經網路模型

【火爐煉AI】深度學習003-構建並訓練深度神經網路模型

【火爐煉AI】深度學習004-Elman迴圈神經網路

TensorFlow系列專題（三）：深度學習簡介

AI：深度學習用於文字處理

相關推薦