Elmo詞向量中文訓練過程雜記

阿新 • • 發佈：2018-11-10

1 elmo是什麼？
- - ELMo的特點：
2 Elmo訓練有哪些好專案？
- - 有訓練過程的專案
  - 預訓練模型：
3 Elmo訓練流程
4 英文預訓練模型
5 中文訓練與相關經驗
- 5.1 相關訓練專案
- 5.2 elmo實戰經驗小結

1 elmo是什麼？

參考：《文字嵌入的經典模型與最新進展》
人們已經提出了大量可能的詞嵌入方法。最常用的模型是 word2vec 和 GloVe，它們都是基於分佈假設的無監督學習方法（在相同上下文中的單詞往往具有相似的含義）。

雖然有些人通過結合語義或句法知識的有監督來增強這些無監督的方法，但純粹的無監督方法在 2017-2018 中發展非常有趣，最著名的是 FastText（word2vec的擴充套件）和 ELMo（最先進的上下文詞向量）。

這裡寫圖片描述

在ELMo 中，每個單詞被賦予一個表示，它是它們所屬的整個語料庫句子的函式。所述的嵌入來自於計算一個兩層雙向語言模型（LM）的內部狀態，因此得名「ELMo」：Embeddings from Language Models。

ELMo embeddings論文路徑

這裡寫圖片描述

ELMo的特點：

ELMo 的輸入是字母而不是單詞。因此，他們可以利用子字詞單元來計算有意義的表示，即使對於詞典外的詞（如 FastText這個詞）也是如此。

ELMo 是 biLMs 幾層啟用的串聯。語言模型的不同層對單詞上的不同型別的資訊進行編碼（如在雙向LSTM神經網路中，詞性標註在較低層編碼好，而詞義消歧義用上層編碼更好）。連線所有層可以自由組合各種文字表示，以提高下游任務的效能。

2 Elmo訓練有哪些好專案？

閒話不多數，理論自己補，因為筆者懶得復現，於是就去找捷徑。。。

有訓練過程的專案

就是兩個專案：allenai/bilm-tf 和 UKPLab/elmo-bilstm-cnn-crf。

elmo是原生於該專案allenai/bilm-tf，py3.5
tf1.2就可以執行，當然有些時間需要預裝他們的allennlp，原生的是自帶訓練模組。
那麼基於此，UKPlab（deeplearning4）改編了一個版本UKPLab/elmo-bilstm-cnn-crf，配置為py3 + tf1.8，而且應用在了bilstm-cnn-crf任務之中。兩個版本因為對tf版本要求不一，所以最好用他們的docker。

預訓練模型：

還有tensorflow hub之中（雙版本，1版、2版），有英文的預訓練模型，可以直接拿來用的那種，於是有很多延伸：

專案一：PrashantRanjan09/WordEmbeddings-Elmo-Fasttext-Word2Vec，該專案對比了
0 – Word2vec, 1 – Gensim FastText, 2- Fasttext (FAIR), 3-
ELMo，幾種詞向量。但是引用的是hub中預訓練的模型，沒有自帶訓練模組；
專案二：strongio/keras-elmo
的 Elmo Embeddings in Keras with TensorFlow
hub，在hub基礎上用keras訓練了一個簡單二分類情感，非常讚的教程，但是沒有提供訓練模組，呼叫的是hub。

還有挺多小專案的，大多基於tf-hub（提示，hub要tf 1.7以上）預訓練模型做一些小應用，那麼問題就來了，英文有預訓練，中文呢？
筆者找到了searobbersduck/ELMo_Chin，該作用小說訓練了一套，筆者按照提示也是能夠訓練，只不過該作者只寫了訓練過程沒有寫訓練完怎麼用起來，所以需要和前面幾個專案對起來看。

3 Elmo訓練流程

3.1 elmo訓練流程

allenNLP給出的解答，計算elmo的流程：
- Prepare input data and a vocabulary file.
- Train the biLM.
- Test (compute the perplexity of) the biLM on heldout data.
- Write out the weights from the trained biLM to a hdf5 file.(checkpoint -> hdf5)

3.2 elmo如何fine-tune到其他領域？？

First download the checkpoint files above. Then prepare the dataset as described in the section “Training a biLM on a new corpus”, with the exception that we will use the existing vocabulary file instead of creating a new one. Finally, use the script bin/restart.py to restart training with the existing checkpoint on the new dataset. For small datasets (e.g. < 10 million tokens) we only recommend tuning for a small number of epochs and monitoring the perplexity on a heldout set, otherwise the model will overfit the small dataset.

3.3 elmo具體使用的方式

來自allennlp/Using pre-trained models，三種使用方式，其中提到的使用方式為整段/整個資料集一次性向量化並儲存，There are three ways to integrate ELMo representations into a downstream task, depending on your use case.

Compute representations on the fly from raw text using character input. This is the most general method and will handle any input text. It is also the most computationally expensive.
Precompute and cache the context independent token representations, then compute context dependent representations using the biLSTMs for input data. This method is less computationally expensive then #1,
but is only applicable with a fixed, prescribed vocabulary.
Precompute the representations for your entire dataset and save to a file.

4 英文預訓練模型

筆者拋磚引玉，給有心人整理一下英文預訓練模型使用方式。

4.1 首推Elmo Embeddings in Keras with TensorFlow hub

code來自：strongio/keras-elmo，只給出重點：


# Import our dependencies# Import 
import tensorflow as tf
import pandas as pd
import tensorflow_hub as hub
import os
import re
from keras import backend as K
import keras.layers as layers
from keras.models import Model
import numpy as np

# Initialize session
sess = tf.Session()
K.set_session(sess)

# Now instantiate the elmo model
elmo_model = hub.Module("https://tfhub.dev/google/elmo/1", trainable=True)
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())

# Build our model

# We create a function to integrate the tensorflow model with a Keras model
# This requires explicitly casting the tensor to a string, because of a Keras quirk
def ElmoEmbedding(x):
    return elmo_model(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]

input_text = layers.Input(shape=(1,), dtype=tf.string)
embedding = layers.Lambda(ElmoEmbedding, output_shape=(1024,))(input_text)
dense = layers.Dense(256, activation='relu')(embedding)
pred = layers.Dense(1, activation='sigmoid')(dense)

model = Model(inputs=[input_text], outputs=pred)

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

# Fit!
model.fit(train_text, 
          train_label,
          validation_data=(test_text, test_label),
          epochs=5,
          batch_size=32)

>>> Train on 25000 samples, validate on 25000 samples
>>> Epoch 1/5
>>>  1248/25000 [>.............................] - ETA: 3:23:34 - loss: 0.6002 - acc: 0.6795
  
   1
   2
   3
   4
   5
   6
   7
   8
   9
   10
   11
   12
   13
   14
   15
   16
   17
   18
   19
   20
   21
   22
   23
   24
   25
   26
   27
   28
   29
   30
   31
   32
   33
   34
   35
   36
   37
   38
   39
   40
   41
   42
   43
   44
   45
   46
   47
   48

開啟Hub中的模型hub.Module("https://tfhub.dev/google/elmo/1", trainable=True)，以及載入embedding = layers.Lambda(ElmoEmbedding, output_shape=(1024,))(input_text)elmo詞向量。

4.2 allenai/bilm-tf官方使用方式

主要是第三章提到的三種使用方式：usage_cached.py 、 usage_character.py 、 usage_token.py

import tensorflow as tf
import os
from bilm import TokenBatcher, BidirectionalLanguageModel, weight_layers, \
    dump_token_embeddings

# Dump the token embeddings to a file. Run this once for your dataset.
token_embedding_file = 'elmo_token_embeddings.hdf5'
dump_token_embeddings(
    vocab_file, options_file, weight_file, token_embedding_file
)
tf.reset_default_graph()

# Build the biLM graph.
bilm = BidirectionalLanguageModel(
    options_file,
    weight_file,
    use_character_inputs=False,
    embedding_weight_file=token_embedding_file
)

# Get ops to compute the LM embeddings.
context_embeddings_op = bilm(context_token_ids)

elmo_context_input = weight_layers('input', context_embeddings_op, l2_coef=0.0)

# run
with tf.Session() as sess:
    # It is necessary to initialize variables once before running inference.
    sess.run(tf.global_variables_initializer())

    # Create batches of data.
    context_ids = batcher.batch_sentences(tokenized_context)
    question_ids = batcher.batch_sentences(tokenized_question)

    # Compute ELMo representations (here for the input only, for simplicity).
    elmo_context_input_, elmo_question_input_ = sess.run(
        [elmo_context_input['weighted_op'], elmo_question_input['weighted_op']],
        feed_dict={context_token_ids: context_ids,
                   question_token_ids: question_ids}
    )
  
   1
   2
   3
   4
   5
   6
   7
   8
   9
   10
   11
   12
   13
   14
   15
   16
   17
   18
   19
   20
   21
   22
   23
   24
   25
   26
   27
   28
   29
   30
   31
   32
   33
   34
   35
   36
   37
   38
   39
   40

4.3 UKPLab/elmo-bilstm-cnn-crf

來自elmo-bilstm-cnn-crf/Keras_ELMo_Tutorial.ipynb，與首推一樣也是keras寫了一個二分類。訓練步驟包含：
we will include it into a preprocessing step:

We read in the dataset (here the IMDB dataset)
Text is tokenized and truncated to a fix length
Each text is fed as a sentence to the AllenNLP ElmoEmbedder to get a 1024 dimensional embedding for each word in the document
These embeddings are then fed to our neural network that we train

import keras
import os
import sys
from allennlp.commands.elmo import ElmoEmbedder
import numpy as np
import random
from keras.models import Sequential
from keras.layers import Dense, Conv1D, GlobalMaxPooling1D, GlobalAveragePooling1D, Activation, Dropout

# Lookup the ELMo embeddings for all documents (all sentences) in our dataset. Store those
# in a numpy matrix so that we must compute the ELMo embeddings only once.
def create_elmo_embeddings(elmo, documents, max_sentences = 1000):
    num_sentences = min(max_sentences, len(documents)) if max_sentences > 0 else len(documents)
    print("\n\n:: Lookup of "+str(num_sentences)+" ELMo representations. This takes a while ::")
    embeddings = []
    labels = []
    tokens = [document['tokens'] for document in documents]

    documentIdx = 0
    for elmo_embedding in elmo.embed_sentences(tokens):  
        document = documents[documentIdx]
        # Average the 3 layers returned from ELMo
        avg_elmo_embedding = np.average(elmo_embedding, axis=0)

        embeddings.append(avg_elmo_embedding)        
        labels.append(document['label'])

        # Some progress info
        documentIdx += 1
        percent = 100.0 * documentIdx / num_sentences
        line = '[{0}{1}]'.format('=' * int(percent / 2), ' ' * (50 - int(percent / 2)))
        status = '\r{0:3.0f}%{1} {2:3d}/{3:3d} sentences'
        sys.stdout.write(status.format(percent, line, documentIdx, num_sentences))

        if max_sentences > 0 and documentIdx >= max_sentences:
            break

    return embeddings, labels


elmo = ElmoEmbedder(cuda_device=1) #Set cuda_device to the ID of your GPU if you have one
train_x, train_y = create_elmo_embeddings(elmo, train_data, 1000)
test_x, test_y  = create_elmo_embeddings(elmo, test_data, 1000)

# :: Pad the x matrix to uniform length ::
def pad_x_matrix(x_matrix):
    for sentenceIdx in range(len(x_matrix)):
        sent = x_matrix[sentenceIdx]
        sentence_vec = np.array(sent, dtype=np.float32)
        padding_length = max_tokens - sentence_vec.shape[0]
        if padding_length > 0:
            x_matrix[sentenceIdx] = np.append(sent, np.zeros((padding_length, sentence_vec.shape[1])), axis=0)

    matrix = np.array(x_matrix, dtype=np.float32)
    return matrix

train_x = pad_x_matrix(train_x)
train_y = np.array(train_y)

test_x = pad_x_matrix(test_x)
test_y = np.array(test_y)

print("Shape Train X:", train_x.shape)
print("Shape Test Y:", test_x.shape)

# Simple model for sentence / document classification using CNN + global max pooling
model = Sequential()
model.add(Conv1D(filters=250, kernel_size=3, padding='same'))
model.add(GlobalMaxPooling1D())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(train_x, train_y, validation_data=(test_x, test_y), epochs=10, batch_size=32)
  
   1
   2
   3
   4
   5
   6
   7
   8
   9
   10
   11
   12
   13
   14
   15
   16
   17
   18
   19
   20
   21
   22
   23
   24
   25
   26
   27
   28
   29
   30
   31
   32
   33
   34
   35
   36
   37
   38
   39
   40
   41
   42
   43
   44
   45
   46
   47
   48
   49
   50
   51
   52
   53
   54
   55
   56
   57
   58
   59
   60
   61
   62
   63
   64
   65
   66
   67
   68
   69
   70
   71
   72
   73
   74

在ElmoEmbedder把tensorflow預訓練模型載入。

4.4 Using ELMo programmatically

來自allennlp Using ELMo programmatically的片段

from allennlp.modules.elmo import Elmo, batch_to_ids

options_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"

elmo = Elmo(options_file, weight_file, 2, dropout=0)

# use batch_to_ids to convert sentences to character ids
sentences = [['First', 'sentence', '.'], ['Another', '.']]
character_ids = batch_to_ids(sentences)

embeddings = elmo(character_ids)

# embeddings['elmo_representations'] is length two list of tensors.
# Each element contains one layer of ELMo representations with shape
# (2, 3, 1024).
#   2    - the batch size
#   3    - the sequence length of the batch
#   1024 - the length of each ELMo vector
  
   1
   2
   3
   4
   5
   6
   7
   8
   9
   10
   11
   12
   13
   14
   15
   16
   17
   18
   19

If you are not training a pytorch model, and just want numpy arrays as output then use allennlp.commands.elmo.ElmoEmbedder.

5 中文訓練與相關經驗

5.1 相關訓練專案

一共有三個中文訓練的源頭。
（1）可參考：searobbersduck/ELMo_Chin，不過好像過程中有些問題，筆者還沒證實原因。
（2）博文：《如何將ELMo詞向量用於中文》，該教程用glove作為初始化向量，思路如下：

將預訓練的詞向量讀入
修改bilm-tf程式碼
- option部分
- 新增給embedding weight賦初值
- 新增儲存embedding weight的程式碼
開始訓練，獲得checkpoint和option檔案
執行指令碼，獲得language model的weight檔案
將embedding weight儲存為hdf5檔案形式
執行指令碼，將語料轉化成ELMo embedding。

（3）HIT-SCIR/ELMoForManyLangs，哈工大今年CoNLL評測的多國語ELMo，有繁體中文。

其中教程中主要需要提取三個內容：

詞彙表vocab_seg_words_elmo.txt；詞表檔案的開頭必須要有<S> </S> <UNK>，且大小寫敏感。並且應當按照單詞的詞頻降序排列。可以通過手動新增這三個特殊符號。

立足
酸甜
冷笑
吃飯
市民
熟
金剛
日月同輝
光
  
   1
   2
   3
   4
   5
   6
   7
   8
   9

資料來源進行分詞vocab_seg_words.txt

有 德克士 吃 [ 色 ] ， 心情 也 是 開朗 的
首選 都 是 德克士 [ 酷 ] [ 酷 ]
德克士 好樣 的 ， 偶 也 發現 了 鮮萃 檸檬 飲 。
有 德克士 ， 能 讓 你 真正 的 幸福 哦
以後 多 給 我們 推出 這麼 到位 的 搭配 ， 德克士 我們 等 著
貼心 的 德克士 ， 吃貨 們 分享 起來
又 學到 好 知識 了 ， 感謝 德克士 [ 吃驚 ]
德克士 一直 久存 於心
  
   1
   2
   3
   4
   5
   6
   7
   8

引數配置表option.json
其中有幾個引數需要注意一下：
- n_train_tokens，訓練集中總token數量
- max_characters_per_token，單個token最長字串長度
- n_tokens_vocab，詞彙表vocab_seg_words_elmo.txt
- n_characters，n_tokens_vocab + max_characters_per_token - 1 (筆者不確定)

如何使用，筆者參考的是4.2 三種中的 usage_token.py，其他兩種貌似總是報錯。

5.2 elmo實戰經驗小結

5.2.1 一則

一些回答來自知乎：劉一佳。哈工大他們的演算法ELMO用的20M詞的生語料訓練的，有的是他們自己IDE訓練演算法，比bilm-tf視訊記憶體效率高一點，訓練穩定性高一些。他們也給出以下幾個經驗：

句法任務中，OOV比例高的資料ELMo效果好，多國語言中OOV rate與ELMo帶來的提升最為明顯；
訓練資料少或接近zero-shot，one-shot，ELMo表現更好；
訓練資料較多，譬如dureader資料，elmo沒啥效果；
有些公司用了，覺得效果明顯，甚至上生產環境，有的公司則效果不佳，具體任務來看。

5.2.2 二則

在博文《吾愛NLP(5)—詞向量技術-從word2vec到ELMo》解釋了一下ELMo與word2vec最大的不同：
Contextual: The representation for each word depends on the entire context in which it is used.　
（即詞向量不是一成不變的，而是根據上下文而隨時變化，這與word2vec或者glove具有很大的區別）

舉個例子：針對某一詞多義的詞彙w="蘋果"
文字序列1=“我 買了 六斤 蘋果。”
文字序列2=“我 買了一個 蘋果 7。”
上面兩個文字序列中都出現了“蘋果”這個詞彙，但是在不同的句子中，它們我的含義顯示是不同的，一個屬於水果領域，一個屬於電子產品呢領域，如果針對“蘋果”這個詞彙同時訓練兩個詞向量來分別刻畫不同領域的資訊呢？答案就是使用ELMo。
  
   1
   2
   3
   4

5.2.3 三則

博文《NAACL2018 一種新的embedding方法–原理與實驗 Deep contextualized word representations (ELMo)》提到：

ELMo的效果非常好, 我自己在SQuAD資料集上可以提高3個左右百分點的準確率. 因為是上下文相關的embedding,
所以在一定程度上解決了一詞多義的語義問題.
ELMo速度非常慢, 資料集中包含越10萬篇短文, 每篇約400詞, 如果將batch設定為32, 用glove詞向量進行編碼,
過3個biLSTM, 3個Linear, 3個softmax/logsoftmax(其餘dropout, relu這種忽略不計),
在1080Ti(TiTan XP上也差不多)總共需要約15分鐘訓練完(包括bp)一個epoch. 而如果用ELMo對其進行編碼,
僅編碼時間就近一個小時, 全部使用的話因為維度非常大, 視訊記憶體佔用極高, 需要使用多張卡,
加上多張卡之間排程和資料傳輸的花銷一個epoch需要2+小時(在4張卡上).

文中提出的效率解決的方式：
ELMo雖然對同一個單詞會編碼出不同的結果, 但是上下文相同的時候ELMo編碼出的結果是不變的(這裡不進行回傳更新LM的引數)因為論文中發現不同任務對不同層的LM編碼資訊的敏感程度不同, 比如SQuAD只對第一和第二層的編碼資訊敏感, 那我們儲存的時候可以只儲存ELMo編碼的一部分, 在SQuAD中只儲存前兩層, 儲存空間可以降低1/3, 需要320G就可以了, 如果我們事先確定資料集對於所有不同層敏感程度(即上文提到的sj), 我們可以直接用係數超參sj對3層的輸出直接用∑Lj=0staskjhLMk,j壓縮成一個1024的向量, 這樣只需要160G的儲存空間即可滿足需求.

5.2.4 四則

Improving NLP tasks by transferring knowledge from large data
To conclude, both papers prove that NLP tasks can benefit from large data. By leveraging either parallel MT corpus or monolingual corpus, there are several killer features in contextual word representation:

Model is able to disambiguate the same word into different
representation based on its context.
Thanks to the character-based convolution, representation of
out-of-vocabulary tokens can be derived from morphological clues.

However, ELMo can be a cut above CoVe not only because of the performance improvement in tasks, but the type of training data. Because eventually, data is what matters the most in industry. Monolingual data do not require as much of the annotation thus can be collected more easily and efficiently.

一些測評：
這裡寫圖片描述