ELMo詞向量用於中文

阿新 • • 發佈：2018-11-10

    <p>10.10更新：ELMo已經由哈工大組用PyTorch重寫了，並且提供了中文的預訓練好的language model，可以直接使用。</p>

ELMo於今年二月由AllenNLP提出，與word2vec或GloVe不同的是其動態詞向量的思想，其本質即通過訓練language model，對於一句話進入到language model獲得不同的詞向量。根據實驗可得，使用了Elmo詞向量之後，許多NLP任務都有了大幅的提高。

論文:Deep contextualized word representations

AllenNLP一共release了兩份ELMo的程式碼，一份是Pytorch版本的，另一份是Tensorflow版本的。Pytorch版本的只開放了使用預訓練好的詞向量的介面，但沒有給出自己訓練的介面，因此無法使用到中文語料中。Tensorflow版本有提供訓練的程式碼，因此本文記錄如何將ELMo用於中文語料中，但本文只記錄使用到的部分，而不會分析全部的程式碼。

需求:
使用預訓練好的詞向量作為句子表示直接傳入到RNN中(也就是不使用程式碼中預設的先過CNN)，在訓練完後，將模型儲存，在需要用的時候load進來，對於一個特定的句子，首先將其轉換成預訓練的詞向量，傳入language model之後最終得到ELMo詞向量。

準備工作:

將中文語料分詞
訓練好GloVe詞向量或者word2vec
下載bilm-tf程式碼
生成詞表 vocab_file （訓練的時候要用到）
optional:閱讀Readme
optional:通讀bilm-tf的程式碼，對程式碼結構有一定的認識

思路:

將預訓練的詞向量讀入

修改bilm-tf程式碼
1. option部分
2. 新增給embedding weight賦初值
3. 新增儲存embedding weight的程式碼
開始訓練，獲得checkpoint和option檔案
執行指令碼，獲得language model的weight檔案
將embedding weight儲存為hdf5檔案形式
執行指令碼，將語料轉化成ELMo embedding。

訓練GloVe或word2vec

可參見我以前的部落格或者網上的教程。
注意到，如果要用gensim匯入GloVe訓好的詞向量，需要在開頭新增num_word embedding_dim。如：

獲得vocab詞表檔案

注意到，詞表檔案的開頭必須要有<S> </S> <UNK>，且大小寫敏感。並且應當按照單詞的詞頻降序排列。可以通過手動新增這三個特殊符號。
如：

程式碼：

model=gensim.models.KeyedVectors.load_word2vec_format(
    fname='/home/zhlin/GloVe/vectors.txt',binary=False
)
words=model.vocab
with open('vocab.txt','w') as f:
    f.write('<S>')
    f.write('\n'）
    f.write('</S>')
    f.write('\n')
    f.write('<UNK>')
    f.write('\n')    # bilm-tf 要求vocab有這三個符號，並且在最前面
    for word in words:
        f.write(word)
        f.write('\n')

修改bilm-tf程式碼

注意到，在使用該程式碼之前，需要安裝好相應的環境。

如果使用的是conda作為預設的Python直譯器，強烈建議使用conda安裝，否則可能會出現一些莫名的錯誤。

1
2
3

conda install tensorflow-gpu=1.4
conda install h5py
python setup.py install #應在bilm-tf的資料夾下執行該指令

然後再執行測試程式碼，通過說明安裝成功。

修改train_elmo.py

bin資料夾下的train_elmo.py是程式的入口。
主要修改的地方：

load_vocab的第二個引數應該改為None
n_gpus CUDA_VISIBLE_DEVICES 根據自己需求改
n_train_tokens 可改可不改，影響的是輸出資訊。要檢視自己語料的行數，可以通過wc -l corpus.txt 檢視。
option的修改，將char_cnn部分都註釋掉，其他根據自己需求修改

如：

修改LanguageModel類

由於我需要傳入預訓練好的GloVe embedding，那麼還需要修改embedding部分，這部分在bilm資料夾下的training.py，進入到LanguageModel類中_build_word_embeddings函式中。注意到，由於前三個是<S> </S> <UNK>，而這三個字元在GloVe裡面是沒有的，因此這三個字元的embedding應當在訓練的時候逐漸學習到，而正因此 embedding_weights的trainable應當設為True

如:

修改train函式

新增程式碼，使得在train函式的最後儲存embedding檔案。

訓練並獲得weights檔案

訓練需要語料檔案corpus.txt，詞表檔案vocab.txt。

訓練

cd到bilm-tf資料夾下，執行

export CUDA_VISIBLE_DEVICES=4
nohup python -u bin/train_elmo.py \
--train_prefix='/home/zhlin/bilm-tf/corpus.txt' \
--vocab_file /home/zhlin/bilm-tf/glove_embedding_vocab8.10/vocab.txt \
--save_dir /home/zhlin/bilm-tf/try >bilm_out.txt 2>&1 &

根據實際情況設定不同的值和路徑。

執行情況：

PS:執行過程中可能會有warning:

‘list’ object has no attribute ‘name’
WARNING:tensorflow:Error encountered when serializing lstm_output_embeddings.
Type is unsupported, or the types of the items don’t match field type in CollectionDef.

應該不用擔心，還是能夠繼續執行的，後面也不受影響。

在等待了相當長的時間後，在save_dir資料夾內生成了幾個檔案，其中checkpoint和options是關鍵，checkpoint能夠進一步生成language model的weights檔案，而options記錄language model的引數。

獲得language model的weights

接下來執行bin/dump_weights.py將checkpoint轉換成hdf5檔案。

1
2
3

nohup python -u  /home/zhlin/bilm-tf/bin/dump_weights.py  \
--save_dir /home/zhlin/bilm-tf/try  \
--outfile /home/zhlin/bilm-tf/try/weights.hdf5 >outfile.txt 2>&1 &

其中save_dir是checkpoint和option檔案儲存的地址。

接下來等待程式執行：

最終獲得了想要的weights和option：

將語料轉化成ELMo embedding

由於我們有了vocab_file、與vocab_file一一對應的embedding h5py檔案、以及language model的weights.hdf5和options.json。
接下來參考usage_token.py將一句話轉化成ELMo embedding。

參考程式碼：

import tensorflow as tf
import os
from bilm import TokenBatcher, BidirectionalLanguageModel, weight_layers, \
    dump_token_embeddings

# Our small dataset.
raw_context = [
    '這 是 測試 .',
    '好的 .'
]
tokenized_context = [sentence.split() for sentence in raw_context]
tokenized_question = [
    ['這', '是', '什麼'],
]

vocab_file='/home/zhlin/bilm-tf/glove_embedding_vocab8.10/vocab.txt'
options_file='/home/zhlin/bilm-tf/try/options.json'
weight_file='/home/zhlin/bilm-tf/try/weights.hdf5'
token_embedding_file='/home/zhlin/bilm-tf/glove_embedding_vocab8.10/vocab_embedding.hdf5'

## Now we can do inference.
# Create a TokenBatcher to map text to token ids.
batcher = TokenBatcher(vocab_file)

# Input placeholders to the biLM.
context_token_ids = tf.placeholder('int32', shape=(None, None))
question_token_ids = tf.placeholder('int32', shape=(None, None))

# Build the biLM graph.
bilm = BidirectionalLanguageModel(
    options_file,
    weight_file,
    use_character_inputs=False,
    embedding_weight_file=token_embedding_file
)

# Get ops to compute the LM embeddings.
context_embeddings_op = bilm(context_token_ids)
question_embeddings_op = bilm(question_token_ids)

elmo_context_input = weight_layers('input', context_embeddings_op, l2_coef=0.0)
with tf.variable_scope('', reuse=True):
    # the reuse=True scope reuses weights from the context for the question
    elmo_question_input = weight_layers(
        'input', question_embeddings_op, l2_coef=0.0
    )

elmo_context_output = weight_layers(
    'output', context_embeddings_op, l2_coef=0.0
)
with tf.variable_scope('', reuse=True):
    # the reuse=True scope reuses weights from the context for the question
    elmo_question_output = weight_layers(
        'output', question_embeddings_op, l2_coef=0.0
    )


with tf.Session() as sess:
    # It is necessary to initialize variables once before running inference.
    sess.run(tf.global_variables_initializer())

    # Create batches of data.
    context_ids = batcher.batch_sentences(tokenized_context)
    question_ids = batcher.batch_sentences(tokenized_question)

    # Compute ELMo representations (here for the input only, for simplicity).
    elmo_context_input_, elmo_question_input_ = sess.run(
        [elmo_context_input['weighted_op'], elmo_question_input['weighted_op']],
        feed_dict={context_token_ids: context_ids,
                   question_token_ids: question_ids}
    )

print(elmo_context_input_,elmo_context_input_)

可以修改程式碼以適應自己的需求。

Reference

https://github.com/allenai/bilm-tf

</div>

ELMo詞向量用於中文

訓練GloVe或word2vec

獲得vocab詞表檔案

修改bilm-tf程式碼

修改train_elmo.py

修改LanguageModel類

修改train函式

訓練並獲得weights檔案

訓練

獲得language model的weights

將語料轉化成ELMo embedding

Reference

ELMo詞向量用於中文

如何將ELMo詞向量用於中文

Elmo詞向量中文訓練過程雜記

word2vec詞向量處理中文語料

【python gensim使用】word2vec詞向量處理中文語料

word2vec 構建中文詞向量

簡單上手用於中文分詞的隱馬爾科夫模型

最簡單的方式獲取Elmo得到的詞向量

詞向量技術-從word2vec到Glove到ELMo

詞向量——ELMo

詞向量技術(從word2vec到ELMo)以及句嵌入技術

中文自然語言處理向量合集(字向量,拼音向量,詞向量,詞性向量,依存關係向量)

Ubuntu下GloVe中文詞向量模型訓練

word2vec訓練維基百科中文詞向量

word2vec詞向量訓練及中文文字相似度計算

【深度學習】120G+訓練好的word2vec模型（中文詞向量）

windows環境下使用wiki中文百科及gensim工具庫訓練詞向量

HMM（隱馬爾科夫）用於中文分詞

第一節——詞向量與ELmo(轉）

詞向量-LRWE模型

ELMo詞向量用於中文

訓練GloVe或word2vec

獲得vocab詞表檔案

修改bilm-tf程式碼

修改train_elmo.py

修改LanguageModel類

修改train函式

訓練並獲得weights檔案

訓練

獲得language model的weights

將語料轉化成ELMo embedding

Reference

相關推薦