最簡單的方式獲取Elmo得到的詞向量
阿新 • • 發佈:2018-11-10
Introduction
本文的目的就是用最簡單的方式獲取 elmo 得到的word representation,看了一些其他人的介紹,其實最後對我有用的就這麼多了,我只想要他生成的詞向量。
簡單介紹一下 elmo:Allen NLP在NAACL2018上的Best paper - Deep contextualized word representations,使用elmo讓原有的模型在NLI等Task上效果提升。
那好,直接說怎麼得到這個elmo。現在有tf,pytorch,keras各種版本。本文使用的官方給出的elmo片段方式,不用加在模型當中,直接獲得詞向量的Tensor,因為我只想用他的詞向量,訓練他的模型又耗時有耗機器。
Environment
首先在conda中新建環境:
conda create -n allennlp python=3.6
接著安裝allennlp[保證你電腦裡gcc是OK的,編譯時需要C++的環境]
pip install allennlp
別斷網就OK了,東西有點多,pytorch啥的全套。
然後,下載allennlp給出的訓練好的引數和模型
網址:
- https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json
- https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5
這樣方便你重複使用。
Method
下面就是用這兩個檔案怎麼得到詞向量了:
from allennlp.commands.elmo import ElmoEmbedder options_file = "/files/elmo_2x4096_512_2048cnn_2xhighway_options.json" weight_file = "/files/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5" elmo = ElmoEmbedder(options_file, weight_file) # use batch_to_ids to convert sentences to character ids context_tokens = [['I', 'love', 'you', '.'], ['Sorry', ',', 'I', 'don', "'t", 'love', 'you', '.']] #references elmo_embedding, elmo_mask = elmo.batch_to_embeddings(context_tokens) print(elmo_embedding) print(elmo_mask)
Result
Embedding:
tensor([[[[ 0.6923, -0.3261, 0.2283, ..., 0.1757, 0.2660, -0.1013],
[-0.7348, -0.0965, -0.1411, ..., -0.3411, 0.3681, 0.5445],
[ 0.3645, -0.1415, -0.0662, ..., 0.1163, 0.1783, -0.7290],
...,
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]],
[[-1.1051, -0.4092, -0.4365, ..., -0.6326, 0.4735, -0.2577],
[ 0.0899, -0.4828, -0.5596, ..., 0.4372, 0.3840, -0.7343],
[-0.5538, -0.1473, -0.2441, ..., 0.2551, 0.0873, 0.2774],
...,
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]],
[[-3.2634, -0.9448, -0.3199, ..., -1.2070, 0.6930, -0.2016],
[-0.3688, -0.7632, -0.0715, ..., 0.6294, 1.6869, -0.6655],
[-1.0870, -1.4243, -0.2445, ..., 0.0825, 0.5020, 0.2765],
...,
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]]],
[[[ 0.5042, -0.6629, -0.0231, ..., -0.3084, -0.9741, -0.7230],
[ 0.1131, 0.1575, 0.1414, ..., 0.3718, -0.1432, -0.0248],
[ 0.6923, -0.3261, 0.2283, ..., 0.1757, 0.2660, -0.1013],
...,
[-0.7348, -0.0965, -0.1411, ..., -0.3411, 0.3681, 0.5445],
[ 0.3645, -0.1415, -0.0662, ..., 0.1163, 0.1783, -0.7290],
[-0.8872, -0.2004, -1.0601, ..., -0.2655, 0.2115, 0.1977]],
[[ 0.1221, -0.7032, 0.0169, ..., -0.3249, -0.4935, -0.4965],
[ 0.3399, -0.4682, 0.1888, ..., -0.0565, 0.1001, -0.0416],
[-0.8135, -0.8491, -0.3264, ..., -0.5674, 0.2638, 0.2006],
...,
[ 0.4460, -0.4475, -0.1583, ..., 0.4372, 0.3840, -0.7343],
[-0.1287, 0.0161, 0.0315, ..., 0.2551, 0.0873, 0.2774],
[-1.2373, -0.3373, 0.1098, ..., -0.0276, -0.0181, 0.0602]],
[[-0.0830, -1.5891, -0.2576, ..., -1.2944, 0.1082, 0.6745],
[-0.0724, -0.7200, 0.1463, ..., 0.6919, 0.9144, -0.1260],
[-2.3460, -1.1714, -0.7065, ..., -1.2885, 0.4679, 0.3800],
...,
[ 0.1246, -0.6929, 0.6330, ..., 0.6294, 1.6869, -0.6655],
[-0.5757, -1.0845, 0.5794, ..., 0.0825, 0.5020, 0.2765],
[-1.2392, -0.6155, -0.9032, ..., 0.0524, -0.0852, 0.0805]]]])
Mask:
tensor([[1, 1, 1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1]])
Tips
- 實驗的輸出結果是 2 * 3 * 8 * 1024 的word embedding,都是2、3、8超引數。
- 2是batch_size, 3是兩層biLM的輸出加一層CNN對character編碼的輸出, 8是最長list的長度(對齊), 1024是每層輸出的維度。
- mask的輸出2是batch_size, 8實在最長list的長度, 第一個list有4個tokens,第二個list有8個tokens, 所以對應位置輸出1。
References
https://cstsunfu.github.io/2018/06/ELMo/
https://blog.csdn.net/sinat_26917383/article/details/81913790