1. 程式人生 > >最簡單的方式獲取Elmo得到的詞向量

最簡單的方式獲取Elmo得到的詞向量

Introduction

本文的目的就是用最簡單的方式獲取 elmo 得到的word representation,看了一些其他人的介紹,其實最後對我有用的就這麼多了,我只想要他生成的詞向量。
簡單介紹一下 elmo:Allen NLP在NAACL2018上的Best paper - Deep contextualized word representations,使用elmo讓原有的模型在NLI等Task上效果提升。
那好,直接說怎麼得到這個elmo。現在有tf,pytorch,keras各種版本。本文使用的官方給出的elmo片段方式,不用加在模型當中,直接獲得詞向量的Tensor,因為我只想用他的詞向量,訓練他的模型又耗時有耗機器。

Environment

首先在conda中新建環境:

conda create -n allennlp python=3.6

接著安裝allennlp[保證你電腦裡gcc是OK的,編譯時需要C++的環境]

pip install allennlp

別斷網就OK了,東西有點多,pytorch啥的全套。
然後,下載allennlp給出的訓練好的引數和模型
網址:

這樣方便你重複使用。

Method

下面就是用這兩個檔案怎麼得到詞向量了:

from allennlp.commands.elmo import ElmoEmbedder

options_file = "/files/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "/files/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"

elmo = ElmoEmbedder(options_file, weight_file)

# use batch_to_ids to convert sentences to character ids
context_tokens = [['I', 'love', 'you', '.'], ['Sorry', ',', 'I', 'don', "'t", 'love', 'you', '.']] #references
elmo_embedding, elmo_mask = elmo.batch_to_embeddings(context_tokens)

print(elmo_embedding)
print(elmo_mask)

Result

Embedding:
tensor([[[[ 0.6923, -0.3261,  0.2283,  ...,  0.1757,  0.2660, -0.1013],
          [-0.7348, -0.0965, -0.1411,  ..., -0.3411,  0.3681,  0.5445],
          [ 0.3645, -0.1415, -0.0662,  ...,  0.1163,  0.1783, -0.7290],
          ...,
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

         [[-1.1051, -0.4092, -0.4365,  ..., -0.6326,  0.4735, -0.2577],
          [ 0.0899, -0.4828, -0.5596,  ...,  0.4372,  0.3840, -0.7343],
          [-0.5538, -0.1473, -0.2441,  ...,  0.2551,  0.0873,  0.2774],
          ...,
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

         [[-3.2634, -0.9448, -0.3199,  ..., -1.2070,  0.6930, -0.2016],
          [-0.3688, -0.7632, -0.0715,  ...,  0.6294,  1.6869, -0.6655],
          [-1.0870, -1.4243, -0.2445,  ...,  0.0825,  0.5020,  0.2765],
          ...,
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]]],


        [[[ 0.5042, -0.6629, -0.0231,  ..., -0.3084, -0.9741, -0.7230],
          [ 0.1131,  0.1575,  0.1414,  ...,  0.3718, -0.1432, -0.0248],
          [ 0.6923, -0.3261,  0.2283,  ...,  0.1757,  0.2660, -0.1013],
          ...,
          [-0.7348, -0.0965, -0.1411,  ..., -0.3411,  0.3681,  0.5445],
          [ 0.3645, -0.1415, -0.0662,  ...,  0.1163,  0.1783, -0.7290],
          [-0.8872, -0.2004, -1.0601,  ..., -0.2655,  0.2115,  0.1977]],

         [[ 0.1221, -0.7032,  0.0169,  ..., -0.3249, -0.4935, -0.4965],
          [ 0.3399, -0.4682,  0.1888,  ..., -0.0565,  0.1001, -0.0416],
          [-0.8135, -0.8491, -0.3264,  ..., -0.5674,  0.2638,  0.2006],
          ...,
          [ 0.4460, -0.4475, -0.1583,  ...,  0.4372,  0.3840, -0.7343],
          [-0.1287,  0.0161,  0.0315,  ...,  0.2551,  0.0873,  0.2774],
          [-1.2373, -0.3373,  0.1098,  ..., -0.0276, -0.0181,  0.0602]],

         [[-0.0830, -1.5891, -0.2576,  ..., -1.2944,  0.1082,  0.6745],
          [-0.0724, -0.7200,  0.1463,  ...,  0.6919,  0.9144, -0.1260],
          [-2.3460, -1.1714, -0.7065,  ..., -1.2885,  0.4679,  0.3800],
          ...,
          [ 0.1246, -0.6929,  0.6330,  ...,  0.6294,  1.6869, -0.6655],
          [-0.5757, -1.0845,  0.5794,  ...,  0.0825,  0.5020,  0.2765],
          [-1.2392, -0.6155, -0.9032,  ...,  0.0524, -0.0852,  0.0805]]]])
Mask:  
 tensor([[1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])

Tips

  • 實驗的輸出結果是 2 * 3 * 8 * 1024 的word embedding,都是2、3、8超引數。
  • 2是batch_size, 3是兩層biLM的輸出加一層CNN對character編碼的輸出, 8是最長list的長度(對齊), 1024是每層輸出的維度。
  • mask的輸出2是batch_size, 8實在最長list的長度, 第一個list有4個tokens,第二個list有8個tokens, 所以對應位置輸出1。

References

https://cstsunfu.github.io/2018/06/ELMo/
https://blog.csdn.net/sinat_26917383/article/details/81913790