最簡單的方式獲取Elmo得到的詞向量

阿新 • • 發佈：2018-11-10

Introduction

本文的目的就是用最簡單的方式獲取 elmo 得到的word representation，看了一些其他人的介紹，其實最後對我有用的就這麼多了，我只想要他生成的詞向量。
簡單介紹一下 elmo：Allen NLP在NAACL2018上的Best paper - Deep contextualized word representations，使用elmo讓原有的模型在NLI等Task上效果提升。
那好，直接說怎麼得到這個elmo。現在有tf，pytorch，keras各種版本。本文使用的官方給出的elmo片段方式，不用加在模型當中，直接獲得詞向量的Tensor，因為我只想用他的詞向量，訓練他的模型又耗時有耗機器。

Environment

首先在conda中新建環境：

conda create -n allennlp python=3.6

接著安裝allennlp[保證你電腦裡gcc是OK的，編譯時需要C++的環境]

pip install allennlp

別斷網就OK了，東西有點多，pytorch啥的全套。
然後，下載allennlp給出的訓練好的引數和模型
網址：

這樣方便你重複使用。

Method

下面就是用這兩個檔案怎麼得到詞向量了：

from allennlp.commands.elmo import ElmoEmbedder

options_file = "/files/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "/files/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"

elmo = ElmoEmbedder(options_file, weight_file)

# use batch_to_ids to convert sentences to character ids
context_tokens = [['I', 'love', 'you', '.'], ['Sorry', ',', 'I', 'don', "'t", 'love', 'you', '.']] #references
elmo_embedding, elmo_mask = elmo.batch_to_embeddings(context_tokens)

print(elmo_embedding)
print(elmo_mask)

Result

Embedding:
tensor([[[[ 0.6923, -0.3261,  0.2283,  ...,  0.1757,  0.2660, -0.1013],
          [-0.7348, -0.0965, -0.1411,  ..., -0.3411,  0.3681,  0.5445],
          [ 0.3645, -0.1415, -0.0662,  ...,  0.1163,  0.1783, -0.7290],
          ...,
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

         [[-1.1051, -0.4092, -0.4365,  ..., -0.6326,  0.4735, -0.2577],
          [ 0.0899, -0.4828, -0.5596,  ...,  0.4372,  0.3840, -0.7343],
          [-0.5538, -0.1473, -0.2441,  ...,  0.2551,  0.0873,  0.2774],
          ...,
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

         [[-3.2634, -0.9448, -0.3199,  ..., -1.2070,  0.6930, -0.2016],
          [-0.3688, -0.7632, -0.0715,  ...,  0.6294,  1.6869, -0.6655],
          [-1.0870, -1.4243, -0.2445,  ...,  0.0825,  0.5020,  0.2765],
          ...,
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]]],


        [[[ 0.5042, -0.6629, -0.0231,  ..., -0.3084, -0.9741, -0.7230],
          [ 0.1131,  0.1575,  0.1414,  ...,  0.3718, -0.1432, -0.0248],
          [ 0.6923, -0.3261,  0.2283,  ...,  0.1757,  0.2660, -0.1013],
          ...,
          [-0.7348, -0.0965, -0.1411,  ..., -0.3411,  0.3681,  0.5445],
          [ 0.3645, -0.1415, -0.0662,  ...,  0.1163,  0.1783, -0.7290],
          [-0.8872, -0.2004, -1.0601,  ..., -0.2655,  0.2115,  0.1977]],

         [[ 0.1221, -0.7032,  0.0169,  ..., -0.3249, -0.4935, -0.4965],
          [ 0.3399, -0.4682,  0.1888,  ..., -0.0565,  0.1001, -0.0416],
          [-0.8135, -0.8491, -0.3264,  ..., -0.5674,  0.2638,  0.2006],
          ...,
          [ 0.4460, -0.4475, -0.1583,  ...,  0.4372,  0.3840, -0.7343],
          [-0.1287,  0.0161,  0.0315,  ...,  0.2551,  0.0873,  0.2774],
          [-1.2373, -0.3373,  0.1098,  ..., -0.0276, -0.0181,  0.0602]],

         [[-0.0830, -1.5891, -0.2576,  ..., -1.2944,  0.1082,  0.6745],
          [-0.0724, -0.7200,  0.1463,  ...,  0.6919,  0.9144, -0.1260],
          [-2.3460, -1.1714, -0.7065,  ..., -1.2885,  0.4679,  0.3800],
          ...,
          [ 0.1246, -0.6929,  0.6330,  ...,  0.6294,  1.6869, -0.6655],
          [-0.5757, -1.0845,  0.5794,  ...,  0.0825,  0.5020,  0.2765],
          [-1.2392, -0.6155, -0.9032,  ...,  0.0524, -0.0852,  0.0805]]]])
Mask:  
 tensor([[1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])

Tips

實驗的輸出結果是 2 * 3 * 8 * 1024 的word embedding，都是2、3、8超引數。
2是batch_size, 3是兩層biLM的輸出加一層CNN對character編碼的輸出, 8是最長list的長度(對齊), 1024是每層輸出的維度。
mask的輸出2是batch_size, 8實在最長list的長度, 第一個list有4個tokens,第二個list有8個tokens, 所以對應位置輸出1。

References

https://cstsunfu.github.io/2018/06/ELMo/
https://blog.csdn.net/sinat_26917383/article/details/81913790

最簡單的方式獲取Elmo得到的詞向量

Introduction

Environment

Method

Result

Tips

References

最簡單的方式獲取Elmo得到的詞向量

Servlet 通過表單上傳檔案和獲取表單資料的最簡單方式

在MVC4.0加Easyui1.5.3的最簡單方式

thinkphp 5 自動生成模組，最簡單方式，一句程式碼

樹莓派Python3 最簡單方式安裝OpenCv3.4.0

android RoundedBitmapDrawable最簡單方式實現圓角

最簡單方式：使用base64字串顯示圖片或二維碼

IDEA中以最簡單方式實現實現Jrebel熱部署

最簡單方式解出線上app中Assets.car的圖片資源

android RoundedBitmapDrawable最簡單方式實現圓角圖片(一)

cell高度自適應實現的最簡單方式

OneNET與第三方平臺對接（最簡單方式）

最簡單的獲取相機拍照的圖片

趴一趴如何用最簡單的方式從html form表單中獲取到資料

Jquery實現全選全不選最簡單的方式以及獲取選中的值

eclipse下的tomcat配置https（最簡單得配置https）

最簡單的方式在linux上升級node.js版本

做視頻直播目前最簡單的實現方式是什麽

最簡單中文詞雲圖

“How are you?” 最簡單自然的回應方式

最簡單的方式獲取Elmo得到的詞向量

Introduction

Environment

Method

Result

Tips

References

相關推薦