word2vec 構建中文詞向量

阿新 • • 發佈：2018-09-23

rbo 解決方法 ins som fig prope print transform load

詞向量作為文本的基本結構——詞的模型，以其優越的性能，受到自然語言處理領域研究人員的青睞。良好的詞向量可以達到語義相近的詞在詞向量空間裏聚集在一起，這對後續的文本分類，文本聚類等等操作提供了便利，本文將詳細介紹如何使用word2vec構建中文詞向量。

一、中文語料庫

本文采用的是搜狗實驗室的搜狗新聞語料庫，數據鏈接 http://www.sogou.com/labs/resource/cs.php

下載下來的文件名為： news_sohusite_xml.full.tar.gz

二、數據預處理

2.1 解壓並查看原始數據

cd 到原始文件目錄下，執行解壓命令：

tar -zvxf news_sohusite_xml.full.tar.gz

得到文件 news_sohusite_xml.dat, 用vim打開該文件，

vim news_sohusite_xml.dat

得到如下結果：

技術分享圖片

2.2 取出內容

取出<content> </content> 中的內容,執行如下命令：

cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "<content>"  > corpus.txt

得到文件名為corpus.txt的文件，可以通過vim 打開

vim corpus.txt

得到如下效果：

技術分享圖片

2.3 分詞

註意，送給word2vec的文件是需要分詞的，分詞可以采用jieba分詞實現，安裝jieba 分詞

pip install jieba

對原始文本內容進行分詞，python 程序如下：

 1 ##!/usr/bin/env python
 2 ## coding=utf-8
 3 import jieba
 4 
 5 filePath=‘corpus.txt‘
 6 fileSegWordDonePath =‘corpusSegDone.txt‘
 7 # read the file by line
 8 fileTrainRead = []
 9 #fileTestRead = [] 

10 with open(filePath) as fileTrainRaw:
11     for line in fileTrainRaw:
12         fileTrainRead.append(line)
13 
14 
15 # define this function to print a list with Chinese
16 def PrintListChinese(list):
17     for i in range(len(list)):
18         print list[i],
19 # segment word with jieba
20 fileTrainSeg=[]
21 for i in range(len(fileTrainRead)):
22     fileTrainSeg.append([‘ ‘.join(list(jieba.cut(fileTrainRead[i][9:-11],cut_all=False)))])
23     if i % 100 == 0 :
24         print i
25 
26 # to test the segment result
27 #PrintListChinese(fileTrainSeg[10])
28 
29 # save the result
30 with open(fileSegWordDonePath,‘wb‘) as fW:
31     for i in range(len(fileTrainSeg)):
32         fW.write(fileTrainSeg[i][0].encode(‘utf-8‘))
33         fW.write(‘\n‘)

可以得到文件名為 corpusSegDone.txt 的文件，需要註意的是，對於讀入文件的每一行，使用結巴分詞的時候並不是從0到結尾的全部都進行分詞，而是對[9:-11]分詞 (如行22中所示: fileTrainRead[i][9:-11] )，這樣可以去掉每行（一篇新聞稿）起始的<content> 和結尾的</content>。

同樣的，可以通過vim 打開分詞之後的文件，執行命令:

vim corpusSegDone.txt

得到如下圖所示的結果:

技術分享圖片

三、構建詞向量

3.1 安裝word2vec

pip install word2vec

3.2 構建詞向量

執行以下程序：

import word2vec
word2vec.word2vec(‘corpusSegDone.txt‘, ‘corpusWord2Vec.bin‘, size=300,verbose=True)

即可構建詞向量，得到結果放在文件名為 corpusWord2Vec.bin的文件中。可以通過設定size 的大小來指定詞向量的維數。用vim打開生成的二進制文件會出現亂碼，目前不知道解決方法。

3.3 顯示並使用詞向量

3.3.1 查看詞向量

import word2vec
model = word2vec.load(‘corpusWord2Vec.bin‘)
print (model.vectors)

可以得到如下結果：

技術分享圖片

3.3.2 查看詞表中的詞

import word2vec
model = word2vec.load(‘corpusWord2Vec.bin‘)
index = 1000
print (model.vocab[index]

得到結果如下：

技術分享圖片

可以得到詞表中第1000個詞為確保。

3.3.3 顯示空間距離相近的詞

一個好的詞向量可以實現詞義相近的一組詞在詞向量空間中也是接近的，可以通過顯示詞向量空間中相近的一組詞並判斷它們語義是否相近來評價詞向量構建的好壞。代碼如下：

import word2vec
model = word2vec.load(‘corpusWord2Vec.bin‘)
indexes = model.cosine(u‘加拿大‘)
for index in indexes[0]:
    print (model.vocab[index])

得到的結果如下：

技術分享圖片

可以修改希望查找的中文詞，例子如下：

技術分享圖片

四、二維空間中顯示詞向量

將詞向量采用PCA進行降維，得到二維的詞向量，並打印出來，代碼如下：

#!/usr/bin/env python
# coding=utf-8
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
import word2vec
# load the word2vec model
model = word2vec.load(‘corpusWord2Vec.bin‘)
rawWordVec=model.vectors

# reduce the dimension of word vector
X_reduced = PCA(n_components=2).fit_transform(rawWordVec)

# show some word(center word) and it‘s similar words
index1,metrics1 = model.cosine(u‘中國‘)
index2,metrics2 = model.cosine(u‘清華‘)
index3,metrics3 = model.cosine(u‘牛頓‘)
index4,metrics4 = model.cosine(u‘自動化‘)
index5,metrics5 = model.cosine(u‘劉亦菲‘)

# add the index of center word 
index01=np.where(model.vocab==u‘中國‘)
index02=np.where(model.vocab==u‘清華‘)
index03=np.where(model.vocab==u‘牛頓‘)
index04=np.where(model.vocab==u‘自動化‘)
index05=np.where(model.vocab==u‘劉亦菲‘)

index1=np.append(index1,index01)
index2=np.append(index2,index03)
index3=np.append(index3,index03)
index4=np.append(index4,index04)
index5=np.append(index5,index05)

# plot the result
zhfont = matplotlib.font_manager.FontProperties(fname=‘/usr/share/fonts/truetype/wqy/wqy-microhei.ttc‘)
fig = plt.figure()
ax = fig.add_subplot(111)

for i in index1:
    ax.text(X_reduced[i][0],X_reduced[i][1], model.vocab[i], fontproperties=zhfont,color=‘r‘)

for i in index2:
    ax.text(X_reduced[i][0],X_reduced[i][1], model.vocab[i], fontproperties=zhfont,color=‘b‘)

for i in index3:
    ax.text(X_reduced[i][0],X_reduced[i][1], model.vocab[i], fontproperties=zhfont,color=‘g‘)

for i in index4:
    ax.text(X_reduced[i][0],X_reduced[i][1], model.vocab[i], fontproperties=zhfont,color=‘k‘)

for i in index5:
    ax.text(X_reduced[i][0],X_reduced[i][1], model.vocab[i], fontproperties=zhfont,color=‘c‘)

ax.axis([0,0.8,-0.5,0.5])
plt.show()

中文的顯示需要做特殊處理，詳見代碼 line: 37

下圖是執行結果：

技術分享圖片

源自： https://www.cnblogs.com/Newsteinwell/p/6034747.html

word2vec 構建中文詞向量

rbo 解決方法 ins som fig prope print transform load 詞向量作為文本的基本結構——詞的模型，以其優越的性能，受到自然語言處理領域研究人員的青睞。良好的詞向量可以達到語義相近的詞在詞向量空間裏聚集在一起，這對後續的文本分類，文本聚類等

word2vec 構建中文詞向量

word2vec 構建中文詞向量

word2vec訓練維基百科中文詞向量

【深度學習】120G+訓練好的word2vec模型（中文詞向量）

[Algorithm & NLP] 文本深度表示模型——word2vec&doc2vec詞向量模型

word2vec 和 doc2vec 詞向量表示

Ubuntu下GloVe中文詞向量模型訓練

文字分類實戰（一）—— word2vec預訓練詞向量

文字分類實戰（四）—— Bi-LSTM模型文字分類實戰（一）—— word2vec預訓練詞向量

文字分類實戰（七）—— Adversarial LSTM模型文字分類實戰（一）—— word2vec預訓練詞向量

文字分類實戰（十）—— BERT 預訓練模型文字分類實戰（一）—— word2vec預訓練詞向量

gensim的word2vec如何得出詞向量（python）

文字情感分析(二)：基於word2vec和glove詞向量的文字表示

word2vec預訓練詞向量

CountVectorizer，Tf-idfVectorizer和word2vec構建詞向量的區別

word2vec詞向量處理中文語料

【python gensim使用】word2vec詞向量處理中文語料

word2vec詞向量訓練及中文文字相似度計算

機器不學習：word2vec是如何得到詞向量的？

Python Word2Vec使用訓練好的模型生成詞向量

機器學習之路： python 實踐 word2vec 詞向量技術

word2vec 構建中文詞向量

相關推薦