NLP--gensim中doc2vec句向量例項
阿新 • • 發佈:2019-01-24
Doc2vec又叫Paragraph Vector是Tomas Mikolov基於word2vec模型提出的,其具有一些優點,比如不用固定句子長度,接受不同長度的句子做訓練樣本,Doc2vec是一個無監督學習演算法,可以用於生成句向量,段落向量和文件向量。生成的向量可以用於文字分類和語義分析.
下面是一個生成句向量,並檢視效果的程式(適用小資料量):
#coding:utf-8
import jieba
import sys
import gensim
import sklearn
import numpy as np
from gensim.models.doc2vec import Doc2Vec, LabeledSentence
TaggededDocument = gensim.models.doc2vec.TaggedDocument
#進行中文分詞
def cut_files():
filePath = 'voicetext.txt'
fr = open(filePath, 'rb');
fvideo = file('voiceCut.txt', "w+");
fileTrainSeg = []
for line in fr.readlines():
curLine =' '.join(list(jieba.cut(line)))
fvideo.writelines(curLine.encode('utf-8' ))
#讀取分詞後的資料並打標記,放到x_train供後續索引,但是這樣的話佔用很大記憶體(這種小資料量使用)
def get_datasest():
with open("voiceCut.txt", 'r') as cf:
docs = cf.readlines()
print len(docs)
x_train = []
for i, text in enumerate(docs):
word_list = text.split(' ')
l = len(word_list)
word_list[l - 1 ] = word_list[l - 1].strip()
document = TaggededDocument(word_list, tags=[i])
x_train.append(document)
return x_train
#模型訓練
def train(x_train, size=200, epoch_num=1):
model_dm = Doc2Vec(x_train, min_count=1, window=3, size=size, sample=1e-3, negative=5, workers=4)
#model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=70)
model_dm.save('model/model_dm_doc2vec')
return model_dm
#例項
def test():
model_dm = Doc2Vec.load("model/model_dm_doc2vec")
test_text = ['我', '想', '看', '扶搖']
inferred_vector_dm = model_dm.infer_vector(test_text)
sims = model_dm.docvecs.most_similar([inferred_vector_dm], topn=10)
return sims
if __name__ == '__main__':
cut_files()
x_train=get_datasest()
model_dm = train(x_train)
sims = test()
for count, sim in sims:
sentence = x_train[count]
words = ''
for word in sentence[0]:
words = words + word + ' '
print (words, sim, len(sentence[0]))
#可以用句向量模型直接根據詞向量查詢相似度
print (model_dm.wv.most_similar('扶搖'))
扶搖相似度為:
如果需要大資料量的資料進行訓練,則訓練過程中只需要TaggedLineDocument(inp)
具體例子如下:
import multiprocessing
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedLineDocument
import numpy as np
inp='corpusSegDone.txt'
sents = TaggedLineDocument(inp) #直接對文字標號 (大資料適用,但是這樣後續不能檢視對應的內容,需要其他方法查詢)
model = Doc2Vec(sents, size = 200, window = 8, alpha = 0.015)
outp1='docmodel'
model.save(outp1)
model=Doc2Vec.load(outp1)
sims = model.docvecs.most_similar(0)#0代表第一個句子或段落
#如果有兩個句子
doc_words1=['驗證','失敗','驗證碼','未','收到']
doc_words2=['今天','獎勵','有','哪些','呢']
#轉換為向量:
invec1 = model.infer_vector(doc_words1, alpha=0.1, min_alpha=0.0001, steps=5)
invec2 = model.infer_vector(doc_words2, alpha=0.1, min_alpha=0.0001, steps=5)
sims = model.docvecs.most_similar([invec1])#計算訓練模型中與句子1相似的內容
print (sims)
print(model.docvecs.similarity(0,1086620))#計算句子的相似度(0和1086620為句子的標號)
#列印結果相似度位: 0.9385169567251749
sims列印結果: