基於python的gensim word2vec訓練詞向量
阿新 • • 發佈:2018-12-30
準備工作
當我們下載了anaconda後,可以在命令視窗通過命令
conda install gensim
安裝gensim
gensim介紹
gensim是一款強大的自然語言處理工具,裡面包括N多常見模型,我們體驗一下:
interfaces – Core gensim interfaces
utils – Various utility functions
matutils – Math utils
corpora.bleicorpus – Corpus in Blei’s LDA-C format
corpora.dictionary – Construct word<->id mappings
corpora.hashdictionary – Construct word<->id mappings
corpora.lowcorpus – Corpus in List-of-Words format
corpora.mmcorpus – Corpus in Matrix Market format
corpora.svmlightcorpus – Corpus in SVMlight format
corpora.wikicorpus – Corpus from a Wikipedia dump
corpora.textcorpus – Building corpora with dictionaries
corpora.ucicorpus – Corpus in UCI bag-of-words format
corpora.indexedcorpus – Random access to corpus documents
models.ldamodel – Latent Dirichlet Allocation
models.ldamulticore – parallelized Latent Dirichlet Allocation
models.ldamallet – Latent Dirichlet Allocation via Mallet
models.lsimodel – Latent Semantic Indexing
models.tfidfmodel – TF-IDF model
models.rpmodel – Random Projections
models.hdpmodel – Hierarchical Dirichlet Process
models.logentropy_model – LogEntropy model
models.lsi_dispatcher – Dispatcher for distributed LSI
models.lsi_worker – Worker for distributed LSI
models.lda_dispatcher – Dispatcher for distributed LDA
models.lda_worker – Worker for distributed LDA
models.word2vec – Deep learning with word2vec
models.doc2vec – Deep learning with paragraph2vec
models.dtmmodel – Dynamic Topic Models (DTM) and Dynamic Influence Models (DIM)
models.phrases – Phrase (collocation) detection
similarities.docsim – Document similarity queries
How It Works
simserver – Document similarity server
我們可以看到:
- 基本的語料處理工具
- LSI
- LDA
- HDP
- DTM
- DIM
- TF-IDF
- word2vec、paragraph2vec
以後用上其他模型的時候再介紹,今天我們來體驗:
word2vec
#encoding=utf-8
from gensim.models import word2vec
sentences=word2vec.Text8Corpus(u'分詞後的爽膚水評論.txt')
model=word2vec.Word2Vec(sentences, size=50)
y2=model.similarity(u"好", u"還行")
print(y2)
for i in model.most_similar(u"滋潤"):
print i[0],i[1]
txt檔案是已經分好詞的5W條評論,訓練模型只需一句話:
model=word2vec.Word2Vec(sentences,min_count=5,size=50)
第一個引數是訓練語料,第二個引數是小於該數的單詞會被剔除,預設值為5,
第三個引數是神經網路的隱藏層單元數,預設為100
model.similarity(u"好", u"還行")#計算兩個詞之間的餘弦距離
model.most_similar(u"滋潤")#計算餘弦距離最接近“滋潤”的10個詞
執行結果:
0.642981583608
保溼 0.995047152042
溫和 0.985100984573
高 0.978088200092
舒服 0.969187200069
補水 0.967649161816
清爽 0.960570812225
水水 0.958645284176
一般 0.928643763065
一款 0.911774456501
真的 0.90943980217
效果不錯吧,雖然只有5W條評論的語料
當然還可以儲存和載入咱們辛辛苦苦訓練好的模型:
model.save('/model/word2vec_model')
new_model=gensim.models.Word2Vec.load('/model/word2vec_model')
也可以獲取每個詞的詞向量
model['computer']