gensim 中tf-idf模型, word2vec 與 doc2vec 簡單使用
阿新 • • 發佈:2019-02-09
轉載:https://blog.csdn.net/chuchus/article/details/77716545
1.簡介
一個python NLP庫. 包含tf-idf模型, word2vec 與 doc2vec 等.
官網地址
2.word2vec
2.1 類與方法
gensim.models.word2vec.Word2Vec(utils.SaveLoad)
類. 用於訓練, 使用, 評估 word2vec 模型.__init__(self, sentences=None, size=100, alpha=0.025, window=5, min_count=5, ...)
sentences
: 一個list, 元素為sentence. sentence也是一個list, 格式為[word1, word2, …, word_n].size
window
: the maximum distance between the current and predicted word within a sentence.alpha
: the initial learning rate.seed
: for the random number generatormin_count
: ignore all words with total frequency lower than this.save(self, *args, **kwargs)
持久化模型, 如model.save('/tmp/mymodel')
@classmethod load(cls, *args, **kwargs)
將持久化的模型反序列化回來. 如new_model = gensim.models.Word2Vec.load('/tmp/mymodel')
.model[word]
如, model[‘computer’], 返回的是該單詞的向量, 它是NumPy的vector.- model.wv.similar_by_word(self, word, topn=10,…)
查詢一個詞的k-nearest neighbor. 計算的是 餘弦相似度.
2.2一些例子
model.wv.most_similar_cosmul(positive=['woman' , 'king'], negative=['man'])
# 得到('queen', 0.71382287), ...]
model.wv.doesnt_match("breakfast cereal dinner lunch".split())
# 'cereal'
model.wv.similarity('woman', 'man')
# 0.73723527
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
3.doc2vec
在word2vec中, 語料庫的詞典都是十幾萬級別的, 所以來了新句子, 裡面的 word 也很少碰到未登入的.
而在doc2vec中, 來了一篇新文章, 它就是未登入的, gensim 提供了 gensim.models.doc2vec.Doc2Vec#infer_vector(self, doc_words, alpha=0.1, min_alpha=0.0001, steps=5)
函式, 產出模型後, 用於預測新文件的 vector representation.
常用類與方法
gensim.similarities.docsim.SparseMatrixSimilarity(interfaces.SimilarityABC)
類, 用餘弦相似度 來度量.
4.tf_idf model
import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim import corpora, models, similarities
# First, create a small corpus of 9 documents and 12 features
# a list of list of tuples
# see: https://radimrehurek.com/gensim/tut1.html
corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],
[(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
[(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
[(0, 1.0), (4, 2.0), (7, 1.0)],
[(3, 1.0), (5, 1.0), (6, 1.0)],
[(9, 1.0)],
[(9, 1.0), (10, 1.0)],
[(9, 1.0), (10, 1.0), (11, 1.0)],
[(8, 1.0), (10, 1.0), (11, 1.0)]]
tfidf = models.TfidfModel(corpus)
vec = [(0, 1), (4, 1)]
print(tfidf[vec])
# shape=9*12
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)
sims = index[tfidf[vec]]
print(list(enumerate(sims)))
"""
[(0, 0.8075244024440723), (4, 0.5898341626740045)]
# Document number zero (the first document) has a similarity score of 0.466=46.6%, the second document has a similarity score of 19.1% etc.
[(0, 0.4662244), (1, 0.19139354), (2, 0.24600551), (3, 0.82094586), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]
"""