gensim 中tf-idf模型, word2vec 與 doc2vec 簡單使用



一個python NLP庫. 包含tf-idf模型, word2vec 與 doc2vec 等. 


2.1 類與方法

  • gensim.models.word2vec.Word2Vec(utils.SaveLoad) 
    類. 用於訓練, 使用, 評估 word2vec 模型.

    __init__(self, sentences=None, size=100, alpha=0.025, window=5, min_count=5, ...) 
    sentences: 一個list, 元素為sentence. sentence也是一個list, 格式為[word1, word2, …, word_n]. 

    : the dimensionality of the feature vectors. 
    window: the maximum distance between the current and predicted word within a sentence. 
    alpha: the initial learning rate. 
    seed: for the random number generator 
    min_count: ignore all words with total frequency lower than this.

  • save(self, *args, **kwargs) 
    持久化模型, 如 model.save('/tmp/mymodel')
  • @classmethod load(cls, *args, **kwargs) 
    將持久化的模型反序列化回來. 如new_model = gensim.models.Word2Vec.load('/tmp/mymodel').
  • model[word] 
    如, model[‘computer’], 返回的是該單詞的向量, 它是NumPy的vector.
  • model.wv.similar_by_word(self, word, topn=10,…) 
    查詢一個詞的k-nearest neighbor. 計算的是 餘弦相似度.


, 'king'], negative=['man']) # 得到('queen', 0.71382287), ...] model.wv.doesnt_match("breakfast cereal dinner lunch".split()) # 'cereal' model.wv.similarity('woman', 'man') # 0.73723527
在word2vec中, 語料庫的詞典都是十幾萬級別的, 所以來了新句子, 裡面的 word 也很少碰到未登入的. 
而在doc2vec中, 來了一篇新文章, 它就是未登入的, gensim 提供了 
gensim.models.doc2vec.Doc2Vec#infer_vector(self, doc_words, alpha=0.1, min_alpha=0.0001, steps=5) 
函式, 產出模型後, 用於預測新文件的 vector representation.


  • gensim.similarities.docsim.SparseMatrixSimilarity(interfaces.SimilarityABC) 
    類, 用餘弦相似度 來度量.

4.tf_idf model

import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from gensim import corpora, models, similarities

# First, create a small corpus of 9 documents and 12 features
# a list of list of tuples
# see: https://radimrehurek.com/gensim/tut1.html
corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],
           [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
           [(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
           [(0, 1.0), (4, 2.0), (7, 1.0)],
           [(3, 1.0), (5, 1.0), (6, 1.0)],
           [(9, 1.0)],
           [(9, 1.0), (10, 1.0)],
           [(9, 1.0), (10, 1.0), (11, 1.0)],
           [(8, 1.0), (10, 1.0), (11, 1.0)]]

tfidf = models.TfidfModel(corpus)

vec = [(0, 1), (4, 1)]
# shape=9*12
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)
sims = index[tfidf[vec]]
[(0, 0.8075244024440723), (4, 0.5898341626740045)]

# Document number zero (the first document) has a similarity score of 0.466=46.6%, the second document has a similarity score of 19.1% etc.
[(0, 0.4662244), (1, 0.19139354), (2, 0.24600551), (3, 0.82094586), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]