使用python gensim輕鬆實現lda模型。
gemsim是一個免費python庫,能夠從文件中有效地自動抽取語義主題。gensim中的演算法包括:LSA(Latent Semantic Analysis), LDA(Latent Dirichlet Allocation), RP (Random Projections), 通過在一個訓練文件語料庫中,檢查詞彙統計聯合出現模式, 可以用來發掘文件語義結構,這些演算法屬於非監督學習,可以處理原始的,非結構化的文字(”plain text”)。
Gensim:實現語言,Python,實現模型,LDA,Dynamic Topic Model,Dynamic Influence Model,HDP,LSI,Random Projections,深度學習的word2vec,paragraph2vec。
gensim 特性
- 記憶體獨立- 對於訓練語料來說,沒必要在任何時間將整個語料都駐留在RAM中
- 有效實現了許多流行的向量空間演算法-包括tf-idf,分散式LSA, 分散式LDA 以及 RP;並且很容易新增新演算法
- 對流行的資料格式進行了IO封裝和轉換
- 在其語義表達中,可以相似查詢
- gensim的建立的目的是,由於缺乏簡單的(java很複雜)實現主題建模的可擴充套件軟體框架.
gensim 設計原則
- 簡單的介面,學習曲線低。對於原型實現很方便
- 根據輸入的語料的size來說,記憶體各自獨立;基於流的演算法操作,一次訪問一個文件.
gensim 核心概念
gensim的整個package會涉及三個概念:corpus, vector, model.
- 語庫(corpus)
在向量空間模型(VSM)中,每個文件被表示成一個特徵陣列。例如,一個單一特徵可以被表示成一個問答對(question-answer pair):
[1].在文件中單詞”splonge”出現的次數? 0個
[2].文件中包含了多少句子? 2個
[3].文件中使用了多少字型? 5種
這裡的問題可以表示成整型id (比如:1,2,3等), 因此,上面的文件可以表示成:(1, 0.0), (2, 2.0), (3, 5.0). 如果我們事先知道所有的問題,我們可以顯式地寫成這樣:(0.0, 2.0, 5.0). 這個answer序列可以認為是一個多維矩陣(3維). 對於實際目的,只有question對應的answer是一個實數.對於每個文件來說,answer是類似的. 因而,對於兩個向量來說(分別表示兩個文件),我們希望可以下類似的結論:“如果兩個向量中的實數是相似的,那麼,原始的文件也可以認為是相似的”。當然,這樣的結論依賴於我們如何去選取我們的question。
稀疏矩陣(Sparse vector)
通常,大多數answer的值都是0.0. 為了節省空間,我們需要從文件表示中忽略它們,只需要寫:(2, 2.0), (3, 5.0) 即可(注意:這裡忽略了(1, 0.0)). 由於所有的問題集事先都知道,那麼在稀疏矩陣的文件表示中所有缺失的特性可以認為都是0.0.
gensim的特別之處在於,它沒有限定任何特定的語料格式;語料可以是任何格式,當迭代時,通過稀疏矩陣來完成即可。例如,集合 ([(2, 2.0), (3, 5.0)], ([0, -1.0], [3, -1.0])) 是一個包含兩個文件的語料,每個都有兩個非零的 pair。
再安裝gensim: pip install gensim
文件的向量表示Corpora and Vector Spaces
documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"] """ #use StemmedCountVectorizer to get stemmed without stop words corpus Vectorizer = StemmedCountVectorizer # Vectorizer = CountVectorizer vectorizer = Vectorizer(stop_words='english') vectorizer.fit_transform(documents) texts = vectorizer.get_feature_names() # print(texts) """ texts = [doc.lower().split() for doc in documents] # print(texts) dict = corpora.Dictionary(texts) #自建詞典 # print dict, dict.token2id #通過dict將用字串表示的文件轉換為用id表示的文件向量 corpus = [dict.doc2bow(text) for text in texts] print(corpus)
topics = [lda_model[c] for c in corpus_tfidf] #大量查詢時不推薦,太慢,只適合查詢小的集合
使用gensim python拓展包
#!/usr/bin/env python
# -*- coding: utf-8 -*-
__title__ = 'topic model - build lda - 20news dataset'
__author__ = 'pi'
__mtime__ = '12/26/2014-026'
# code is far away from bugs with the god animal protecting
from Colors import *
from collections import defaultdict
import re
import datetime
from sklearn import datasets
import nltk
from gensim import corpora
from gensim import models
import numpy as np
from scipy import spatial
from CorePyPro.Fun.TimeStump import totalTime
def load_texts(dataset_type='train', groups=None):
load datasets to bytes list
:return:train_dataset_bunch.data bytes list
if groups == 'small':
groups = ['comp.graphics', 'comp.os.ms-windows.misc'] # 僅用於小資料測試時用, #1368
elif groups == 'medium':
groups = ['comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.ma c.hardware',
'comp.windows.x', 'sci.space'] # 中量資料時用 #3414
train_dataset_bunch = datasets.load_mlcomp('20news-18828', dataset_type, mlcomp_root='./datasets',
categories=groups) # 13180
return train_dataset_bunch.data
def preprocess_texts(texts, test_doc_id=1):
texts preprocessing
:param texts: bytes list
:return:bytes list
texts = [t.decode(errors='ignore') for t in texts] # bytes2str
# print(REDH, 'original texts[%d]: ' % test_doc_id, DEFAULT, '\n',texts[test_doc_id])
# split_texts = [t.lower().split() for t in texts]
# print(REDH, 'split texts[%d]: #%d' % (test_doc_id, len(split_texts)), DEFAULT, '\n',split_texts[test_doc_id])
# lower str & split str 2 word list with sep=... & delete None
SEPS = '[\s()-/,:.?!]\s*'
texts = [re.split(SEPS, t.lower()) for t in texts]
for t in texts:
while '' in t:
# print(REDH, 'texts[%d] lower & split(seps= %s) & delete None: #%d' % (test_doc_id, SEPS, len(texts[test_doc_id])), DEFAULT, '\n',texts[test_doc_id])
# nltk.download() #then choose the corpus.stopwords
stopwords = set(nltk.corpus.stopwords.words('english')) # #127
stopwords.update(['from', 'subject', 'writes']) # #129
word_usage = defaultdict(int)
for t in texts:
for w in t:
word_usage[w] += 1
COMMON_LINE = len(texts) / 10
too_common_words = [w for w in t if word_usage[w] > COMMON_LINE] # set(too_common_words)
# print('too_common_words: #', len(too_common_words), '\n', too_common_words) #68
# print('stopwords: #', len(stopwords), '\n', stopwords) # #147
english_stemmer = nltk.SnowballStemmer('english')
MIN_WORD_LEN = 3 # 4
texts = [[english_stemmer.stem(w) for w in t if
not set(w) & set('@+>0123456789*') and w not in stopwords and len(w) >= MIN_WORD_LEN] for t in
texts] # set('+-.?!()>@0123456789*/')
# print(REDH, 'texts[%d] delete ^alphanum & stopwords & len<%d & stemmed: #' % (test_doc_id, MIN_WORD_LEN),
# len(texts[test_doc_id]), DEFAULT, '\n', texts[test_doc_id])
return texts
def build_corpus(texts):
build corpora
:param texts: bytes list
:return: corpus DirectTextCorpus(corpora.TextCorpus)
class DirectTextCorpus(corpora.TextCorpus):
def get_texts(self):
return self.input
def __len__(self):
return len(self.input)
corpus = DirectTextCorpus(texts)
return corpus
def build_id2word(corpus):
from corpus build id2word=dict
:param corpus:
:return:dict = corpus.dictionary
dict = corpus.dictionary # gensim.corpora.dictionary.Dictionary
# print(dict.id2token)
# print("dict.id2token is not {} now")
# print(dict.id2token)
return dict
def save_corpus_dict(dict, corpus, dictDir='./LDA/id_word.dict', corpusDir='./LDA/corpus.mm'):
print(GREENL, 'dict saved into %s successfully ...' % dictDir, DEFAULT)
corpora.MmCorpus.serialize(corpusDir, corpus)
print(GREENL, 'corpus saved into %s successfully ...' % corpusDir, DEFAULT)
# corpus.save(fname='./LDA/corpus.mm') # stores only the (tiny) iteration object
def load_ldamodel(modelDir='./lda.pkl'):
model = models.LdaModel.load(fname=modelDir)
print(GREENL, 'ldamodel load from %s successfully ...' % modelDir, DEFAULT)
return model
def load_corpus_dict(dictDir='./LDA/id_word.dict', corpusDir='./LDA/corpus.mm'):
dict = corpora.Dictionary.load(fname=dictDir)
print(GREENL, 'dict load from %s successfully ...' % dictDir, DEFAULT)
# dict = corpora.Dictionary.load_from_text('./id_word.txt')
corpus = corpora.MmCorpus(corpusDir) # corpora.mmcorpus.MmCorpus
print(GREENL, 'corpus load from %s successfully ...' % corpusDir, DEFAULT)
return dict, corpus
def build_doc_word_mat(corpus, model, num_topics):
build doc_word_mat in topic space
:param corpus:
:param model:
:param num_topics: int
:return:doc_word_mat np.array (len(topics) * num_topics)
topics = [model[c] for c in corpus] # (word_id, weight) list
doc_word_mat = np.zeros((len(topics), num_topics))
for doc, topic in enumerate(topics):
for word_id, weight in topic:
doc_word_mat[doc, word_id] += weight
return doc_word_mat
def compute_pairwise_dist(doc_word_mat):
compute pairwise dist
:param doc_word_mat: np.array (len(topics) * num_topics)
:return:pairwise_dist <class 'numpy.ndarray'>
pairwise_dist = spatial.distance.squareform(spatial.distance.pdist(doc_word_mat))
max_weight = pairwise_dist.max() + 1
for i in list(range(len(pairwise_dist))):
pairwise_dist[i, i] = max_weight
return pairwise_dist
def closest_texts(corpus, model, num_topics, test_doc_id=1, topn=5):
find the closest_doc_ids for doc[test_doc_id]
:param corpus:
:param model:
:param num_topics:
:param test_doc_id:
:param topn:
doc_word_mat = build_doc_word_mat(corpus, model, num_topics)
pairwise_dist = compute_pairwise_dist(doc_word_mat)
# print(REDH, 'original texts[%d]: ' % test_doc_id, DEFAULT, '\n', original_texts[test_doc_id])
closest_doc_ids = pairwise_dist[test_doc_id].argsort()
# return closest_doc_ids[:topn]
for closest_doc_id in closest_doc_ids[:topn]:
print(RED, 'closest doc[%d]' % closest_doc_id, DEFAULT, '\n', original_texts[closest_doc_id])
def evaluate_model(model):
計算模型在test data的Perplexity
:param model:
:return:model.log_perplexity float
test_texts = load_texts(dataset_type='test', groups='small')
test_texts = preprocess_texts(test_texts)
test_corpus = build_corpus(test_texts)
return model.log_perplexity(test_corpus)
def test_num_topics():
dict, corpus = load_corpus_dict()
print("#corpus_items:", len(corpus))
for num_topics in [3, 5, 10, 30, 50, 100, 150, 200, 300]:
start_time = datetime.datetime.now()
model = models.LdaModel(corpus, num_topics=num_topics, id2word=dict)
end_time = datetime.datetime.now()
print("total running time = ", end_time - start_time)
print(REDL, 'model.log_perplexity for test_texts with num_topics=%d : ' % num_topics, evaluate_model(model),
def test():
texts = load_texts(dataset_type='train', groups='small')
original_texts = texts
test_doc_id = 1
# texts = preprocess_texts(texts, test_doc_id=test_doc_id)
# corpus = build_corpus(texts=texts) # corpus DirectTextCorpus(corpora.TextCorpus)
# dict = build_id2word(corpus)
# save_corpus_dict(dict, corpus)
dict, corpus = load_corpus_dict()
# print(len(corpus))
num_topics = 100
model = models.LdaModel(corpus, num_topics=num_topics, id2word=dict) # 每次結果不同
# model.save(fname='./lda.pkl')
# model = load_ldamodel()
# closest_texts(corpus, model, num_topics, test_doc_id=1, topn=3)
print(REDL, 'model.log_perplexity for test_texts', evaluate_model(model), DEFAULT)
if __name__ == '__main__':
# test_num_topics()