gensim文字主題模型推薦

阿新 • • 發佈：2018-12-22

用gensim包做中文文字的推薦

一、gensim是generate similar的簡寫，叫做普遍相似。對於gensim這個包建議新手直接使用anaconda工具進行集中安裝

二、gensim包中做文字推薦要使用的幾個重要的模組

1、corpora 語料庫（將文字文件轉為文件向量（基於詞頻和tfidf的文件向量））

from gensim import corpora

import jieba

sentences = ["我喜歡吃土豆","土豆是個百搭的東西","我不喜歡今天霧霾的北京"]

texts=[]
for doc in sentences:
words.append(list(jieba.cut(doc)))#分詞
print words

dictionary = corpora.Dictionary(words)
print dictionary
print dic.token2id

{'\xe5\x8c\x97\xe4\xba\xac': 12, '\xe6\x90\xad': 6, '\xe7\x9a\x84': 9, '\xe5\x96\x9c\xe6\xac\xa2': 1, '\xe4\xb8\x8d': 10, '\xe4\xb8\x9c\xe8\xa5\xbf': 4, '\xe5\x9c\x9f\xe8\xb1\x86': 2, '\xe9\x9c\xbe': 14, '\xe6\x98\xaf': 7, '\xe4\xb8\xaa': 5, '\xe9\x9b\xbe': 13, '\xe7\x99\xbe': 8, '\xe4\xbb\x8a\xe5\xa4\xa9': 11, '\xe6\x88\x91': 3, '\xe5\x90\x83': 0}

#可以看到字串為'\xe5\x8c\x97\xe4\xba\xac'（詞）的編號為12，\xe4\xbb\x8a\xe5\xa4\xa9的編號為11....一共有15個詞，編號從0到14

corpus = [dictionary.doc2bow(text) for text in words]
print corpus

[[(0, 1), (1, 1), (2, 1), (3, 1)], [(2, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)], [(1, 1), (3, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)]]

#這是一個鑲嵌的列表，第一個子列表[(0, 1), (1, 1), (2, 1), (3, 1)]表示sentence中的第一個字串的對應的文字向量，第二個子列表對應對應了第二個字串的文字向量.....其中例如第一個子列表中（0，1）表示編號為0的詞出現了一次........

#對curpus進行tfidf變換

tfidf=models.TfidfModel(corpus)

curpus_tfidf=tfidf[curpus]

for doc in curpus_tfidf:

print doc

#列印curpus_tfidf的資訊，例如下面的第一個列表對應的curpus的第一個子列表，其中(0, 0.8425587958192721)表示編號為0的詞的tfidf值是0.8425587958192721..........

[(0, 0.8425587958192721), (1, 0.3109633824035548), (2, 0.3109633824035548), (3, 0.3109633824035548)]
[(2, 0.16073253746956623), (4, 0.4355066251613605), (5, 0.4355066251613605), (6, 0.4355066251613605), (7, 0.4355066251613605), (8, 0.4355066251613605), (9, 0.16073253746956623)]
[(1, 0.1586956620869655), (3, 0.1586956620869655), (9, 0.1586956620869655), (10, 0.42998768831312806), (11, 0.42998768831312806), (12, 0.42998768831312806), (13, 0.42998768831312806), (14, 0.42998768831312806)]

例如：將文字向量vector進行tfidf變換

vector=[(0, 1), (4, 1)]

vector_tfidf=tfidf[vector]

for doc in vector_tfidf:

print doc

#列印vecter_tfidf的資訊

[(0, 0.7071067811865475), (4, 0.7071067811865475)]

# tfidf[curpus]對curpus 進行tfidf變換，tfidf[vector]把新文字向量vector進行tfidf變換

2、models 模型庫（主題提取，主題訓練）

根據上面tfidf變換，就可以進行模型訓練，然後根據訓練集提取出一定數量的主題

主題提取有兩個模型：LSI和LDA

以LSI為例

from gensim import models

lsi=models.Lsimodel(corpus_tfidf,id2word=dictionary,num_topics=2)

#lsi表示根據corpus_tfidf 提取出前2個主題

corpus_lsi=lsi[corpus]

#查詢corpus的每一個文字和這兩個主題的相似度

corpus_lsi = lsi[corpus_tfidf]
for doc in corpus_lsi:
print doc

[(0, -0.70861576320682107), (1, 0.1431958007198823)]
[(0, -0.42764142348481798), (1, -0.88527674470703799)]
[(0, -0.66124862582594512), (1, 0.4190711252114323)]

例如：計算新文字vector與兩個主題的相似度

vector =[(13, 1), (14, 1)]

vector_lsi=lsi[vector]

#查詢vector（一個文件向量）與這兩個主題的每一個主題的相似度

print vector_lsi

[(0, 0.50670602027401368), (1, -0.3678056037187441)]

3、similarities 相似度計算庫【(1)基於主題提取後計算新文字與訓練集中的每一個文字的相似度

(2)直接計算新文字與訓練集中每一個文字的相似度】

（1）基於主題模型（lsi）計算文字的相似度

index=similarities.MatrixSimilarity(lsi[corpus])（#建立索引）

sims=index[lsi[vector]]（#計算vector（文字向量）與corpus中每一個文字的相似）

for doc in sims:

print doc

（2）直接計算新文字與訓練集中每一個文字的相似度(沒有進行主題提取)

index=similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=15)

sims=index[tfidf[vector]]

for doc in sims:

print doc

三、文字主題推薦

主要步驟：

要做一些準備工作：

在anaconda工具中編寫

載入停頓詞庫（stop_words）

設定預設utf-8編碼格式

1、讀取所有文字的目錄(files_list)

2、讀取每一個文字(*.txt)的內容(content)並把每一個文字內容讀成一行寫入到一個大文字中(full.txt)

【注：得到訓練集】

3、按行讀取大文字內容，然後對每一行分詞並去掉停頓詞、數字、字母，最後存放在一個鑲嵌的列表中（texts）【注：對訓練集進行預處理】

4、去掉只出現一次的詞(frequency=1的詞)

【注：對訓練集進行預處理】

5、建立詞典dictionary（所有詞的集合並賦予唯一的id編號），然後將鑲嵌的列表（texts）轉化為一個文件向量矩陣(corpus)（基於詞的頻數），將文件向量矩陣轉化為用tfidf表示的文件向量矩陣（轉為詞的頻率）

【注：將訓練集中的文件轉化為兩種向量形式便於進行數學計算（預處理）】

6、主題提取（基於所有文字（訓練集）提取出一定個數的主題）

【兩種重要的訓練提取模型：（1）LSI（採用svd奇異值分解）；（2）LDA（採用忘了）】

7、基於主題進行相似度的計算

【兩個角度來看：（1）一個新文字或訓練集中的每一個文字與每一個主題都有相似度（2）給定一個新文字，計算與訓練集中的文字的相似度】

8、給定新文字計算與訓練集中的文字相似度並進行排名，然後向讀者推薦相似文字。

【可以推薦相似度靠前的10個文字】

程式碼示例

# -*- coding: utf-8 -*-
"""
Created on Fri Jun 24 17:37:33 2016

@author: yongsheng
"""
from __future__ import division
import sys
import os
import re
import josn
import string
import numpy
import scipy
import gensim
from gensim import models,corpora,similarities
import jieba
import logging
import time
from colections import defaultdict
loggging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',level=logging.INFO)
printable=set(string.printable)
stop_file_read=open('C:\Users\yongsheng\Desktop\stop_word.txt','r')
stop_word=stop_file_read.read().decode('utf-8').split('\n')
filelist=[]
for file in os.listdir('files'):
if 'txt' in file:
filelist.append(os.path.join(files,file))

for file in filelist:
filename=file.split('\\')[-1]
f=open('C:\Users\yongsheng\Desktop\full.txt','w')
lines=[]
with open(file,'r') as fd:
for line in fd.readlines():
line=line.replace('\n',' ')
line=line.replace('-','')
line=line.replace(',',' ')
lines.append(line)
f.write(filename+','+' '.join(lines)+'\n')
f.close()

texts=[]
textname=[]
fl=open('C:\Users\yongsheng\Desktop\full.txt','r')
for line in f.readlines():
filelist=line.split(',')
_filename=filelist[0]
_content=filelist[1]
seg_list=jieba.cut(_content,cut_all=False)
seg_list_str=[]
for seg in seg_list:
if seg not in stop_word and len(seg)>1 and seg not in printable:
seg_list_str.append(str(seg))
texts.append(seg_list_str)
textname.append(_filename)
fl.close()
frequency=defaultdict(int)
for text in texts:
for token in text:
frequency[token]+=1
texts=[[token for token in text if frequency[token]>1] for text in texts]
dictionary=corpora.Dictionary(texts)
corpus=[dictionary.doc2bow(text) for text in texts]
tfidf=models.TfidfModel(corpus)
corpus_tfidf=tfidf[corpus]
lsi=models.LsiModel(corpus_tfidf,dictionary=dictionary, num_topic=30)
lsi_corpus=lsi[corpus]
index=similarities.MatrixSimilarity(lsi[corpus])

假設查詢的文章：vec
sims=index[lsi[vec]]
simsorts=sorted(enumerate(sims),key=lambda item:-item[1])
sims_filename=[]
sims_filecontext=[]
for item in simsorts[0:10]:
sims_filename.append(textname[item[0]])
sims_filecontext.append(texts[item[0]])
simsten=[' '.join(text) for text in sims_filecontext]
print simsten

在此也參考了某位博主的的文章，加上對他每一步的解讀。由於時間關係有很多地方沒有寫的很清楚，以後在慢慢補充。

gensim文字主題模型推薦

gensim文字主題模型推薦

文字主題模型之非負矩陣分解(NMF)

文字主題模型之LDA(二) LDA求解之Gibbs取樣演算法

文字主題模型之潛在語義分析（LDA:Latent Dirichlet Allocation）

Gensim LDA主題模型實驗

文字主題模型之LDA(一) LDA基礎

文字主題抽取：用gensim訓練LDA模型

初試主題模型LDA-基於python的gensim包

主題模型TopicModel：通過gensim實現LDA

中文文字預處理--主題模型

理順主題模型LDA及在推薦系統中的應用

Gensim做中文主題模型（LDA)

NLP：主題模型LDA+SVM進行文字分類

主題模型LDA及其在微博推薦&廣告演算法中的應用--第1期

python下進行lda主題挖掘(二)——利用gensim訓練LDA模型

文字表示模型中涉及的知識點整理(詞袋模型，TF-IDF，主題模型，詞嵌入模型)

用scikit-learn學習LDA主題模型

Spark機器學習(8)：LDA主題模型算法

LDA主題模型

Familia：百度NLP開源的中文主題模型應用工具包

gensim文字主題模型推薦

相關推薦