python應用主題模型——lda，牛刀小試

阿新 • • 發佈：2020-08-24

說明：

1.資料來源：WoS文獻資料

2.python讀取excel中儲存的資料

3.通過分句、分詞、去停用詞、詞形還原分析TI（篇名）與AB（摘要）中的文字

4.lda可採用的庫有sklearn，gensim（本文采用的這個）等；sklearn基於EM演算法，gensim基於Gibbs取樣的MCMC演算法

對excel中的資料進行操作

#讀取excel資料
from pprint import pprint
import xlrd
path = r"D:\02-1python\2020.08.11-lda\data\2010-2011\usa\us1.xlsx"#修改路徑
data = xlrd.open_workbook(path)

#第一列，第二列
sheet_1_by_index = data.sheet_by_index(0)
title = sheet_1_by_index.col_values(0)
abstract = sheet_1_by_index.col_values(1)
n_of_rows = sheet_1_by_index.nrows
doc_set = []#空列表
for i in range(1,n_of_rows):#逐行讀取
    doc_set.append(title[i] + '. ' + abstract[i])

doc_set[0]

'The impact of supply chain integration on performance: A contingency and configuration approach.This study extends the developing body of literature on supply chain integration (SCI), which is the degree to which a manufacturer strategically collaborates with its supply chain partners and collaboratively manages intra- and inter-organizational processes, in order to achieve effective and efficient flows of products and services, information, money and decisions, to provide maximum value to the customer. The previous research is inconsistent in its findings about the relationship between SCI and performance. We attribute this inconsistency to incomplete definitions of SCI, in particular, the tendency to focus on customer and supplier integration only, excluding the important central link of internal integration. We study the relationship between three dimensions of SCI, operational and business performance, from both a contingency and a configuration perspective. In applying the contingency approach, hierarchical regression was used to determine the impact of individual SCI dimensions (customer, supplier and internal integration) and their interactions on performance. In the configuration approach, cluster analysis was used to develop patterns of SCI, which were analyzed in terms of SCI strength and balance. Analysis of variance was used to examine the relationship between SCI pattern and performance. The findings of both the contingency and configuration approach indicated that SCI was related to both operational and business performance. Furthermore, the results indicated that internal and customer integration were more strongly related to improving performance than supplier integration. (C) 2009 Elsevier B.V. All rights reserved.'

列表中就可以直接使用了，如需儲存一個txt文字到本地的話，見下。

#儲存為txt到指定路徑下
file_path = 'D:/02-1python/2020.08.11-lda/data/2010-2011/china/2695.txt'
with open(file_path,'a') as file_handle:   # .txt可以不自己新建,程式碼會自動新建
    file_handle.write(str(doc_set[0:]))     # 寫入
    file_handle.write('\n')         # 有時放在迴圈裡面需要自動轉行，不然會覆蓋上一條資料

資料預處理部分

import 
 nltk
#分句
from nltk.tokenize import sent_tokenize
#分詞
from nltk.tokenize import word_tokenize
#去停用詞
from nltk.corpus import stopwords
#詞形還原
from nltk.stem import WordNetLemmatizer
#詞幹提取
from nltk.stem.porter import PorterStemmer

english_stopwords = stopwords.words("english")
#自定義英文表單符號列表
english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '!', '@', '#', '%', '$', '*', "''"]
texts = []

#對每篇文獻進行處理
for doc in doc_set:
    #分詞
    text_list = nltk.word_tokenize(doc)
    #去停用詞1
    text_list0 = [word for word in text_list if word not in english_stopwords]
    #去停用詞2自編，這裡是我自己覺得需要去掉的詞
    english_stopwords2 = ['c', 'also', '2009', '2010', '2011', "'s"]#修改停用詞：年份
    text_list1 = [word for word in text_list0 if word not in english_stopwords2]
    #去標點符號
    text_list2 = [word for word in text_list1 if word not in english_punctuations]
    #詞形還原
    text_list3 = [WordNetLemmatizer().lemmatize(word) for word in text_list2]
    #詞幹化
    text_list4 = [PorterStemmer().stem(word) for word in text_list3]
　　 #最終處理好的結果存放於text[]中
    texts.append(text_list4)

#查詢文件（我這裡是文獻的數目）
M = len(texts)
print('文字數目：%d個' % M)

lda部分

#利用 gensim 庫構建文件-詞頻矩陣
import gensim
from gensim import corpora
#構建字典，把剛剛處理好的詞都存進去
dictionary = corpora.Dictionary(texts)

#構建文件-詞頻矩陣，得到的是詞袋矩陣，也可以進一步使用TF-IDF，這裡未使用
corpus = [dictionary.doc2bow(text) for text in texts]
print('\n文件-詞頻矩陣：')
#pprint(corpus)
pprint(corpus[0:19])
#for c in corpus:
    #print(c)

#轉換成文件詞頻稀疏矩陣
from gensim.matutils import corpus2dense
corpus_matrix=corpus2dense(corpus, len(dictionary))
corpus_matrix.T

類似於這種[0 1 3 0 2 2；···]

#使用gensim來建立 LDA 模型物件
Lda = gensim.models.ldamodel.LdaModel
#在文件-詞頻矩陣上執行和訓練 LDA 模型
num_topics = 10#主題個數，引數可修改
ldamodel = Lda(corpus, num_topics=num_topics, id2word=dictionary, passes=100)#修改超引數，主題個數，遍歷次數
doc_topic = [doc_t for doc_t in ldamodel[corpus]]
print('文件-主題矩陣:\n')
#pprint(doc_topic)
pprint(doc_topic[0:19])
#for doc_topic in ldamodel.get_document_topics(corpus):
    #print(doc_topic)
print('主題-詞:\n')
for topic_id in range(num_topics):
    print('Topic', topic_id)
    pprint(ldamodel.show_topic(topic_id))

什麼主題數才是最適合的呢，可以採用一致性評分或困惑度。

#一致性評分
print('一致性評分:\n')
coherence_model_lda = gensim.models.CoherenceModel(model=ldamodel,texts=texts,dictionary=dictionary,coherence='c_v')
coherence_lda=coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

references:

[1]謝婷. 基於LDA模型的人工智慧領域前沿識別研究[D].南京航空航天大學,2019.

[2]https://datartisan.gitbooks.io/begining-text-mining-with-python/content/

[3]更多參考：https://t.zsxq.com/R7QbQrv

python應用主題模型——lda，牛刀小試

對excel中的資料進行操作

資料預處理部分

lda部分

references:

python應用主題模型——lda，牛刀小試

Python pandas RFM模型應用例項詳解

LDA線性判別分析原理及python應用（葡萄酒案例分析）

基於LDA主題模型和SVM的文字分類

機器學習——LDA主題模型

LDA主題模型

Python邏輯迴歸模型應用舉例

這或許是對小白最友好的python入門了吧——20，定義函式簡單應用

介面控制元件DevExpress WPF的主題設計器，可輕鬆完成應用主題研發

開發應用剪輯App Clip，iOS的小程式

python應用檔案讀取與登入註冊功能

Python django框架輸入漢字，數字，字元生成二維碼實現詳解

Python通過VGG16模型實現影象風格轉換操作詳解

python:目標檢測模型預測準確度計算方式(基於IoU)

python應用Axes3D繪圖（批量梯度下降演算法）

Python應用實現處理excel資料過程解析

Python應用實現雙指數函式及擬合程式碼例項

win10系統應用商店在哪裡，win10應用商店在哪裡開啟

【python】我OUT了，原來函式中的冒號和箭頭是這麼回事

Java和Python現在都挺火，我應該怎麼選？

python應用主題模型——lda，牛刀小試

對excel中的資料進行操作

資料預處理部分

lda部分

references:

相關推薦