python應用主題模型——lda,牛刀小試
阿新 • • 發佈:2020-08-24
說明:
1.資料來源:WoS文獻資料
2.python讀取excel中儲存的資料
3.通過分句、分詞、去停用詞、詞形還原分析TI(篇名)與AB(摘要)中的文字
4.lda可採用的庫有sklearn,gensim(本文采用的這個)等;sklearn基於EM演算法,gensim基於Gibbs取樣的MCMC演算法
對excel中的資料進行操作
#讀取excel資料 from pprint import pprint import xlrd path = r"D:\02-1python\2020.08.11-lda\data\2010-2011\usa\us1.xlsx"#修改路徑 data = xlrd.open_workbook(path)
#第一列,第二列
sheet_1_by_index = data.sheet_by_index(0) title = sheet_1_by_index.col_values(0) abstract = sheet_1_by_index.col_values(1) n_of_rows = sheet_1_by_index.nrows doc_set = []#空列表 for i in range(1,n_of_rows):#逐行讀取 doc_set.append(title[i] + '. ' + abstract[i])
doc_set[0]
'The impact of supply chain integration on performance: A contingency and configuration approach.This study extends the developing body of literature on supply chain integration (SCI), which is the degree to which a manufacturer strategically collaborates with its supply chain partners and collaboratively manages intra- and inter-organizational processes, in order to achieve effective and efficient flows of products and services, information, money and decisions, to provide maximum value to the customer. The previous research is inconsistent in its findings about the relationship between SCI and performance. We attribute this inconsistency to incomplete definitions of SCI, in particular, the tendency to focus on customer and supplier integration only, excluding the important central link of internal integration. We study the relationship between three dimensions of SCI, operational and business performance, from both a contingency and a configuration perspective. In applying the contingency approach, hierarchical regression was used to determine the impact of individual SCI dimensions (customer, supplier and internal integration) and their interactions on performance. In the configuration approach, cluster analysis was used to develop patterns of SCI, which were analyzed in terms of SCI strength and balance. Analysis of variance was used to examine the relationship between SCI pattern and performance. The findings of both the contingency and configuration approach indicated that SCI was related to both operational and business performance. Furthermore, the results indicated that internal and customer integration were more strongly related to improving performance than supplier integration. (C) 2009 Elsevier B.V. All rights reserved.'
列表中就可以直接使用了,如需儲存一個txt文字到本地的話,見下。
#儲存為txt到指定路徑下
file_path = 'D:/02-1python/2020.08.11-lda/data/2010-2011/china/2695.txt' with open(file_path,'a') as file_handle: # .txt可以不自己新建,程式碼會自動新建 file_handle.write(str(doc_set[0:])) # 寫入 file_handle.write('\n') # 有時放在迴圈裡面需要自動轉行,不然會覆蓋上一條資料
資料預處理部分
importnltk #分句 from nltk.tokenize import sent_tokenize #分詞 from nltk.tokenize import word_tokenize #去停用詞 from nltk.corpus import stopwords #詞形還原 from nltk.stem import WordNetLemmatizer #詞幹提取 from nltk.stem.porter import PorterStemmer
english_stopwords = stopwords.words("english") #自定義英文表單符號列表 english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '!', '@', '#', '%', '$', '*', "''"]
texts = []
#對每篇文獻進行處理 for doc in doc_set: #分詞 text_list = nltk.word_tokenize(doc) #去停用詞1 text_list0 = [word for word in text_list if word not in english_stopwords] #去停用詞2自編,這裡是我自己覺得需要去掉的詞 english_stopwords2 = ['c', 'also', '2009', '2010', '2011', "'s"]#修改停用詞:年份 text_list1 = [word for word in text_list0 if word not in english_stopwords2] #去標點符號 text_list2 = [word for word in text_list1 if word not in english_punctuations] #詞形還原 text_list3 = [WordNetLemmatizer().lemmatize(word) for word in text_list2] #詞幹化 text_list4 = [PorterStemmer().stem(word) for word in text_list3]
#最終處理好的結果存放於text[]中 texts.append(text_list4)
#查詢文件(我這裡是文獻的數目)
M = len(texts) print('文字數目:%d個' % M)
lda部分
#利用 gensim 庫構建文件-詞頻矩陣 import gensim from gensim import corpora #構建字典,把剛剛處理好的詞都存進去 dictionary = corpora.Dictionary(texts)
#構建文件-詞頻矩陣,得到的是詞袋矩陣,也可以進一步使用TF-IDF,這裡未使用 corpus = [dictionary.doc2bow(text) for text in texts] print('\n文件-詞頻矩陣:') #pprint(corpus) pprint(corpus[0:19]) #for c in corpus: #print(c)
#轉換成文件詞頻稀疏矩陣 from gensim.matutils import corpus2dense corpus_matrix=corpus2dense(corpus, len(dictionary)) corpus_matrix.T
類似於這種[0 1 3 0 2 2;···]
#使用gensim來建立 LDA 模型物件 Lda = gensim.models.ldamodel.LdaModel #在文件-詞頻矩陣上執行和訓練 LDA 模型 num_topics = 10#主題個數,引數可修改 ldamodel = Lda(corpus, num_topics=num_topics, id2word=dictionary, passes=100)#修改超引數,主題個數,遍歷次數 doc_topic = [doc_t for doc_t in ldamodel[corpus]] print('文件-主題矩陣:\n') #pprint(doc_topic) pprint(doc_topic[0:19]) #for doc_topic in ldamodel.get_document_topics(corpus): #print(doc_topic) print('主題-詞:\n') for topic_id in range(num_topics): print('Topic', topic_id) pprint(ldamodel.show_topic(topic_id))
什麼主題數才是最適合的呢,可以採用一致性評分或困惑度。
#一致性評分 print('一致性評分:\n') coherence_model_lda = gensim.models.CoherenceModel(model=ldamodel,texts=texts,dictionary=dictionary,coherence='c_v') coherence_lda=coherence_model_lda.get_coherence() print('\nCoherence Score: ', coherence_lda)
references:
[1]謝婷. 基於LDA模型的人工智慧領域前沿識別研究[D].南京航空航天大學,2019.
[2]https://datartisan.gitbooks.io/begining-text-mining-with-python/content/
[3]更多參考:https://t.zsxq.com/R7QbQrv