Python【極簡】中文LDA模型

阿新 • • 發佈：2018-12-15

完整程式碼

from gensim import corpora, models
import jieba.posseg as jp
# 待分析文字集
text1 = '美國教練坦言，沒輸給中國女排，是輸給了郎平'
text2 = '中國女排世界排名第一？真實水平如何，聽聽巴西和美國主教練的評價'
text3 = '為什麼越來越多的人買MPV，而放棄SUV？跑一趟長途就知道了'
text4 = '跑了長途才知道，SUV和轎車之間的差距'
texts = [text1, text2, text3, text4]
# 過濾條件
flags = ('n', 'nr', 'ns', 'nt', 'eng' 
, 'v', 'd')  # 詞性
stopwords = ('沒', '就', '知道', '是', '才', '聽聽', '坦言')  # 停詞
# 分詞
words_ls = []
for text in texts:
    words = [word.word for word in jp.cut(text) if word.flag in flags and word.word not in stopwords]
    words_ls.append(words)
# 構造詞典
dictionary = corpora.Dictionary(words_ls)
# 基於詞典，使【詞】→【稀疏向量】，並將向量放入列表，形成【稀疏向量集】 

corpus = [dictionary.doc2bow(words) for words in words_ls]
# lda模型，num_topics設定主題的個數
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2)
# 列印所有主題，每個主題顯示3個詞
for topic in lda.print_topics(num_words=3):
    print(topic)

結果: 主題1關鍵詞（汽車）：0.077*"長途" + 0.076*"SUV" + 0.070*"跑" 主題2關鍵詞（體育）：0.072*"中國女排

" + 0.068*"輸給" + 0.066*"美國"

過程詳解

words_ls

[[‘美國’, ‘輸給’, ‘中國女排’, ‘輸給’, ‘郎平’], [‘中國女排’, ‘真實’, ‘水平’, ‘巴西’, ‘美國’, ‘主教練’, ‘評價’], [‘越來越’, ‘人’, ‘買’, ‘MPV’, ‘放棄’, ‘SUV’, ‘跑’, ‘長途’], [‘跑’, ‘長途’, ‘SUV’, ‘轎車’, ‘差距’]]

dictionary

Dictionary( 19 unique tokens: [‘真實’, ‘水平’, ‘郎平’, ‘MPV’, ‘SUV’, ‘美國’, ‘巴西’, ‘中國女排’, ‘主教練’, ‘評價’, ‘放棄’, ‘輸給’, ‘越來越’, ‘跑’, ‘長途’, ‘差距’, ‘轎車’, ‘買’, ‘人’] )

dictionary.doc2bow函式

[‘美國’, ‘輸給’, ‘中國女排’, ‘輸給’, ‘郎平’]: ↓↓↓（美國→0、輸給→2、中國女排→1、郎平→3）
[0, 2, 1, 2, 3]: ↓↓↓（2有兩個，其它只有一個，所以(2, 2)）
[(0, 1), (1, 1), (2, 2), (3, 1)]: …

corpus

[[(0, 1), (1, 1), (2, 2), (3, 1)], [(0, 1), (1, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)], [(9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1)], [(10, 1), (15, 1), (16, 1), (17, 1), (18, 1)]]

lda

LdaModel(num_terms=19, num_topics=2, decay=0.5, chunksize=2000)

Python【極簡】中文LDA模型

完整程式碼

過程詳解

words_ls

dictionary

dictionary.doc2bow函式

corpus

lda

Python【極簡】中文LDA模型

Python【極簡】文字分類模型

【極簡】如何挑選合適的百度BCC，並安裝寶塔控制面板

【極簡】如何在伺服器上安裝SSL證書？

【極簡】LaTex快速安裝和入門

王小草【機器學習】筆記--主題模型LDA實踐與應用

【知識發現】隱語義模型LFM演算法python實現(二)

【知識發現】隱語義模型LFM演算法python實現(三)

【極簡版】SpringBoot+SpringData JPA 管理系統

【oracle入門】數據模型

python 【練習2】字典打印三級菜單

python 【練習1】資產信息掃描

【深度學習】常用的模型評估指標

【資料倉庫】1.資料模型

極簡】如何在伺服器上安裝SSL證書？

分享《父與子的編程之旅python【第二版】》+PDF+源碼+Warren Sande+蘇金國

【機器學習】機器學習模型訓練與測試評估

【雷達與對抗】【2015.09】通用雷達模型在汽車領域的應用

ApolloStudio高手之路（6）：用Python以極簡方式讀寫OPC DA、OPC UA資料並實現UI控制元件自動繫結重新整理顯示

Python 【元祖】【元祖相關功能】

Python【極簡】中文LDA模型

完整程式碼

過程詳解

words_ls

dictionary

dictionary.doc2bow函式

corpus

lda

相關推薦