希婆郵件主題抽取-----LDA模型應用
阿新 • • 發佈:2018-11-11
程式碼例項:
1、匯入庫和檔案
import numpy as np
import pandas as pd
import re
from gensim import corpora,models,similarities
from nltk.corpus import stopwords
df = pd.read_csv('H:/HillaryEmails.csv')
df = df[['Id','ExtractedBodyText']].dropna()
2、文字處理
''' 文字預處理 ''' def clean_email_text(text): text = text.replace('\n',' ') #去掉換行符 text = re.sub("-"," ",text) #用空格替換掉‘-’ text = re.sub(r"\d+/\d+/\d+"," " ,text) #去掉日期資料 text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]", "", text) # 時間,沒意義 text = re.sub(r"[\w]
[email protected][\.\w]+", "", text) # 郵件地址,沒意義 text = re.sub(r"/[a-zA-Z]*[:\//\]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i", "", text) # 網址,沒意義 pure_text = '' # 以防還有其他特殊字元(數字)等等,我們直接把他們loop一遍,過濾掉 for letter in text: # 只留下字母和空格 if letter.isalpha() or letter == ' ': pure_text += letter # 再把那些去除特殊字元後落單的單詞,直接排除。 # 我們就只剩下有意義的單詞了。 text = ' '.join(word for word in pure_text.split() if len(word) > 1) return text docs = df['ExtractedBodyText'] docs=docs.apply(lambda s:clean_email_text(s))
3構建模型
''' 利用gensim構建模型 1、從nltk.corpus匯入停止詞表,分詞 2、構建語料庫 ''' doclist = docs.values #去停止詞 words = stopwords.words('english') #!!!記得去停止詞需要加上這句 texts = [[word for word in doc.lower().split() if word not in words] for doc in doclist] #構建語料庫,此處使用詞袋模式 dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts] # print(corpus[0]) # [(0, 3), (1, 2), (2, 1), (3, 2), (4, 1), (5, 2), (6, 2), (7, 2), (8, 1), (9, 1), (10, 1), (11, 3), (12, 1)] # (0,3)代表0號詞出現三次,以此類推 lda = models.ldamodel.LdaModel(corpus = corpus,id2word=dictionary,num_topics=20) #lda.print_topic(10, topn=5) #某一個分類中最常出現的單詞 # print(lda.print_topics(num_topics=20,num_words=5)) #輸出所有分類和其常出現的單詞
4 測試
'''
通過
lda.get_document_topics(bow)
或者
lda.get_term_topics(word_id)
兩個方法,我們可以把新鮮的文字/單詞,分類成20個主題中的一個。
但是注意,我們這裡的文字和單詞,都必須得經過同樣步驟的文字預處理+詞袋化,也就是說,變成數字表示每個單詞的形式
'''
text1= 'We have still have not shattered that highest and hardest glass ceiling. But some day, someone willTo Barack and Michelle Obama, our country owes you an enormous debt of gratitude. We thank you for your graceful, determined leadership'
text1=clean_email_text(text1)
text1 = [word for word in text1.lower().split() if word not in words]
text1_bows = dictionary.doc2bow(text1)
print(lda.get_document_topics(text1_bows))
#[(0, 0.52221924), (2, 0.1793758), (9, 0.13047828), (15, 0.12792665)]
LDA原理講解參考:https://blog.csdn.net/v_july_v/article/details/41209515