基於文字模式的主題模式識別

阿新 • • 發佈：2019-01-05

前面幾篇博文都介紹了幾種不同的分類器，基於分類，好像其他場合應用的監督學習，但有時我們不知道主題分類，這時，相當於其他場合的無監督學習，如果能實現，先用機器學習進行主題識別，再加上人工標記，這樣就能實現強大使用的主題庫。

下面的時間，我們來探討一下如何來實現，主要有以下幾個基本步驟：

（１）載入資料，包括需要分類的輸入資料，還有停用詞、詞幹提取和標記解析等。

def load_data(input_file):

data = []

with open(input_file, 'r')as f:

for line inf.readlines():

data.append(line[:-1])

return data

（２）預處理資料：

① 正則表示式過濾資料
tokens = RegexpTokenizer(r'\w+').tokenize(input_text.lower())
② 停用詞提取
stop_words_english = stopwords.words('english')
③ 根據第二步結果，移除停用詞：
tokens_stopwords = [x for x in tokens if not x in stop_words_english]
④ 定義一種詞幹提取器：
stemmer = SnowballStemmer('english')
⑤

詞幹提取器進行提取：
tokens_stemmed = [stemmer.stem(x) for x intokens_stopwords]
（３）  建立基於預處理後文檔字典：
dict_tokens = corpora.Dictionary(processed_tokens)
（４）  建立文件-詞矩陣，便於機器學習：
corpus = [dict_tokens.doc2bow(text) for text inprocessed_tokens]
（５）  使用LDA做主題建模，設定好引數：
ldamodel = models.ldamodel.LdaModel(corpus,num_topics=num_topics,id2word=dict_tokens, passes=25)
（６）  識別出主題後，我們可以輸出識別規則：
item=ldamodel.print_topics(num_topics=num_topics, num_words=num_words)
print(item)#item中存放了兩個主題文件模型識別規則。
Topic 0 ==> 0.063*"need" + 0.062*"order" +0.037*"encrypt" + 0.037*"modern"

Topic 1 ==> 0.052*"need" +0.031*"train" + 0.031*"develop" + 0.031*"younger"

本例中測試資料是：

data= ['He spenta lot of time studying cryptography. ', 'You need to have a very goodunderstanding of modern encryption systems in order to work there.', "Iftheir team doesn't win this match, they will be out of the competition.",'Those codes are generated by a specialized machine. ', 'The club needs todevelop a policy of training and promoting younger talent. ', 'His movement offthe ball is really great. ', 'In order to evade the defenders, he needs to moveswiftly.', 'We need to make sure only the authorized parties can read themessage.']

基於文字模式的主題模式識別

AMQ初級使用（佇列模式+主題模式）

基於文字模式的主題模式識別

x86CPU 實模式保護模式傻傻分不清楚？基於Xv6-OS 分析CR0 寄存器

activeMQ隊列模式和主題模式的Java實現

Spring集成Redis方案（spring-data-redis）（基於Jedis的單機模式）（待實踐）

基於keepalived實現多種模式的高可用集群網站架構

Material使用05 自定義主題、黑夜模式白天模式切換

使用Java編寫ActiveMQ的隊列模式和主題模式

Django之基於session和CBV模式裝飾器實現使用者認證

基於Java 生產者消費者模式(詳細分析)

（八）RabbitMQ訊息佇列-通過Topic主題模式分發訊息

15.6.1 【Task使用】基於任務的非同步模式

基於約束的頻繁模式挖掘

F#與ASP.NET（2）：使用F#實現基於事件的非同步模式

F#與ASP.NET（1）：基於事件的非同步模式與非同步Action

依賴注入[2]: 基於IoC的設計模式

UTM篇(6.0) 01. 基於代理與基於流的檢測模式 ❀ 飛塔 (Fortinet) 防火牆

基於標題分類的文章主題句識別與提取方法

SpringBoot整合JmsTemplate(佇列模式和主題模式)（xml和JavaConfig配置實現）

ActiveMQ的佇列模式和主題模式

基於文字模式的主題模式識別

相關推薦