基於文字模式的主題模式識別
前面幾篇博文都介紹了幾種不同的分類器,基於分類,好像其他場合應用的監督學習,但有時我們不知道主題分類,這時,相當於其他場合的無監督學習,如果能實現,先用機器學習進行主題識別,再加上人工標記,這樣就能實現強大使用的主題庫。
下面的時間,我們來探討一下如何來實現,主要有以下幾個基本步驟:
(1) 載入資料,包括需要分類的輸入資料,還有停用詞、詞幹提取和標記解析等。
def load_data(input_file):
data = []
with open(input_file, 'r')as f:
for line inf.readlines():
data.append(line[:-1])
return data
(2) 預處理資料:
① 正則表示式過濾資料tokens = RegexpTokenizer(r'\w+').tokenize(input_text.lower())
② 停用詞提取
stop_words_english = stopwords.words('english')
③ 根據第二步結果,移除停用詞:
tokens_stopwords = [x for x in tokens if not x in stop_words_english]
④ 定義一種詞幹提取器:
stemmer = SnowballStemmer('english')
⑤
tokens_stemmed = [stemmer.stem(x) for x intokens_stopwords]
(3) 建立基於預處理後文檔字典:
dict_tokens = corpora.Dictionary(processed_tokens)
(4) 建立文件-詞矩陣,便於機器學習:
corpus = [dict_tokens.doc2bow(text) for text inprocessed_tokens]
(5) 使用LDA做主題建模,設定好引數:
ldamodel = models.ldamodel.LdaModel(corpus,num_topics=num_topics,id2word=dict_tokens, passes=25)
(6) 識別出主題後,我們可以輸出識別規則:
item=ldamodel.print_topics(num_topics=num_topics, num_words=num_words)
print(item)#item中存放了兩個主題文件模型識別規則。
Topic 0 ==> 0.063*"need" + 0.062*"order" +0.037*"encrypt" + 0.037*"modern"
Topic 1 ==> 0.052*"need" +0.031*"train" + 0.031*"develop" + 0.031*"younger"
本例中測試資料是:
data= ['He spenta lot of time studying cryptography. ', 'You need to have a very goodunderstanding of modern encryption systems in order to work there.', "Iftheir team doesn't win this match, they will be out of the competition.",'Those codes are generated by a specialized machine. ', 'The club needs todevelop a policy of training and promoting younger talent. ', 'His movement offthe ball is really great. ', 'In order to evade the defenders, he needs to moveswiftly.', 'We need to make sure only the authorized parties can read themessage.']