1. 程式人生 > 實用技巧 >利用jieba庫對紅樓夢進行分詞統計

利用jieba庫對紅樓夢進行分詞統計

import jieba
excludes = {"什麼","一個","我們","那裡","你們","如今","說道","知道","起來","姑娘","這裡","出來","他們","眾人","自己",
            "一面","只見","怎麼","兩個","沒有","不是","不知","這個","聽見","這樣","進來","咱們","告訴","就是",
            "東西","襲人","回來","只是","大家","只得","老爺","丫頭","這些","不敢","出去","所以","不過","的話","不好",
            "姐姐","探春","鴛鴦","
一時","不能","過來","心裡","如此","今日","銀子","幾個","答應","二人","還有","只管", "這麼","說話","一回","那邊","這話","外頭","打發","自然","今兒","罷了","屋裡","那些","聽說","小丫頭","不用","如何"} txt = open("E:/下載/紅樓夢.txt","r",encoding='utf-8').read() words = jieba.lcut(txt) #將紅樓夢的所有語句分成詞彙 counts = {} #建立的一個空的字典 for word in words: if len(word) == 1: #
如果長度是一,可能是語氣詞之類的,應該刪除掉 continue else: counts[word] = counts.get(word,0) + 1 # 如果字典中沒有這個健(名字)則建立,如果有這個健那麼就給他的計數加一 [姓名:數量],這裡是數量加一 for word in excludes: #如果列出的干擾詞彙在分完詞後的所有詞彙中那麼刪除 del(counts[word]) items = list(counts.items()) #把儲存[姓名:個數]的字典轉換成列表 items.sort(key=lambda x:x[1],reverse = True) #
對上述列表進行排序,'True'是降序排列 for i in range(20): word,count = items[i] print("{0:<10}{1:>5}".format(word,count))