文本詞頻統計
本例是數組、字典、列表、jieba(第三方庫)的綜合應用,我們將對三國演義中出現次數前十的任務進行排名並統計出出現的次數。
程序1:
#CalThreeKingdomsV1.py
import jieba
txt = open("threekingdoms.txt", "r", encoding="utf-8").read()
words = jieba.lcut(txt) #利用jieba函數進行分詞並返回列表類型
counts = {} #創建一個字典存儲信息
for word in words:
if len(word) == 1:
continue
else:
counts[word] = counts.get(word,0) + 1
items = list(counts.items()) #強制列表轉換(列表中以元組對的形式存放)
items.sort(key=lambda x:x[1], reverse=True) #排序,從大到小
for i in range(15):
word, count = items[i]
print("{0:<10}{1:>5}".format(word, count))
運行結果:
可以看出這個程序只是滿足的要求,並未達到理想的效果,即結果中存在不是人名的詞,對此我們要做出適當修改,修改後程序如下:
源程序2:
import jieba
excludes = {"將軍","卻說","荊州","二人","不可","不能","如此","如何","商議","軍士","左右","引兵","軍馬","次日","大喜","天下","東吳","於是","今日","不散","魏兵","不敢","陛下","一人"}
txt = open("threekingdoms.txt", "r", encoding=‘utf-8‘).read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "諸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "關公" or word == "雲長":
rword = "關羽"
elif word == "玄德" or word == "玄德曰" :
rword = "劉備"
elif word == "孟德" or word == "丞相":
rword = "曹操"
elif word == "都督" :
rword = "周瑜"
else:
rword = word
counts[rword] = counts.get(rword,0) + 1 #將整理後的列表元素存入到字典中
for word in excludes:
del counts[word]#刪除不合理的非人名元素
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
註意,這裏excludes裏面的元素是我們在運行中逐步添加的,直至達到我們理想中的要求。(本例中為前十位,可結合實際問題修改)
這個例子講解的是三國演義的詞頻統計,我們還可以進行其他的應用,比如自己喜歡的文學作品或者小說,總之是一件很好玩的事情!
版權聲明:
本文中所有文字、圖片版權均屬本人所有,如需轉載請註明來源。
文本詞頻統計