中英文詞頻統計
阿新 • • 發佈:2018-09-29
所有 切片 去除 lower 輸出 app lac list ctu
步驟:
1.準備utf-8編碼的文本文件file
2.通過文件讀取字符串 str
3.對文本進行預處理
4.分解提取單詞 list
5.單詞計數字典 set , dict
6.按詞頻排序 list.sort(key=)
7.排除語法型詞匯,代詞、冠詞、連詞等無語義詞
8.輸出TOP(20)
英文詞頻統計
with open (‘English.txt‘,‘r‘) as fb: content = fb.read() # 清洗數據 import string content = content.lower() # 格式化數據,轉為小寫 for i in string.punctuation : # 去除所有標點符號 content = content.replace(i,‘ ‘) wordList = content.split() # 切片分詞 # 統計單詞數量 data = {} for word in wordList : data[word] = data.get(word,0) +1 # 排序 hist = [] for key,value in data.items(): hist.append([value,key]) hist.sort(reverse = True) # 降序 # 前20個 for i in range(20): print(hist[i])
中文詞頻統計
with open (‘Chinese.txt‘,‘r‘) as fb: content = fb.read() # 清洗數據 bd = ‘,。?!;:‘’“”【】‘ for word in content : content = content.replace(bd,‘ ‘) # 統計出詞頻字典 wordDict = {} for word in content : wordDict[word] = content.count(word) wordList = list(wordDict.items()) # 排序 wordList.sort(key=lambda x: x[1], reverse=True) # TOP20 for i in range(20): print(wordList[i])
中英文詞頻統計