資料處理-------利用jieba對資料集進行分詞和統計頻數
阿新 • • 發佈:2018-12-21
一,對txt檔案中出現的詞語的頻數統計再找出出現頻率多的
二,程式碼:
import re from collections import Counter import jieba def cut_word(datapath): with open(datapath,'r',encoding='utf-8')as fp: string = fp.read() data = re.sub(r"[\s+\.\!\/_,$%^*(【】:\]\[\-:;+\"\']+|[+——!,。?、[email protected]#¥%……&*()]+|[0-9]+", "", string) word_list = jieba.cut(data) print(type(word_list)) return word_list def static_top_word(word_list,top=5): result = dict(Counter(word_list)) print(result) sortlist = sorted(result.items(),key=lambda x:x[1],reverse=True) resultlist = [] for i in range(0,top): resultlist.append(sortlist[i]) return resultlist def main(): datapath = 'comment.txt' word_list = cut_word(datapath) Result = static_top_word(word_list) print(Result) main()
三,用正則對特殊符號過濾,用re.sub()對字元進行空字元替換