python3結巴分詞分行拆分統計詞頻
阿新 • • 發佈:2019-02-01
python3 和 python2 的語法差異應該是最蛋疼的事情了
dict本來就是沒有順序的吧
把dict轉換成list
再去排序就會比較好了
#!/usr/bin/env python3 # -*- coding: utf-8 -*- import jieba import csv def dict2list(dic:dict): # 將字典轉化為列表 keys = dic.keys() vals = dic.values() lst = [(key, val) for key, val in zip(keys, vals)] return lst csv_reader = csv.reader(open('/Users/dear_jinx/Desktop/zz.csv', 'U')) dic = [] for row in csv_reader: # seg_list = jieba.cut_for_search(row[4]) seg_list = jieba.cut(row[4]) for x in seg_list: dic.append(x) word = {} for i in dic: if i not in word: word[i] = 1 else: word[i] += 1 list = sorted(dict2list(word), key=lambda x: x[1], reverse=False) for x in list: print(x) # for item in word.items(): # print(item) # print("/".join(dic))
上面的方法太繁瑣了,並且分詞的效果也不好,會出現一些符號的統計。
這裡我們只在列表裡面加入那些長度大於等於2的詞
並且用counter去做統計
#!/usr/bin/env python3 # -*- coding: utf-8 -*- import jieba import csv from collections import Counter def dict2list(dic:dict): # 將字典轉化為列表 keys = dic.keys() vals = dic.values() lst = [(key, val) for key, val in zip(keys, vals)] return lst csv_reader = csv.reader(open('/Users/dear_jinx/Desktop/zz.csv', 'U')) dic = [] for row in csv_reader: # seg_list = jieba.cut_for_search(row[4]) seg_list = jieba.cut(row[4]) for x in seg_list: if len(x) >= 2: dic.append(x) c = Counter(dic).most_common(20) print(c)