042 例項10-文字詞頻統計

阿新 • • 發佈：2020-12-13

一、"文字詞頻統計"問題分析
- 1.1 問題分析
二、"Hamlet英文詞頻統計"例項講解
三、"《三國演義》人物出場統計"例項講解(上)
四、"《三國演義》人物出場統計"例項講解(下)
- 4.1 《三國演義》人物出場統計
五、"文字詞頻統計"舉一反三
- 5.1 應用問題的擴充套件

一、"文字詞頻統計"問題分析

1.1 問題分析

文字詞頻統計

需求：一篇文章，出現了哪些詞？哪些詞出現得最多？
該怎麼做呢？

英文文字 --> 中文文字

英文文字：Hamlet 分析詞頻

想要《Hamlet》文字的同學可以加我微信：chenyoudea

中文文字：《三國演義》分析人物

想要《三國演義》文字的同學可以加我微信：chenyoudea

二、"Hamlet英文詞頻統計"例項講解

文字去噪及歸一化
使用字典表達詞頻

# CalHamletV1.py


def getText():
    txt = open("hamlet.txt", "r").read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
        txt = txt.replace(ch, " ")
    return txt


hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for 
 word in words:
    counts[word] = counts.get(word, 0) + 1
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(10):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

the         948
and         855
to          650
of          581
you         494 

a           468
my          447
i           443
in          373
hamlet      361

執行結果由大到小排序
觀察單詞出現次數

三、"《三國演義》人物出場統計"例項講解(上)

中文文字分詞
使用字典表達詞頻

# CalThreeKingdomsV1.py

import jieba
txt = open("threekingdoms.txt", "r", encoding="utf-8").read()
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word, 0) + 1
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(15):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/mh/krrg51957cqgl0rhgnwyylvc0000gn/T/jieba.cache
Loading model cost 1.030 seconds.
Prefix dict has been built succesfully.


曹操          953
孔明          836
將軍          772
卻說          656
玄德          585
關公          510
丞相          491
二人          469
不可          440
荊州          425
玄德曰         390
孔明曰         390
不能          384
如此          378
張飛          358

四、"《三國演義》人物出場統計"例項講解(下)

4.1 《三國演義》人物出場統計

將詞頻與人物相關聯，面向問題

詞頻統計 --> 人物統計

#CalThreeKingdomsV2.py
import jieba
txt = open("threekingdoms.txt", "r", encoding="utf-8").read()
excludes = {"將軍", "卻說", "荊州", "二人", "不可", "不能", "如此"}
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "諸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "關公" or word == "雲長":
        rword = "關羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "劉備"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"
    else:
        rword = word
    counts[rword] = counts.get(rword, 0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(10):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

曹操         1451
孔明         1383
劉備         1252
關羽          784
張飛          358
商議          344
如何          338
主公          331
軍士          317
呂布          300

中文文字分詞
使用字典表達詞頻
擴充套件程式解決問題
根據結果進一步優化

隆重發布《三國演義》人物出場順序前20：曹操、孔明、劉備、關羽、張飛、呂布、趙雲、孫權、司馬懿、周瑜、袁紹、馬超、魏延、黃忠、姜維、馬岱、龐德、孟獲、劉表、夏侯惇

五、"文字詞頻統計"舉一反三

5.1 應用問題的擴充套件

《紅樓夢》、《西遊記》、《水滸傳》…
政府工作報告、科研論文、新聞報道 …
進一步呢？未來還有詞雲…

042 例項10-文字詞頻統計

目錄一、\"文字詞頻統計\"問題分析 1.1 問題分析二、\"Hamlet英文詞頻統計\"例項講解

C語言實現英文文字詞頻統計

這幾天寫了一個基於C語言對文字詞頻進行統計的程式，開發及除錯環境：mac整合開發環境Xcode；測試文字，馬丁.路德金的《I have a dream》原文演講稿。

詞頻統計例項

目錄英文(詞頻統計)中文(三國演義詞頻統計)詞頻統計2.0->三國演義人物出場統計詞雲實現

ml-10-2-應用例項-圖片文字識別( ( Application Example-Photo OCR) )

應用例項-圖片文字識別( ( Application Example-Photo OCR) ) 問題描述和流程圖Problem Description and Pipeline

Python生成詞雲圖，TIIDF方法文字挖掘: 詞頻統計，詞雲圖

python中使用wordcloud包生成的詞雲圖。下面來介紹一下wordcloud包的基本用法。 class wordcloud.WordCloud(font_path=None, width=400, height=200, margin=2, ranks_only=None, prefer_horizontal=0.9,mask=None