Python文字詞頻統計的編碼問題-MOOC嵩天
阿新 • • 發佈:2018-11-05
1 Python文字詞頻統計程式碼
1.1Hamlet詞頻統計(含Hamlet原文文字)
#CalHamletV1.py def getText(): txt = open("hamlet.txt", "r").read() txt = txt.lower() for ch in '!"#$%&()*+,-./:;<=>[email protected][\\]^_‘{|}~': txt = txt.replace(ch, " ") #將文字中特殊字元替換為空格 return txt hamletTxt = getText() words = hamletTxt.split() counts = {} for word in words: counts[word] = counts.get(word,0) + 1 items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(10): word, count = items[i] print ("{0:<10}{1:>5}".format(word, count))
1.2《三國演義》人物出場統計(上)(含《三國演義》原文文字)
#CalThreeKingdomsV1.py import jieba txt = open("threekingdoms.txt", "r", encoding='utf-8').read() words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue else: counts[word] = counts.get(word,0) + 1 items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(15): word, count = items[i] print ("{0:<10}{1:>5}".format(word, count))
1.3《三國演義》人物出場統計(下)(含《三國演義》原文文字)
#CalThreeKingdomsV2.py import jieba excludes = {"將軍","卻說","荊州","二人","不可","不能","如此"} txt = open("threekingdoms.txt", "r", encoding='utf-8').read() words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue elif word == "諸葛亮" or word == "孔明曰": rword = "孔明" elif word == "關公" or word == "雲長": rword = "關羽" elif word == "玄德" or word == "玄德曰": rword = "劉備" elif word == "孟德" or word == "丞相": rword = "曹操" else: rword = word counts[rword] = counts.get(rword,0) + 1 for word in excludes: del counts[word] items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(10): word, count = items[i] print ("{0:<10}{1:>5}".format(word, count))
1.4 文字詞頻統計的文字
該資源是《Python文字詞頻統計的編碼問題-MOOC嵩天》的文字詞頻統計的文字。包內包含三國演義中文版TXT和哈姆雷特英文版TXT。
資源地址:文字詞頻統計的文字
2 文字詞頻統計的編碼問題
2.1 文字詞頻編碼對應程式碼
將文字詞頻統計的文字和程式碼放於同一資料夾下,執行上述程式碼,會出現以下報錯:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 14: illegal multibyte sequence
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 3: invalid start byte
根據提示,是TXT文字編碼的格式問題。因此只需要將程式碼中識別編碼格式的程式碼進行更改即可。Hamlet以及ThreeKingdom的識別編碼的程式碼行如下:
Hamlet
txt = open("hamlet.txt", "r").read()
ThreeKingdomV1
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
ThreeKingdomV2
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
2.2 檢視TXT編碼並更改程式碼
檢視TXT編碼只需要開啟TXT並另存為,就會出現現在文件的編碼格式。編碼格式也可以更改。關鍵是要保持文字的編碼格式和程式碼讀取的編碼格式相同。
由於我上傳的TXT編碼格式都是utf-8編碼,因此相應程式碼只需要更改為以下程式碼即可成功執行。
Hamlet
txt = open("hamlet.txt", "r", encoding='utf-8').read()
ThreeKingdomV1
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
ThreeKingdomV2
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()