[參考]對莎士比亞英語文段的字頻分析
阿新 • • 發佈:2021-10-13
根據密碼學的明文推斷中的統計學規律,一個英語文段中字母字頻有一定的規律。因而對莎士比亞的多份作品做了基本的字頻分析。
根據密碼學的明文推斷中的統計學規律,一個英語文段中字母字頻有一定的規律。因而對莎士比亞的多份作品做了基本的字頻分析。
-
指令碼
import sys import csv if(len(sys.argv) != 2): exit(-2) filename = sys.argv[1] alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"#所有正常列印字元 strings = open(filename).read()#讀取需要統計頻數的文字 len = len(strings) result = {} for i in alphabet: counts = strings.count(i) i = '{0}'.format(i) result[i] = counts res = sorted(result.items(), key=lambda item: item[1], reverse=True) num = 0 print("Statistic file "+ filename+" ...") print("Result sheet will be saved to "+filename+".analysis.csv\n") for data in res: num += 1 print("Char '" + data[0] + "' appeared "+ str(data[1]) + "times with percentage "+ str( 100 * data[1]/len) + "%") print('\nRESULT') for i in res: flag = str(i[0]) print(flag[0], end="") with open(filename+".analysis.csv", "w",encoding='utf-8',newline='') as csvfile: # 開啟檔案 writer = csv.writer(csvfile) #先寫入columns_name writer.writerow(["char","count"]) #寫入多行用writerows writer.writerows(res)
-
示例資料(以hamlet_TXT_FolgerShakespeare.txt.analysis.csv為例)
-
csv
檔案
此分析檔案包含了對整個個文段,即資料夾/text
下的對應文章的字元個數分析結果。 -
.txt.output.txt
檔案
此檔案格式如下Statistic file hamlet_TXT_FolgerShakespeare.txt ... #從hamlet_TXT_FolgerShakespeare.txt統計字頻 Result sheet will be saved to hamlet_TXT_FolgerShakespeare.txt.analysis.csv #將此字頻表儲存到hamlet_TXT_FolgerShakespeare.txt.analysis.csv Char 'e' appeared 14843 times (8.396026834704106%) #字元 'e'出現了 14843次,頻率為8.396026834704106% Char 't' appeared 10981 times (6.211464708743905%) #... Char '8' appeared 0 times (0.0%) Sorting result: #按照出現先後排序 etoasnhirldumywfcgpbTAIvEkHLONRMSWGPUBCDFKxYQqjzVZJ123450679X8
-