用Python統計文字檔案中詞彙字母短語等分佈

阿新 • • 發佈：2018-12-19

這是MSRA的高階軟體設計結對程式設計的作業

這篇部落格討論具體地實現方式與過程，包括效能分析與單元測試

分析的工具使用方法可以參考這兩篇部落格:

該專案的完整程式碼，請參考下面的Github:

先看一下這個專案的要求:

使用者需求：英語的26 個字母的頻率在一本小說中是如何分佈的？某型別文章中常出現的單詞是什麼？某作家最常用的詞彙是什麼？《哈利波特》中最常用的短語是什麼，等等。我們就寫一些程式來解決這個問題，滿足一下我們的好奇心。

要求：程式的單元測試，迴歸測試，效能測試C/C++/C# 等基本語言的運用和 debug。

題目要求：

Step-0：輸出某個英文文字檔案中 26 字母出現的頻率，由高到低排列，並顯示字母出現的百分比，精確到小數點後面兩位。

Step-1：輸出單個檔案中的前 N 個最常出現的英語單詞。

Step-2:支援 stop words，我們可以做一個 stop word 檔案（停詞表），在統計詞彙的時候，跳過這些詞。

Step-3:輸出某個英文文字檔案中單詞短語出現的頻率，由高到低排列，並顯示字母出現的百分比，精確到小數點後面兩位。

Step-4:第四步：把動詞形態都統一之後再計數。

Step-0:輸出某個英文文字檔案中 26 字母出現的頻率，由高到低排列，並顯示字母出現的百分比，精確到小數點後面兩位。

最初的想法是去除掉各種亂七八糟的符號之後，使用遍歷整個文字檔案的每一個字母，用一個字典儲存計數，每次去索引字典的值，索引到該值之後，在字典的value上加一實現。具體實現的程式碼如下:

#!/usr/bin/env python
#-*- coding:utf-8 -*-
#author: Enoch time:2018/10/22 0031

import time
import re
import operator
from string import punctuation           

start = time.clock()
'''function：Calculate the word frequency of each line
    input:  line : a list contains a string for a row
            counts: an empty  dictionary 
    ouput:  counts: a dictionary , keys are words and values are frequencies
    data:2018/10/22
'''
def ProcessLine(line,counts):
    #Replace the punctuation mark with a space

    line = re.sub('[^a-z]', '', line)
    for ch in line:
        counts[ch] = counts.get(ch, 0) + 1
    return counts

def main():
    file = open("../Gone With The Wind.txt", 'r')
    wordsCount = 0
    alphabetCounts = {}
    for line in file:
        alphabetCounts = ProcessLine(line.lower(), alphabetCounts)
    wordsCount = sum(alphabetCounts.values())
    alphabetCounts = sorted(alphabetCounts.items(), key=lambda k: k[0])
    alphabetCounts = sorted(alphabetCounts, key=lambda k: k[1], reverse=True)
    for letter, fre in alphabetCounts:
    	print("|\t{:15}|{:<11.2%}|".format(letter, fre / wordsCount))

    file.close()


if __name__ == '__main__':
    main()

end = time.clock()
print (end-start)

這樣做的程式碼理論上程式碼是正確的，為了驗證程式碼的正確性，我們需要使用三個文字檔案做單元測試，具體就是，一個空檔案，一個小樣本檔案，和一個樣本較多的檔案，分別做驗證，於是可以寫單元測試的程式碼如下:

from count import CountLetters
CountLetters("Null.txt")
CountLetters("Test.txt")
CountLetters("gone_with_the_wind.txt")

其中:

Null.txt 是一個空的文字檔案
gone_with_the_wind.txt 是《亂世佳人》的文字檔案
Test.txt 是一個我們自己指定的內容固定的文字檔案，這樣就可以統計結果的正確性

經過我們的驗證，這個結果是正確的。保證了結果的正確性，經過這樣的驗證，但還不清楚程式碼的覆蓋率怎麼樣，於是我們使用工具coverage，對程式碼進行分析，使用如下命令列分析程式碼覆蓋率

coverage run my_program.py arg1 arg2

得到的結果如下:

Name                      Stmts   Exec  Cover
---------------------------------------------
CountLetters                 56     50    100%
---------------------------------------------
TOTAL                        56     50    100%

可以看到，在保證程式碼覆蓋率為100%的時候，程式碼執行是正確的。

但程式的執行速度怎麼樣呢？為了更加了解清楚它的執行速度，我們使用cprofile分析效能，從而提升執行的效能, 使用cprofile執行的結果為

我們大致知道main，Processline，ReplacePunctuations三個模組最耗時，其中最多是ProcessLine，我們就需要看preocessLine()模組裡呼叫了哪些函式，花費了多長時間。

最後使用圖形化工具graphviz畫出具體地耗時情況如下:

step0

可以從上面的影象中看到文字有9千多行，low函式和re.sub被呼叫了9023次，每個字母每個字母的統計get也被呼叫了1765982次，這種一個字母一個字母的索引方式太慢了。我們需要尋求新的解決辦法，於是想到了正則表示式，遍歷字母表來匹配正則表示式，於是我們就得到了第二版的函式

###################################################################################
#Name:count_letters
#Inputs:file name
#outputs:None
#Author: Thomas
#Date:2018.10.22
###################################################################################
def CountLetters(file_name,n,stopName,verbName):
    print("File name:" + os.path.abspath(file_name))
    if (stopName != None):
        stopflag = True
    else:
        stopflag = False
    if(verbName != None):
        print("Verb tenses normalizing is not supported in this function!")
    else:
        pass
    totalNum = 0
    dicNum = {}
    t0 = time.clock()
    if (stopflag == True):
        with open(stopName) as f:
            stoplist = f.readlines()
    with open(file_name) as f:
        txt = f.read().lower()
    for letter in letters:
        dicNum[letter] = len(re.findall(letter,txt))
        totalNum += dicNum[letter]
    if (stopflag == True):
        for word in stoplist:
            word = word.replace('\n','')
            try:
                del tempc[word]
            except:
                pass
    dicNum = sorted(dicNum.items(), key=lambda k: k[0])
    dicNum = sorted(dicNum, key=lambda k: k[1], reverse=True)
    t1 = time.clock()
    display(dicNum[:n],'character',totalNum,9)
    print("Time Consuming:%4f" % (t1 - t0))

該函式把執行時間從原來的1.14s直接降到了0.2s，通過重複剛才的單元測試以及效能分析（這裡我就不重複貼上結果了），驗證了在程式碼覆蓋率為100%的情況下，程式碼的執行也是正確的，並且發現執行時間最長的就是其中的正則表示式，在這樣的情況下，我們又尋求新的解決方案。最終我們發現了文字自帶的count方法，將正則表示式用更該方法替換之後，即將上面的程式碼:

dicNum[letter] = len(re.findall(letter,txt))

替換為

dicNum[letter] = txt.count(letter) #here count is faster than re

成功的將時間降到了5.83e-5s可以說提高了非常多的數量級，優化到這裡，基本上已經達到了優化的瓶頸，沒法繼續優化了。

注:後來的版本添加了許多功能，這裡的程式碼是添加了功能之後的程式碼, 如需要執行最初的功能則需要將後面的引數指定成None。

Step-1：輸出單個檔案中的前 N 個最常出現的英語單詞。

首先的瞭解，單詞的定義是什麼：

單詞：以英文字母開頭，由英文字母和字母數字符號組成的字串視為一個單詞。單詞以分隔符分割且不區分大小寫。在輸出時，所有單詞都用小寫字元表示。

英文字母：A-Z，a-z 字母數字符號：A-Z，a-z，0-9 分割符：空格,非字母數字符號例：good123是一個單詞，123good不是一個單詞。good，Good和GOOD是同一個單詞

最初的想法是去除掉各種亂七八糟的符號之後，是用空格分隔出單詞，然後遍歷文字中的每一個單詞，用一個字典儲存計數，每次去索引字典的值，索引到該值之後，在字典的value上加一實現。具體實現的程式碼如下:

#!/usr/bin/env python
#-*- coding:utf-8 -*-
#author: Eron time:2018/10/22 0022
import time
import re
start = time.time()
from string import punctuation           #Temporarily useless
 
'''function：Calculate the word frequency of each line
    input:  line : a list contains a string for a row
            counts: an empty  dictionary 
    ouput:  counts: a dictionary , keys are words and values are frequencies
    data:2018/10/22
'''
def ProcessLine(line,counts):
    #Replace the punctuation mark with a space
    #line=ReplacePunctuations(line)
    line = re.sub('[^a-z0-9]', ' ', line)
    words = line.split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1
    return  counts


'''function：Replace the punctuation mark with a space
    input:  line : A list containing a row of original strings
    ouput:  line: a list whose punctuation is all replaced with spaces
    data:2018/10/22
'''
def ReplacePunctuations(line):
    for ch in line :
        #Create our own symbol list
        tags = [',','.','?','"','“','”','—']
        if ch in tags:
            line=line.replace(ch," ")
    return line

'''function：Create a taboo "stopwords.txt"
    input:  line : A list contains all the words in the "Gone With The Wind.txt"
    ouput:  nono
    data:2018/10/23
'''
def CreatStopWordsTxt(list):
    file = open('stopwords.txt', 'w')

    for  str in list:
        file.write(str+'\t')
    file.close()

'''function：Remove any words that do not meet the requirements
    input: dict : A dict whose keys are words and values are frequencies
    ouput: dictProc : A  removed undesirable words dict
    data:2018/10/23
'''
def RemoveUndesirableWords(dict):
    wordsCount = 0  # Number of words
    wordsCount = sum(dict.values())
    dictProc = dict.copy();
    for temp in list(dict):
        if temp[0].isdigit():
            del dictProc[temp]
        else:
            dictProc[temp] = round(dictProc[temp] / wordsCount, 4)
    return dictProc



def CountWords(fileName):
    file = open(fileName,'r')
    count = 10 #Show the top count  words that appear most frequently

    alphabetCountsOrg={}       # Creates an empty dictionary used to calculate word frequency

    for line in file:
        alphabetCountsOrg = ProcessLine(line.lower(), alphabetCountsOrg) #Calculate the word frequency of each line

    alphabetCounts = RemoveUndesirableWords(alphabetCountsOrg) #Remove any words that do not meet the requirements


    pairs = list(alphabetCounts.items())    #Get the key-value pairs from the dictionary
    items = [[x,y]for (y,x)in pairs]        #key-value pairs in the list exchange locations, data pairs sort
    items.sort(reverse=True)

    #Notice we didn't order words of the same frequency

    for i in range(count ):
        print(items[i][1] + "\t" + str(items[i][0]))
    file.close()
    #CreatStopWordsTxt(alphabetCounts.keys())

 
if __name__ == '__main__':
    CountWords("gone_with_the_wind.txt")

end = time.time()
print (end-start)

from count import CountWords
CountWords("Null.txt")
CountWords("Test.txt")
CountWords("gone_with_the_wind.txt")

其中:

Null.txt 是一個空的文字檔案
gone_with_the_wind.txt 是《亂世佳人》的文字檔案
Test.txt 是一個我們自己指定的內容固定的文字檔案，這樣就可以統計結果的正確性

coverage run test.py

得到的結果如下:

Name                      Stmts   Exec  Cover
---------------------------------------------
CountWords                   78     92    100%
---------------------------------------------
TOTAL                        78     92    100%

可以看到，在保證程式碼覆蓋率為100%的時候，程式碼執行是正確的。因為程式碼做了修改，因此需要做迴歸測試，編寫如下程式碼做迴歸測試:

from count import CountLetters
from count import CountWords
CountWords("Null.txt")
CountWords("Test.txt")
CountWords("gone_with_the_wind.txt")

CountLetters("Null.txt")
CountLetters("Test.txt")
CountLetters("gone_with_the_wind.txt")

但程式的執行速度怎麼樣呢？為了更加了解清楚它的執行速度，我們使用cprofile分析效能，從而提升執行的效能, 使用cprofile執行的結果為

我們大致知道sub，Split，get三個模組最耗時，其中最多是sub，我們就需要看preocessLine()模組裡呼叫了哪些函式，花費了多長時間。

最後使用圖形化工具graphviz畫出具體地耗時情況如下:

可以從上面的影象中看到文字有9千多行，low函式和re.sub被呼叫了9023次，每個字母每個字母的統計get也被呼叫了1765982次，這種一個單詞一個單詞的索引方式太慢了。我們需要尋求新的解決辦法，於是想到了正則表示式，遍歷字母表來匹配正則表示式，於是我們就得到了新的的函式，我們可以使用正則表示式的findall 函式，找到所有單詞，作為單詞list，使用collections 的Counter去統計字典中的重複元素，得到如下程式碼：

###################################################################################
#Name:count_words
#Inputs:file name,the first n words, stopfile name
#outputs:None
#Author: Thomas
#Date:2018.10.22
###################################################################################
def CountWords(file_name,n,stopName,verbName):
    print("File name:" + sys.path[0] + "\\" + file_name)
    if (stopName != None):
        stopflag = True
    else:
        stopflag = False
    if(verbName != None):
        verbflag = True
    else:
        verbflag = False
    t0 = time.clock()
    with open(file_name) as f:
        txt = f.read()
    txt = txt.lower()
    if(stopflag == True):
        with open(stopName) as f:
            stoplist = f.readlines()
    pattern = r"[a-z][a-z0-9]*"
    wordList = re.findall(pattern,txt)
    totalNum = len(wordList)
    tempc = Counter(wordList)
    if (stopflag == True):
        for word in stoplist:
            word = word.replace('\n','')
            del tempc[word]
    dicNum = dict(tempc.most_common(n))
    if (verbflag == True):
        totalNum = 0
        verbDic = {}
        verbDicNum = {}
        with open(verbName) as f:
            for line in f.readlines():
                key,value = line.split(' -> ')
                verbDic[key] = value.replace('\n','').split(',')
                verbDicNum[key] = tempc[key]
                for word in verbDic[key]:
                    verbDicNum[key] += tempc[word]
                totalNum += verbDicNum[key]
        verbDicNum = sorted(verbDicNum.items(), key=lambda k: k[0])
        verbDicNum = sorted(verbDicNum, key=lambda k: k[1], reverse=True)
    dicNum = sorted(dicNum.items(), key=lambda k:k[0])
    dicNum = sorted(dicNum, key=lambda k:k[1], reverse=True)
    t1 = time.clock()
    if (verbflag == True):
        display(verbDicNum[:n], 'words',totalNum,3)
    else:
        display(dicNum,'words',totalNum,3)
    print("Time Consuming:%4f" % (t1 - t0))

修改之後，依舊需要做單元測試和迴歸測試，這裡避免重複就不寫了，成功的將時間降到了0.34s可以說提高了非常多的數量級，優化到這裡，基本上已經達到了優化的瓶頸，沒法繼續優化了。

Step-2:支援 stop words，我們可以做一個 stop word 檔案（停詞表），在統計詞彙的時候，跳過這些詞。

停詞表就沒有之前實現那樣需要這麼麻煩去優化效能了，因為這個功能是基於之前的已經優化好的函式做的，因此要做的只是單元測試與迴歸測試，首先先分析一下實現的方式，因為之前已經統計得到了每一個單詞出現的次數，現在需要做的是讀取stopword檔案中的單詞，將這個單詞在字典中刪去，就可以到達最終所需要的效果，因為統計的時候用的是Counter型別的，因此只需要遍歷stopword然後在counter中刪掉就好了，這樣得到的程式碼就是:

if(stopflag == True):
    with open(stopName) as f:
        stoplist = f.readlines()    
if (stopflag == True):
    for word in stoplist:
        word = word.replace('\n','')
        del tempc[word]

同樣的，我們需要使用三個文字檔案做單元測試，具體就是，一個空檔案，一個小樣本檔案，和一個樣本較多的檔案，分別做驗證，於是可以寫單元測試的程式碼如下:

from count import CountWords
CountWords("Null.txt","Stopwords.txt")
CountWords("Test.txt","Stopwords.txt")
CountWords("gone_with_the_wind.txt","Stopwords.txt")

其中:

Null.txt 是一個空的文字檔案
gone_with_the_wind.txt 是《亂世佳人》的文字檔案
Test.txt 是一個我們自己指定的內容固定的文字檔案，這樣就可以統計結果的正確性

coverage run test.py

得到的結果如下:

Name                      Stmts   Exec  Cover
---------------------------------------------
CountWords                   78     92    100%
---------------------------------------------
TOTAL                        78     92    100%

可以看到，在保證程式碼覆蓋率為100%的時候，程式碼執行是正確的。因為程式碼做了修改，因此需要做迴歸測試，編寫如下程式碼做迴歸測試:

from count import CountLetters
from count import CountWords
CountWords("Null.txt","Stopwords.txt")
CountWords("Test.txt","Stopwords.txt")
CountWords("gone_with_the_wind.txt","Stopwords.txt")

CountLetters("Null.txt","Stopwords.txt")
CountLetters("Test.txt","Stopwords.txt")
CountLetters("gone_with_the_wind.txt","Stopwords.txt")

發現之前的counterletters不支援stopword的功能，於是我們又去修改了該函式，只不過那個函式沒有用counter型別，因此為了達到stopword功能，需要從字典中刪去改項，於是我們得到

if (stopflag == True):
    with open(stopName) as f:
        stoplist = f.readlines()    
if (stopflag == True):
    for word in stoplist:
        word = word.replace('\n','')
        try:
            del tempc[word]
        except:
            pass

經過單元測試，迴歸測試之後，結果正確。

Step-3:輸出某個英文文字檔案中單詞短語出現的頻率，由高到低排列，並顯示字母出現的百分比，精確到小數點後面兩位。

首先的瞭解，短語的定義是什麼：

短語:兩個或多個英語單詞，它們之間只有空格分隔. 請看下面的例子：

　　hello world //這是一個短語

　　hello, world //這不是一個短語

這個會導致一個句子中有許多短語，舉個例子：

I am not a good boy.

這個就有：I am, am not, not a, a good, good boy.

這就難倒了正則表示式，因為這樣就不能用回溯功能，於是隊友想到了辦法，我們把文章先分為句子，再從句子中提出短語，用for迴圈去遍歷一個句子，然後我們寫出了下面的程式碼：

#!/usr/bin/env python
#-*- coding:utf-8 -*-
#author: albert time:2018/10/23 0023

import time
import re
import string
from collections import Counter

start = time.time()
from string import punctuation  # Temporarily useless

def NumWordFrequency(fileContent,number):
    fileContent = re.sub('\n|\t',' ',fileContent)
    mPunctuation = r',|;|\?|\!|\.|\:|\“|\"|\”'
    sentenceList = re.split(mPunctuation , fileContent)#Divide the text into sentences according to the punctuation marks
    wordsCounts = {}  # Creates an empty dictionary used to calculate word frequency
    for oneSentence in sentenceList:
        wordsCounts = ProcessLine(oneSentence.lower(), wordsCounts,number)  # Calculate the specified length phrase frequency
    return wordsCounts


'''function：Calculate the word frequency of each line
    input:  line : a list contains a string for a row
            countsDict: an empty  dictionary 
    ouput:  counts: a dictionary , keys are words and values are frequencies
    data:2018/10/22
'''

def ProcessLine(sentence, countsDict,number):
    # Replace the punctuation mark with a space
    # line=ReplacePunctuations(line)
    sentence = re.sub('[^a-z0-9]', ' ', sentence)
    words = sentence.split()
    if len(words) >= number:
        for i in range(len(words)-number+1):
            countsDict[" ".join(words[i:i+number])] = countsDict.get(" ".join(words[i:i+number]), 0) + 1
    else:
        if sentence.strip()=='':   #Judge if the sentence is empty
            return countsDict
        countsDict[sentence] = countsDict.get(sentence, 0) + 1
    return countsDict


'''function：Replace the punctuation mark with a space
    input:  line : A list containing a row of original strings
    ouput:  line: a list whose punctuation is all replaced with spaces
    data:2018/10/22
'''

def ReplacePunctuations(line):
    for ch in line:
        # Create our own symbol list
        tags = [',', '.', '?', '"', '“', '”', '—']
        if ch in tags:
            line = line.replace(ch, " ")
    return line


'''function：Create a taboo "stopwords.txt"
    input:  line : A list contains all the words in the "Gone With The Wind.txt"
    ouput:  nono
    data:2018/10/23
'''

def CreatStopWordsTxt(list):
    file = open('stopwords.txt', 'w')

    for str in list:
        file.write(str + '\t')
    file.close()

'''function：Remove any words that do not meet the requirements
    input: dict : A dict whose keys are words and values are frequencies
    ouput: dict : A  removed undesirable words dict
    data:2018/10/23
'''
def RemoveUndesirableWords(dict):
    '''
        wordsCount = 0  # Number of words
        wordsCount = sum(dict.values())
    '''
    listKey = list(dict)
    for temp in listKey:
        if temp[0].isdigit():
            del dict[temp]
        #else:
           # dict[temp] = round(dict[temp] , 4)
    return dict

'''function：Remove the words from the "stopwords.txt"
    input: dict : A list transformed by a dict whose keys are words and values are frequencies
    ouput: dictProc : A list after removing stopwords
    data:2018/10/23
'''

def StopWordProcessing(dict):
    fileTabu = open("stopwords1.txt", 'r')
    stopWordlist = fileTabu.read()
    fileTabu.close()

    stopWordlist = re.sub('[^a-z0-9]', ' ', stopWordlist).split(' ')
    dictProc = dict.copy()
    for temp in dict.keys():
        if temp.strip() in stopWordlist:
            del dictProc[temp]
    return dictProc

class WordFinder(object):
    '''A compound structure of dictionary and set to store word mapping'''
    def __init__(self):

        self.mainTable = {}
        for char in string.ascii_lowercase:
            self.mainTable[char] = {}
        self.specialTable = {}
        #print(self.mainTable)
        for headword, related in lemmas.items():
            # Only 3 occurrences of uppercase in lemmas.txt, which include 'I'
            # Trading precision for simplicity
            headword = headword.lower()
            try:
                related = related.lower()
            except AttributeError:
                related = None
            if related:
                for word in related.split():
                    if word[0] != headword[0]:
                        self.specialTable[headword] = set(related.split())
                        break
                    else:
                        self.mainTable[headword[0]][headword] = set(related.split())
            else:
                self.mainTable[headword[0]][headword] = None
        #print(self.specialTable)
        #print(self.mainTable)
    def find_headword(self, word):
        """Search the 'table' and return the original form of a word"""
        word = word.lower()
        alphaTable = self.mainTable[word[0]]
        if word in alphaTable:
            return word

        for headword, related in alphaTable.items():
            if related and (word in related):
                return headword

        for headword, related in self.specialTable.items():
            if word == headword:
                return word
            if word in related:
                return headword
        # This should never happen after the removal of words not in valid_words
        # in Book.__init__()
        return None

    # TODO
    def find_related(self, headword):
        pass


def VerbTableFrequency(fileContent):
    global lemmas
    global  allVerbWords
    lemmas = {}
    allVerbWords = set()
    with open('verbs.txt') as fileVerb:
        # print(fileVerb.read())
        for line in fileVerb:
            # print(line)
            line = re.sub(r'\n|\s|\,', ' ', line)
            headWord = line.split('->')[0]
            # print(headWord)
            # print(headWord)
            try:
                related = line.split('->')[1]
                # print(related)

            except IndexError:
                related = None
            lemmas[headWord] = related

    allVerbWords = set()
    for headWord, related in lemmas.items():
        allVerbWords.add(headWord)
        # print(allVerbWords)
        # print("\t")
        if related:
            allVerbWords.update(set(related.split()))
            # allVerbWords.update(related)

    tempList = re.split(r'\b([a-zA-Z-]+)\b',fileContent)
    tempList = [item for item in tempList if (item in allVerbWords)]
    finder = WordFinder()
    tempList = [finder.find_headword(item) for item in tempList]

    cnt = Counter()
    for word in tempList:
        cnt[word] += 1
    #print(type(cnt))
    return cnt

def main():
    with open("Gone With The Wind.txt") as file :
        content = file.read().lower()

    outCounts = 10  # Show the top count  words that appear most frequently
    number = 1  #Phrase length
    flag = 1


    if flag == 1:
        verbFreCount = VerbTableFrequency(content)
        #print(type(cnt))

        wordsCounts ={}
        for word in sorted(verbFreCount, key=lambda x: verbFreCount[x], reverse=True):
            wordsCounts[word] = verbFreCount[word]
        print(type(wordsCounts))
        freCountNum = sum(wordsCounts.values())

        #print (freCountNum )
        for word, fre in list(wordsCounts.items())[0:outCounts]:
            print("|\t{:15}|{:<11.2f}|".format(word,fre / freCountNum))
        print("--------------------------------")


    else:
        wordsCounts = NumWordFrequency(content,number)
        wordsCounts = RemoveUndesirableWords(wordsCounts)  # Remove any words that do not meet the requirements
        wordsCounts = StopWordProcessing(wordsCounts)  # Remove the words from the "stopwords.txt"

        pairsList = list(wordsCounts.items())  # Get the key-value pairsList from the dictionary
        items = [[x, y] for (y, x) in pairsList]  # key-value pairsList in the list exchange locations, data pairsList sort
        items.sort(reverse=True)
        # Notice we didn't order words of the same frequency
        for i in range(outCounts):
            print(items[i][1] + "\t" + str(items[i][0]))


if __name__ == '__main__':
    main()

end = time.time()
print(end - start)

from count import CountPhrase
CountPhrase("Null.txt",2)
CountPhrase("Test.txt",2)
CountPhrase("gone_with_the_wind.txt",2)

CountPhrase("Null.txt",2,"Stopwords.txt")
CountPhrase("Test.txt",2,"Stopwords.txt")
CountPhrase("gone_with_the_wind.txt",2,"Stopwords.txt")

其中:

Null.txt 是一個空的文字檔案
gone_with_the_wind.txt 是《亂世佳人》的文字檔案
Test.txt 是一個我們自己指定的內容固定的文字檔案，這樣就可以統計結果的正確性

coverage run test.py

得到的結果如下:

Name                      Stmts   Exec  Cover
---------------------------------------------
CountPhrase                  78     92    100%
---------------------------------------------
TOTAL                        78     92    100%

可以看到，在保證程式碼覆蓋率為100%的時候，程式碼執行是正確的。因為程式碼做了修改，因此需要做迴歸測試，編寫如下程式碼做迴歸測試:

from count import CountLetters
from count import CountWords
from count import CountPhrase

CountWords("Null.txt","Stopwords.txt")
CountWords("Test.txt","Stopwords.txt")
CountWords("gone_with_the_wind.txt","Stopwords.txt")

CountLetters("Null.txt","Stopwords.txt")
CountLetters("Test.txt","Stopwords.txt")
CountLetters("gone_with_the_wind.txt","Stopwords.txt")

CountPhrase("Null.txt",2)
CountPhrase("Test.txt",2)
CountPhrase("gone_with_the_wind.txt",2)

CountPhrase("Null.txt",2,"Stopwords.txt")
CountPhrase("Test.txt",2,"Stopwords.txt")
CountPhrase("gone_with_the_wind.txt",2,"Stopwords.txt")

發現之前的counterPhrases不支援stopword的功能，於是我們又去修改了該函式，思想和CountWords函式相同。

經過單元測試，迴歸測試之後，結果正確。

但程式的執行速度怎麼樣呢？為了更加了解清楚它的執行速度，我們使用cprofile分析效能，從而提升執行的效能, 使用cprofile執行的結果為，一共用了2.39s，為了降低時間成本。

因此需要對其進行優化，我們想到一個絕妙的辦法，可以將文章看作一個巨大的句子，用句號對文中的句子進行分割，然後，用正則表示式匹配第一次，這一次就會漏掉一些，但是我們刪去一個詞再去用正則表示式，就可以統計到缺失的那部分，同樣的，一直替換到刪去n-1個詞語,就得到最終版本的程式碼:

###################################################################################
#Name:count_words
#Inputs:file name,the first n words, stopfile name
#outputs:None
#Author: Thomas
#Date:2018.10.22
###################################################################################
def CountPhrases(file_name,n,stopName,verbName,k):
    print("File name:" + sys.path[0] + "\\" + file_name)
    totalNum = 0
    if (stopName != None):
        stopflag = True
    else:
        stopflag = False
    if(verbName != None):
        verbflag = True
    else:
        verbflag = False
    t0 = time.clock()
    with open(file_name) as f:
        txt = f.read()
    txt = txt.lower()
    txt = re.sub(r'[\s|\']+',' ',txt)
    pword = r'(([a-z]+ )+[a-z]+)'  # extract sentence
    pattern = re.compile(pword)
    sentence = pattern.findall(txt)
    txt = ','.join([sentence[m][0] for m in range(len(sentence))])
    if(stopflag == True):
        with open(stopName) as f:
            stoplist = f.readlines()
    pattern = "[a-z]+[0-9]*"
    for i in range(k-1):
        pattern += "[\s|,][a-z]+[0-9]*"
    wordList = []
    for i in range(k):
        if( i == 0 ):
            tempList = re.findall(pattern, txt)
        else:
            wordpattern = "[a-z]+[0-9]*"
            txt = re.sub(wordpattern, '', txt, 1).strip()
            tempList = re.findall(pattern, txt)
        wordList += tempList
    tempc = Counter(wordList)
    if (stopflag == True):
        for word in stoplist:
            word = word.replace('\n','')
            del tempc[word]
    dicNum = {}
    if (verbflag == True):
        verbDic = {}
        with open(verbName) as f:
            for line in f.readlines():
                key,value = line.split(' -> ')
                for tverb in value.replace('\n', '').split(','):
                    verbDic[tverb] = key
                verbDic[key] = key
        for phrase in tempc.keys():
            if (',' not in phrase):
                totalNum += 1
                verbList = phrase.split(' ')
                normPhrase = verbList[0]
                for verb in verbList[1:]:
                    if verb in verbDic.keys():
                        verb = verbDic[verb]
                    normPhrase += ' ' + verb
                if (normPhrase in dicNum.keys()):
                    dicNum[normPhrase] += tempc[phrase]
                else:
                    dicNum[normPhrase] = tempc[phrase]
    else:
        phrases = tempc.keys()
        for phrase in phrases:
            if (',' not in phrase):
                dicNum[phrase] = tempc[phrase]
                totalNum += tempc[phrase]
    dicNum = sorted(dicNum.items(), key=lambda k: k[0])
    dicNum = sorted(dicNum, key=lambda k: k[1], reverse=True)
    t1 = time.clock()
    display(dicNum[:n], 'Phrases',totalNum,3)
    print("Time Consuming:%4f" % (t1 - t0))

經過執行上面的單元測試，迴歸測試的程式碼，發現執行結果沒有變化，時間降到了1.8s，已經達到優化的最終目的了。

Step-4:第四步：把動詞形態都統一之後再計數。

首先，我們需要看一下動詞形態在Verbs.txt中是什麼樣子的

abandon -> abandons,abandoning,abandoned abase -> abases,abasing,abased abate -> abates,abating,abated abbreviate -> abbreviates,abbreviating,abbreviated abdicate -> abdicates,abdicating,abdicated abduct -> abducts,abducting,abducted abet -> abets,abetting,abetted abhor -> abhors,abhorring,abhorred

可以看到左邊是動詞原形，右邊是動詞的各種形式，因為目前已經對單詞全部統計出來了，所以現在需要做的是，首先將verbs.txt讀入字典當中，用這個字典將相同詞語不同形式的加到一起，於是可以編寫程式碼如下：

if (verbflag == True):
    totalNum = 0
    verbDic = {}
    verbDicNum = {}
    with open(verbName) as f:
        for line in f.readlines():
            key,value = line.split(' -> ')
            verbDic[key] = value.replace('\n','').split(',')
            verbDicNum[key] = tempc[key]
            for word in verbDic[key]:
                verbDicNum[key] += tempc[word]
            totalNum += verbDicNum[key]

from count import CountWords,CountPhrases
CountWords("Null.txt","Verbs.txt")
CountWords("Test.txt","Verbs.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt")

CountWords("Null.txt","Verbs.txt","Verbs.txt","stopwords.txt")
CountWords("Test.txt","Verbs.txt","Verbs.txt","stopwords.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt","stopwords.txt")

CountWords("Null.txt","Verbs.txt")
CountWords("Test.txt","Verbs.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt")

CountWords("Null.txt","Verbs.txt","Verbs.txt""stopphrases.txt")
CountWords("Test.txt","Verbs.txt","Verbs.txt""stopphrases.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt","stopphrases.txt")

其中:

Null.txt 是一個空的文字檔案
gone_with_the_wind.txt 是《亂世佳人》的文字檔案
Test.txt 是一個我們自己指定的內容固定的文字檔案，這樣就可以統計結果的正確性

對於單詞來說經過我們的驗證，這個結果是正確的。但發現短語不支援verbs.txt的功能，於是我們對短語的功能進行了修改，但是怎麼歸一化呢，想到了一個絕妙的辦法，就是各種形式作為key，對應值作為value，這樣的話索引各種形式都可以變換到原型，然後就有了如下的程式碼：

    if (verbflag == True):
        verbDic = {}
        with open(verbName) as f:
            for line in f.readlines():
                key,value = line.split(' -> ')
                for tverb in value.replace('\n', '').split(','):
                    verbDic[tverb] = key
                verbDic[key] = key
        for phrase in tempc.keys():
            if (',' not in phrase):
                totalNum += 1
                verbList = phrase.split(' ')
                normPhrase = verbList[0]
                for verb in verbList[1:]:
                    if verb in verbDic.keys():
                        verb = verbDic[verb]
                    normPhrase += ' ' + verb
                if (normPhrase in dicNum.keys()):
                    dicNum[normPhrase] += tempc[phrase]
                else:
                    dicNum[normPhrase] = tempc[phrase]

經過這樣的驗證，但還不清楚程式碼的覆蓋率怎麼樣，於是我們使用工具coverage，對程式碼進行分析，使用如下命令列分析程式碼覆蓋率

coverage run test.py

得到的結果如下:

Name                      Stmts   Exec  Cover
---------------------------------------------
CountWords                   78     92    100%
---------------------------------------------
TOTAL                        78     92    100%

可以看到，在保證程式碼覆蓋率為100%的時候，程式碼執行是正確的。因為程式碼做了修改，因此需要做迴歸測試，編寫如下程式碼做迴歸測試:

from count import CountLetters
from count import CountWords
CountWords("Null.txt","Verbs.txt")
CountWords("Test.txt","Verbs.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt")

CountWords("Null.txt","Verbs.txt","stopwords.txt")
CountWords("Test.txt","Verbs.txt","stopwords.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt","stopwords.txt")

CountWords("Null.txt")
CountWords("Test.txt")
CountWords("gone_with_the_wind.txt")

CountLetters("Null.txt","Verbs.txt","Stopwords.txt")
CountLetters("Test.txt","Verbs.txt","Stopwords.txt")
CountLetters("gone_with_the_wind.txt","Verbs.txt","Stopwords.txt")

CountLetters("Null.txt","Stopwords.txt")
CountLetters("Test.txt","Stopwords.txt")
CountLetters("gone_with_the_wind.txt","Stopwords.txt")

CountLetters("Null.txt")
CountLetters("Test.txt")
CountLetters("gone_with_the_wind.txt")

發現之前的counterletters不支援verbs.txt的功能，於是我們又去修改了該函式，但後來覺得歸一化單詞去統計字母的出現次數是沒有意義的，於是便刪去了原先程式碼。

Step-5:第五步：統計動介短語的次數。

首先先看一下動介短語的定義是什麼:

VerbPhrase := Verb + Spaces + Preposition Spaces := Space+ Space := ' ' | '\t' | '\r' | '\n' Preposition := <any one from the list of prepositions> Verb := <any one in any tense FROM THE DICTIONARY>

一開始並沒有想到第5步與第4步有緊密的聯絡，因此我們這步的程式碼是從頭開始寫的，構造了一個非常長的正則表示式，主要就是用for迴圈將詞語用或連起來，因為這樣的用時太長了，一共花了56s，可以說根本沒法用，因此直接就摒棄了這種方式，也沒有做單元測試效能分析，因為時間太長了，肯定需要重新想的。後來想起來第4步不是統計了所有的短語嘛，我們可以將統計的短語拿過來使用，只要歸一化再加上判斷介詞就可以用了。但是怎麼歸一化呢，想到了一個絕妙的辦法，就是各種形式作為key，對應值作為value，這樣的話索引各種形式都可以變換到原型，這樣得到最終的程式碼：

###################################################################################
#Name:count_words
#Inputs:file name,the first n words, stopfile name
#outputs:None
#Author: Thomas
#Date:2018.10.22
###################################################################################
def CountVerbPre(file_name,n,stopName,verbName,preName):
    print("File name:" + sys.path[0] + "\\" + file_name)
    dicNum = {}
    totalNum = 0
    if (stopName != None):
        stopflag = True
    else:
        stopflag = False
    t0 = time.clock()
    with open(file_name) as f:
        txt = f.read()
    txt = txt.lower()
    txt = re.sub(r'[\s|\']+',' ',txt)
    pword = r'(([a-z]+ )+[a-z]+)'  # extract sentence
    pattern = re.compile(pword)
    sentence = pattern.findall(txt)
    txt = ','.join([sentence[m][0] for m in range(len(sentence))])
    if(stopflag == True):
        with open(stopName) as f:
            stoplist = f.readlines()
    pattern = "[a-z]+[0-9]*"
    for i in range(1):
        pattern += "[\s|,][a-z]+[0-9]*"
    wordList = []
    for i in range(2):
        if( i == 0 ):
            tempList = re.findall(pattern, txt)
        else:
            wordpattern = "[a-z]+[0-9]*"
            txt = re.sub(wordpattern, '', txt, 1).strip()
            tempList = re.findall(pattern, txt)
        wordList += tempList

    tempc = Counter(wordList)
    with open(preName) as f:
        preTxt = f.read()
    preList = preTxt.split('\n')
    verbDic = {}
    with open(verbName) as f:
        for line in f.readlines():
            key,value = line.split(' -> ')
            for tverb in value.replace('\n','').split(','):
                verbDic[tverb] = key
            verbDic[key] = key
    for phrase in tempc.keys():
        if(',' not in phrase):
            totalNum += 1
            verb, pre = phrase.split(' ')
            if (verb in verbDic.keys() and pre in preList):
                normPhrase = verbDic[verb] + ' ' + pre
                if (normPhrase in dicNum.keys()):
                    dicNum[normPhrase] += tempc[phrase]
                else:
                    dicNum[normPhrase] = tempc[phrase]
    if (stopflag == True):
        for word in stoplist:
            word = word.replace('\n','')
            del dicNum[word]
    dicNum = sorted(dicNum.items(), key=lambda k: k[0])
    dicNum = sorted(dicNum, key=lambda k: k[1], reverse=True)
    t1 = time.clock()
    display(dicNum[:n], 'VerbPre',totalNum, 3)
    print("Time Consuming:%4f"%(t1-t0))

用Python統計文字檔案中詞彙字母短語等分佈

用Python統計文字檔案中詞彙字母短語等分佈

用python從txt檔案中讀入資料

統計一個檔案中各個字母出現次數

用python向txt檔案中寫資料時的追加和覆蓋問題

用python統計資料庫sqlite中某一table中的記錄行數

用Python將gml檔案中邊的資訊輸出為csv(或者txt)格式

用lex統計文字檔案字元數

使用Python刪除文字檔案中的部分內容

用Python把py檔案中的行數刪除掉

如何用Python從PDF檔案中提取文字詞彙

用python統計檔案中各個單詞出現的次數

用python統計多個文字中你想統計的單詞

用python讀取文字資訊，進行處理，寫到另一檔案中

用Python處理文字——刪除.txt每行中的不必要字元

【OS大作業】用多執行緒統計txt檔案中字元個數（Java實現）

用python提取文字中的數字, 文字複製

Python每日一題：第3題：統計一個檔案中每個單詞出現的次數

python中傳送郵件（普通文字檔案、附件、圖片等）

python 將print輸出儲存到文字檔案中

用批處理實現文字檔案中指定字串的替換 zz

用Python統計文字檔案中詞彙字母短語等分佈

相關推薦