用Python統計文字檔案中詞彙字母短語等分佈
這是MSRA的高階軟體設計結對程式設計的作業
這篇部落格討論具體地實現方式與過程,包括效能分析與單元測試
分析的工具使用方法可以參考這兩篇部落格:
該專案的完整程式碼,請參考下面的Github:
先看一下這個專案的要求:
使用者需求:英語的26 個字母的頻率在一本小說中是如何分佈的?某型別文章中常出現的單詞是什麼?某作家最常用的詞彙是什麼?《哈利波特》 中最常用的短語是什麼,等等。 我們就寫一些程式來解決這個問題,滿足一下我們的好奇心。
要求:程式的單元測試,迴歸測試,效能測試C/C++/C# 等基本語言的運用和 debug。
題目要求:
Step-0:輸出某個英文文字檔案中 26 字母出現的頻率,由高到低排列,並顯示字母出現的百分比,精確到小數點後面兩位。
Step-1:輸出單個檔案中的前 N 個最常出現的英語單詞。
Step-2:支援 stop words,我們可以做一個 stop word 檔案 (停詞表), 在統計詞彙的時候,跳過這些詞。
Step-3:輸出某個英文文字檔案中 單詞短語出現的頻率,由高到低排列,並顯示字母出現的百分比,精確到小數點後面兩位。
Step-4:第四步:把動詞形態都統一之後再計數。
Step-0:輸出某個英文文字檔案中 26 字母出現的頻率,由高到低排列,並顯示字母出現的百分比,精確到小數點後面兩位。
最初的想法是去除掉各種亂七八糟的符號之後,使用遍歷整個文字檔案的每一個字母,用一個字典儲存計數,每次去索引字典的值,索引到該值之後,在字典的value上加一實現。具體實現的程式碼如下:
#!/usr/bin/env python #-*- coding:utf-8 -*- #author: Enoch time:2018/10/22 0031 import time import re import operator from string import punctuation start = time.clock() '''function:Calculate the word frequency of each line input: line : a list contains a string for a row counts: an empty dictionary ouput: counts: a dictionary , keys are words and values are frequencies data:2018/10/22 ''' def ProcessLine(line,counts): #Replace the punctuation mark with a space line = re.sub('[^a-z]', '', line) for ch in line: counts[ch] = counts.get(ch, 0) + 1 return counts def main(): file = open("../Gone With The Wind.txt", 'r') wordsCount = 0 alphabetCounts = {} for line in file: alphabetCounts = ProcessLine(line.lower(), alphabetCounts) wordsCount = sum(alphabetCounts.values()) alphabetCounts = sorted(alphabetCounts.items(), key=lambda k: k[0]) alphabetCounts = sorted(alphabetCounts, key=lambda k: k[1], reverse=True) for letter, fre in alphabetCounts: print("|\t{:15}|{:<11.2%}|".format(letter, fre / wordsCount)) file.close() if __name__ == '__main__': main() end = time.clock() print (end-start)
這樣做的程式碼理論上程式碼是正確的,為了驗證程式碼的正確性,我們需要使用三個文字檔案做單元測試,具體就是,一個空檔案,一個小樣本檔案,和一個樣本較多的檔案,分別做驗證,於是可以寫單元測試的程式碼如下:
from count import CountLetters
CountLetters("Null.txt")
CountLetters("Test.txt")
CountLetters("gone_with_the_wind.txt")
其中:
- Null.txt 是一個空的文字檔案
- gone_with_the_wind.txt 是《亂世佳人》的文字檔案
- Test.txt 是一個我們自己指定的內容固定的文字檔案,這樣就可以統計結果的正確性
經過我們的驗證,這個結果是正確的。保證了結果的正確性,經過這樣的驗證,但還不清楚程式碼的覆蓋率怎麼樣,於是我們使用工具coverage,對程式碼進行分析,使用如下命令列分析程式碼覆蓋率
coverage run my_program.py arg1 arg2
得到的結果如下:
Name Stmts Exec Cover
---------------------------------------------
CountLetters 56 50 100%
---------------------------------------------
TOTAL 56 50 100%
可以看到,在保證程式碼覆蓋率為100%的時候,程式碼執行是正確的。
但程式的執行速度怎麼樣呢?為了更加了解清楚它的執行速度,我們使用cprofile
分析效能,從而提升執行的效能, 使用cprofile執行的結果為
我們大致知道main,Processline,ReplacePunctuations
三個模組最耗時,其中最多是ProcessLine
,我們就需要看preocessLine()
模組裡呼叫了哪些函式,花費了多長時間。
最後使用圖形化工具graphviz畫出具體地耗時情況如下:
可以從上面的影象中看到文字有9千多行,low函式和re.sub被呼叫了9023次,每個字母每個字母的統計get也被呼叫了1765982次,這種一個字母一個字母的索引方式太慢了。我們需要尋求新的解決辦法,於是想到了正則表示式,遍歷字母表來匹配正則表示式,於是我們就得到了第二版的函式
###################################################################################
#Name:count_letters
#Inputs:file name
#outputs:None
#Author: Thomas
#Date:2018.10.22
###################################################################################
def CountLetters(file_name,n,stopName,verbName):
print("File name:" + os.path.abspath(file_name))
if (stopName != None):
stopflag = True
else:
stopflag = False
if(verbName != None):
print("Verb tenses normalizing is not supported in this function!")
else:
pass
totalNum = 0
dicNum = {}
t0 = time.clock()
if (stopflag == True):
with open(stopName) as f:
stoplist = f.readlines()
with open(file_name) as f:
txt = f.read().lower()
for letter in letters:
dicNum[letter] = len(re.findall(letter,txt))
totalNum += dicNum[letter]
if (stopflag == True):
for word in stoplist:
word = word.replace('\n','')
try:
del tempc[word]
except:
pass
dicNum = sorted(dicNum.items(), key=lambda k: k[0])
dicNum = sorted(dicNum, key=lambda k: k[1], reverse=True)
t1 = time.clock()
display(dicNum[:n],'character',totalNum,9)
print("Time Consuming:%4f" % (t1 - t0))
該函式把執行時間從原來的1.14s直接降到了0.2s,通過重複剛才的單元測試以及效能分析(這裡我就不重複貼上結果了),驗證了在程式碼覆蓋率為100%的情況下,程式碼的執行也是正確的,並且發現執行時間最長的就是其中的正則表示式,在這樣的情況下,我們又尋求新的解決方案。最終我們發現了文字自帶的count方法,將正則表示式用更該方法替換之後,即將上面的程式碼:
dicNum[letter] = len(re.findall(letter,txt))
替換為
dicNum[letter] = txt.count(letter) #here count is faster than re
成功的將時間降到了5.83e-5s可以說提高了非常多的數量級,優化到這裡,基本上已經達到了優化的瓶頸,沒法繼續優化了。
注:後來的版本添加了許多功能,這裡的程式碼是添加了功能之後的程式碼, 如需要執行最初的功能則需要將後面的引數指定成None。
Step-1:輸出單個檔案中的前 N 個最常出現的英語單詞。
首先的瞭解,單詞的定義是什麼:
單詞:以英文字母開頭,由英文字母和字母數字符號組成的字串視為一個單詞。單詞以分隔符分割且不區分大小寫。在輸出時,所有單詞都用小寫字元表示。
英文字母:A-Z,a-z 字母數字符號:A-Z,a-z,0-9 分割符:空格,非字母數字符號 例:good123是一個單詞,123good不是一個單詞。good,Good和GOOD是同一個單詞
最初的想法是去除掉各種亂七八糟的符號之後,是用空格分隔出單詞,然後遍歷文字中的每一個單詞,用一個字典儲存計數,每次去索引字典的值,索引到該值之後,在字典的value上加一實現。具體實現的程式碼如下:
#!/usr/bin/env python
#-*- coding:utf-8 -*-
#author: Eron time:2018/10/22 0022
import time
import re
start = time.time()
from string import punctuation #Temporarily useless
'''function:Calculate the word frequency of each line
input: line : a list contains a string for a row
counts: an empty dictionary
ouput: counts: a dictionary , keys are words and values are frequencies
data:2018/10/22
'''
def ProcessLine(line,counts):
#Replace the punctuation mark with a space
#line=ReplacePunctuations(line)
line = re.sub('[^a-z0-9]', ' ', line)
words = line.split()
for word in words:
counts[word] = counts.get(word, 0) + 1
return counts
'''function:Replace the punctuation mark with a space
input: line : A list containing a row of original strings
ouput: line: a list whose punctuation is all replaced with spaces
data:2018/10/22
'''
def ReplacePunctuations(line):
for ch in line :
#Create our own symbol list
tags = [',','.','?','"','“','”','—']
if ch in tags:
line=line.replace(ch," ")
return line
'''function:Create a taboo "stopwords.txt"
input: line : A list contains all the words in the "Gone With The Wind.txt"
ouput: nono
data:2018/10/23
'''
def CreatStopWordsTxt(list):
file = open('stopwords.txt', 'w')
for str in list:
file.write(str+'\t')
file.close()
'''function:Remove any words that do not meet the requirements
input: dict : A dict whose keys are words and values are frequencies
ouput: dictProc : A removed undesirable words dict
data:2018/10/23
'''
def RemoveUndesirableWords(dict):
wordsCount = 0 # Number of words
wordsCount = sum(dict.values())
dictProc = dict.copy();
for temp in list(dict):
if temp[0].isdigit():
del dictProc[temp]
else:
dictProc[temp] = round(dictProc[temp] / wordsCount, 4)
return dictProc
def CountWords(fileName):
file = open(fileName,'r')
count = 10 #Show the top count words that appear most frequently
alphabetCountsOrg={} # Creates an empty dictionary used to calculate word frequency
for line in file:
alphabetCountsOrg = ProcessLine(line.lower(), alphabetCountsOrg) #Calculate the word frequency of each line
alphabetCounts = RemoveUndesirableWords(alphabetCountsOrg) #Remove any words that do not meet the requirements
pairs = list(alphabetCounts.items()) #Get the key-value pairs from the dictionary
items = [[x,y]for (y,x)in pairs] #key-value pairs in the list exchange locations, data pairs sort
items.sort(reverse=True)
#Notice we didn't order words of the same frequency
for i in range(count ):
print(items[i][1] + "\t" + str(items[i][0]))
file.close()
#CreatStopWordsTxt(alphabetCounts.keys())
if __name__ == '__main__':
CountWords("gone_with_the_wind.txt")
end = time.time()
print (end-start)
這樣做的程式碼理論上程式碼是正確的,為了驗證程式碼的正確性,我們需要使用三個文字檔案做單元測試,具體就是,一個空檔案,一個小樣本檔案,和一個樣本較多的檔案,分別做驗證,於是可以寫單元測試的程式碼如下:
from count import CountWords
CountWords("Null.txt")
CountWords("Test.txt")
CountWords("gone_with_the_wind.txt")
其中:
- Null.txt 是一個空的文字檔案
- gone_with_the_wind.txt 是《亂世佳人》的文字檔案
- Test.txt 是一個我們自己指定的內容固定的文字檔案,這樣就可以統計結果的正確性
經過我們的驗證,這個結果是正確的。保證了結果的正確性,經過這樣的驗證,但還不清楚程式碼的覆蓋率怎麼樣,於是我們使用工具coverage,對程式碼進行分析,使用如下命令列分析程式碼覆蓋率
coverage run test.py
得到的結果如下:
Name Stmts Exec Cover
---------------------------------------------
CountWords 78 92 100%
---------------------------------------------
TOTAL 78 92 100%
可以看到,在保證程式碼覆蓋率為100%的時候,程式碼執行是正確的。因為程式碼做了修改,因此需要做迴歸測試,編寫如下程式碼做迴歸測試:
from count import CountLetters
from count import CountWords
CountWords("Null.txt")
CountWords("Test.txt")
CountWords("gone_with_the_wind.txt")
CountLetters("Null.txt")
CountLetters("Test.txt")
CountLetters("gone_with_the_wind.txt")
但程式的執行速度怎麼樣呢?為了更加了解清楚它的執行速度,我們使用cprofile
分析效能,從而提升執行的效能, 使用cprofile執行的結果為
我們大致知道sub,Split,get
三個模組最耗時,其中最多是sub
,我們就需要看preocessLine()
模組裡呼叫了哪些函式,花費了多長時間。
最後使用圖形化工具graphviz畫出具體地耗時情況如下:
可以從上面的影象中看到文字有9千多行,low函式和re.sub被呼叫了9023次,每個字母每個字母的統計get也被呼叫了1765982次,這種一個單詞一個單詞的索引方式太慢了。我們需要尋求新的解決辦法,於是想到了正則表示式,遍歷字母表來匹配正則表示式,於是我們就得到了新的的函式,我們可以使用正則表示式的findall 函式,找到所有單詞,作為單詞list,使用collections 的Counter去統計字典中的重複元素,得到如下程式碼:
###################################################################################
#Name:count_words
#Inputs:file name,the first n words, stopfile name
#outputs:None
#Author: Thomas
#Date:2018.10.22
###################################################################################
def CountWords(file_name,n,stopName,verbName):
print("File name:" + sys.path[0] + "\\" + file_name)
if (stopName != None):
stopflag = True
else:
stopflag = False
if(verbName != None):
verbflag = True
else:
verbflag = False
t0 = time.clock()
with open(file_name) as f:
txt = f.read()
txt = txt.lower()
if(stopflag == True):
with open(stopName) as f:
stoplist = f.readlines()
pattern = r"[a-z][a-z0-9]*"
wordList = re.findall(pattern,txt)
totalNum = len(wordList)
tempc = Counter(wordList)
if (stopflag == True):
for word in stoplist:
word = word.replace('\n','')
del tempc[word]
dicNum = dict(tempc.most_common(n))
if (verbflag == True):
totalNum = 0
verbDic = {}
verbDicNum = {}
with open(verbName) as f:
for line in f.readlines():
key,value = line.split(' -> ')
verbDic[key] = value.replace('\n','').split(',')
verbDicNum[key] = tempc[key]
for word in verbDic[key]:
verbDicNum[key] += tempc[word]
totalNum += verbDicNum[key]
verbDicNum = sorted(verbDicNum.items(), key=lambda k: k[0])
verbDicNum = sorted(verbDicNum, key=lambda k: k[1], reverse=True)
dicNum = sorted(dicNum.items(), key=lambda k:k[0])
dicNum = sorted(dicNum, key=lambda k:k[1], reverse=True)
t1 = time.clock()
if (verbflag == True):
display(verbDicNum[:n], 'words',totalNum,3)
else:
display(dicNum,'words',totalNum,3)
print("Time Consuming:%4f" % (t1 - t0))
修改之後,依舊需要做單元測試和迴歸測試,這裡避免重複就不寫了,成功的將時間降到了0.34s可以說提高了非常多的數量級,優化到這裡,基本上已經達到了優化的瓶頸,沒法繼續優化了。
Step-2:支援 stop words,我們可以做一個 stop word 檔案 (停詞表), 在統計詞彙的時候,跳過這些詞。
停詞表就沒有之前實現那樣需要這麼麻煩去優化效能了,因為這個功能是基於之前的已經優化好的函式做的,因此要做的只是單元測試與迴歸測試,首先先分析一下實現的方式,因為之前已經統計得到了每一個單詞出現的次數,現在需要做的是讀取stopword檔案中的單詞,將這個單詞在字典中刪去,就可以到達最終所需要的效果,因為統計的時候用的是Counter型別的,因此只需要遍歷stopword然後在counter中刪掉就好了,這樣得到的程式碼就是:
if(stopflag == True):
with open(stopName) as f:
stoplist = f.readlines()
if (stopflag == True):
for word in stoplist:
word = word.replace('\n','')
del tempc[word]
同樣的,我們需要使用三個文字檔案做單元測試,具體就是,一個空檔案,一個小樣本檔案,和一個樣本較多的檔案,分別做驗證,於是可以寫單元測試的程式碼如下:
from count import CountWords
CountWords("Null.txt","Stopwords.txt")
CountWords("Test.txt","Stopwords.txt")
CountWords("gone_with_the_wind.txt","Stopwords.txt")
其中:
- Null.txt 是一個空的文字檔案
- gone_with_the_wind.txt 是《亂世佳人》的文字檔案
- Test.txt 是一個我們自己指定的內容固定的文字檔案,這樣就可以統計結果的正確性
經過我們的驗證,這個結果是正確的。保證了結果的正確性,經過這樣的驗證,但還不清楚程式碼的覆蓋率怎麼樣,於是我們使用工具coverage,對程式碼進行分析,使用如下命令列分析程式碼覆蓋率
coverage run test.py
得到的結果如下:
Name Stmts Exec Cover
---------------------------------------------
CountWords 78 92 100%
---------------------------------------------
TOTAL 78 92 100%
可以看到,在保證程式碼覆蓋率為100%的時候,程式碼執行是正確的。因為程式碼做了修改,因此需要做迴歸測試,編寫如下程式碼做迴歸測試:
from count import CountLetters
from count import CountWords
CountWords("Null.txt","Stopwords.txt")
CountWords("Test.txt","Stopwords.txt")
CountWords("gone_with_the_wind.txt","Stopwords.txt")
CountLetters("Null.txt","Stopwords.txt")
CountLetters("Test.txt","Stopwords.txt")
CountLetters("gone_with_the_wind.txt","Stopwords.txt")
發現之前的counterletters不支援stopword的功能,於是我們又去修改了該函式,只不過那個函式沒有用counter型別,因此為了達到stopword功能,需要從字典中刪去改項,於是我們得到
if (stopflag == True):
with open(stopName) as f:
stoplist = f.readlines()
if (stopflag == True):
for word in stoplist:
word = word.replace('\n','')
try:
del tempc[word]
except:
pass
經過單元測試,迴歸測試之後,結果正確。
Step-3:輸出某個英文文字檔案中 單詞短語出現的頻率,由高到低排列,並顯示字母出現的百分比,精確到小數點後面兩位。
首先的瞭解,短語的定義是什麼:
短語:兩個或多個英語單詞, 它們之間只有空格分隔. 請看下面的例子:
hello world //這是一個短語
hello, world //這不是一個短語
這個會導致一個句子中有許多短語,舉個例子:
I am not a good boy.
這個就有:I am, am not, not a, a good, good boy.
這就難倒了正則表示式,因為這樣就不能用回溯功能,於是隊友想到了辦法,我們把文章先分為句子,再從句子中提出短語,用for迴圈去遍歷一個句子,然後我們寫出了下面的程式碼:
#!/usr/bin/env python
#-*- coding:utf-8 -*-
#author: albert time:2018/10/23 0023
import time
import re
import string
from collections import Counter
start = time.time()
from string import punctuation # Temporarily useless
def NumWordFrequency(fileContent,number):
fileContent = re.sub('\n|\t',' ',fileContent)
mPunctuation = r',|;|\?|\!|\.|\:|\“|\"|\”'
sentenceList = re.split(mPunctuation , fileContent)#Divide the text into sentences according to the punctuation marks
wordsCounts = {} # Creates an empty dictionary used to calculate word frequency
for oneSentence in sentenceList:
wordsCounts = ProcessLine(oneSentence.lower(), wordsCounts,number) # Calculate the specified length phrase frequency
return wordsCounts
'''function:Calculate the word frequency of each line
input: line : a list contains a string for a row
countsDict: an empty dictionary
ouput: counts: a dictionary , keys are words and values are frequencies
data:2018/10/22
'''
def ProcessLine(sentence, countsDict,number):
# Replace the punctuation mark with a space
# line=ReplacePunctuations(line)
sentence = re.sub('[^a-z0-9]', ' ', sentence)
words = sentence.split()
if len(words) >= number:
for i in range(len(words)-number+1):
countsDict[" ".join(words[i:i+number])] = countsDict.get(" ".join(words[i:i+number]), 0) + 1
else:
if sentence.strip()=='': #Judge if the sentence is empty
return countsDict
countsDict[sentence] = countsDict.get(sentence, 0) + 1
return countsDict
'''function:Replace the punctuation mark with a space
input: line : A list containing a row of original strings
ouput: line: a list whose punctuation is all replaced with spaces
data:2018/10/22
'''
def ReplacePunctuations(line):
for ch in line:
# Create our own symbol list
tags = [',', '.', '?', '"', '“', '”', '—']
if ch in tags:
line = line.replace(ch, " ")
return line
'''function:Create a taboo "stopwords.txt"
input: line : A list contains all the words in the "Gone With The Wind.txt"
ouput: nono
data:2018/10/23
'''
def CreatStopWordsTxt(list):
file = open('stopwords.txt', 'w')
for str in list:
file.write(str + '\t')
file.close()
'''function:Remove any words that do not meet the requirements
input: dict : A dict whose keys are words and values are frequencies
ouput: dict : A removed undesirable words dict
data:2018/10/23
'''
def RemoveUndesirableWords(dict):
'''
wordsCount = 0 # Number of words
wordsCount = sum(dict.values())
'''
listKey = list(dict)
for temp in listKey:
if temp[0].isdigit():
del dict[temp]
#else:
# dict[temp] = round(dict[temp] , 4)
return dict
'''function:Remove the words from the "stopwords.txt"
input: dict : A list transformed by a dict whose keys are words and values are frequencies
ouput: dictProc : A list after removing stopwords
data:2018/10/23
'''
def StopWordProcessing(dict):
fileTabu = open("stopwords1.txt", 'r')
stopWordlist = fileTabu.read()
fileTabu.close()
stopWordlist = re.sub('[^a-z0-9]', ' ', stopWordlist).split(' ')
dictProc = dict.copy()
for temp in dict.keys():
if temp.strip() in stopWordlist:
del dictProc[temp]
return dictProc
class WordFinder(object):
'''A compound structure of dictionary and set to store word mapping'''
def __init__(self):
self.mainTable = {}
for char in string.ascii_lowercase:
self.mainTable[char] = {}
self.specialTable = {}
#print(self.mainTable)
for headword, related in lemmas.items():
# Only 3 occurrences of uppercase in lemmas.txt, which include 'I'
# Trading precision for simplicity
headword = headword.lower()
try:
related = related.lower()
except AttributeError:
related = None
if related:
for word in related.split():
if word[0] != headword[0]:
self.specialTable[headword] = set(related.split())
break
else:
self.mainTable[headword[0]][headword] = set(related.split())
else:
self.mainTable[headword[0]][headword] = None
#print(self.specialTable)
#print(self.mainTable)
def find_headword(self, word):
"""Search the 'table' and return the original form of a word"""
word = word.lower()
alphaTable = self.mainTable[word[0]]
if word in alphaTable:
return word
for headword, related in alphaTable.items():
if related and (word in related):
return headword
for headword, related in self.specialTable.items():
if word == headword:
return word
if word in related:
return headword
# This should never happen after the removal of words not in valid_words
# in Book.__init__()
return None
# TODO
def find_related(self, headword):
pass
def VerbTableFrequency(fileContent):
global lemmas
global allVerbWords
lemmas = {}
allVerbWords = set()
with open('verbs.txt') as fileVerb:
# print(fileVerb.read())
for line in fileVerb:
# print(line)
line = re.sub(r'\n|\s|\,', ' ', line)
headWord = line.split('->')[0]
# print(headWord)
# print(headWord)
try:
related = line.split('->')[1]
# print(related)
except IndexError:
related = None
lemmas[headWord] = related
allVerbWords = set()
for headWord, related in lemmas.items():
allVerbWords.add(headWord)
# print(allVerbWords)
# print("\t")
if related:
allVerbWords.update(set(related.split()))
# allVerbWords.update(related)
tempList = re.split(r'\b([a-zA-Z-]+)\b',fileContent)
tempList = [item for item in tempList if (item in allVerbWords)]
finder = WordFinder()
tempList = [finder.find_headword(item) for item in tempList]
cnt = Counter()
for word in tempList:
cnt[word] += 1
#print(type(cnt))
return cnt
def main():
with open("Gone With The Wind.txt") as file :
content = file.read().lower()
outCounts = 10 # Show the top count words that appear most frequently
number = 1 #Phrase length
flag = 1
if flag == 1:
verbFreCount = VerbTableFrequency(content)
#print(type(cnt))
wordsCounts ={}
for word in sorted(verbFreCount, key=lambda x: verbFreCount[x], reverse=True):
wordsCounts[word] = verbFreCount[word]
print(type(wordsCounts))
freCountNum = sum(wordsCounts.values())
#print (freCountNum )
for word, fre in list(wordsCounts.items())[0:outCounts]:
print("|\t{:15}|{:<11.2f}|".format(word,fre / freCountNum))
print("--------------------------------")
else:
wordsCounts = NumWordFrequency(content,number)
wordsCounts = RemoveUndesirableWords(wordsCounts) # Remove any words that do not meet the requirements
wordsCounts = StopWordProcessing(wordsCounts) # Remove the words from the "stopwords.txt"
pairsList = list(wordsCounts.items()) # Get the key-value pairsList from the dictionary
items = [[x, y] for (y, x) in pairsList] # key-value pairsList in the list exchange locations, data pairsList sort
items.sort(reverse=True)
# Notice we didn't order words of the same frequency
for i in range(outCounts):
print(items[i][1] + "\t" + str(items[i][0]))
if __name__ == '__main__':
main()
end = time.time()
print(end - start)
同樣的,我們需要使用三個文字檔案做單元測試,具體就是,一個空檔案,一個小樣本檔案,和一個樣本較多的檔案,分別做驗證,於是可以寫單元測試的程式碼如下:
from count import CountPhrase
CountPhrase("Null.txt",2)
CountPhrase("Test.txt",2)
CountPhrase("gone_with_the_wind.txt",2)
CountPhrase("Null.txt",2,"Stopwords.txt")
CountPhrase("Test.txt",2,"Stopwords.txt")
CountPhrase("gone_with_the_wind.txt",2,"Stopwords.txt")
其中:
- Null.txt 是一個空的文字檔案
- gone_with_the_wind.txt 是《亂世佳人》的文字檔案
- Test.txt 是一個我們自己指定的內容固定的文字檔案,這樣就可以統計結果的正確性
經過我們的驗證,這個結果是正確的。保證了結果的正確性,經過這樣的驗證,但還不清楚程式碼的覆蓋率怎麼樣,於是我們使用工具coverage,對程式碼進行分析,使用如下命令列分析程式碼覆蓋率
coverage run test.py
得到的結果如下:
Name Stmts Exec Cover
---------------------------------------------
CountPhrase 78 92 100%
---------------------------------------------
TOTAL 78 92 100%
可以看到,在保證程式碼覆蓋率為100%的時候,程式碼執行是正確的。因為程式碼做了修改,因此需要做迴歸測試,編寫如下程式碼做迴歸測試:
from count import CountLetters
from count import CountWords
from count import CountPhrase
CountWords("Null.txt","Stopwords.txt")
CountWords("Test.txt","Stopwords.txt")
CountWords("gone_with_the_wind.txt","Stopwords.txt")
CountLetters("Null.txt","Stopwords.txt")
CountLetters("Test.txt","Stopwords.txt")
CountLetters("gone_with_the_wind.txt","Stopwords.txt")
CountPhrase("Null.txt",2)
CountPhrase("Test.txt",2)
CountPhrase("gone_with_the_wind.txt",2)
CountPhrase("Null.txt",2,"Stopwords.txt")
CountPhrase("Test.txt",2,"Stopwords.txt")
CountPhrase("gone_with_the_wind.txt",2,"Stopwords.txt")
發現之前的counterPhrases不支援stopword的功能,於是我們又去修改了該函式,思想和CountWords函式相同。
經過單元測試,迴歸測試之後,結果正確。
但程式的執行速度怎麼樣呢?為了更加了解清楚它的執行速度,我們使用cprofile
分析效能,從而提升執行的效能, 使用cprofile執行的結果為,一共用了2.39s,為了降低時間成本。
因此需要對其進行優化,我們想到一個絕妙的辦法,可以將文章看作一個巨大的句子,用句號對文中的句子進行分割,然後,用正則表示式匹配第一次,這一次就會漏掉一些,但是我們刪去一個詞再去用正則表示式,就可以統計到缺失的那部分,同樣的,一直替換到刪去n-1個詞語,就得到最終版本的程式碼:
###################################################################################
#Name:count_words
#Inputs:file name,the first n words, stopfile name
#outputs:None
#Author: Thomas
#Date:2018.10.22
###################################################################################
def CountPhrases(file_name,n,stopName,verbName,k):
print("File name:" + sys.path[0] + "\\" + file_name)
totalNum = 0
if (stopName != None):
stopflag = True
else:
stopflag = False
if(verbName != None):
verbflag = True
else:
verbflag = False
t0 = time.clock()
with open(file_name) as f:
txt = f.read()
txt = txt.lower()
txt = re.sub(r'[\s|\']+',' ',txt)
pword = r'(([a-z]+ )+[a-z]+)' # extract sentence
pattern = re.compile(pword)
sentence = pattern.findall(txt)
txt = ','.join([sentence[m][0] for m in range(len(sentence))])
if(stopflag == True):
with open(stopName) as f:
stoplist = f.readlines()
pattern = "[a-z]+[0-9]*"
for i in range(k-1):
pattern += "[\s|,][a-z]+[0-9]*"
wordList = []
for i in range(k):
if( i == 0 ):
tempList = re.findall(pattern, txt)
else:
wordpattern = "[a-z]+[0-9]*"
txt = re.sub(wordpattern, '', txt, 1).strip()
tempList = re.findall(pattern, txt)
wordList += tempList
tempc = Counter(wordList)
if (stopflag == True):
for word in stoplist:
word = word.replace('\n','')
del tempc[word]
dicNum = {}
if (verbflag == True):
verbDic = {}
with open(verbName) as f:
for line in f.readlines():
key,value = line.split(' -> ')
for tverb in value.replace('\n', '').split(','):
verbDic[tverb] = key
verbDic[key] = key
for phrase in tempc.keys():
if (',' not in phrase):
totalNum += 1
verbList = phrase.split(' ')
normPhrase = verbList[0]
for verb in verbList[1:]:
if verb in verbDic.keys():
verb = verbDic[verb]
normPhrase += ' ' + verb
if (normPhrase in dicNum.keys()):
dicNum[normPhrase] += tempc[phrase]
else:
dicNum[normPhrase] = tempc[phrase]
else:
phrases = tempc.keys()
for phrase in phrases:
if (',' not in phrase):
dicNum[phrase] = tempc[phrase]
totalNum += tempc[phrase]
dicNum = sorted(dicNum.items(), key=lambda k: k[0])
dicNum = sorted(dicNum, key=lambda k: k[1], reverse=True)
t1 = time.clock()
display(dicNum[:n], 'Phrases',totalNum,3)
print("Time Consuming:%4f" % (t1 - t0))
經過執行上面的單元測試,迴歸測試的程式碼,發現執行結果沒有變化,時間降到了1.8s,已經達到優化的最終目的了。
Step-4:第四步:把動詞形態都統一之後再計數。
首先,我們需要看一下動詞形態在Verbs.txt中是什麼樣子的
abandon -> abandons,abandoning,abandoned abase -> abases,abasing,abased abate -> abates,abating,abated abbreviate -> abbreviates,abbreviating,abbreviated abdicate -> abdicates,abdicating,abdicated abduct -> abducts,abducting,abducted abet -> abets,abetting,abetted abhor -> abhors,abhorring,abhorred
可以看到左邊是動詞原形,右邊是動詞的各種形式,因為目前已經對單詞全部統計出來了,所以現在需要做的是,首先將verbs.txt讀入字典當中,用這個字典將相同詞語不同形式的加到一起,於是可以編寫程式碼如下:
if (verbflag == True):
totalNum = 0
verbDic = {}
verbDicNum = {}
with open(verbName) as f:
for line in f.readlines():
key,value = line.split(' -> ')
verbDic[key] = value.replace('\n','').split(',')
verbDicNum[key] = tempc[key]
for word in verbDic[key]:
verbDicNum[key] += tempc[word]
totalNum += verbDicNum[key]
同樣的,我們需要使用三個文字檔案做單元測試,具體就是,一個空檔案,一個小樣本檔案,和一個樣本較多的檔案,分別做驗證,於是可以寫單元測試的程式碼如下:
from count import CountWords,CountPhrases
CountWords("Null.txt","Verbs.txt")
CountWords("Test.txt","Verbs.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt")
CountWords("Null.txt","Verbs.txt","Verbs.txt","stopwords.txt")
CountWords("Test.txt","Verbs.txt","Verbs.txt","stopwords.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt","stopwords.txt")
CountWords("Null.txt","Verbs.txt")
CountWords("Test.txt","Verbs.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt")
CountWords("Null.txt","Verbs.txt","Verbs.txt""stopphrases.txt")
CountWords("Test.txt","Verbs.txt","Verbs.txt""stopphrases.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt","stopphrases.txt")
其中:
- Null.txt 是一個空的文字檔案
- gone_with_the_wind.txt 是《亂世佳人》的文字檔案
- Test.txt 是一個我們自己指定的內容固定的文字檔案,這樣就可以統計結果的正確性
對於單詞來說經過我們的驗證,這個結果是正確的。但發現短語不支援verbs.txt的功能,於是我們對短語的功能進行了修改,但是怎麼歸一化呢,想到了一個絕妙的辦法,就是各種形式作為key,對應值作為value,這樣的話索引各種形式都可以變換到原型,然後就有了如下的程式碼:
if (verbflag == True):
verbDic = {}
with open(verbName) as f:
for line in f.readlines():
key,value = line.split(' -> ')
for tverb in value.replace('\n', '').split(','):
verbDic[tverb] = key
verbDic[key] = key
for phrase in tempc.keys():
if (',' not in phrase):
totalNum += 1
verbList = phrase.split(' ')
normPhrase = verbList[0]
for verb in verbList[1:]:
if verb in verbDic.keys():
verb = verbDic[verb]
normPhrase += ' ' + verb
if (normPhrase in dicNum.keys()):
dicNum[normPhrase] += tempc[phrase]
else:
dicNum[normPhrase] = tempc[phrase]
經過這樣的驗證,但還不清楚程式碼的覆蓋率怎麼樣,於是我們使用工具coverage,對程式碼進行分析,使用如下命令列分析程式碼覆蓋率
coverage run test.py
得到的結果如下:
Name Stmts Exec Cover
---------------------------------------------
CountWords 78 92 100%
---------------------------------------------
TOTAL 78 92 100%
可以看到,在保證程式碼覆蓋率為100%的時候,程式碼執行是正確的。因為程式碼做了修改,因此需要做迴歸測試,編寫如下程式碼做迴歸測試:
from count import CountLetters
from count import CountWords
CountWords("Null.txt","Verbs.txt")
CountWords("Test.txt","Verbs.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt")
CountWords("Null.txt","Verbs.txt","stopwords.txt")
CountWords("Test.txt","Verbs.txt","stopwords.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt","stopwords.txt")
CountWords("Null.txt")
CountWords("Test.txt")
CountWords("gone_with_the_wind.txt")
CountLetters("Null.txt","Verbs.txt","Stopwords.txt")
CountLetters("Test.txt","Verbs.txt","Stopwords.txt")
CountLetters("gone_with_the_wind.txt","Verbs.txt","Stopwords.txt")
CountLetters("Null.txt","Stopwords.txt")
CountLetters("Test.txt","Stopwords.txt")
CountLetters("gone_with_the_wind.txt","Stopwords.txt")
CountLetters("Null.txt")
CountLetters("Test.txt")
CountLetters("gone_with_the_wind.txt")
發現之前的counterletters不支援verbs.txt的功能,於是我們又去修改了該函式,但後來覺得歸一化單詞去統計字母的出現次數是沒有意義的,於是便刪去了原先程式碼。
Step-5:第五步:統計動介短語的次數。
首先先看一下動介短語的定義是什麼:
VerbPhrase := Verb + Spaces + Preposition Spaces := Space+ Space := ' ' | '\t' | '\r' | '\n' Preposition := <any one from the list of prepositions> Verb := <any one in any tense FROM THE DICTIONARY>
一開始並沒有想到第5步與第4步有緊密的聯絡,因此我們這步的程式碼是從頭開始寫的,構造了一個非常長的正則表示式,主要就是用for迴圈將詞語用或連起來,因為這樣的用時太長了,一共花了56s,可以說根本沒法用,因此直接就摒棄了這種方式,也沒有做單元測試效能分析,因為時間太長了,肯定需要重新想的。後來想起來第4步不是統計了所有的短語嘛,我們可以將統計的短語拿過來使用,只要歸一化再加上判斷介詞就可以用了。但是怎麼歸一化呢,想到了一個絕妙的辦法,就是各種形式作為key,對應值作為value,這樣的話索引各種形式都可以變換到原型,這樣得到最終的程式碼:
###################################################################################
#Name:count_words
#Inputs:file name,the first n words, stopfile name
#outputs:None
#Author: Thomas
#Date:2018.10.22
###################################################################################
def CountVerbPre(file_name,n,stopName,verbName,preName):
print("File name:" + sys.path[0] + "\\" + file_name)
dicNum = {}
totalNum = 0
if (stopName != None):
stopflag = True
else:
stopflag = False
t0 = time.clock()
with open(file_name) as f:
txt = f.read()
txt = txt.lower()
txt = re.sub(r'[\s|\']+',' ',txt)
pword = r'(([a-z]+ )+[a-z]+)' # extract sentence
pattern = re.compile(pword)
sentence = pattern.findall(txt)
txt = ','.join([sentence[m][0] for m in range(len(sentence))])
if(stopflag == True):
with open(stopName) as f:
stoplist = f.readlines()
pattern = "[a-z]+[0-9]*"
for i in range(1):
pattern += "[\s|,][a-z]+[0-9]*"
wordList = []
for i in range(2):
if( i == 0 ):
tempList = re.findall(pattern, txt)
else:
wordpattern = "[a-z]+[0-9]*"
txt = re.sub(wordpattern, '', txt, 1).strip()
tempList = re.findall(pattern, txt)
wordList += tempList
tempc = Counter(wordList)
with open(preName) as f:
preTxt = f.read()
preList = preTxt.split('\n')
verbDic = {}
with open(verbName) as f:
for line in f.readlines():
key,value = line.split(' -> ')
for tverb in value.replace('\n','').split(','):
verbDic[tverb] = key
verbDic[key] = key
for phrase in tempc.keys():
if(',' not in phrase):
totalNum += 1
verb, pre = phrase.split(' ')
if (verb in verbDic.keys() and pre in preList):
normPhrase = verbDic[verb] + ' ' + pre
if (normPhrase in dicNum.keys()):
dicNum[normPhrase] += tempc[phrase]
else:
dicNum[normPhrase] = tempc[phrase]
if (stopflag == True):
for word in stoplist:
word = word.replace('\n','')
del dicNum[word]
dicNum = sorted(dicNum.items(), key=lambda k: k[0])
dicNum = sorted(dicNum, key=lambda k: k[1], reverse=True)
t1 = time.clock()
display(dicNum[:n], 'VerbPre',totalNum, 3)
print("Time Consuming:%4f"%(t1-t0))