用Python實現針對英文論文的詞頻分析
阿新 • • 發佈:2018-12-24
有時候看英文論文,高頻詞彙是一些術語,可能不太認識,因此我們可以先分析一下該論文的詞頻,對於高頻詞彙可以在看論文之前就記住其意思,這樣看論文思路會更順暢一旦,接下來就講一下如何用python輸出一篇英文論文的詞彙出現頻次。
首先肯定要先把論文從PDF版轉為txt格式,一般來說直接轉會出現亂碼,建議先轉為Word格式,之後再複製為txt文字格式。
接下來附上含有詳細註釋的程式碼
#論文詞頻分析 #You should convert the file to text format __author__ = 'Chen Hong' #Read the text and save all the words in a list def readtxt(filename): fr = open(filename, 'r') wordsL = []#use this list to save the words for word in fr: word = word.strip() word = word.split() wordsL = wordsL + word fr.close() return wordsL #count the frequency of every word and store in a dictionary #And sort dictionaries by value from large to small def count(wordsL): wordsD = {} for x in wordsL: #move these words that we don't need if Judge(x): continue #count if not x in wordsD: wordsD[x] = 1 wordsD[x] += 1 #Sort dictionaries by value from large to small wordsInorder = sorted(wordsD.items(), key=lambda x:x[1], reverse = True) return wordsInorder #juege whether the word is that we want to move such as punctuation or letter #You can modify this function to move more words such as number def Judge(word): punctList = [' ','\t','\n',',','.',':','?']#juege whether the word is punctuation letterList = ['a','b','c','d','m','n','x','p','t']#juege whether the word is letter if word in punctList: return True elif word in letterList: return True else: return False #Read the file and output the file filename = 'F:\\python\\Paper1.txt' wordsL = readtxt(filename) words = count(wordsL) fw = open('F:\\python\\Words In Order_1.txt','w') for item in words: fw.write(item[0] + ' ' + str(item[1]) + '\n') fw.close()