1. 程式人生 > >Python RAKE 關鍵字提取

Python RAKE 關鍵字提取

RAKE, 即Automatic keyword extraction。

來自於M. W. Berry and J. Kogan (Eds.), Text Mining: Applications and Theory.unknown: John Wiley and Sons, Ltd.一書的第一章

https://www.airpair.com/nlp/keyword-extraction-tutorial   網頁中有keywords extraction的詳細介紹。

https://github.com/aneesha/RAKE  是RAKE的python原始碼

https://github.com/zelandiya/RAKE-tutorial 是“a_medelyan”的修改版

其中 rake_tutorial.py我根據網頁加了以下注釋。

import rake
import operator

# EXAMPLE ONE - SIMPLE
stoppath = "SmartStoplist.txt"
'''
# 1. initialize RAKE by providing a path to a stopwords file
rake_object = rake.Rake(stoppath, 5, 3, 4)  # the notation is: (1)Each word has at least 5 characters, (2)Each phrase has at most 3 words,(3)Each keyword appears in the text at least 4 times


# 2. run on RAKE on a given text
sample_file = open("data/docs/fao_test/w2167e.txt", 'r')
text = sample_file.read()

keywords = rake_object.run(text) # this command can output all the keywords and their scores

# 3. print results
print "Keywords:", keywords

print "----------"           '''
# EXAMPLE TWO - BEHIND THE SCENES (from https://github.com/aneesha/RAKE/rake.py)

# initialize RAKE by providing a path to a stopwords file
rake_object = rake.Rake(stoppath)

text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility " \
       "of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. " \
       "Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating"\
       " sets of solutions for all types of systems are given. These criteria and the corresponding algorithms " \
       "for constructing a minimal supporting set of solutions can be used in solving all the considered types of " \
       "systems and systems of mixed types."



# Split text into sentences
sentenceList = rake.split_sentences(text) # sentence was split by  punctuation mark, comma and period here.

for sentence in sentenceList:
    print "Sentence:", sentence

# generate candidate keywords
stopwordpattern = rake.build_stop_word_regex(stoppath)
phraseList = rake.generate_candidate_keywords(sentenceList, stopwordpattern)   # phrase is the candidated keywords
# this method does not work for phrases in which these boundaries are parts of the actual phrase (e.g. .Net or Dr. Who).
# improvements can be made here
Read more at https://www.airpair.com/nlp/keyword-extraction-tutorial#4Lc4GeP5t5cYe7OR.99
print "Phrases:", phraseList

# calculate individual word scores
wordscores = rake.calculate_word_scores(phraseList)

# generate candidate keyword scores
keywordcandidates = rake.generate_candidate_keyword_scores(phraseList, wordscores)
# One issue here is that the candidates are not normalized in any way. 
# As a result we may have keywords that look nearly identical: small scale production and small scale producers, or skim milk powder and skimmed milk powder.
# Ideally, a keyword extraction algorithm should apply stemming and other ways of normalizing keywords first.
# so stemming is always used before keyword extraction. This can be another improvement. 



for candidate in keywordcandidates.keys():
    print "Candidate: ", candidate, ", score: ", keywordcandidates.get(candidate)



# sort candidates by score to determine top-scoring keywords
sortedKeywords = sorted(keywordcandidates.iteritems(), key=operator.itemgetter(1), reverse=True)
totalKeywords = len(sortedKeywords)

# for example, you could just take the top third as the final keywords
for keyword in sortedKeywords[0:(totalKeywords / 3)]: # note that hte final keywords are determined by top third
    print "Keyword: ", keyword[0], ", score: ", keyword[1]

print rake_object.run(text) # this command outputs all the keywords and their scores.