TF-IDF、TextRank、WordCount三種方法實現英文關鍵詞提取(python實現)

阿新 • • 發佈：2020-09-23

原始碼：https://github.com/Cpaulyz/BigDataAnalysis/tree/master/Assignment2

資料預處理

進行關鍵詞提取之前，需要對原始檔進行一系列預處理：

提取PDF為TXT檔案
分句
分詞（詞幹提取、詞形還原）
過濾數字、特殊字元等，大小寫轉換

提取PDF

使用Apache PDFBox工具對PDF文字進行提取

依賴如下：

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.13</version>
</dependency>

提取工具類utils/PDFParser類程式碼邏輯如下

try {
    // 讀取PDF資料夾，將PDF格式檔案路徑存入一個Array中
    File dir = new File("src\\main\\resources\\ACL2020");
    ArrayList<String> targets = new ArrayList<String>();
    for(File file:dir.listFiles()){
        if(file.getAbsolutePath().endsWith(".pdf")){
            targets.add(file.getAbsolutePath());
        }
    }
    // readPdf為提取方法
    for(String path:targets){
        readPdf(path);
    }
} catch (Exception e) {
    e.printStackTrace();
}

至此，完成將PDF檔案中的文字提取，並存入.txt檔案中的操作，以便後續操作，示意圖如下。

分句

使用python中的nltk庫進行分句

from nltk.tokenize import sent_tokenize
sens = sent_tokenize(str)

分句情況大致如下，可以看出分句情況較為準確

分詞（詞幹提取、詞形還原）

nltk提供了分詞工具，API如下

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
print(wnl.lemmatize('ate', 'v'))
print(wnl.lemmatize('fancier', 'n'))

# 輸出為eat fancy

但是，這種分詞方法需要確定單詞在的詞性，好在nltk也為我們提供了方法來判斷句子的詞性，將其封裝為方法如下

# 獲取單詞的詞性
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

結合後進行呼叫，如下：

from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

tokens = word_tokenize(sentence)  # 分詞
tagged_sent = pos_tag(tokens)  # 獲取單詞詞性

wnl = WordNetLemmatizer()
lemmas_sent = []
for tag in tagged_sent:
    wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.NOUN
    lemmas_sent.append(wnl.lemmatize(tag[0], pos=wordnet_pos))  # 詞形還原

結果如圖

可以看出分詞後的效果還不錯，但仍存在問題為

沒有剔除掉;:.,等特殊符號
沒有剔除數字等
沒有剔除一些如a、the、of等介詞

過濾

問題1、2容易使用正則表示式進行剔除；

問題3我們通過nltk提供的英文停用詞列表、以及“不妨假設長度為4以下的字串無效”來進行剔除。

import re
from nltk.corpus import stopwords

invalid_word = stopwords.words('english')

# 預處理,如果是False就丟掉
def is_valid(word):
    if re.match("[()\-:;,.0-9]+", word):
        return False
    elif len(word) < 4 or word in invalid_word:
        return False
    else:
        return True

方法1 TF-IDF

TF-IDF演算法提取關鍵詞的結構化流程如下：

1.1 分句分詞

同資料預處理，不再贅述

1.2 構造語料庫

由於IDF的計算需要語料庫的支援，我們在這裡以全部文章構建一個語料庫，儲存在all_dic = {}中

all_dict是一個map，儲存結構為(String 文章名,Map 詞頻<單詞，詞頻>)

一個示例如下

{
	'A Generative Model for Joint Natural Language Understanding and Generation.txt': 
		{'natural': 13, 
		'language': 24, 
		'understanding': 4,
		'andnatural': 1, 
		'generation': 9, 
		'twofundamental': 1,
		...
		},
	...
}

1.3 計算TF-IDF

(1)TF

詞頻 (term frequency, TF) 指的是某一個給定的詞語在該檔案中出現的次數。這個數字通常會被歸一化(一般是詞頻除以文章總詞數), 以防止它偏向長的檔案。（同一個詞語在長檔案裡可能會比短檔案有更高的詞頻，而不管該詞語重要與否。）

TF = article_dict[word] / article_word_counts

(2)IDF

逆向檔案頻率 (inverse document frequency, IDF) IDF的主要思想是：如果包含詞條t的文件越少, IDF越大，則說明詞條具有很好的類別區分能力。某一特定詞語的IDF，可以由總檔案數目除以包含該詞語之檔案的數目，再將得到的商取對數得到。

            contain_count = 1  # 包含的文件總數，因為要+1，乾脆直接初始值為1來做
            for article1 in all_dic.keys():
                if word in all_dic[article1].keys():
                    contain_count += 1
            IDF = log(article_nums / contain_count)

(3)TF-IDF

實現核心程式碼如下：

def TFIDF():
    article_nums = len(all_dic)
    for article in all_dic.keys():
        article_dict: dict = all_dic[article]
        article_word_counts = 0
        for count in article_dict.values():
            article_word_counts += count
        local_dict = {}
        for word in article_dict:
            TF = article_dict[word] / article_word_counts
            contain_count = 1  # 包含的文件總數，因為要+1，乾脆直接初始值為1來做
            for article1 in all_dic.keys():
                if word in all_dic[article1].keys():
                    contain_count += 1
            IDF = log(article_nums / contain_count)
            local_dict[word] = TF * IDF
        all_dic[article] = local_dict  # 用TFIDF替代詞頻

1.4 輸出結果

值得一提的是，TF-IDF的基於語料庫的關鍵詞演算法，我們在將ACL2020的全部文章作為語料庫進行提取，因此提取到的TF-IDF值是相對於文章內部的關鍵詞權重。

因此，通過這種方法，我們生成的是每篇文章的關鍵詞而非語料庫的關鍵詞。

在這裡，我們選取每篇文章中TF-IDF最高的單詞及其權重輸出到method1_dict.txt中，權重表示的是TF-IDF值，排序為按照文章標題的字母排序。

unlabelled 0.03366690429509488
database 0.025963153344621098
triplet 0.06007324859328521
anaphor 0.054325239855360946
sparse 0.05140787295501171
dialog 0.02857688733696682
evaluator 0.047046849916043215
article 0.03181976626426247
dialogue 0.05009864522556742
false 0.05046963249913187
explanation 0.06756267918534663
keyphrases 0.07257334117762049
switch 0.02057258339292402
response 0.03487928535131968
hcvae 0.01490817643452481
response 0.01691069785427619
fragment 0.036740214670107636
concept 0.10144398960055125
node 0.026861943279698357
type 0.021568639909022032
hierarchy 0.04174740425673965
legal 0.09062083506033958
confidence 0.03208193690887942
question 0.018326715354972434
follow-up 0.0768915254934173
graph 0.030139792811985255
quarel 0.03142980753777034
instruction 0.04310656492734328
summary 0.023522349291620226
mutual 0.021794659657633334
malicious 0.03361252033133951
nucleus 0.03062106234461863
supervision 0.02716542294214428
relation 0.026017607441275774
calibrator 0.053113533081036744
centrality 0.06527959271708282
question 0.015813880735872966
slot 0.04442739804723785
graph 0.017963145985978687
taxonomy 0.05263359765861765
question 0.01694100733341999
transformer 0.019573842786351815
response 0.027652528223249546
topic 0.04541019920353925
paraphrase 0.024098507886884227

方法2 TextRank

TextRank演算法提取關鍵詞的結構化流程如下

2.1 分句

同預處理部分的分句處理，不再贅述

2.2 建立關係矩陣

建立關係矩陣M^n*n，其中n為單詞數量（相同單詞僅記一次），M_ij表示j到i存在權重為M_ij的關係。

關係的定義如下：

取視窗大小為win，則在每個分句中，去除停用詞、標點、無效詞後，每個單詞與距離為win以內的單詞存在聯絡

為了方便表示關係矩陣，這裡以一個(String word, Array relative_words)的Map來進行表示存在word→relative_words的關係，例子如下（來源網路http://www.hankcs.com/nlp/textrank-algorithm-to-extract-the-keywords-java-implementation.html）

句分詞 = [程式設計師, 英文, 程式, 開發, 維護, 專業, 人員, 程式設計師, 分為, 程式, 設計, 人員, 程式, 編碼, 人員, 界限, 特別, 中國, 軟體, 人員, 分為, 程式設計師, 高階, 程式設計師, 系統, 分析員, 專案, 經理]

之後建立兩個大小為5的視窗，每個單詞將票投給它身前身後距離5以內的單詞：

{開發=[專業, 程式設計師, 維護, 英文, 程式, 人員],

軟體=[程式設計師, 分為, 界限, 高階, 中國, 特別, 人員],

程式設計師=[開發, 軟體, 分析員, 維護, 系統, 專案, 經理, 分為, 英文, 程式, 專業, 設計, 高階, 人員, 中國],

分析員=[程式設計師, 系統, 專案, 經理, 高階],

維護=[專業, 開發, 程式設計師, 分為, 英文, 程式, 人員],

系統=[程式設計師, 分析員, 專案, 經理, 分為, 高階],

專案=[程式設計師, 分析員, 系統, 經理, 高階],

經理=[程式設計師, 分析員, 系統, 專案],

分為=[專業, 軟體, 設計, 程式設計師, 維護, 系統, 高階, 程式, 中國, 特別, 人員],

英文=[專業, 開發, 程式設計師, 維護, 程式],

程式=[專業, 開發, 設計, 程式設計師, 編碼, 維護, 界限, 分為, 英文, 特別, 人員],

特別=[軟體, 編碼, 分為, 界限, 程式, 中國, 人員],

專業=[開發, 程式設計師, 維護, 分為, 英文, 程式, 人員],

設計=[程式設計師, 編碼, 分為, 程式, 人員],

編碼=[設計, 界限, 程式, 中國, 特別, 人員],

界限=[軟體, 編碼, 程式, 中國, 特別, 人員],

高階=[程式設計師, 軟體, 分析員, 系統, 專案, 分為, 人員],

中國=[程式設計師, 軟體, 編碼, 分為, 界限, 特別, 人員],

人員=[開發, 程式設計師, 軟體, 維護, 分為, 程式, 特別, 專業, 設計, 編碼, 界限, 高階, 中國]}

實現部分程式碼如下

def add_to_dict(word_list, windows=5):
    valid_word_list = []  # 先進行過濾
    for word in word_list:
        word = str(word).lower()
        if is_valid(word):
            valid_word_list.append(word)
    # 根據視窗進行關係建立
    if len(valid_word_list) < windows:
        win = valid_word_list
        build_words_from_windows(win)
    else:
        index = 0
        while index + windows <= len(valid_word_list):
            win = valid_word_list[index:index + windows]
            index += 1
            build_words_from_windows(win)

# 根據小視窗，將關係建立到words中
def build_words_from_windows(win):
    for word in win:
        if word not in words.keys():
            words[word] = []
        for other in win:
            if other == word or other in words[word]:
                continue
            else:
                words[word].append(other)

2.3 迭代

TextRank的計算公式類似PageRank

迭代的終止條件有以下兩種

max_diff < 指定閾值，說明已收斂
max_iter > 指定迭代次數，說明迭代次數達到上限

程式碼實現如下

def text_rank(d=0.85, max_iter=100):
    min_diff = 0.05
    words_weight = {}  # {str,float)
    for word in words.keys():
        words_weight[word] = 1 / len(words.keys())
    for i in range(max_iter):
        n_words_weight = {}  # {str,float)
        max_diff = 0
        for word in words.keys():
            n_words_weight[word] = 1 - d
            for other in words[word]:
                if other == word or len(words[other]) == 0:
                    continue
                n_words_weight[word] += d * words_weight[other] / len(words[other])
            max_diff = max(n_words_weight[word] - words_weight[word], max_diff)
        words_weight = n_words_weight
        print('iter', i, 'max diff is', max_diff)
        if max_diff < min_diff:
            print('break with iter', i)
            break
    return words_weight

2.4 輸出結果

選取前30個關鍵詞，輸出結果如下，本方法中權重表示TextRank計算出來的值，儲存在method2_dict.txt中

 model 176.5304347133946
 question 85.40181168045564
 response 62.507994652932325
 data 60.65722815422958
 method 59.467011421798766
 result 58.625521805302576
 show 58.328949197586205
 graph 57.56085447050974
 answer 56.016412290514324
 generate 53.04744866326927
 example 52.68958963476476
 training 52.109756756305856
 also 51.35655567676399
 input 50.69980375572206
 word 50.52677865990237
 train 49.34118286080509
 representation 48.497427796293245
 sentence 48.21207111035171
 dataset 48.07840701700186
 work 47.57844139247928
 system 47.03771276235998
 propose 46.88347913956473
 task 46.518530285062205
 performance 45.70988317875179
 base 45.675096486932375
 different 44.92213315873288
 score 43.76950706001539
 test 42.996530025663326
 give 42.40794849944198
 information 42.39192128940212

方法3 WordCount

最後一種方法是樸素的詞頻計演算法，思想很簡單，就是計算詞頻，認為出現次數越多，越可能是關鍵詞，結構化流程如下：

3.1 分詞分句

同預處理部分，不再贅述

3.2 統計詞頻

使用一個Map來表示(單詞，詞頻)

dic = {}

def add_to_dict(word_list):
    for word in word_list:
        word = str(word).lower()
        if is_valid(word):
            if word in dic.keys():
                dic[word] += 1
            else:
                dic[word] = 1

3.3 輸出結果

選取前30個關鍵詞，輸出結果如下，本方法中權重表示詞頻，儲存在method3_dict.txt中

model 1742
question 813
response 579
graph 515
data 490
method 464
show 456
result 447
answer 445
representation 408
generate 398
example 394
training 393
word 387
dataset 377
sentence 368
input 365
propose 360
train 351
test 349
system 345
also 342
task 330
performance 327
score 325
different 315
work 312
document 304
base 294
information 293