NLP入門（三）詞形還原（Lemmatization）

阿新 • • 發佈：2018-12-19

詞形還原（Lemmatization）是文字預處理中的重要部分，與詞幹提取（stemming）很相似。簡單說來，詞形還原就是去掉單詞的詞綴，提取單詞的主幹部分，通常提取後的單詞會是字典中的單詞，不同於詞幹提取（stemming），提取後的單詞不一定會出現在單詞中。比如，單詞“cars”詞形還原後的單詞為“car”，單詞“ate”詞形還原後的單詞為“eat”。在Python的nltk模組中，使用WordNet為我們提供了穩健的詞形還原的函式。如以下示例Python程式碼：

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
# lemmatize nouns
print(wnl.lemmatize('cars', 'n'))
print(wnl.lemmatize('men', 'n'))

# lemmatize verbs
print(wnl.lemmatize('running', 'v'))
print(wnl.lemmatize('ate', 'v'))

# lemmatize adjectives
print(wnl.lemmatize('saddest', 'a'))
print(wnl.lemmatize('fancier', 'a'))

輸出結果如下：

car men run eat sad fancy

在以上程式碼中，wnl.lemmatize()函式可以進行詞形還原，第一個引數為單詞，第二個引數為該單詞的詞性，如名詞，動詞，形容詞等，返回的結果為輸入單詞的詞形還原後的結果。詞形還原一般是簡單的，但具體我們在使用時，指定單詞的詞性很重要，不然詞形還原可能效果不好，如以下程式碼：

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
print(wnl.lemmatize('ate', 'n'))
print(wnl.lemmatize('fancier', 'v'))

輸出結果如下：

ate fancier

那麼，如何獲取單詞的詞性呢？在NLP中，使用Parts of speech（POS）技術實現。在nltk中，可以使用nltk.pos_tag()獲取單詞在句子中的詞性，如以下Python程式碼：

sentence = 'The brown fox is quick and he is jumping over the lazy dog'
import nltk
tokens = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokens)
print(tagged_sent)

輸出結果如下：

[('The', 'DT'), ('brown', 'JJ'), ('fox', 'NN'), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

關於上述詞性的說明，可以參考下表：

OK，知道了獲取單詞在句子中的詞性，再結合詞形還原，就能很好地完成詞形還原功能。示例的Python程式碼如下：

from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

# 獲取單詞的詞性
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

sentence = 'football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.'
tokens = word_tokenize(sentence)  # 分詞
tagged_sent = pos_tag(tokens)     # 獲取單詞詞性

wnl = WordNetLemmatizer()
lemmas_sent = []
for tag in tagged_sent:
    wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.NOUN
    lemmas_sent.append(wnl.lemmatize(tag[0], pos=wordnet_pos)) # 詞形還原

print(lemmas_sent)

輸出結果如下：

['football', 'be', 'a', 'family', 'of', 'team', 'sport', 'that', 'involve', ',', 'to', 'vary', 'degree', ',', 'kick', 'a', 'ball', 'to', 'score', 'a', 'goal', '.']

輸出的結果就是對句子中的單詞進行詞形還原後的結果。本次分享到此結束，歡迎大家交流~

注意：本人現已開通微信公眾號： Python爬蟲與演算法（微訊號為：easy_web_scrape），歡迎大家關注哦~~

NLP入門（三）詞形還原（Lemmatization）

NLP入門（三）詞形還原（Lemmatization）

帶你走入angular--angular入門（三、angularJS專案開發流程）

詞幹提取（stemming）與詞形還原（lemmatization）

詞幹提取（stemming）和詞形還原（lemmatization）比較

.NET深入解析LINQ框架（三：LINQ優雅的前奏）

docker實戰之Dockerfile（三層鏡像的構建）

計算機圖形學（三種畫線算法）

從零開始之驅動發開、linux驅動（三十、mmap使用舉例）

es6（三set和map資料結構）

深入淺出RxJava（三：響應式的好處）

初識大資料（三. Hadoop與MPP資料倉庫）

SpringMVC學習筆記（三、重定向與轉發）

排序算法（三人組加上快排）

hdu3001（三進位制狀壓）

sk_buff整理筆記（三、記憶體申請和釋放）

Cubic spline（三次樣條插值）（轉載）

python3教程系列（三.3.2、pdb除錯）

python教程系列（三.2.9、shutil模組）

python教程系列（三.2.8、log模組）

python教程系列（三.2.7、random模組）

NLP入門（三）詞形還原（Lemmatization）

相關推薦