使用 spacy 進行自然語言處理（一）

阿新 • • 發佈：2019-02-19

介紹

自然語言處理(NLP) 是人工智慧方向一個非常重要的研究領域。自然語言處理在很多智慧應用中扮演著非常重要的角色，例如：

automated chat bots,
article summarizers,
multi-lingual translation
opinion identification from data

每一個利用NLP來理解非結構化文字資料的行業，不僅要求準確，而且在獲取結果方面也很敏捷。

自然語言處理是一個非常廣闊的領域，NLP 的任務包括

text classification,
entity detection,
machine translation

,
question answering,
concept identification.

在本文中，將介紹一個高階的 NLP 庫 - spaCy

內容列表

關於 spaCy 和安裝
Spacy 流水線和屬性
1. Tokenization
2. Pos Tagging
3. Entity Detection
4. Dependency Parsing
5. 名詞短語
與 NLTK 和 coreNLP 的對比

1.關於 spaCy 和安裝

1.1 關於 Spacy

Spacy 是由 cython 編寫。因此它是一個非常快的庫。 spaCy 提供簡潔的介面用來訪問其方法和屬性 governed by trained machine (and deep) learning models.

1.2 安裝

安裝 Spacy

pip install spacy

下載資料和模型

python -m spacy download en

現在，您可以使用 Spacy 了。

2. Spacy 流水線和屬性

要想使用 Spacy 和訪問其不同的 properties，需要先建立 pipelines。 通過載入模型來建立一個 pipeline。 Spacy 提供了許多不同的模型 , 模型中包含了語言的資訊- 詞彙表，預訓練的詞向量，語法和實體。

下面將載入預設的模型- english-core-web

import spacy 
nlp = spacy.load(“en”)

nlp 物件將要被用來建立文件，訪問語言註釋和不同的 nlp 屬性。我們通過載入一個文字檔案來建立一個 document 。這裡使用的是從 tripadvisor's 網站上下載下來的旅館評論。

document = open(filename).read()
document = nlp(document)

現在，document 成為 spacy.english 模型的一部分，同時 document 也有一些成員屬性。可以通過 dir(document) 檢視。

dir(document)
>> [..., 'user_span_hooks', 'user_token_hooks', 'vector', 'vector_norm', 'vocab']

document 包含大量的文件屬性資訊，包括 - tokens, token’s reference index, part of speech tags, entities, vectors, sentiment, vocabulary etc. 下面將介紹一下幾個屬性

2.1 Tokenization

"this is a sentence."
-> (tokenization)
>> ['this', 'is', 'a', 'sentence', '.']

Spacy 會先將文件分解成句子，然後再 tokenize 。我們可以使用迭代來遍歷整個文件。

# first token of the doc 
document[0] 
>> Nice

# last token of the doc  
document[len(document)-5]
>> boston 

# List of sentences of our doc 
list(document.sents)
>> [ Nice place Better than some reviews give it credit for.,
 Overall, the rooms were a bit small but nice.,
...
Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city).]

2.2 Part of Speech Tagging (詞性標註)

詞性標註： word 的動詞/名詞/… 屬性。這些標註可以作為文字特徵用到 information filtering, statistical models, 和 rule based parsing 中.

# get all tags
all_tags = {w.pos: w.pos_ for w in document}
>> {83: 'ADJ', 91: 'NOUN', 84: 'ADP', 89: 'DET', 99: 'VERB', 94: 'PRON', 96: 'PUNCT', 85: 'ADV', 88: 'CCONJ', 95: 'PROPN', 102: 'SPACE', 93: 'PART', 98: 'SYM', 92: 'NUM', 100: 'X', 90: 'INTJ'}

# all tags of first sentence of our document 
for word in list(document.sents)[0]:  
    print(word, word.tag_)
>> (Nice, 'JJ') (place, 'NN') (Better, 'JJR') (than, 'IN') (some, 'DT') (reviews, 'NNS') (give, 'VBP') (it, 'PRP') (credit, 'NN') (for, 'IN') (., '.')

下面程式碼建立一個文字處理操作，去掉噪聲詞。

#define some parameters  
noisy_pos_tags = ["PROP"]
min_token_length = 2

#Function to check if the token is a noise or not  
def isNoise(token):     
    is_noise = False
    if token.pos_ in noisy_pos_tags:
        is_noise = True 
    elif token.is_stop == True:
        is_noise = True
    elif len(token.string) <= min_token_length:
        is_noise = True
    return is_noise 
def cleanup(token, lower = True):
    if lower:
       token = token.lower()
    return token.strip()

# top unigrams used in the reviews 
from collections import Counter
cleaned_list = [cleanup(word.string) for word in document if not isNoise(word)]
Counter(cleaned_list) .most_common(5)
>> [('hotel', 683), ('room', 652), ('great', 300),  ('sheraton', 285), ('location', 271)]

2.3 Entity Detection （實體檢測）

Spacy 包含了一個快速的實體識別模型，它可以識別出文檔中的實體短語。有多種型別的實體，例如 - 人物，地點，組織，日期，數字。可以通過 document 的 ents 屬性來訪問這些實體。

下面程式碼用來找出當前文件中的所有命名實體。

labels = set([w.label_ for w in document.ents]) 
for label in labels: 
    entities = [cleanup(e.string, lower=False) for e in document.ents if label==e.label_] 
    entities = list(set(entities)) 
    print label,entities

2.4 Dependency Parsing

spacy 一個非常強大的特性就是十分快速和準確的語法解析樹的構建，通過一個簡單的 API 即可完成。這個 parser 也可以用作句子邊界檢測和短語切分。通過 “.children” , “.root”, “.ancestor” 即可訪問。

# extract all review sentences that contains the term - hotel
hotel = [sent for sent in document.sents if 'hotel' in sent.string.lower()]

# create dependency tree
sentence = hotel[2] 
for word in sentence:
    print(word, ': ', str(list(word.children)))
>> A :  []  
cab :  [A, from] 
from :  [airport, to]
the :  [] 
airport :  [the] 
to :  [hotel] 
the :  [] 
hotel :  [the] 
can :  []
be :  [cab, can, cheaper, .] 
cheaper :  [than]
than :  [shuttles] 
the :  []
shuttles :  [the, depending] 
depending :  [time] 
what :  [] 
time :  [what, of] 
of :  [day]
the :  [] 
day :  [the, go] 
you :  []
go :  [you]
. :  []

下面程式碼所作的工作是：解析所有包含 “hotel” 句子的依賴樹，看看都用了什麼樣的形容詞來描述 “hotel”。下面建立了一個自定義函式來解析依賴樹和抽取相關的詞性標籤。

# check all adjectives used with a word 
def pos_words (document, token, pos_tag):
    sentences = [sent for sent in document.sents if token in sent.string]     
    pwrds = []
    for sent in sentences:
        for word in sent:
            if token in word.string: 
                   pwrds.extend([child.string.strip() for child in word.children
                                                      if child.pos_ == pos_tag] )
    return Counter(pwrds).most_common(10)

pos_words(document, 'hotel', "ADJ")
>> [(u'other', 20), (u'great', 10), (u'good', 7), (u'better', 6), (u'nice', 6), (u'different', 5), (u'many', 5), (u'best', 4), (u'my', 4), (u'wonderful', 3)]

2.5 Noun Phrases （名詞短語）

Dependency trees 也可以用來生成名詞短語。

# Generate Noun Phrases 
doc = nlp(u'I love data science on analytics vidhya') 
for np in doc.noun_chunks:
    print(np.text, np.root.dep_, np.root.head.text)
>> I nsubj love
   data science dobj love
   analytics pobj on

3.與CNTK和core NLP 的對比

這裡寫圖片描述

使用 spacy 進行自然語言處理（一）

介紹

內容列表

1.關於 spaCy 和安裝

1.1 關於 Spacy

1.2 安裝

2. Spacy 流水線和屬性

2.1 Tokenization

2.2 Part of Speech Tagging (詞性標註)

2.3 Entity Detection （實體檢測）

2.4 Dependency Parsing

2.5 Noun Phrases （名詞短語）

3.與CNTK和core NLP 的對比

參考資料

使用 spacy 進行自然語言處理（一）

python自然語言處理（一）

《使用Python進行自然語言處理（Nltk）》2

系統學習自然語言處理（一）--綜述

Python與自然語言處理（一）搭建環境

python自然語言處理（一）之中文分詞預處理、統計詞頻

利用Tensorflow進行自然語言處理（NLP）系列之二高階Word2Vec

深度學習與自然語言處理（一）

自然語言處理（一）——基礎

自然語言處理（一）

Pyhon 自然語言處理（一）NLTK及語料庫下載

初識NLP 自然語言處理（一）

（初學者）用Python進行自然語言處理筆記一

python自然語言處理（二）

自然語言處理（3）——Word2Vec理論

關於自然語言處理（NLP）的個人學習資料

《使用python進行自然語言理解（Nltk）》1.2

自然語言處理（NLP）——分詞統計可能用到的模組方法

自然語言處理（NLP）- HMM+VITERBI演算法實現詞性標註（解碼問題）（動態規劃）（Python實現）

Python 自然語言處理（NLP）工具庫彙總

使用 spacy 進行自然語言處理（一）

介紹

內容列表

1.關於 spaCy 和 安裝

1.1 關於 Spacy

1.2 安裝

2. Spacy 流水線 和 屬性

2.1 Tokenization

2.2 Part of Speech Tagging (詞性標註)

2.3 Entity Detection （實體檢測）

2.4 Dependency Parsing

2.5 Noun Phrases （名詞短語）

3.與CNTK和core NLP 的對比

參考資料

相關推薦

1.關於 spaCy 和安裝

2. Spacy 流水線和屬性