本文簡要介紹Python自然語言處理(NLP)，使用Python的NLTK庫。NLTK是Python的自然語言處理工具包，在NLP領域中，最常使用的一個Python庫。

什麼是NLP？

簡單來說，自然語言處理(NLP)就是開發能夠理解人類語言的應用程式或服務。

這裡討論一些自然語言處理(NLP)的實際應用例子，如語音識別、語音翻譯、理解完整的句子、理解匹配詞的同義詞，以及生成語法正確完整句子和段落。

這並不是NLP能做的所有事情。

NLP實現

搜尋引擎: 比如谷歌，Yahoo等。谷歌搜尋引擎知道你是一個技術人員，所以它顯示與技術相關的結果；

社交網站推送:比如Facebook News Feed。如果News Feed演算法知道你的興趣是自然語言處理，就會顯示相關的廣告和帖子。

語音引擎:比如Apple的Siri。

垃圾郵件過濾:如谷歌垃圾郵件過濾器。和普通垃圾郵件過濾不同，它通過了解郵件內容裡面的的深層意義，來判斷是不是垃圾郵件。

NLP庫

下面是一些開源的自然語言處理庫(NLP)：

Natural language toolkit (NLTK);
Apache OpenNLP;
Stanford NLP suite;
Gate NLP library

其中自然語言工具包(NLTK)是最受歡迎的自然語言處理庫(NLP)，它是用Python編寫的，而且背後有非常強大的社群支援。

NLTK也很容易上手，實際上，它是最簡單的自然語言處理(NLP)庫。

在這個NLP教程中，我們將使用Python NLTK庫。

安裝 NLTK

如果您使用的是Windows/Linux/Mac，您可以使用pip安裝NLTK:

Python

pip install nltk

1	pip install nltk

開啟python終端匯入NLTK檢查NLTK是否正確安裝：

Python

import nltk

1	importnltk

如果一切順利，這意味著您已經成功地安裝了NLTK庫。首次安裝了NLTK，需要通過執行以下程式碼來安裝NLTK擴充套件包:

Python

import nltk

nltk.download()

123	importnltknltk.download()

這將彈出NLTK 下載視窗來選擇需要安裝哪些包:

您可以安裝所有的包，因為它們的大小都很小，所以沒有什麼問題。

使用Python Tokenize文字

首先，我們將抓取一個web頁面內容，然後分析文字瞭解頁面的內容。

我們將使用urllib模組來抓取web頁面:

Python

import urllib.request

response = urllib.request.urlopen('http://php.net/')
html = response.read()
print (html)

12345

importurllib.requestresponse=urllib.request.urlopen('http://php.net/')html=response.read()print(html)

從列印結果中可以看到，結果包含許多需要清理的HTML標籤。
然後BeautifulSoup模組來清洗這樣的文字:

Python

from bs4 import BeautifulSoup

import urllib.request
response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
# 這需要安裝html5lib模組
text = soup.get_text(strip=True)
print (text)

123456789

frombs4 importBeautifulSoupimporturllib.requestresponse=urllib.request.urlopen('http://php.net/')html=response.read()soup=BeautifulSoup(html,"html5lib")# 這需要安裝html5lib模組text=soup.get_text(strip=True)print(text)

現在我們從抓取的網頁中得到了一個乾淨的文字。
下一步，將文字轉換為tokens,像這樣:

Python

from bs4 import BeautifulSoup
import urllib.request

response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
tokens = text.split()
print (tokens)

123456789

frombs4 importBeautifulSoupimporturllib.requestresponse=urllib.request.urlopen('http://php.net/')html=response.read()soup=BeautifulSoup(html,"html5lib")text=soup.get_text(strip=True)tokens=text.split()print(tokens)

統計詞頻

text已經處理完畢了，現在使用Python NLTK統計token的頻率分佈。

可以通過呼叫NLTK中的FreqDist()方法實現:

Python

from bs4 import BeautifulSoup
import urllib.request
import nltk

response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
tokens = text.split()
freq = nltk.FreqDist(tokens)
for key,val in freq.items():
    print (str(key) + ':' + str(val))

123456789101112

frombs4 importBeautifulSoupimporturllib.requestimportnltkresponse=urllib.request.urlopen('http://php.net/')html=response.read()soup=BeautifulSoup(html,"html5lib")text=soup.get_text(strip=True)tokens=text.split()freq=nltk.FreqDist(tokens)forkey,val infreq.items():print(str(key)+':'+str(val))

如果搜尋輸出結果，可以發現最常見的token是PHP。
您可以呼叫plot函式做出頻率分佈圖:

Python

freq.plot(20, cumulative=False)
# 需要安裝matplotlib庫

12	freq.plot(20,cumulative=False)# 需要安裝matplotlib庫

這上面這些單詞。比如of,a,an等等，這些詞都屬於停用詞。

一般來說，停用詞應該刪除，防止它們影響分析結果。

處理停用詞

NLTK自帶了許多種語言的停用詞列表，如果你獲取英文停用詞:

Python

from nltk.corpus import stopwords

stopwords.words('english')

123	fromnltk.corpus importstopwordsstopwords.words('english')

現在，修改下程式碼,在繪圖之前清除一些無效的token:

Python

clean_tokens = list()
sr = stopwords.words('english')
for token in tokens:
    if token not in sr:
        clean_tokens.append(token)

12345

clean_tokens=list()sr=stopwords.words('english')fortokenintokens:iftokennotinsr:clean_tokens.append(token)

最終的程式碼應該是這樣的:

Python

from bs4 import BeautifulSoup
import urllib.request
import nltk
from nltk.corpus import stopwords

response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
tokens = text.split()
clean_tokens = list()
sr = stopwords.words('english')
for token in tokens:
    if not token in sr:
        clean_tokens.append(token)
freq = nltk.FreqDist(clean_tokens)
for key,val in freq.items():
    print (str(key) + ':' + str(val))

123456789101112131415161718

frombs4 importBeautifulSoupimporturllib.requestimportnltkfromnltk.corpus importstopwordsresponse=urllib.request.urlopen('http://php.net/')html=response.read()soup=BeautifulSoup(html,"html5lib")text=soup.get_text(strip=True)tokens=text.split()clean_tokens=list()sr=stopwords.words('english')fortokenintokens:ifnottokeninsr:clean_tokens.append(token)freq=nltk.FreqDist(clean_tokens)forkey,val infreq.items():print(str(key)+':'+str(val))

現在再做一次詞頻統計圖，效果會比之前好些，因為剔除了停用詞:

Python

freq.plot(20,cumulative=False)

1	freq.plot(20,cumulative=False)

使用NLTK Tokenize文字

在之前我們用split方法將文字分割成tokens，現在我們使用NLTK來Tokenize文字。

文字沒有Tokenize之前是無法處理的，所以對文字進行Tokenize非常重要的。token化過程意味著將大的部件分割為小部件。

你可以將段落tokenize成句子，將句子tokenize成單個詞，NLTK分別提供了句子tokenizer和單詞tokenizer。

假如有這樣這段文字:

Python

Hello Adam, how are you? I hope everything is going well. Today is a good day, see you dude.

1	Hello Adam,how are you?Ihope everything isgoing well.Today isagood day,see you dude.

使用句子tokenizer將文字tokenize成句子:

Python

from nltk.tokenize import sent_tokenize

mytext = "Hello Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(sent_tokenize(mytext))

1234	fromnltk.tokenizeimportsent_tokenizemytext="Hello Adam, how are you? I hope everything is going well. Today is a good day, see you dude."print(sent_tokenize(mytext))

輸出如下:

Python

['Hello Adam, how are you?', 'I hope everything is going well.', 'Today is a good day, see you dude.']

1	['Hello Adam, how are you?','I hope everything is going well.','Today is a good day, see you dude.']

這是你可能會想，這也太簡單了，不需要使用NLTK的tokenizer都可以，直接使用正則表示式來拆分句子就行，因為每個句子都有標點和空格。

那麼再來看下面的文字:

Python

Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude.

1	Hello Mr.Adam,how are you?Ihope everything isgoing well.Today isagood day,see you dude.

這樣如果使用標點符號拆分,Hello Mr將會被認為是一個句子，如果使用NLTK:

Python

from nltk.tokenize import sent_tokenize

mytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(sent_tokenize(mytext))

1234	fromnltk.tokenizeimportsent_tokenizemytext="Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude."print(sent_tokenize(mytext))

輸出如下:

Python

['Hello Mr. Adam, how are you?', 'I hope everything is going well.', 'Today is a good day, see you dude.']

1	['Hello Mr. Adam, how are you?','I hope everything is going well.','Today is a good day, see you dude.']

這才是正確的拆分。

接下來試試單詞tokenizer:

Python

from nltk.tokenize import word_tokenize

mytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(word_tokenize(mytext))

1234

fromnltk.tokenizeimportword_tokenizemytext=

Python NLP入門教程

什麼是NLP？

NLP實現

NLP庫

安裝 NLTK

使用Python Tokenize文字

統計詞頻

處理停用詞

使用NLTK Tokenize文字

Python NLP入門教程

ML/NLP入門教程Python版（第一部分：文字處理）

Python基礎入門教程，Python學習路線圖

python+scrapy入門教程之爬取騰訊招聘職位資訊

Python爬蟲入門教程 4-100 美空網未登入圖片爬取

Python爬蟲入門教程——致敬博主夢想橡皮擦

這可能是最囉嗦的Python爬蟲入門教程了 5-100

這可能是最囉嗦的Python爬蟲入門教程了 6-100

Python爬蟲入門教程 3-100 美空網資料爬取

Python爬蟲入門教程 3-100 美空網數據爬取

這可能是最囉嗦的Python爬蟲入門教程了 8-100

Python爬蟲入門教程 2-100 妹子圖網站爬取

Python爬蟲入門教程 9-100 河北陽光理政投訴板塊

Python爬蟲入門教程 15-100 石家莊政民互動資料爬取

Python爬蟲入門教程 4-100 美空網未登錄圖片爬取

Python爬蟲入門教程 5-100 27270圖片爬取

Python-matplotlib-入門教程（四）-顏色管理

Python-matplotlib-入門教程（三）-線形管理

Python-matplotlib-入門教程（二）-plot-figure設定

Python-matplotlib-入門教程（一）-基礎圖表繪製

Python NLP入門教程

什麼是NLP？

NLP實現

NLP庫

安裝 NLTK

使用Python Tokenize文字

統計詞頻

處理停用詞

使用NLTK Tokenize文字

相關推薦