機器學習---文本特征提取之詞袋模型（Machine Learning Text Feature Extraction Bag of Words）

阿新 • • 發佈：2018-09-06

from 就是 mat 關聯關系關系們的維度進行 class

假設有一段文本："I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends." 那麽怎麽提取這段文本的特征呢？

一個簡單的方法就是使用詞袋模型（bag of words model）。選定文本內一定的詞放入詞袋，統計詞袋內所有詞出現的頻率（忽略語法和單詞出現的順序），把詞頻（term frequency）用向量的形式表示出來。

詞頻統計可以用scikit-learn的CountVectorizer實現：

text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends. 
" 

from sklearn.feature_extraction.text import CountVectorizer
CV=CountVectorizer()
words=CV.fit_transform([text1]) #這裏註意要把文本字符串變為列表進行輸入
print(words)

首先CountVectorizer將文本映射成字典，字典的鍵是文本內的詞，值是詞的索引，然後對字典進行學習，將其轉換成詞頻矩陣並輸出：

  (0, 3)        1
  (0, 4)        1
  (0, 0)        1
  (0, 11)       1
  (0, 2)        1
  (0, 10)       1
  (0, 7)        2
  (0, 8)        2
  (0, 9)        1
  (0, 6)        1
  (0, 1)        1
  (0, 5)        1

(0, 7)        2  代表第7個詞"Huzihu"出現了2次。

註：CountVectorizer類會把文本全部轉換成小寫，然後將文本詞塊化（tokenize）。文本詞塊化是把句子分割成詞塊（token）或有意義的字母序列的過程。詞塊大多是單詞，但它們也可能是一些短語，如標點符號和詞綴。CountVectorizer類通過正則表達式用空格分割句子，然後抽取長度大於等於2的字母序列。（摘自：http://lib.csdn.net/article/machinelearning/42813）

我們一般提取文本特征是用於文檔分類，那麽就需要知道各個文檔之間的相似程度。可以通過計算文檔特征向量之間的

歐氏距離（Euclidean distance）來進行比較。

讓我們添加另外兩段文本，看看這三段文本之間的相似程度如何。

文本二："My cousin has a cute dog. He likes sleeping and eating. He is friendly to others."

文本三："We all need to make plans for the future, otherwise we will regret when we‘re old."

text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends."
text2="My cousin has a cute dog. He likes sleeping and eating. He is friendly to others."
text3= "We all need to make plans for the future, otherwise we will regret when we‘re old."

corpus=[text1,text2,text3] #把三個文檔放入語料庫

from sklearn.feature_extraction.text import CountVectorizer
CV=CountVectorizer()
words=CV.fit_transform(corpus)
words_frequency=words.todense()  #用todense()轉化成矩陣
print(CV.get_feature_names()) 
print(words_frequency)

此時分別輸出的是特征名稱和由每個文本的詞頻向量組成的矩陣：

[‘all‘, ‘and‘, ‘are‘, ‘cat‘, ‘cousin‘, ‘cute‘, ‘dog‘, ‘eating‘, ‘for‘, ‘friendly‘, ‘friends‘, ‘future‘, ‘good‘, ‘has‘, ‘have‘, ‘he‘, ‘his‘, ‘huzihu‘, ‘is‘, ‘likes‘, ‘make‘, ‘my‘, ‘name‘, ‘need‘, ‘old‘, ‘others‘, ‘otherwise‘, ‘plans‘, ‘re‘, ‘really‘, ‘regret‘, ‘sleeping‘, ‘the‘, ‘to‘, ‘we‘, ‘when‘, ‘will‘]
[[0 1 1 ..., 1 0 0]
 [0 1 0 ..., 0 0 0]
 [1 0 0 ..., 3 1 1]]

可以看到，矩陣第一列，其中前兩個數都為0，最後一個數為1，代表"all"在前兩個文本中都未出現過，而在第三個文本中出現了一次。

接下來，我們就可以用sklearn中的euclidean_distances來計算這三個文本特征向量之間的距離了。

from sklearn.metrics.pairwise import euclidean_distances
for i,j in ([0,1],[0,2],[1,2]):
    dist=euclidean_distances(words_frequency[i],words_frequency[j])
    print("文本{}和文本{}特征向量之間的歐氏距離是：{}".format(i+1,j+1,dist))

輸出如下：

文本1和文本2特征向量之間的歐氏距離是：[[ 5.19615242]]
文本1和文本3特征向量之間的歐氏距離是：[[ 6.08276253]]
文本2和文本3特征向量之間的歐氏距離是：[[ 6.164414]]

可以看到，文本一和文本二之間最相似。

現在思考一下，應該選什麽樣的詞放入詞袋呢？有一些詞並不能提供多少有用的信息，比如：the, be, you, he...這些詞被稱為停用詞（stop words）。由於文本內包含的詞的數量非常之多（詞袋內的每一個詞都是一個維度），因此我們需要盡量減少維度，去除這些噪音，以便更好地計算和擬合。

可以在創建CountVectorizer實例時添加stop_words="english"參數來去除這些停用詞。

另外，也可以下載NLTK（Natural Language Toolkit）自然語言工具包，使用其裏面的停用詞。

下面，我們就用NLTK來試一試（使用之前，請大家先下載安裝：pip install NLTK）：

text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends."
text2="My cousin has a cute dog. He likes sleeping and eating. He is friendly to others."
text3= "We all need to make plans for the future, otherwise we will regret when we‘re old."

corpus=[text1,text2,text3]

from nltk.corpus import stopwords
noise=stopwords.words("english")

from sklearn.feature_extraction.text import CountVectorizer
CV=CountVectorizer(stop_words=noise)
words=CV.fit_transform(corpus)
words_frequency=words.todense()
print(CV.get_feature_names())
print(words_frequency)

輸出：

[‘cat‘, ‘cousin‘, ‘cute‘, ‘dog‘, ‘eating‘, ‘friendly‘, ‘friends‘, ‘future‘, ‘good‘, ‘huzihu‘, ‘likes‘, ‘make‘, ‘name‘, ‘need‘, ‘old‘, ‘others‘, ‘otherwise‘, ‘plans‘, ‘really‘, ‘regret‘, ‘sleeping‘]
[[1 0 1 ..., 1 0 0]
 [0 1 1 ..., 0 0 1]
 [0 0 0 ..., 0 1 0]]

可以看到，此時詞袋裏的詞減少了。通過查看words_frequncy.shape，我們發現特征向量的維度也由原來的37變為了21。

還有一個需要考慮的情況，比如說文本中出現的friendly和friends意思相近，可以看成是一個詞。但是由於之前把這兩個詞分別算成是兩個不同的特征，這就可能導致文本分類出現偏差。解決辦法是對單詞進行詞幹提取（stemming），再把詞幹放入詞袋。

下面用NLTK中的SnowballStemmer來提取詞幹（註意：需要先用正則表達式把文本中的詞提取出來，也就是進行詞塊化，再提取詞幹，因此在用CountVectorizer時可以把tokenizer參數設為自己寫的function）：

text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends."
text2="My cousin has a cute dog. He likes sleeping and eating. He is friendly to others."
text3= "We all need to make plans for the future, otherwise we will regret when we‘re old."

corpus=[text1,text2,text3]

from nltk import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer

def stemming(token):
    stemming=SnowballStemmer("english")
    stemmed=[stemming.stem(each) for each in token]
    return stemmed

def tokenize(text):
    tokenizer=RegexpTokenizer(r‘\w+‘)  #設置正則表達式規則
    tokens=tokenizer.tokenize(text)
    stems=stemming(tokens)
    return stems

from nltk.corpus import stopwords
noise=stopwords.words("english")

from sklearn.feature_extraction.text import CountVectorizer
CV=CountVectorizer(stop_words=noise,tokenizer=tokenize,lowercase=False)

words=CV.fit_transform(corpus)
words_frequency=words.todense()
print(CV.get_feature_names())
print(words_frequency)

輸出：

[‘cat‘, ‘cousin‘, ‘cute‘, ‘dog‘, ‘eat‘, ‘friend‘, ‘futur‘, ‘good‘, ‘huzihu‘, ‘like‘, ‘make‘, ‘name‘, ‘need‘, ‘old‘, ‘otherwis‘, ‘plan‘, ‘realli‘, ‘regret‘, ‘sleep‘]
[[1 0 1 ..., 1 0 0]
 [0 1 1 ..., 0 0 1]
 [0 0 0 ..., 0 1 0]]

可以看到，friendly和friends在提取詞幹後都變成了friend。而others提取詞幹後變為other，other屬於停用詞，被移除了，因此現在詞袋特征向量維度變成了19。

此外，還需註意的是詞形的變化。比如說單復數："foot"和"feet"，過去式和現在進行時："understood"和"understanding"，主動和被動："eat"和"eaten"，等等。這些詞都應該被視為同一個特征。解決的辦法是進行詞形還原（lemmatization）。這裏就不演示了，可以用NLTK中的WordNetLemmatizer來進行詞形還原（from nltk.stem.wordnet import WordNetLemmatizer）。

詞幹提取和詞形還原的區別可參見：https://www.neilx.com/blog/?p=1425。

最後，再想一下，我們在對文檔進行分類時，假如某個詞在文檔中都有出現，那麽這個詞就無法給分類帶來多少有用的信息。因此，對於出現頻率高的詞和頻率低的詞，我們應該區分對待，它們的重要性是不一樣的。解決的辦法就是用TF-IDF（term frequncy, inverse document frequency）來給詞進行加權。TF-IDF會根據單詞在文本中出現的頻率進行加權，出現頻率高的詞，加權系數就低，反之，出現頻率低的詞，加權系數就高。可以用sklearn的TfidfVectorizer來實現。

下面，我們把CountVectorizer換成TfidfVectorizer（包括之前使用過的提取詞幹和去除停用詞），再來計算一下這三個文本之間的相似度：

text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends."
text2="My cousin has a cute dog. He likes sleeping and eating. He is friendly to others."
text3= "We all need to make plans for the future, otherwise we will regret when we‘re old."

corpus=[text1,text2,text3]

from nltk import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer

def stemming(token):
    stemming=SnowballStemmer("english")
    stemmed=[stemming.stem(each) for each in token]
    return stemmed

def tokenize(text):
    tokenizer=RegexpTokenizer(r‘\w+‘)  #設置正則表達式規則
    tokens=tokenizer.tokenize(text)
    stems=stemming(tokens)
    return stems

from nltk.corpus import stopwords
noise=stopwords.words("english")

from sklearn.feature_extraction.text import TfidfVectorizer
CV=TfidfVectorizer(stop_words=noise,tokenizer=tokenize,lowercase=False)

words=CV.fit_transform(corpus)
words_frequency=words.todense()
print(CV.get_feature_names())
print(words_frequency)

from sklearn.metrics.pairwise import euclidean_distances
for i,j in ([0,1],[0,2],[1,2]):
    dist=euclidean_distances(words_frequency[i],words_frequency[j])
    print("文本{}和文本{}特征向量之間的歐氏距離是：{}".format(i+1,j+1,dist))

輸出：

[‘cat‘, ‘cousin‘, ‘cute‘, ‘dog‘, ‘eat‘, ‘friend‘, ‘futur‘, ‘good‘, ‘huzihu‘, ‘like‘, ‘make‘, ‘name‘, ‘need‘, ‘old‘, ‘otherwis‘, ‘plan‘, ‘realli‘, ‘regret‘, ‘sleep‘]
[[ 0.30300252  0.          0.23044123 ...,  0.30300252  0.          0.        ]
 [ 0.          0.40301621  0.30650422 ...,  0.          0.          0.40301621]
 [ 0.          0.          0.         ...,  0.          0.37796447  0.        ]]
文本1和文本2特征向量之間的歐氏距離是：[[ 1.25547312]]
文本1和文本3特征向量之間的歐氏距離是：[[ 1.41421356]]
文本2和文本3特征向量之間的歐氏距離是：[[ 1.41421356]]

可以看到，現在特征值不再是0和1了，而是加權之後的值。雖然我們只用了很短的文本進行測試，但還是能看出來，經過一系列優化後，計算出的結果更準確了。

詞袋模型的缺點： 1. 無法反映詞之間的關聯關系。例如："Humans like cats."和"Cats like humans"具有相同的特征向量。

2. 無法捕捉否定關系。例如："I will not eat noodles today."和"I will eat noodles today."盡管意思相反，但是從特征向量來看它們非常相似。不過這個問題可以通過設置n-gram來解決（比如可以在用sklearn創建CountVectorizer實例時加上ngram_range參數）。

機器學習---文本特征提取之詞袋模型（Machine Learning Text Feature Extraction Bag of Words）

from 就是 mat 關聯關系關系們的維度進行 class 假設有一段文本："I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends." 那

機器學習---文本特征提取之詞袋模型（Machine Learning Text Feature Extraction Bag of Words）

機器學習---文本特征提取之詞袋模型（Machine Learning Text Feature Extraction Bag of Words）

機器學習文本挖掘之spherical k-means algorithm初識

京東金融大數據競賽豬臉識別（3）- 圖像特征提取之二

機器學習1《特征抽取，歸一化與標準化》

圖像特征提取之Haar特征

目標檢測的圖像特征提取之_LBP特征

文本特征處理

Spark 2.0 機器學習 ML 庫：特徵提取、轉化、選取（Scala 版）

圖像特征提取——局部圖結構（LGS）及matlab代碼實現

機器學習基礎（二）——詞集模型（SOW）和詞袋模型（BOW）

機器學習---用python實現最小二乘線性回歸並用隨機梯度下降法求解（Machine Learning Least Squares Linear Regression Application SGD）

情感分析之詞袋模型TF-IDF演算法（三）

視覺SLAM之詞袋（bag of words）模型淺析

視覺SLAM之詞袋（bag of words）模型與K-means聚類演算法淺析（1）

視覺SLAM之詞袋（bag of words）模型與K-means聚類演算法淺析（2）

Spark2.0 特征提取、轉換、選擇之二：特征選擇、文本處理，以中文自然語言處理(情感分類)為例

文本分類學習（三）特征權重（TF/IDF）和特征提取

近期分享幹貨，使用python實現語音文件的特征提取方法

特征工程：圖像特征提取和深度學習

UFLDL講義二十：卷積特征提取

機器學習---文本特征提取之詞袋模型（Machine Learning Text Feature Extraction Bag of Words）

相關推薦