機器學習---文本特征提取之詞袋模型(Machine Learning Text Feature Extraction Bag of Words)
假設有一段文本:"I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends." 那麽怎麽提取這段文本的特征呢?
一個簡單的方法就是使用詞袋模型(bag of words model)。選定文本內一定的詞放入詞袋,統計詞袋內所有詞出現的頻率(忽略語法和單詞出現的順序),把詞頻(term frequency)用向量的形式表示出來。
text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends." from sklearn.feature_extraction.text import CountVectorizer CV=CountVectorizer() words=CV.fit_transform([text1]) #這裏註意要把文本字符串變為列表進行輸入 print(words)
(0, 3) 1
(0, 4) 1
(0, 0) 1
(0, 11) 1
(0, 2) 1
(0, 10) 1
(0, 7) 2
(0, 8) 2
(0, 9) 1
(0, 6) 1
(0, 1) 1
(0, 5) 1
(0, 7) 2 代表第7個詞"Huzihu"出現了2次。
文本二:"My cousin has a cute dog. He likes sleeping and eating. He is friendly to others."
文本三:"We all need to make plans for the future, otherwise we will regret when we‘re old."
text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends." text2="My cousin has a cute dog. He likes sleeping and eating. He is friendly to others." text3= "We all need to make plans for the future, otherwise we will regret when we‘re old." corpus=[text1,text2,text3] #把三個文檔放入語料庫 from sklearn.feature_extraction.text import CountVectorizer CV=CountVectorizer() words=CV.fit_transform(corpus) words_frequency=words.todense() #用todense()轉化成矩陣 print(CV.get_feature_names()) print(words_frequency)
[‘all‘, ‘and‘, ‘are‘, ‘cat‘, ‘cousin‘, ‘cute‘, ‘dog‘, ‘eating‘, ‘for‘, ‘friendly‘, ‘friends‘, ‘future‘, ‘good‘, ‘has‘, ‘have‘, ‘he‘, ‘his‘, ‘huzihu‘, ‘is‘, ‘likes‘, ‘make‘, ‘my‘, ‘name‘, ‘need‘, ‘old‘, ‘others‘, ‘otherwise‘, ‘plans‘, ‘re‘, ‘really‘, ‘regret‘, ‘sleeping‘, ‘the‘, ‘to‘, ‘we‘, ‘when‘, ‘will‘] [[0 1 1 ..., 1 0 0] [0 1 0 ..., 0 0 0] [1 0 0 ..., 3 1 1]]
from sklearn.metrics.pairwise import euclidean_distances for i,j in ([0,1],[0,2],[1,2]): dist=euclidean_distances(words_frequency[i],words_frequency[j]) print("文本{}和文本{}特征向量之間的歐氏距離是:{}".format(i+1,j+1,dist))
文本1和文本2特征向量之間的歐氏距離是:[[ 5.19615242]] 文本1和文本3特征向量之間的歐氏距離是:[[ 6.08276253]] 文本2和文本3特征向量之間的歐氏距離是:[[ 6.164414]]
現在思考一下,應該選什麽樣的詞放入詞袋呢?有一些詞並不能提供多少有用的信息,比如:the, be, you, he...這些詞被稱為停用詞(stop words)。由於文本內包含的詞的數量非常之多(詞袋內的每一個詞都是一個維度),因此我們需要盡量減少維度,去除這些噪音,以便更好地計算和擬合。
另外,也可以下載NLTK(Natural Language Toolkit)自然語言工具包,使用其裏面的停用詞。
下面,我們就用NLTK來試一試(使用之前,請大家先下載安裝:pip install NLTK):
text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends." text2="My cousin has a cute dog. He likes sleeping and eating. He is friendly to others." text3= "We all need to make plans for the future, otherwise we will regret when we‘re old." corpus=[text1,text2,text3] from nltk.corpus import stopwords noise=stopwords.words("english") from sklearn.feature_extraction.text import CountVectorizer CV=CountVectorizer(stop_words=noise) words=CV.fit_transform(corpus) words_frequency=words.todense() print(CV.get_feature_names()) print(words_frequency)
[‘cat‘, ‘cousin‘, ‘cute‘, ‘dog‘, ‘eating‘, ‘friendly‘, ‘friends‘, ‘future‘, ‘good‘, ‘huzihu‘, ‘likes‘, ‘make‘, ‘name‘, ‘need‘, ‘old‘, ‘others‘, ‘otherwise‘, ‘plans‘, ‘really‘, ‘regret‘, ‘sleeping‘] [[1 0 1 ..., 1 0 0] [0 1 1 ..., 0 0 1] [0 0 0 ..., 0 1 0]]
text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends." text2="My cousin has a cute dog. He likes sleeping and eating. He is friendly to others." text3= "We all need to make plans for the future, otherwise we will regret when we‘re old." corpus=[text1,text2,text3] from nltk import RegexpTokenizer from nltk.stem.snowball import SnowballStemmer def stemming(token): stemming=SnowballStemmer("english") stemmed=[stemming.stem(each) for each in token] return stemmed def tokenize(text): tokenizer=RegexpTokenizer(r‘\w+‘) #設置正則表達式規則 tokens=tokenizer.tokenize(text) stems=stemming(tokens) return stems from nltk.corpus import stopwords noise=stopwords.words("english") from sklearn.feature_extraction.text import CountVectorizer CV=CountVectorizer(stop_words=noise,tokenizer=tokenize,lowercase=False) words=CV.fit_transform(corpus) words_frequency=words.todense() print(CV.get_feature_names()) print(words_frequency)
[‘cat‘, ‘cousin‘, ‘cute‘, ‘dog‘, ‘eat‘, ‘friend‘, ‘futur‘, ‘good‘, ‘huzihu‘, ‘like‘, ‘make‘, ‘name‘, ‘need‘, ‘old‘, ‘otherwis‘, ‘plan‘, ‘realli‘, ‘regret‘, ‘sleep‘] [[1 0 1 ..., 1 0 0] [0 1 1 ..., 0 0 1] [0 0 0 ..., 0 1 0]]
此外,還需註意的是詞形的變化。比如說單復數:"foot"和"feet",過去式和現在進行時:"understood"和"understanding",主動和被動:"eat"和"eaten",等等。這些詞都應該被視為同一個特征。解決的辦法是進行詞形還原(lemmatization)。這裏就不演示了,可以用NLTK中的WordNetLemmatizer來進行詞形還原(from nltk.stem.wordnet import WordNetLemmatizer)。
最後,再想一下,我們在對文檔進行分類時,假如某個詞在文檔中都有出現,那麽這個詞就無法給分類帶來多少有用的信息。因此,對於出現頻率高的詞和頻率低的詞,我們應該區分對待,它們的重要性是不一樣的。解決的辦法就是用TF-IDF(term frequncy, inverse document frequency)來給詞進行加權。TF-IDF會根據單詞在文本中出現的頻率進行加權,出現頻率高的詞,加權系數就低,反之,出現頻率低的詞,加權系數就高。可以用sklearn的TfidfVectorizer來實現。
text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends." text2="My cousin has a cute dog. He likes sleeping and eating. He is friendly to others." text3= "We all need to make plans for the future, otherwise we will regret when we‘re old." corpus=[text1,text2,text3] from nltk import RegexpTokenizer from nltk.stem.snowball import SnowballStemmer def stemming(token): stemming=SnowballStemmer("english") stemmed=[stemming.stem(each) for each in token] return stemmed def tokenize(text): tokenizer=RegexpTokenizer(r‘\w+‘) #設置正則表達式規則 tokens=tokenizer.tokenize(text) stems=stemming(tokens) return stems from nltk.corpus import stopwords noise=stopwords.words("english") from sklearn.feature_extraction.text import TfidfVectorizer CV=TfidfVectorizer(stop_words=noise,tokenizer=tokenize,lowercase=False) words=CV.fit_transform(corpus) words_frequency=words.todense() print(CV.get_feature_names()) print(words_frequency) from sklearn.metrics.pairwise import euclidean_distances for i,j in ([0,1],[0,2],[1,2]): dist=euclidean_distances(words_frequency[i],words_frequency[j]) print("文本{}和文本{}特征向量之間的歐氏距離是:{}".format(i+1,j+1,dist))
[‘cat‘, ‘cousin‘, ‘cute‘, ‘dog‘, ‘eat‘, ‘friend‘, ‘futur‘, ‘good‘, ‘huzihu‘, ‘like‘, ‘make‘, ‘name‘, ‘need‘, ‘old‘, ‘otherwis‘, ‘plan‘, ‘realli‘, ‘regret‘, ‘sleep‘] [[ 0.30300252 0. 0.23044123 ..., 0.30300252 0. 0. ] [ 0. 0.40301621 0.30650422 ..., 0. 0. 0.40301621] [ 0. 0. 0. ..., 0. 0.37796447 0. ]] 文本1和文本2特征向量之間的歐氏距離是:[[ 1.25547312]] 文本1和文本3特征向量之間的歐氏距離是:[[ 1.41421356]] 文本2和文本3特征向量之間的歐氏距離是:[[ 1.41421356]]
詞袋模型的缺點: 1. 無法反映詞之間的關聯關系。例如:"Humans like cats."和"Cats like humans"具有相同的特征向量。
2. 無法捕捉否定關系。例如:"I will not eat noodles today."和"I will eat noodles today."盡管意思相反,但是從特征向量來看它們非常相似。不過這個問題可以通過設置n-gram來解決(比如可以在用sklearn創建CountVectorizer實例時加上ngram_range參數)。
機器學習---文本特征提取之詞袋模型(Machine Learning Text Feature Extraction Bag of Words)