使用scikit-learn進行文字分類
阿新 • • 發佈:2019-01-04
1. 資料來源
所用的資料是分類好的資料,詳細描述見SMS Spam Collection v. 1,可以從github下載,資料在第4章。每一行資料包括包括兩列,使用逗號隔開, 第1列是分類(lable),第2列是文字。
2. 資料準備sms = pd.read_csv(filename, sep=',', header=0, names=['label','text']) sms.head Out[5]: <bound method DataFrame.head of label text 0 ham Go until jurong point, crazy.. Available only ... 1 ham Ok lar... Joking wif u oni... 2 spam Free entry in 2 a wkly comp to win FA Cup fina... 3 ham U dun say so early hor... U c already then say... 4 ham Nah I don't think he goes to usf, he lives aro... 5 spam FreeMsg Hey there darling it's been 3 week's n... 6 ham Even my brother is not like to speak with me. ... 7 ham As per your request 'Melle Melle (Oru Minnamin... 8 spam WINNER!! As a valued network customer you have... 9 spam Had your mobile 11 months or more? U R entitle... 10 ham I'm gonna be home soon and i don't want to tal...
總共有5574行資料,隨機從中抽取500行作為測試資料集,其它的作為訓練資料集,為此定義了一個函式。執行後發現這個函式有一點小問題,它取不到500個數據,會少幾個,分析原因,應該是產生的隨機數有重複導致的。n為抽取的資料行數,size是整個資料集的行數。
3. 特徵提取def randomSequence(n, size): result = [0 for i in range(size)] for i in range(n): x = random.randrange(0, size-1, 1) result[x] = 1 return result
進行文字分類,在呼叫演算法之前需要將文字內容轉換成特徵。 scikit-learn提供的CountVectorizer, TfidfTransformer兩個類可以完成特徵的提取。測試資料集共用了訓練資料集產生的詞彙表。
4.完整的程式碼
以上用的是貝葉斯分類演算法,也可以換其他演算法。# -*- coding: utf-8 -*- import random import pandas as pd from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer from sklearn.naive_bayes import MultinomialNB #生成選擇訓練資料和測試資料的隨機序列 def randomSequence(n, size): result = [0 for i in range(size)] for i in range(n): x = random.randrange(0, size-1, 1) result[x] = 1 return result if __name__ == '__main__': #讀資料 filename = 'data/sms_spam.csv' sms = pd.read_csv(filename, sep=',', header=0, names=['label','text']) #拆分訓練資料集和測試資料集 size = len(sms) sequence = randomSequence(500, size) sms_train_mask = [sequence[i]==0 for i in range(size)] sms_train = sms[sms_train_mask] sms_test_mask = [sequence[i]==1 for i in range(size)] sms_test = sms[sms_test_mask] #文字轉換成TF-IDF向量 train_labels = sms_train['label'].values train_features = sms_train['text'].values count_v1= CountVectorizer(stop_words = 'english', max_df = 0.5, decode_error = 'ignore') counts_train = count_v1.fit_transform(train_features) #print(count_v1.get_feature_names()) #repr(counts_train.shape) tfidftransformer = TfidfTransformer() tfidf_train = tfidftransformer.fit(counts_train).transform(counts_train) test_labels = sms_test['label'].values test_features = sms_test['text'].values count_v2 = CountVectorizer(vocabulary=count_v1.vocabulary_,stop_words = 'english', max_df = 0.5, decode_error = 'ignore') counts_test = count_v2.fit_transform(test_features) tfidf_test = tfidftransformer.fit(counts_test).transform(counts_test) #訓練 clf = MultinomialNB(alpha = 0.01) clf.fit(tfidf_train, train_labels) #預測 predict_result = clf.predict(tfidf_test) #print(predict_result) #正確率 correct = [test_labels[i]==predict_result[i] for i in range(len(predict_result))] r = len(predict_result) t = correct.count(True) f = correct.count(False) print(r, t, f, t/float(r) )
執行結果
runfile('E:/MyProject/_python/ScikitLearn/NaiveBayes.py', wdir='E:/MyProject/_python/ScikitLearn')
(476, 468, 8, 0.9831932773109243)