sklearn進行垃圾郵件分類
阿新 • • 發佈:2021-01-12
技術標籤:Machine Learning垃圾郵件分類sklearn
1. 資料讀取
# 1. 資料集的讀取
import os
dataset_x = []
dataset_y = []
for filename in list(os.listdir("spam")):
file_content = None
with open("spam/" + filename, mode="r", encoding="utf-8") as f:
file_content = f.readlines( )
content = ""
for c in file_content:
if len(content) !=0 :
content += " "
content += c
dataset_x.append(content)
dataset_y.append(1)
for filename in list(os.listdir("ham")):
file_content = None
with open("ham/" + filename, mode="r", encoding="utf-8") as f:
file_content = f.readlines()
content = ""
for c in file_content:
if len(content) !=0 :
content += " "
content += c
dataset_x.append(content)
dataset_y.append(0)
dataset_x = np.array(dataset_x)
dataset_y = np.array(dataset_y)
2. shuffle
這裡,因為要劃分下訓練集和測試集,所需需要先打亂順序。
# 2. 資料集打亂順序
import numpy as np
np.random.seed(116)
np.random.shuffle(dataset_x)
np.random.seed(116)
np.random.shuffle(dataset_y)
3. 向量化表示
前面自己簡單實現了bag of words
,這裡就使用tf-idf
來實現向量化。
# 3. 使用sklearn的tf-idf來得到詞向量表示
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1,2), stop_words='english', strip_accents='unicode',norm='l2')
dataset_x_vec = vectorizer.fit_transform(dataset_x)
4. 資料集劃分
為了在後面更加正規的計算準確率等,這裡劃分資料集。
# 4.分0.8/0.2為訓練集和測試集
from sklearn.model_selection import train_test_split
train_x_vec,test_x_vec,train_y,test_y = train_test_split(dataset_x_vec, dataset_y,test_size=0.2)
5. 建立模型
直接使用sklearn
中的樸素貝葉斯,我們這裡假設特徵的先驗概率為多項式分佈,即MultinomialNB
。
當然,還有二元伯努利分佈、正態分佈,分別對應BernoulliNB
、GaussianNB
。
# 5. 分類器
from sklearn.naive_bayes import MultinomialNB
clf=MultinomialNB().fit(train_x_vec,train_y)
6. 檢視結果
不妨看下訓練向量的維度:
最後,看看效果:
from sklearn.metrics import classification_report
pred=clf.predict(test_x_vec)
print(classification_report(test_y,pred))
如下:
完整程式碼如下:
# 1. 資料集的讀取
import os
dataset_x = []
dataset_y = []
for filename in list(os.listdir("spam")):
file_content = None
with open("spam/" + filename, mode="r", encoding="utf-8") as f:
file_content = f.readlines()
content = ""
for c in file_content:
if len(content) !=0 :
content += " "
content += c
dataset_x.append(content)
dataset_y.append(1)
for filename in list(os.listdir("ham")):
file_content = None
with open("ham/" + filename, mode="r", encoding="utf-8") as f:
file_content = f.readlines()
content = ""
for c in file_content:
if len(content) !=0 :
content += " "
content += c
dataset_x.append(content)
dataset_y.append(0)
dataset_x = np.array(dataset_x)
dataset_y = np.array(dataset_y)
# 2. 資料集打亂順序
import numpy as np
np.random.seed(116)
np.random.shuffle(dataset_x)
np.random.seed(116)
np.random.shuffle(dataset_y)
# 3. 使用sklearn的tf-idf來得到詞向量表示
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1,2), stop_words='english', strip_accents='unicode',norm='l2')
dataset_x_vec = vectorizer.fit_transform(dataset_x)
# 4.分0.8/0.2為訓練集和測試集
from sklearn.model_selection import train_test_split
train_x_vec,test_x_vec,train_y,test_y = train_test_split(dataset_x_vec, dataset_y,test_size=0.2)
# 5. 分類器
from sklearn.naive_bayes import MultinomialNB
clf=MultinomialNB().fit(train_x_vec,train_y)
pred=clf.predict(test_x_vec)
from sklearn.metrics import classification_report
print(classification_report(test_y,pred))