1. 程式人生 > 其它 >sklearn進行垃圾郵件分類

sklearn進行垃圾郵件分類

技術標籤:Machine Learning垃圾郵件分類sklearn

1. 資料讀取

# 1. 資料集的讀取
import os

dataset_x = []
dataset_y = []
for filename in list(os.listdir("spam")):
    file_content = None
    with open("spam/" + filename, mode="r", encoding="utf-8") as f:
        file_content = f.readlines(
) content = "" for c in file_content: if len(content) !=0 : content += " " content += c dataset_x.append(content) dataset_y.append(1) for filename in list(os.listdir("ham")): file_content = None with open("ham/"
+ filename, mode="r", encoding="utf-8") as f: file_content = f.readlines() content = "" for c in file_content: if len(content) !=0 : content += " " content += c dataset_x.append(content) dataset_y.append(0) dataset_x =
np.array(dataset_x) dataset_y = np.array(dataset_y)

2. shuffle

這裡,因為要劃分下訓練集和測試集,所需需要先打亂順序。

# 2. 資料集打亂順序
import numpy as np
np.random.seed(116)
np.random.shuffle(dataset_x)
np.random.seed(116)
np.random.shuffle(dataset_y)

3. 向量化表示

前面自己簡單實現了bag of words,這裡就使用tf-idf來實現向量化。

# 3. 使用sklearn的tf-idf來得到詞向量表示
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1,2), stop_words='english', strip_accents='unicode',norm='l2')
dataset_x_vec = vectorizer.fit_transform(dataset_x)

4. 資料集劃分

為了在後面更加正規的計算準確率等,這裡劃分資料集。

# 4.分0.8/0.2為訓練集和測試集
from sklearn.model_selection import train_test_split
train_x_vec,test_x_vec,train_y,test_y = train_test_split(dataset_x_vec, dataset_y,test_size=0.2)

5. 建立模型

直接使用sklearn中的樸素貝葉斯,我們這裡假設特徵的先驗概率為多項式分佈,即MultinomialNB
當然,還有二元伯努利分佈、正態分佈,分別對應BernoulliNBGaussianNB

# 5. 分類器
from sklearn.naive_bayes import MultinomialNB
clf=MultinomialNB().fit(train_x_vec,train_y)

6. 檢視結果

不妨看下訓練向量的維度:
在這裡插入圖片描述
最後,看看效果:

from sklearn.metrics import classification_report
pred=clf.predict(test_x_vec)
print(classification_report(test_y,pred))

如下:
在這裡插入圖片描述


完整程式碼如下:

# 1. 資料集的讀取
import os

dataset_x = []
dataset_y = []
for filename in list(os.listdir("spam")):
    file_content = None
    with open("spam/" + filename, mode="r", encoding="utf-8") as f:
        file_content = f.readlines()
    content = ""
    for c in file_content:
        if len(content) !=0 :
            content += " "
        content += c
    dataset_x.append(content)
    dataset_y.append(1)

for filename in list(os.listdir("ham")):
    file_content = None
    with open("ham/" + filename, mode="r", encoding="utf-8") as f:
        file_content = f.readlines()
    content = ""
    for c in file_content:
        if len(content) !=0 :
            content += " "
        content += c
    dataset_x.append(content)
    dataset_y.append(0)
    
dataset_x = np.array(dataset_x)
dataset_y = np.array(dataset_y)

# 2. 資料集打亂順序
import numpy as np
np.random.seed(116)
np.random.shuffle(dataset_x)
np.random.seed(116)
np.random.shuffle(dataset_y)


# 3. 使用sklearn的tf-idf來得到詞向量表示
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1,2), stop_words='english', strip_accents='unicode',norm='l2')
dataset_x_vec = vectorizer.fit_transform(dataset_x)

# 4.分0.8/0.2為訓練集和測試集
from sklearn.model_selection import train_test_split
train_x_vec,test_x_vec,train_y,test_y = train_test_split(dataset_x_vec, dataset_y,test_size=0.2)

# 5. 分類器
from sklearn.naive_bayes import MultinomialNB
clf=MultinomialNB().fit(train_x_vec,train_y)
pred=clf.predict(test_x_vec)

from sklearn.metrics import classification_report
print(classification_report(test_y,pred))