《深度學習Python實踐》第22章——文字分類例項

阿新 • • 發佈：2019-01-23

程式碼如下：

１）演算法比價

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import 
 KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import 
 GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot as plt

categories=['alt.atheism',
            'tec.sport.hockey',
            'sci.crypt',
            'comp.sys.ibm.pc.hardware',
            'sci.med',
            'comp.sys.mac.hardware' 
,
            'sci.space',
            'comp.windows.x',
            'soc.religion.christian',
            'misc.forsale',
            'talk.politocs.guns',
            'rec.autos',
            'talk.politocs.medeast',
            'rec.motorcycles',
            'talk.politics.misc',
            'rec.sport.baseball',
            'talk.religion.misc']

#匯入訓練資料
train_path='/home/duan/下載/20news-bydate/20news-bydate-train'
dataset_train=load_files(container_path=train_path,categories=categories)

#匯入評估資料
test_path='/home/duan/下載/20news-bydate/20news-bydate-test'
dataset_test=load_files(container_path=test_path,categories=categories)

#資料準備與理解

#計算詞頻
count_vect=CountVectorizer(stop_words='english',decode_error='ignore')
X_train_counts=count_vect.fit_transform(dataset_train.data)
#檢視資料維度
#詞頻的計算結果如下：
print(X_train_counts.shape)

#計算TF-IDF
tf_transformer=TfidfVectorizer(stop_words='english',decode_error='ignore')
X_train_counts_tf=tf_transformer.fit_transform(dataset_train.data)
print(X_train_counts_tf.shape)

#以上用兩種方法進行了文字特徵的提取。並且查看了資料維度。
#接下來用TF-IDF特徵進行分類模型的訓練。


#評估演算法
#設定評估演算法的基準
num_folds=10
seed=7
scoring='accuracy'

#線性演算法LR ，
#非線性演算法：CART，SVM，MNB，KNN
models={}
models['LR']=LogisticRegression()
models['SVM']=SVC()
models['CART']=DecisionTreeClassifier()
models['MNB']=MultinomialNB()
models['KNN']=KNeighborsClassifier()

#比較演算法
results=[]
for key in models:
    kfold = KFold(n_splits= num_folds, random_state=seed)
    cv_result = cross_val_score(models[key], X_train_counts_tf, dataset_train.target, cv=kfold, scoring=scoring)
    results.append(cv_result)
    print('%s: %f (%f)' %(key, cv_result.mean(), cv_result.std()))

執行結果:

(7838, 77172)
(7838, 77172)
KNN: 0.824575 (0.012700)
LR: 0.920900 (0.008155)
CART: 0.703240 (0.013782)
MNB: 0.896786 (0.009055)
SVM: 0.062772 (0.004306)

箱線圖比較演算法:

#箱線圖10折交叉驗證比較演算法
fig=plt.figure()
fig.suptitle("Algorithm Comparision")
ax=fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(models.keys())
plt.show()

執行結果:
這裡寫圖片描述
從圖中結果可以看出,樸素貝葉斯分類器的資料離散程度比較好,邏輯迴歸的偏度較大.演算法結果的離散程度能夠反應演算法對資料的只用情況,所以對邏輯迴歸和樸素貝葉斯分類器進行進一步的研究,實行演算法調參.

2) 演算法調參

通過上面的分析發現,LR和MNB值得進一步進行優化.下面對這兩個演算法的參宿進行調參,進一步提高演算法的準確度.

(1)邏輯迴歸調參

邏輯迴歸中的超引數是C.C是目標的約束函式,C值越小則正則化強度越大,對C進行調參,每次給C設定一定數量的值,如果臨界值是最有引數,重複這個步驟,直到找到最優值.

#演算法調參
#調參LR
param_grid={}
param_grid['C']=[0.1,5,13,15]
model=LogisticRegression()
kfold=KFold(n_splits=num_folds,random_state=seed)
grid=GridSearchCV(estimator=model,param_grid=param_grid,scoring=scoring,cv=kfold)
grid_result=grid.fit(X=X_train_counts_tf,y=dataset_train.target)
print('最優：%s使用%s'%(grid_result.best_score_,grid_result.best_params_))

執行結果:
最優：0.9393978055626435使用{'C': 15}

(2)樸素貝葉斯調參

樸素貝葉斯有一個alpha引數,該引數是一個平滑引數,預設值為1.0.
我們可以對這個引數進行調參,以提高演算法的準確度.

#演算法調參
#調參MNB
param_grid={}
param_grid['alpha']=[0.001,0.01,0.1,1.5]
model=MultinomialNB()
kfold=KFold(n_splits=num_folds,random_state=seed)
grid=GridSearchCV(estimator=model,param_grid=param_grid,scoring=scoring,cv=kfold)
grid_result=grid.fit(X=X_train_counts_tf,y=dataset_train.target)
print('最優：%s使用%s'%(grid_result.best_score_,grid_result.best_params_))
cv_results=zip(grid_result.cv_results_['mean_test_score'],
               grid_result.cv_results_['std_test_score'],
               grid_result.cv_results_['params'])
for mean, std, param in cv_results:
    print('%f (%f) with %r'%(mean, std, param))

執行結果:

最優：0.934804797142128使用{'alpha': 0.01}
0.929829 (0.008380) with {'alpha': 0.001}
0.934805 (0.008096) with {'alpha': 0.01}
0.928043 (0.008024) with {'alpha': 0.1}
0.889640 (0.010375) with {'alpha': 1.5}

MNB演算法最有引數為alpha=0.01.最優：0.934804797142128使用{‘alpha’: 0.01}
LR演算法最優引數為:C=15. 最優：0.9393978055626435使用{‘C’: 15}

通過調參發現,LR在C=15時具有最好的準確度.接下來審查整合演算法.

3).整合演算法

隨機森林(RF)
AdaBoost(AB)

ensembles={}
ensembles['RF']=RandomForestClassifier()
ensembles['AB']=AdaBoostClassifier()
#比較整合演算法
results=[]
for key in ensembles:
    kfold = KFold(n_splits= num_folds, random_state=seed)
    cv_result = cross_val_score(ensembles[key], X_train_counts_tf, dataset_train.target, cv=kfold, scoring=scoring)
    results.append(cv_result)
    print('%s: %f (%f)' %(key, cv_result.mean(), cv_result.std()))

執行結果:

RF: 0.773795 (0.017244)
AB: 0.620055 (0.017638)

箱線圖:

#箱線圖10折交叉驗證比較演算法
fig=plt.figure()
fig.suptitle("Algorithm Comparision")
ax=fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(ensembles.keys())
plt.show()

這裡寫圖片描述

從箱線圖可以看出,隨機森林的分佈比較均勻,對資料的適用性比較高,更值得進一步優化研究.

4).整合演算法調參

#整合演算法調參
#調參RF
param_grid={}
param_grid['n_estimators']=[10,100,150,200]
model=RandomForestClassifier()

kfold=KFold(n_splits=num_folds,random_state=seed)

grid=GridSearchCV(estimator=model,param_grid=param_grid,scoring=scoring,cv=kfold)

grid_result=grid.fit(X=X_train_counts_tf,y=dataset_train.target)

print('最優：%s使用%s'%(grid_result.best_score_,grid_result.best_params_))

cv_results=zip(grid_result.cv_results_['mean_test_score'],
               grid_result.cv_results_['std_test_score'],
               grid_result.cv_results_['params'])
for mean, std, param in cv_results:
    print('%f (%f) with %r'%(mean, std, param))

執行結果:

最優：0.888236795100791使用{'n_estimators': 200}
0.779025 (0.007910) with {'n_estimators': 10}
0.882496 (0.012405) with {'n_estimators': 100}
0.887982 (0.010867) with {'n_estimators': 150}
0.888237 (0.009727) with {'n_estimators': 200}

確定最終模型

#演算法調參
#調參LR
param_grid={}
model=LogisticRegression(C=15)
model.fit(X=X_train_counts_tf,y=dataset_train.target)
X_test_counts=tf_transformer.transform(dataset_test.data)
predictions=model.predict(X_test_counts)
print(accuracy_score(dataset_test.target,predictions))
print(classification_report(dataset_test.target,predictions))

執行結果:

0.8844163312248419
             precision    recall  f1-score   support

          0       0.85      0.79      0.82       319
          1       0.78      0.84      0.81       392
          2       0.86      0.88      0.87       385
          3       0.91      0.89      0.90       395
          4       0.81      0.90      0.86       390
          5       0.91      0.91      0.91       396
          6       0.97      0.95      0.96       398
          7       0.94      0.97      0.96       397
          8       0.97      0.94      0.96       396
          9       0.92      0.89      0.91       396
         10       0.93      0.95      0.94       394
         11       0.86      0.93      0.89       398
         12       0.91      0.77      0.84       310
         13       0.70      0.62      0.65       251

avg / total       0.89      0.88      0.88      5217

《深度學習Python實踐》第22章——文字分類例項

程式碼如下：１）演算法比價 from sklearn.datasets import load_files from sklearn.feature_extraction.text import CountVectorizer from sklea

《深度學習》Textbook第十章學習筆記

長度可變 strong text 循環 tro nbsp 數據寬度深度學習第10章序列建模：循環和遞歸網絡 1.循環神經網絡介紹相比卷積神經網絡：專門用於處理網格化的數據（如圖像），可以很容易擴展到更具有很大寬度和高度的圖像，以及處理大小可變的圖像；循環神經

深度學習--花書第2章

第2章介紹的是線性代數，記錄一些課本里對一些知識點沒有掌握的新視角和忘記的知識。 1. 張量： A tensor is something that transforms like a tensor! 一個量, 在不同的參考系下按照某種特定的法則進行變換, 就

深度學習之TensorFlow 第三章基本開發步驟--以邏輯迴歸擬合二維資料為例

深度學習有四個步驟：準備資料搭建模型迭代訓練使用模型 import tensorflow as tf import numpy as np #陣列 import matplotlib.pyplot as plt # matplotlib.pyplot

模式識別學習筆記：第四章線性分類器 (持續更新中。。。)

1 引言基於樣本直接設計分類器需要三個基本要素： 1. 分類器即判別函式的型別，也就是從什麼樣的判別函式類中去求解； 2. 分類器設計的目標或準則，確定函式類中的某些待定引數； 3. 如何設計演算法利用樣本資料搜尋到最優的函式引數。判別函式

《python學習手冊》第34章異常對象

indexer app interrupt 類定義 ase num pac chan 捕捉基於字符串的異常 python在2.6之前可以使用字符串來定義異常，並且是通過對象標識符來匹配的（即通過is 而不是==） myexc = "My excetion string"

python 3.5學習筆記（第四章）

ping 函數返回值使用 family __next__ 什麽 san 直接 all 本章內容：　　一、、裝飾器　　二、生成器　　三、叠代器　　四、python中的內置方法　　五、json & pickle 的數據序列化及反序列化　　六、程序目錄結構規

《Think Python》第5章學習筆記

ssi toc turn nal rec source CA 執行c -c 備忘：parameter 指的是形參，argument 指的是實參。 [TOC] 5.1 整除和取模（Floor division and modulus）在 Python 3 中，/ 符號對

《Think Python》第7章學習筆記

決定 tro LG 以及 roo pre ssi 一個 span [TOC] 7.1 重新賦值（Reassignment）在 Python 中，用等號 = 進行變量賦值，但是變量賦值語句，比如 a = 3，與數學意義上的等式不是等價的。 7.2 更新變量（Updating

《Think Python》第15章學習筆記

用戶 main asa ... 類名執行 __main__ deb 5.4 《Think Python》第15章學習筆記 [TOC] 15.1 程序員定義的類型（Programmer-defined types）定義一個類的過程會創建一個類型對象（type object

Python編程從入門到實踐-第2章-字符串

string als 作用小寫 mes 換行符 col scrip The 一、字符串：一系列字符，python中用一號括起來的即為字符串，可單可雙 "This is a string." ‘This is also a string.‘ #引號靈活性的作用：可在

(Flask Web開發:基於Python的Web應用開發實戰)------學習筆記（第2章）

第2章程式的基本結構本章將帶你瞭解 Flask 程式各部分的作用，編寫並執行第一個 Flask Web 程式。 2.1　初始化所有 Flask 程式都必須建立一個程式例項，程式例項是 Flask 類的物件。 Web 伺服器使用一種名為 Web 伺服器閘

python程式設計從入門到實踐第六章字典

學習筆記： # 一個簡單的字典 alien_0={'color':'green','points':'5'} print(alien_0['color']) print(alien_0['points']) #6.2 使用字典 ## 字典是一系列鍵—值對，每個鍵都有一個值

《機器學習實戰》第2章閱讀筆記3 使用K近鄰演算法改進約會網站的配對效果—分步驟詳細講解1——資料準備：從文字檔案中解析資料（附詳細程式碼及註釋）

本篇使用的資料存放在文字檔案datingTestSet2.txt中，每個樣本資料佔據一行，總共有1000行。樣本主要包含以下3中特徵：（1）每年獲得飛行常客里程數（2）玩視訊遊戲所耗時間百分比（3）每週消費的冰淇淋公升數在使用分類器之前，需要將處理的檔案格式

Python程式設計：從入門到實踐第11章專案 1 　外星人入侵

alien_invasion.py #!/usr/bin/env python # -*- coding:utf-8 -*- import pygame from pygame.sprite import Group from settings import Setting

流程python學習筆記：第五章(1)

第五章：一等函式在python中一切都可以視作為物件，包括函式。我們來看個例子： def function_try(): '''it is funciton try doc''' print 'function_try' if __name__=="_

《Python資料分析與挖掘實戰》第15章——文字挖掘

本文是基於《Python資料分析與挖掘實戰》的實戰部分的第15章的資料——《電商產品評論資料情感分析》做的分析。旨在回顧對評論文字資料的處理和建模方法。1 挖掘背景與目標對京東平臺上的熱水器評論進行文字挖掘分析，挖掘建模如下：1）分析某一個品牌熱水器的使用者情感傾向2

Python程式設計從入門到實踐第十章：檔案和異常

檔案和異常從檔案中讀取資料 1.函式open() 要以任何方式使用檔案,都得先開啟檔案,這樣才能訪問它. 接受一個引數:要開啟檔案的名稱. 返回一個表示檔案的物件.Python將這個物件儲存在後面要使用

Python程式設計從入門到實踐第九章：類

類建立和使用類使用類和例項修改屬性的值直接通過例項進行修改通過方法進行設定通過方法進行遞增（增加特定的值繼承如果你要編寫的類是另一個類的特殊版本，可使用繼承。一個類繼承另一個類時，它將自動獲得另一

Python程式設計從入門到實踐第八章：函式

函式傳遞實參位置實參關鍵字實參預設值等效的函式呼叫避免實參錯誤返回值傳遞列表禁止函式修改列表 - 可向函式傳遞列表的副本而不是原件；這樣做函式所做的任何修改都只對影響副本 -

《深度學習Python實踐》第22章——文字分類例項

１）演算法比價

2) 演算法調參

(1)邏輯迴歸調參

(2)樸素貝葉斯調參

3).整合演算法

4).整合演算法調參

確定最終模型

相關推薦