《深度學習Python實踐》第22章——文字分類例項
阿新 • • 發佈:2019-01-23
程式碼如下:
1)演算法比價
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot as plt
categories=['alt.atheism',
'tec.sport.hockey',
'sci.crypt',
'comp.sys.ibm.pc.hardware',
'sci.med',
'comp.sys.mac.hardware' ,
'sci.space',
'comp.windows.x',
'soc.religion.christian',
'misc.forsale',
'talk.politocs.guns',
'rec.autos',
'talk.politocs.medeast',
'rec.motorcycles',
'talk.politics.misc',
'rec.sport.baseball',
'talk.religion.misc']
#匯入訓練資料
train_path='/home/duan/下載/20news-bydate/20news-bydate-train'
dataset_train=load_files(container_path=train_path,categories=categories)
#匯入評估資料
test_path='/home/duan/下載/20news-bydate/20news-bydate-test'
dataset_test=load_files(container_path=test_path,categories=categories)
#資料準備與理解
#計算詞頻
count_vect=CountVectorizer(stop_words='english',decode_error='ignore')
X_train_counts=count_vect.fit_transform(dataset_train.data)
#檢視資料維度
#詞頻的計算結果如下:
print(X_train_counts.shape)
#計算TF-IDF
tf_transformer=TfidfVectorizer(stop_words='english',decode_error='ignore')
X_train_counts_tf=tf_transformer.fit_transform(dataset_train.data)
print(X_train_counts_tf.shape)
#以上用兩種方法進行了文字特徵的提取。並且查看了資料維度。
#接下來用TF-IDF特徵進行分類模型的訓練。
#評估演算法
#設定評估演算法的基準
num_folds=10
seed=7
scoring='accuracy'
#線性演算法LR ,
#非線性演算法:CART,SVM,MNB,KNN
models={}
models['LR']=LogisticRegression()
models['SVM']=SVC()
models['CART']=DecisionTreeClassifier()
models['MNB']=MultinomialNB()
models['KNN']=KNeighborsClassifier()
#比較演算法
results=[]
for key in models:
kfold = KFold(n_splits= num_folds, random_state=seed)
cv_result = cross_val_score(models[key], X_train_counts_tf, dataset_train.target, cv=kfold, scoring=scoring)
results.append(cv_result)
print('%s: %f (%f)' %(key, cv_result.mean(), cv_result.std()))
執行結果:
(7838, 77172)
(7838, 77172)
KNN: 0.824575 (0.012700)
LR: 0.920900 (0.008155)
CART: 0.703240 (0.013782)
MNB: 0.896786 (0.009055)
SVM: 0.062772 (0.004306)
箱線圖比較演算法:
#箱線圖10折交叉驗證比較演算法
fig=plt.figure()
fig.suptitle("Algorithm Comparision")
ax=fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(models.keys())
plt.show()
執行結果:
從圖中結果可以看出,樸素貝葉斯分類器的資料離散程度比較好,邏輯迴歸的偏度較大.演算法結果的離散程度能夠反應演算法對資料的只用情況,所以對邏輯迴歸和樸素貝葉斯分類器進行進一步的研究,實行演算法調參.
2) 演算法調參
通過上面的分析發現,LR和MNB值得進一步進行優化.下面對這兩個演算法的參宿進行調參,進一步提高演算法的準確度.
(1)邏輯迴歸調參
邏輯迴歸中的超引數是C.C是目標的約束函式,C值越小則正則化強度越大,對C進行調參,每次給C設定一定數量的值,如果臨界值是最有引數,重複這個步驟,直到找到最優值.
#演算法調參
#調參LR
param_grid={}
param_grid['C']=[0.1,5,13,15]
model=LogisticRegression()
kfold=KFold(n_splits=num_folds,random_state=seed)
grid=GridSearchCV(estimator=model,param_grid=param_grid,scoring=scoring,cv=kfold)
grid_result=grid.fit(X=X_train_counts_tf,y=dataset_train.target)
print('最優:%s使用%s'%(grid_result.best_score_,grid_result.best_params_))
執行結果:
最優:0.9393978055626435使用{'C': 15}
(2)樸素貝葉斯調參
樸素貝葉斯有一個alpha引數,該引數是一個平滑引數,預設值為1.0.
我們可以對這個引數進行調參,以提高演算法的準確度.
#演算法調參
#調參MNB
param_grid={}
param_grid['alpha']=[0.001,0.01,0.1,1.5]
model=MultinomialNB()
kfold=KFold(n_splits=num_folds,random_state=seed)
grid=GridSearchCV(estimator=model,param_grid=param_grid,scoring=scoring,cv=kfold)
grid_result=grid.fit(X=X_train_counts_tf,y=dataset_train.target)
print('最優:%s使用%s'%(grid_result.best_score_,grid_result.best_params_))
cv_results=zip(grid_result.cv_results_['mean_test_score'],
grid_result.cv_results_['std_test_score'],
grid_result.cv_results_['params'])
for mean, std, param in cv_results:
print('%f (%f) with %r'%(mean, std, param))
執行結果:
最優:0.934804797142128使用{'alpha': 0.01}
0.929829 (0.008380) with {'alpha': 0.001}
0.934805 (0.008096) with {'alpha': 0.01}
0.928043 (0.008024) with {'alpha': 0.1}
0.889640 (0.010375) with {'alpha': 1.5}
MNB演算法最有引數為alpha=0.01.最優:0.934804797142128使用{‘alpha’: 0.01}
LR演算法最優引數為:C=15. 最優:0.9393978055626435使用{‘C’: 15}
通過調參發現,LR在C=15時具有最好的準確度.接下來審查整合演算法.
3).整合演算法
隨機森林(RF)
AdaBoost(AB)
ensembles={}
ensembles['RF']=RandomForestClassifier()
ensembles['AB']=AdaBoostClassifier()
#比較整合演算法
results=[]
for key in ensembles:
kfold = KFold(n_splits= num_folds, random_state=seed)
cv_result = cross_val_score(ensembles[key], X_train_counts_tf, dataset_train.target, cv=kfold, scoring=scoring)
results.append(cv_result)
print('%s: %f (%f)' %(key, cv_result.mean(), cv_result.std()))
執行結果:
RF: 0.773795 (0.017244)
AB: 0.620055 (0.017638)
箱線圖:
#箱線圖10折交叉驗證比較演算法
fig=plt.figure()
fig.suptitle("Algorithm Comparision")
ax=fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(ensembles.keys())
plt.show()
從箱線圖可以看出,隨機森林的分佈比較均勻,對資料的適用性比較高,更值得進一步優化研究.
4).整合演算法調參
#整合演算法調參
#調參RF
param_grid={}
param_grid['n_estimators']=[10,100,150,200]
model=RandomForestClassifier()
kfold=KFold(n_splits=num_folds,random_state=seed)
grid=GridSearchCV(estimator=model,param_grid=param_grid,scoring=scoring,cv=kfold)
grid_result=grid.fit(X=X_train_counts_tf,y=dataset_train.target)
print('最優:%s使用%s'%(grid_result.best_score_,grid_result.best_params_))
cv_results=zip(grid_result.cv_results_['mean_test_score'],
grid_result.cv_results_['std_test_score'],
grid_result.cv_results_['params'])
for mean, std, param in cv_results:
print('%f (%f) with %r'%(mean, std, param))
執行結果:
最優:0.888236795100791使用{'n_estimators': 200}
0.779025 (0.007910) with {'n_estimators': 10}
0.882496 (0.012405) with {'n_estimators': 100}
0.887982 (0.010867) with {'n_estimators': 150}
0.888237 (0.009727) with {'n_estimators': 200}
確定最終模型
#演算法調參
#調參LR
param_grid={}
model=LogisticRegression(C=15)
model.fit(X=X_train_counts_tf,y=dataset_train.target)
X_test_counts=tf_transformer.transform(dataset_test.data)
predictions=model.predict(X_test_counts)
print(accuracy_score(dataset_test.target,predictions))
print(classification_report(dataset_test.target,predictions))
執行結果:
0.8844163312248419
precision recall f1-score support
0 0.85 0.79 0.82 319
1 0.78 0.84 0.81 392
2 0.86 0.88 0.87 385
3 0.91 0.89 0.90 395
4 0.81 0.90 0.86 390
5 0.91 0.91 0.91 396
6 0.97 0.95 0.96 398
7 0.94 0.97 0.96 397
8 0.97 0.94 0.96 396
9 0.92 0.89 0.91 396
10 0.93 0.95 0.94 394
11 0.86 0.93 0.89 398
12 0.91 0.77 0.84 310
13 0.70 0.62 0.65 251
avg / total 0.89 0.88 0.88 5217