k近鄰8-交叉驗證，網格搜尋優化模型

阿新 • • 發佈：2021-09-13

1 什麼是交叉驗證(cross validation)

交叉驗證：將拿到的訓練資料，分為訓練和驗證集。以下圖為例：將資料分成4份，其中一份作為驗證集。然後經過4次(組)的測試，每次都更換不同的驗證集。即得到4組模型的結果，取平均值作為最終結果。又稱4折交叉驗證。

1.1 分析

為了讓從訓練得到模型結果更加準確，做以下處理

訓練集：訓練集+驗證集
測試集：測試集

1.2 為什麼需要交叉驗證

交叉驗證目的：為了讓被評估的模型更加準確可信
注意: 交叉驗證不能提高模型的準確率

2 什麼是網格搜尋(Grid Search)

超引數:
- sklearn中,需要手動指定的引數,叫做超引數
網格搜尋就是把這些超引數的值,通過字典的形式傳遞進去,然後進行選擇最優值

3 交叉驗證，網格搜尋（模型選擇與調優）API：

sklearn.model_selection.GridSearchCV(estimator, * param_grid=None,cv=None)對估計器的指定引數值進行詳盡搜尋

引數：

estimator：選擇了哪個訓練模型
param_grid：需要傳遞的超引數(dict){“n_neighbors”:[1,3,5]}
cv：指定幾折交叉驗證

訓練

fit：輸入訓練資料
score：準確率

結果分析：

bestscore__:在交叉驗證中驗證的最好結果
bestestimator：最好的引數模型
cvresults:每次交叉驗證後的驗證集準確率結果和訓練集準確率結果

4 鳶尾花案例增加K值調優

使用GridSearchCV構建估計器

from sklearn.datasets import load_iris
from sklearn.model_selection import  train_test_split,GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
# 1.獲取資料
iris=load_iris()
# 2.資料基本處理
# x_train,x_test,y_train,y_test為訓練集特徵值、測試集特徵值、訓練集目標值、測試集目標值
x_train,x_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.2,random_state=22)

# 3.特徵工程-資料預處理
# 3.1 標準化
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.transform(x_test)

# 4.模型訓練-KNN
# 4.1例項化一個估計器
estimator=KNeighborsClassifier(n_neighbors=5)
# 4.2 模型調優--交叉驗證，網格搜尋
param_dict = {"n_neighbors": [1, 3, 5]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=3)
# 4.3使用估計器進行模型訓練
estimator.fit(x_train,y_train)
# 5.模型評估
## 5.1 預測值結果輸出
y_pre = estimator.predict(x_test)
print("預測結果為：\n", y_pre)
print("比對真實值和預測值：\n",y_pre == y_test)
## 5.2 準確率計算
score=estimator.score(x_test,y_test)
print("準確率為：\n",score)

評估檢視最終選擇的結果和交叉驗證的結果

print("在交叉驗證中驗證的最好結果：\n", estimator.best_score_)
print("最好的引數模型：\n", estimator.best_estimator_)
print("每次交叉驗證後的準確率結果：\n", estimator.cv_results_)

最終結果

比對預測結果和真實值：
 [ True  True  True  True  True  True  True False  True  True  True  True
  True  True  True  True  True  True False  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True]
直接計算準確率：
 0.947368421053
在交叉驗證中驗證的最好結果：
 0.973214285714
最好的引數模型：
 KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
每次交叉驗證後的準確率結果：
 {'mean_fit_time': array([ 0.00114751,  0.00027037,  0.00024462]), 'std_fit_time': array([  1.13901511e-03,   1.25300249e-05,   1.11011951e-05]), 'mean_score_time': array([ 0.00085751,  0.00048693,  0.00045625]), 'std_score_time': array([  3.52785082e-04,   2.87650037e-05,   5.29673344e-06]), 'param_n_neighbors': masked_array(data = [1 3 5],
             mask = [False False False],
       fill_value = ?)
, 'params': [{'n_neighbors': 1}, {'n_neighbors': 3}, {'n_neighbors': 5}], 'split0_test_score': array([ 0.97368421,  0.97368421,  0.97368421]), 'split1_test_score': array([ 0.97297297,  0.97297297,  0.97297297]), 'split2_test_score': array([ 0.94594595,  0.89189189,  0.97297297]), 'mean_test_score': array([ 0.96428571,  0.94642857,  0.97321429]), 'std_test_score': array([ 0.01288472,  0.03830641,  0.00033675]), 'rank_test_score': array([2, 3, 1], dtype=int32), 'split0_train_score': array([ 1.        ,  0.95945946,  0.97297297]), 'split1_train_score': array([ 1.        ,  0.96      ,  0.97333333]), 'split2_train_score': array([ 1.  ,  0.96,  0.96]), 'mean_train_score': array([ 1.        ,  0.95981982,  0.96876877]), 'std_train_score': array([ 0.        ,  0.00025481,  0.0062022 ])}