sklearn-GridSearchCV 網格搜尋 調引數
阿新 • • 發佈:2018-12-15
Grid Search 網格搜尋
GridSearchCV:一種調參的方法,當你演算法模型效果不是很好時,可以通過該方法來調整引數,通過迴圈遍歷,嘗試每一種引數組合,返回最好的得分值的引數組合 比如支援向量機中的引數 C 和 gamma ,當我們不知道哪個引數效果更好時,可以通過該方法來選擇引數,我們把C 和gamma 的選擇範圍定位[0.001,0.01,0.1,1,10,100] 每個引數都能組合在一起,迴圈過程就像是在網格中遍歷,所以叫網格搜尋
c=0.001 | c=0.01 | c=0.1 | c=1 | c=10 | c=100 | |
---|---|---|---|---|---|---|
gamma =0.001 | SVC( gamma=0.001,C=0.001) | … | … | … | … | … |
gamma =0.01 | SVC( gamma=0.01,C=0.001) | … | … | … | … | … |
… | … | … | … | … | … | … |
… | … | … | … | … | … | … |
gamma= 10 | SVC( gamma=10,C=0.001) | … | … | … | … | … |
gamma=100 | SVC( gamma=100,C=0.001) | … | … | … | … | … |
下面來通過具體程式碼看看怎麼調優:
from sklearn.datasets import load_iris from sklearn.svm import SVC from sklearn.model_selection import train_test_split iris = load_iris() X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=0) print("訓練集個數:%d 測試集個數:%d "%((len(X_train)),len(X_test))) #開始進行網格搜尋 best_score = 0 for gamma in [0.001,0.01,0.1,1,10,100]: for C in [0.001,0.01,0.1,1,10,100]: svm = SVC(gamma = gamma ,C = C) svm.fit(X_train,y_train) score = svm.score(X_test,y_test) if score > best_score: best_score = score best_parameters = {'gamma':gamma,'C':C} print("best_score:{:.2f}".format(best_score)) print("best_parameters:{}".format(best_parameters))
輸出:
訓練集個數:112 驗證集個數:38
best_score:0.97
best_parameters:{'gamma': 0.001, 'C': 100}
存在的問題: 原來的資料集分割為訓練集和測試集之後,其中測試集起到的作用有兩個,一個是用來調整引數,一個是用來評價模型的好壞,這樣會導致評分值會比實際效果要好。(因為我們將測試集送到了模型裡面去測試模型的好壞,而我們目的是要將訓練模型應用在沒使用過的資料上。)
解決方法: 我們可以通過把資料集劃分三份,一份是訓練集(訓練資料),一份是驗證集(調整引數),一份是測試集(測試模型)。
具體程式碼如下:
X_trainval,X_test,y_trainval,y_test = train_test_split(iris.data,iris.target) X_train,X_val,y_train,y_val = train_test_split(X_trainval,y_trainval) print("訓練集個數:%d 驗證集個數:%d 測試集個數:%d "%((len(X_train)),len(X_val),len(X_test))) best_scroe = 0 for gamma in [0.001,0.01,0.1,1,10,100]: for C in [0.001,0.01,0.1,1,10,100]: svm = SVC(gamma=gamma,C=C) svm.fit(X_train,y_train) score = svm.score(X_val,y_val) if score > best_score: best_score = score best_parameters = {'gamma':gamma,'C':C} svm = SVC(**best_parameters) svm.fit(X_trainval,y_trainval) test_score = svm.score(X_test,y_test) print("best_score:{:.2f}".format(best_score)) print("best_parameters:{}".format(best_parameters)) print("best_score:{:.2f}".format(test_score))
輸出:
訓練集個數:84 驗證集個數:28 測試集個數:38
best_score:1.00
best_parameters:{'gamma': 0.001, 'C': 100}
best_score:0.95
進一步改進: 為了防止模型過擬合,我們使用交叉驗證的方法
Grid Search with Cross Validation(GridSearchCV)
from sklearn.model_selection import cross_val_score
best_score = 0.0
for gamma in [0.001,0.01,0.1,1,10,100]:
for C in [0.001,0.01,0.1,1,10,100]:
svm = SVC(gamma=gamma,C=C)
scores = cross_val_score(svm,X_trainval,y_trainval,cv=5)
score = scores.mean()
if score > best_score:
best_score = score
best_parameters = {'gamma':gamma,'C':C}
svm = SVC(**best_parameters)
svm.fit(X_trainval,y_trainval)
test_score = svm.score(X_test,y_test)
print("best_score:{:.2f}".format(best_score))
print("best_parameters:{}".format(best_parameters))
print("best_score:{:.2f}".format(test_score))
輸出:
best_score:0.97
best_parameters:{'gamma': 0.1, 'C': 1}
best_score:0.95
為了方便調參,sklearn 設定了一個類 GridSearchCV ,用來實現上面的fit,score等功能。
from sklearn.model_selection import GridSearchCV
#需要求的引數的範圍(列表的形式)
param_grid = {"gamma":[0.001,0.01,0.1,1,10,100],
"C":[0.001,0.01,0.1,1,10,100]}
#estimator模型 (將所求引數之外的確定的引數給出 )
estimator = SVC()
grid_search = GridSearchCV(estimator,param_grid,cv = 5)
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=10)
grid_search.fit(X_train,y_train)
print("Best set score:{:.2f}".format(grid_search.best_score_))
print("Best parameters:{}".format(grid_search.best_params_))
print("Test set score:{:.2f}".format(grid_search.score(X_test,y_test)))
輸出
Best set score:0.98
Best parameters:{'gamma': 0.1, 'C': 10}
Test set score:0.97
總結
GridSearchCV能夠使我們找到範圍內最優的引數,param_grid引數越多,組合越多,計算的時間也需要越多,GridSearchCV使用於小資料集。