機器學習:sklearn演算法引數選擇--網格搜尋
阿新 • • 發佈:2018-12-21
機器學習中很多演算法的引數選擇是個比較繁瑣的問題,人工調參比較費時,好在sklearn給我們提供了網格搜尋引數的方法,其實就是類似暴力破解,先設定一些引數的取值,然後通過gridsearch,去尋找這些引數中表現的最好的引數。
我們依舊使用上一節的泰坦尼克號生存者預測資料集。同樣使用隨機森林演算法,看看girdsearch如何使用。
先設定要調的引數和對應的取值:
param_grid = { 'bootstrap': [True], 'max_depth': [10, 20, 50], 'max_features': [len((X.columns))], 'min_samples_leaf': [3, 4, 5], 'min_samples_split': [4, 8], 'n_estimators': [5, 10, 50] }
再初始化我們要用的演算法,然後使用網格搜尋,尋找最優引數:
#初始化模型
forest = RandomForestClassifier()
#初始化網格搜尋
grid_search = GridSearchCV(estimator=forest, param_grid=param_grid, cv=3,
n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)
#檢視最好的引數選擇
print(grid_search.best_params_)
最後用網格搜尋得到的引數,進行模型訓練:
#使用網格搜尋得到的最好的引數選擇進行模型訓練
best_forest = grid_search.best_estimator_
best_forest.fit(X_train, y_train)
全部的程式碼如下:
# -*- coding: utf-8 -*- # @Time : 2018/12/14 上午9:59 # @Author : yangchen # @FileName: gridsearch.py # @Software: PyCharm # @Blog :https://blog.csdn.net/opp003/article import matplotlib.pyplot as plt import numpy as np import pandas as pd from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, precision_recall_fscore_support from sklearn.model_selection import train_test_split #匯入資料 df = pd.read_csv('processed_titanic.csv', header=0) #設定y值 X = df.drop(["survived"], axis=1) y = df["survived"] #訓練集和測試集劃分 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0, shuffle=True) #構建網格引數 param_grid = { 'bootstrap': [True], 'max_depth': [10, 20, 50], 'max_features': [len((X.columns))], 'min_samples_leaf': [3, 4, 5], 'min_samples_split': [4, 8], 'n_estimators': [5, 10, 50] } #初始化模型 forest = RandomForestClassifier() #初始化網格搜尋 grid_search = GridSearchCV(estimator=forest, param_grid=param_grid, cv=3, n_jobs=-1, verbose=1) grid_search.fit(X_train, y_train) #檢視最好的引數選擇 print(grid_search.best_params_) #使用網格搜尋得到的最好的引數選擇進行模型訓練 best_forest = grid_search.best_estimator_ best_forest.fit(X_train, y_train) # 預測 pred_train = best_forest.predict(X_train) pred_test = best_forest.predict(X_test) #準確率 train_acc = accuracy_score(y_train, pred_train) test_acc = accuracy_score(y_test, pred_test) print ("訓練集準確率: {0:.2f}, 測試集準確率: {1:.2f}".format(train_acc, test_acc)) #其他模型評估指標 precision, recall, F1, _ = precision_recall_fscore_support(y_test, pred_test, average="binary") print ("precision: {0:.2f}. recall: {1:.2f}, F1: {2:.2f}".format(precision, recall, F1)) #特徵重要度 features = list(X_test.columns) importances = best_forest.feature_importances_ indices = np.argsort(importances)[::-1] num_features = len(importances) #將特徵重要度以柱狀圖展示 plt.figure() plt.title("Feature importances") plt.bar(range(num_features), importances[indices], color="g", align="center") plt.xticks(range(num_features), [features[i] for i in indices], rotation='45') plt.xlim([-1, num_features]) plt.show() #輸出各個特徵的重要度 for i in indices: print ("{0} - {1:.3f}".format(features[i], importances[i]))
得到的結果:
{'bootstrap': True, 'max_depth': 20, 'max_features': 7, 'min_samples_leaf': 4, 'min_samples_split': 8, 'n_estimators': 5}
訓練集準確率: 0.86, 測試集準確率: 0.76
precision: 0.86. recall: 0.79, F1: 0.82
sex - 0.428
age - 0.294
fare - 0.204
sibsp - 0.036
embarked - 0.030
parch - 0.008
pclass - 0.000
我們可以看到結果和上節所得到的結果,略有提升。其實網格搜尋雖然方便了模型調參,但是還是需要建模人員有一定的調參經驗作為基礎的。