模型調優：交叉驗證，超引數搜尋(複習17)

阿新 • • 發佈：2019-02-19

用模型在測試集上進行效能評估前，通常是希望儘可能利用手頭現有的資料對模型進行調優，甚至可以粗略地估計測試結果。通常，對現有資料進行取樣分割：一部分資料用於模型引數訓練，即訓練集；一部分資料用於調優模型配置和特徵選擇，且對未知的測試效能做出估計，即驗證集。

交叉驗證可以保證所有資料都有被訓練和驗證的機會，也盡最大可能讓優化的模型效能表現的更加可信。下圖給出了十折交叉驗證的示意圖。
這裡寫圖片描述

模型的超引數是指實驗時模型的配置，通過網格搜尋的方法對超引數組合進行調優，該過程平行計算。由於超引數的空間是無盡的，因此超引數的組合配置只能是“更優”解，沒有最優解。通常，依靠網格搜尋對多種超引數組合的空間進行暴力搜尋。每一套超引數組合被代入到學習函式中作為新的模型，為了比較新模型之間的效能，每個模型都會採用交叉驗證的方法在多組相同的訓練和測試資料集下進行評估

。

from sklearn.model_selection import GridSearchCV
from sklearn import svm
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

svc=svm.SVC()
param_grid = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
grid_search = GridSearchCV(svc, param_grid=param_grid, verbose=10 
)
grid_search.fit(X, y)
print(grid_search.best_estimator_)

這裡寫圖片描述

from __future__ import print_function
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import 
 GridSearchCV
from sklearn.pipeline import Pipeline

categories = ['alt.atheism','talk.religion.misc']
data = fetch_20newsgroups(subset='train', categories=categories)
print("Loading 20 newsgroups dataset for categories:",categories)
print("%d documents,%d categories" % (len(data.filenames),len(data.target_names)))

# Define a pipeline combining a text feature extractor with a simple classifier
pipeline = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', SGDClassifier())])

parameters = {
    #'vect__max_df': (0.5, 0.75, 1.0),
    #'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    #'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    #'clf__n_iter': (10, 50, 80),
}
# find the best parameters for both the feature extraction and the classifier
grid_search = GridSearchCV(pipeline, parameters,n_jobs=-1, verbose=10)   #用全部CPU平行計算
grid_search.fit(data.data, data.target)

print("Best score: %0.3f" % grid_search.best_score_)

這裡寫圖片描述

模型調優：交叉驗證，超引數搜尋(複習17)

模型調優：交叉驗證，超引數搜尋(複習17)

機器學習之模型選擇（K折交叉驗證，超引數的選擇）

Sklearn流水線交叉驗證以及超引數網格交叉評估基礎案例實戰-大資料ML樣本集案例實戰

訓練模型：交叉驗證

規則化和模型選擇（Regularization and model selection）——機器學習：交叉驗證Cross validation

優達機器學習：交叉驗證

機器學習與深度學習系列連載：第一部分機器學習（四）誤差分析（Bias and Variance）和模型調優

機器學習：交叉驗證和模型選擇與Python程式碼實現

SQL Server性能調優：資源管理之內存管理篇（上）

如果使用交叉驗證，是否還需要單獨分出測試集？

機器學習（二）工作流程與模型調優

linux調優：按照CPU、記憶體、磁碟IO、網路效能監測

WebRTC通話質量調優：三個弱網模擬測試工具的使用與對比

Java學習路線指南之JVM調優並解決OutOfMemoryError，StackOverflowError

客戶逾期貸款預測[6] - 網格搜尋調參和交叉驗證

ML - 貸款使用者逾期情況分析3 - 模型調優

PHP-FPM 調優：使用 ‘pm static’ 來最大化你的伺服器負載能力

機器學習演算法：交叉驗證——（監督）學習器效能評估方法 [ sklearn.model_selection.cross_val_score()官方翻譯 ]

通過５折交叉驗證，實現邏輯迴歸，決策樹，SVM,隨機森林，GBDT,Xgboost,lightGBM的評分

金融貸款逾期的模型構建4——模型調優

模型調優：交叉驗證，超引數搜尋(複習17)

相關推薦