Sklearn流水線交叉驗證以及超引數網格交叉評估基礎案例實戰-大資料ML樣本集案例實戰
阿新 • • 發佈:2018-12-23
版權宣告:本套技術專欄是作者(秦凱新)平時工作的總結和昇華,通過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和叢集環境容量規劃等內容,請持續關注本套部落格。QQ郵箱地址:[email protected],如有任何技術交流,可隨時聯絡。
1 基本資料探索
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
X = pd.read_csv('C:\\ML\\MLData\\iris.data')
X.columns = ['sepal_length_cm', 'sepal_width_cm', 'petal_length_cm', 'petal_width_cm', 'class']
X.head()
X.sample(n=10)
複製程式碼
X.shape
(149, 5)
X.dtypes
sepal_length_cm float64
sepal_width_cm float64
petal_length_cm float64
petal_width_cm float64
class object
dtype: object
X.describe()
複製程式碼
2 資料視覺化探索分析
-
box 檢視異常點
X.plot(kind="box",subplots=True,layout=(1,4),figsize=(12,5)) plt.show() 複製程式碼
-
hist區間圖
X.hist(figsize=(12,5),xlabelsize=1,ylabelsize=1) plt.show() 複製程式碼
-
密度圖
X.plot(kind="density",subplots=True,layout=(1,4),figsize=(12,5)) plt.show() 複製程式碼
-
熱力圖關係圖
fig = plt.figure(figsize=(10,10)) ax = fig.add_subplot(111) cax = ax.matshow(X.corr(),vmin=-1,vmax=1,interpolation="none") fig.colorbar(cax) ticks = np.arange(0,4,1) ax.set_xticks(ticks) ax.set_yticks(ticks) ax.set_xticklabels(col_name) ax.set_yticklabels(col_name) plt.show() 複製程式碼
3 資料比例劃分
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
all_inputs = iris_data[['sepal_length_cm', 'sepal_width_cm',
'petal_length_cm', 'petal_width_cm']].values
all_classes = iris_data['class'].values
(training_inputs,
testing_inputs,
training_classes,
testing_classes) = train_test_split(all_inputs, all_classes, train_size=0.75, random_state=1)
複製程式碼
4 多分類模型集中評估
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
models = []
models.append(("AB",AdaBoostClassifier()))
models.append(("GBM",GradientBoostingClassifier()))
models.append(("RF",RandomForestClassifier()))
models.append(("ET",ExtraTreesClassifier()))
models.append(("SVC",SVC()))
models.append(("KNN",KNeighborsClassifier()))
models.append(("LR",LogisticRegression()))
models.append(("GNB",GaussianNB()))
models.append(("LDA",LinearDiscriminantAnalysis()))
names = []
results = []
for name,model in models:
result = cross_val_score(model,training_inputs,training_classes,scoring="accuracy",cv=5)
names.append(name)
results.append(result)
print("{} Mean:{:.4f}(Std{:.4f})".format(name,result.mean(),result.std()))
AB Mean:0.9097(Std0.0290)
GBM Mean:0.9370(Std0.0361)
RF Mean:0.9461(Std0.0442)
ET Mean:0.9370(Std0.0361)
SVC Mean:0.9640(Std0.0340)
KNN Mean:0.9374(Std0.0454)
LR Mean:0.9379(Std0.0353)
GNB Mean:0.9556(Std0.0391)
LDA Mean:0.9735(Std0.0360)
複製程式碼
5 流水線交叉驗證
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipeline = []
pipeline.append(("ScalerET", Pipeline([("Scaler",StandardScaler()),
("ET",ExtraTreesClassifier())])))
pipeline.append(("ScalerGBM", Pipeline([("Scaler",StandardScaler()),
("GBM",GradientBoostingClassifier())])))
pipeline.append(("ScalerRF", Pipeline([("Scaler",StandardScaler()),
("RF",RandomForestClassifier())])))
names = []
results = []
for name,model in pipeline:
kfold = KFold(n_splits=5,random_state=42)
result = cross_val_score(model, training_inputs,training_classes, cv=kfold, scoring="accuracy")
results.append(result)
names.append(name)
print("{}: Error Mean:{:.4f} (Error Std:{:.4f})".format(
name,result.mean(),result.std()))
ScalerET: Error Mean:0.9372 (Error Std:0.0358)
ScalerGBM: Error Mean:0.9462 (Error Std:0.0332)
ScalerRF: Error Mean:0.9553 (Error Std:0.0275)
複製程式碼
6 超引數網格交叉評估
from sklearn.model_selection import GridSearchCV
param_grid = {
"C":[0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 1.3, 1.5, 1.7, 2.0],
"kernel":['linear', 'poly', 'rbf', 'sigmoid']
}
model = SVC()
kfold = KFold(n_splits=5, random_state=42)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring="accuracy", cv=kfold)
grid_result = grid.fit(training_inputs,training_classes)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
Best: 0.972973 using {'C': 0.9, 'kernel': 'linear'}
0.954955 (0.027681) with: {'C': 0.1, 'kernel': 'linear'}
0.927928 (0.021620) with: {'C': 0.1, 'kernel': 'poly'}
0.945946 (0.016821) with: {'C': 0.1, 'kernel': 'rbf'}
0.351351 (0.049646) with: {'C': 0.1, 'kernel': 'sigmoid'}
0.963964 (0.017933) with: {'C': 0.3, 'kernel': 'linear'}
0.954955 (0.028629) with: {'C': 0.3, 'kernel': 'poly'}
0.954955 (0.027681) with: {'C': 0.3, 'kernel': 'rbf'}
0.351351 (0.049646) with: {'C': 0.3, 'kernel': 'sigmoid'}
0.963964 (0.017933) with: {'C': 0.5, 'kernel': 'linear'}
0.954955 (0.028629) with: {'C': 0.5, 'kernel': 'poly'}
0.963964 (0.017933) with: {'C': 0.5, 'kernel': 'rbf'}
0.351351 (0.049646) with: {'C': 0.5, 'kernel': 'sigmoid'}
0.963964 (0.017933) with: {'C': 0.7, 'kernel': 'linear'}
0.963964 (0.033773) with: {'C': 0.7, 'kernel': 'poly'}
0.963964 (0.017933) with: {'C': 0.7, 'kernel': 'rbf'}
0.342342 (0.045336) with: {'C': 0.7, 'kernel': 'sigmoid'}
0.972973 (0.021914) with: {'C': 0.9, 'kernel': 'linear'}
0.963964 (0.033773) with: {'C': 0.9, 'kernel': 'poly'}
0.963964 (0.017933) with: {'C': 0.9, 'kernel': 'rbf'}
0.351351 (0.049646) with: {'C': 0.9, 'kernel': 'sigmoid'}
0.972973 (0.021914) with: {'C': 1.0, 'kernel': 'linear'}
0.963964 (0.033773) with: {'C': 1.0, 'kernel': 'poly'}
0.963964 (0.017933) with: {'C': 1.0, 'kernel': 'rbf'}
0.351351 (0.049646) with: {'C': 1.0, 'kernel': 'sigmoid'}
0.972973 (0.021914) with: {'C': 1.3, 'kernel': 'linear'}
0.963964 (0.033773) with: {'C': 1.3, 'kernel': 'poly'}
0.963964 (0.017933) with: {'C': 1.3, 'kernel': 'rbf'}
0.351351 (0.049646) with: {'C': 1.3, 'kernel': 'sigmoid'}
0.972973 (0.021914) with: {'C': 1.5, 'kernel': 'linear'}
0.963964 (0.033773) with: {'C': 1.5, 'kernel': 'poly'}
0.963964 (0.017933) with: {'C': 1.5, 'kernel': 'rbf'}
0.351351 (0.049646) with: {'C': 1.5, 'kernel': 'sigmoid'}
0.972973 (0.021914) with: {'C': 1.7, 'kernel': 'linear'}
0.954955 (0.028629) with: {'C': 1.7, 'kernel': 'poly'}
0.963964 (0.017933) with: {'C': 1.7, 'kernel': 'rbf'}
0.351351 (0.049646) with: {'C': 1.7, 'kernel': 'sigmoid'}
0.963964 (0.017933) with: {'C': 2.0, 'kernel': 'linear'}
0.954955 (0.028629) with: {'C': 2.0, 'kernel': 'poly'}
0.954955 (0.027681) with: {'C': 2.0, 'kernel': 'rbf'}
0.351351 (0.049646) with: {'C': 2.0, 'kernel': 'sigmoid'}
複製程式碼
總結
本文沒有華麗的技術,在於整合多分類模型集中評估,流水線交叉驗證以及超引數網格交叉評估多種場景。
版權宣告:本套技術專欄是作者(秦凱新)平時工作的總結和昇華,通過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和叢集環境容量規劃等內容,請持續關注本套部落格。QQ郵箱地址:[email protected],如有任何技術交流,可隨時聯絡。
秦凱新 於深圳