Sklearn流水線交叉驗證以及超引數網格交叉評估基礎案例實戰-大資料ML樣本集案例實戰

阿新 • • 發佈：2018-12-23

版權宣告：本套技術專欄是作者（秦凱新）平時工作的總結和昇華，通過從真實商業環境抽取案例進行總結和分享，並給出商業應用的調優建議和叢集環境容量規劃等內容，請持續關注本套部落格。QQ郵箱地址：[email protected]，如有任何技術交流，可隨時聯絡。

1 基本資料探索

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
X = pd.read_csv('C:\\ML\\MLData\\iris.data')
X.columns = ['sepal_length_cm', 'sepal_width_cm', 'petal_length_cm', 'petal_width_cm', 'class']

X.head()
X.sample(n=10)
複製程式碼

X.shape
(149, 5)

X.dtypes
sepal_length_cm    float64
sepal_width_cm     float64
petal_length_cm    float64
petal_width_cm     float64
class               object
dtype: object

X.describe()
複製程式碼

2 資料視覺化探索分析

box 檢視異常點

  X.plot(kind="box",subplots=True,layout=(1,4),figsize=(12,5))
  plt.show()
複製程式碼

hist區間圖

  X.hist(figsize=(12,5),xlabelsize=1,ylabelsize=1)
  plt.show()
複製程式碼

密度圖

  X.plot(kind="density",subplots=True,layout=(1,4),figsize=(12,5))
  plt.show()
複製程式碼

熱力圖關係圖

  fig = plt.figure(figsize=(10,10))
  ax = fig.add_subplot(111)
  cax = ax.matshow(X.corr(),vmin=-1,vmax=1,interpolation="none")
  fig.colorbar(cax)
  ticks = np.arange(0,4,1)
  ax.set_xticks(ticks)
  ax.set_yticks(ticks)
  ax.set_xticklabels(col_name)
  ax.set_yticklabels(col_name)
  plt.show()
複製程式碼

3 資料比例劃分

    from sklearn.model_selection import KFold
    from sklearn.model_selection import train_test_split
    
    all_inputs = iris_data[['sepal_length_cm', 'sepal_width_cm',
                             'petal_length_cm', 'petal_width_cm']].values
    
    all_classes = iris_data['class'].values
    
    (training_inputs,
     testing_inputs,
     training_classes,
     testing_classes) = train_test_split(all_inputs, all_classes, train_size=0.75, random_state=1)
複製程式碼

4 多分類模型集中評估

    from sklearn.ensemble import AdaBoostClassifier
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.ensemble import ExtraTreesClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.svm import SVC
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.naive_bayes import GaussianNB
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
    from sklearn.model_selection import KFold
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import cross_val_score
    
    models = []
    models.append(("AB",AdaBoostClassifier()))
    models.append(("GBM",GradientBoostingClassifier()))
    models.append(("RF",RandomForestClassifier()))
    models.append(("ET",ExtraTreesClassifier()))
    models.append(("SVC",SVC()))
    models.append(("KNN",KNeighborsClassifier()))
    models.append(("LR",LogisticRegression()))
    models.append(("GNB",GaussianNB()))
    models.append(("LDA",LinearDiscriminantAnalysis()))
    
    names = []
    results = []
    
    for name,model in models:
        result = cross_val_score(model,training_inputs,training_classes,scoring="accuracy",cv=5)
        names.append(name)
        results.append(result)
        print("{}  Mean:{:.4f}(Std{:.4f})".format(name,result.mean(),result.std()))
        
        AB  Mean:0.9097(Std0.0290)
        GBM  Mean:0.9370(Std0.0361)
        RF  Mean:0.9461(Std0.0442)
        ET  Mean:0.9370(Std0.0361)
        SVC  Mean:0.9640(Std0.0340)
        KNN  Mean:0.9374(Std0.0454)
        LR  Mean:0.9379(Std0.0353)
        GNB  Mean:0.9556(Std0.0391)
        LDA  Mean:0.9735(Std0.0360) 
複製程式碼

5 流水線交叉驗證

    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    pipeline = []
    pipeline.append(("ScalerET", Pipeline([("Scaler",StandardScaler()),
                                         ("ET",ExtraTreesClassifier())])))
    pipeline.append(("ScalerGBM", Pipeline([("Scaler",StandardScaler()),
                                           ("GBM",GradientBoostingClassifier())])))
    pipeline.append(("ScalerRF", Pipeline([("Scaler",StandardScaler()),
                                         ("RF",RandomForestClassifier())])))
    
    names = []
    results = []
    for name,model in pipeline:
        kfold = KFold(n_splits=5,random_state=42)
        result = cross_val_score(model, training_inputs,training_classes, cv=kfold, scoring="accuracy")
        results.append(result)
        names.append(name)
        print("{}:  Error Mean:{:.4f} (Error Std:{:.4f})".format(
            name,result.mean(),result.std()))

ScalerET:   Error Mean:0.9372 (Error Std:0.0358)
ScalerGBM:  Error Mean:0.9462 (Error Std:0.0332)
ScalerRF:   Error Mean:0.9553 (Error Std:0.0275)
複製程式碼

6 超引數網格交叉評估

    from sklearn.model_selection import GridSearchCV
    param_grid = {
        "C":[0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 1.3, 1.5, 1.7, 2.0],
        "kernel":['linear', 'poly', 'rbf', 'sigmoid']
    }
    model = SVC()
    kfold = KFold(n_splits=5, random_state=42)
    
    grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring="accuracy", cv=kfold)
    grid_result = grid.fit(training_inputs,training_classes)
    
    print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
    means = grid_result.cv_results_['mean_test_score']
    stds = grid_result.cv_results_['std_test_score']
    params = grid_result.cv_results_['params']
    for mean, stdev, param in zip(means, stds, params):
        print("%f (%f) with: %r" % (mean, stdev, param))
        
    Best: 0.972973 using {'C': 0.9, 'kernel': 'linear'}
    
    0.954955 (0.027681) with: {'C': 0.1, 'kernel': 'linear'}
    0.927928 (0.021620) with: {'C': 0.1, 'kernel': 'poly'}
    0.945946 (0.016821) with: {'C': 0.1, 'kernel': 'rbf'}
    0.351351 (0.049646) with: {'C': 0.1, 'kernel': 'sigmoid'}
    0.963964 (0.017933) with: {'C': 0.3, 'kernel': 'linear'}
    0.954955 (0.028629) with: {'C': 0.3, 'kernel': 'poly'}
    0.954955 (0.027681) with: {'C': 0.3, 'kernel': 'rbf'}
    0.351351 (0.049646) with: {'C': 0.3, 'kernel': 'sigmoid'}
    0.963964 (0.017933) with: {'C': 0.5, 'kernel': 'linear'}
    0.954955 (0.028629) with: {'C': 0.5, 'kernel': 'poly'}
    0.963964 (0.017933) with: {'C': 0.5, 'kernel': 'rbf'}
    0.351351 (0.049646) with: {'C': 0.5, 'kernel': 'sigmoid'}
    0.963964 (0.017933) with: {'C': 0.7, 'kernel': 'linear'}
    0.963964 (0.033773) with: {'C': 0.7, 'kernel': 'poly'}
    0.963964 (0.017933) with: {'C': 0.7, 'kernel': 'rbf'}
    0.342342 (0.045336) with: {'C': 0.7, 'kernel': 'sigmoid'}
    0.972973 (0.021914) with: {'C': 0.9, 'kernel': 'linear'}
    0.963964 (0.033773) with: {'C': 0.9, 'kernel': 'poly'}
    0.963964 (0.017933) with: {'C': 0.9, 'kernel': 'rbf'}
    0.351351 (0.049646) with: {'C': 0.9, 'kernel': 'sigmoid'}
    0.972973 (0.021914) with: {'C': 1.0, 'kernel': 'linear'}
    0.963964 (0.033773) with: {'C': 1.0, 'kernel': 'poly'}
    0.963964 (0.017933) with: {'C': 1.0, 'kernel': 'rbf'}
    0.351351 (0.049646) with: {'C': 1.0, 'kernel': 'sigmoid'}
    0.972973 (0.021914) with: {'C': 1.3, 'kernel': 'linear'}
    0.963964 (0.033773) with: {'C': 1.3, 'kernel': 'poly'}
    0.963964 (0.017933) with: {'C': 1.3, 'kernel': 'rbf'}
    0.351351 (0.049646) with: {'C': 1.3, 'kernel': 'sigmoid'}
    0.972973 (0.021914) with: {'C': 1.5, 'kernel': 'linear'}
    0.963964 (0.033773) with: {'C': 1.5, 'kernel': 'poly'}
    0.963964 (0.017933) with: {'C': 1.5, 'kernel': 'rbf'}
    0.351351 (0.049646) with: {'C': 1.5, 'kernel': 'sigmoid'}
    0.972973 (0.021914) with: {'C': 1.7, 'kernel': 'linear'}
    0.954955 (0.028629) with: {'C': 1.7, 'kernel': 'poly'}
    0.963964 (0.017933) with: {'C': 1.7, 'kernel': 'rbf'}
    0.351351 (0.049646) with: {'C': 1.7, 'kernel': 'sigmoid'}
    0.963964 (0.017933) with: {'C': 2.0, 'kernel': 'linear'}
    0.954955 (0.028629) with: {'C': 2.0, 'kernel': 'poly'}
    0.954955 (0.027681) with: {'C': 2.0, 'kernel': 'rbf'}
    0.351351 (0.049646) with: {'C': 2.0, 'kernel': 'sigmoid'}
複製程式碼

總結

本文沒有華麗的技術，在於整合多分類模型集中評估，流水線交叉驗證以及超引數網格交叉評估多種場景。

秦凱新於深圳

Sklearn流水線交叉驗證以及超引數網格交叉評估基礎案例實戰-大資料ML樣本集案例實戰

版權宣告：本套技術專欄是作者（秦凱新）平時工作的總結和昇華，通過從真實商業環境抽取案例進行總結和分享，並給出商業應用的調優建議和叢集環境容量規劃等內容，請持續關注本套部落格。QQ郵箱地址：[email protected]，如有任何技術交流，可隨時聯絡。 1 基本資料探索 import pand

機器學習之模型選擇（K折交叉驗證，超引數的選擇）

來源： https://www.cnblogs.com/jerrylead/archive/2011/03/27/1996799.html 對於解決同一個問題，如怎麼選擇模型去擬合線性迴歸中只有一個特徵時房價預測問題，如可能有不同的模型去解決，如： 1、d = 1，h（

Scikit-Learn學習筆記——模型驗證與超引數網格搜尋

超引數與模型驗證模型驗證就是在選擇模型和超引數之後，通過對訓練資料進行學習，對比模型對已知資料的預測值與實際值的差異。模型驗證的正確方法是使用留出集評估模型效能，即先從訓練模型中的資料中留出一部分，然後用這部分留出來的資料檢驗模型效能。但是

模型調優：交叉驗證，超引數搜尋(複習17)

用模型在測試集上進行效能評估前，通常是希望儘可能利用手頭現有的資料對模型進行調優，甚至可以粗略地估計測試結果。通常，對現有資料進行取樣分割：一部分資料用於模型引數訓練，即訓練集；一部分資料用於調優模型配

分類預測，交叉驗證調超參數

date ESS read 實現簡單轉化 random end app ive 調參數是一件很頭疼的事情，今天學習到一個較為簡便的跑循環交叉驗證的方法，雖然不是最好的，如今網上有很多調參的技巧，目前覺得實現簡單的，以後了解更多了再更新。 import numpy as

python交叉驗證以及將全部資料分類訓練集和測試集（分類）

1,將全部資料分離成訓練集和測試集（之前首先先將x和y分類出來才可以） ''' 分離資料集-- test_size :如果是整數則選出來兩個測試集，如果是小數，則是選擇測試集所佔的百分比。 train_size ：同理，都含有預設值0.25 shuffle ：預設為Tru

【scikit-learn】交叉驗證及其用於引數選擇、模型選擇、特徵選擇的例子

[0.95999999999999996, 0.95333333333333337, 0.96666666666666656, 0.96666666666666656, 0.96666666666666679, 0.96666666666666679, 0.96666666666666679, 0.9666

scikit-learn中交叉驗證及其用於引數選擇、模型選擇、特徵選擇的例子

內容概要訓練集/測試集分割用於模型驗證的缺點 K折交叉驗證是如何克服之前的不足交叉驗證如何用於選擇調節引數、選擇模型、選擇特徵改善交叉驗證 1. 模型驗證回顧進行模型驗證的一個重要目的是要選出一個最合適的模型，對於監督學習而言，我們希望模型

機器學習實踐（八）—sklearn之交叉驗證與引數調優

一、交叉驗證與引數調優交叉驗證(cross validation) 交叉驗證：將拿到的訓練資料，分為訓練集、驗證集和測試集。訓練集：訓練集+驗證集測試集：測試集

泛化能力、訓練集、測試集、K折交叉驗證、假設空間、欠擬合與過擬合、正則化（L1正則化、L2正則化）、超引數

泛化能力（generalization）：機器學習模型。在先前未觀測到的輸入資料上表現良好的能力叫做泛化能力（generalization）。訓練集（training set）與訓練錯誤（training error）：訓練機器學習模型使用的資料集稱為訓練集（tr

模型評估和超引數調整（二）——交叉驗證（cross validation）

讀《python machine learning》chapt 6 Learning Best Practices for Model Evaluation and Hyperparameter Tuning【主要內容】（1）獲得對模型評估的無偏估計（2）診斷機器學習演算法的

超引數的選擇與交叉驗證

1. 超引數有哪些　　與超引數對應的是引數。引數是可以在模型中通過BP（反向傳播）進行更新學習的引數，例如各種權值矩陣，偏移量等等。超引數是需要進行程式設計師自己選擇的引數，無法學習獲得。　　常見的超引數有模型（SVM，Softmax，Multi-layer Neural Network,…)，迭代

超引數的選擇、格點搜尋與交叉驗證

超引數的選擇 1. 超引數有哪些　　與超引數對應的是引數。引數是可以在模型中通過BP（反向傳播）進行更新學習的引數，例如各種權值矩陣，偏移量等等。超引數是需要進行程式設計師自己選擇的引數，無法學習獲得。　　常見的超引數有模型（SVM，Softmax，Multi-lay

通過網格搜尋和巢狀交叉驗證尋找機器學習模型的最優引數

在機器學習的模型中，通常有兩類引數，第一類是通過訓練資料學習得到的引數，也就是模型的係數，如迴歸模型中的權重係數，第二類是模型演算法中需要進行設定和優化的超參，如logistic迴歸中的正則化係數和決策樹中的樹的深度引數等。在上一篇文章中，我們通過驗證曲線來尋找最優的超參，在

Spark2.0機器學習系列之1：基於Pipeline、交叉驗證、ParamMap的模型選擇和超引數調優

Spark中的CrossValidation Spark中採用是k折交叉驗證（k-fold cross validation）。舉個例子，例如10折交叉驗證(10-fold cross validation)，將資料集分成10份，輪流將其中9份

sklearn中的交叉驗證與引數選擇

大家可能看到交叉驗證想到最多的就是sklearn裡面資料集的劃分方法train_test_split，實際上這只是資料交叉驗證的資料方法，對模型的進行評分。這裡我們將對仔細講解sklearn中交叉驗證如何判斷模型是否過擬合，並進行引數選擇。主要涉及一下方法：

Python機器學習庫sklearn網格搜尋與交叉驗證

網格搜尋一般是針對引數進行尋優，交叉驗證是為了驗證訓練模型擬合程度。sklearn中的相關API如下：（1）交叉驗證的首要工作：切分資料集train/validation/test A.)沒指定資料切分方式，直接選用cross_val_scor

libsvm交叉驗證與網格搜尋（引數選擇）

首先說交叉驗證。交叉驗證（Cross validation）是一種評估統計分析、機器學習演算法對獨立於訓練資料的資料集的泛化能力（generalize），能夠避免過擬合問題。交叉驗證一般要儘量滿足： 1）訓練集的比例要足夠多，一般大於一半 2）訓練集和測試集要均勻抽樣

超引數、驗證集和K-折交叉驗證

- 本文首發自公眾號：[RAIS](https://ai.renyuzhuo.cn/img/wechat_ercode.png) ## 前言本系列文章為《Deep Learning》讀書筆記，可以參看原書一起閱讀，效果更佳。 ## 超引數 - 引數：網路模型在訓練過程中不斷學習自動調節的變數，

sklearn交叉驗證-【老魚學sklearn】

logs 數值可視化 tar [] spl img mode ear 交叉驗證（Cross validation)，有時亦稱循環估計，是一種統計學上將數據樣本切割成較小子集的實用方法。於是可以先在一個子集上做分析，而其它子集則用來做後續對此分析的確認及驗證。一開始的

Sklearn流水線交叉驗證以及超引數網格交叉評估基礎案例實戰-大資料ML樣本集案例實戰

1 基本資料探索

2 資料視覺化探索分析

3 資料比例劃分

4 多分類模型集中評估

5 流水線交叉驗證

6 超引數網格交叉評估

總結

相關推薦