Python機器學習之資料探索視覺化庫yellowbrick-tutorial

阿新 • • 發佈：2019-08-20

背景介紹

從學sklearn時，除了演算法的坎要過，還得學習matplotlib視覺化，對我的實踐應用而言，視覺化更重要一些，然而matplotlib的易用性和美觀性確實不敢恭維。陸續使用過plotly、seaborn，最終定格在了Bokeh，因為它可以與Flask完美的結合，資料看板的開發難度降低了很多。

前陣子看到這個庫可以較為便捷的實現資料探索，今天得空打算學習一下。原本訪問的是英文文件，結果發現已經有人在做漢化，雖然看起來也像是谷歌翻譯的，本著拿來主義，少費點精力的精神，就半抄半學，還是發現了一些與文件不太一致的地方。

# http://www.scikit-yb.org/zh/latest/tutorial.html

模型選擇教程

在本教程中，我們將檢視各種 Scikit-Learn 模型的分數，並使用 Yellowbrick 的視覺化診斷工具對其進行比較，以便為我們的資料選擇最佳模型。

模型選擇三元組

關於機器學習的討論常常集中在模型選擇上。無論是邏輯迴歸、隨機森林、貝葉斯方法，還是人工神經網路，機器學習實踐者通常都能很快地展示他們的偏好。這主要是因為歷史原因。儘管現代的第三方機器學習庫使得各類模型的部署顯得微不足道，但傳統上，即使是其中一種演算法的應用和調優也需要多年的研究。因此，與其他模型相比，機器學習實踐者往往對特定的(並且更可能是熟悉的)模型有強烈的偏好。

然而，模型選擇比簡單地選擇“正確”或“錯誤”演算法更加微妙。實踐中的工作流程包括:

選擇和/或設計最小和最具預測性的特性集
從模型家族中選擇一組演算法，並且
優化演算法超引數以優化效能。

模型選擇三元組是由Kumar 等人，在 2015 年的 SIGMOD 論文中首次提出。在他們的論文中，談論到下一代為預測建模而構建的資料庫系統的開發。作者很中肯地表示，由於機器學習在實踐中具有高度實驗性，因此迫切需要這樣的系統。“模型選擇，”他們解釋道，“是迭代的和探索性的，因為(模型選擇三元組)的空間通常是無限的，而且通常不可能讓分析師事先知道哪個(組合)將產生令人滿意的準確性和/或洞察力。”

最近，許多工作流程已經通過網格搜尋方法、標準化 API 和基於 GUI 的應用程式實現了自動化。然而，在實踐中，人類的直覺和指導可以比窮舉搜尋更有效地專注於模型質量。通過視覺化模型選擇過程，資料科學家可以轉向最終的、可解釋的模型，並避免陷阱。

Yellowbrick 庫是一個針對機器學習的視覺化診斷平臺，它允許資料科學家控制模型選擇過程。Yellowbrick 用一個新的核心物件擴充套件了Scikit-Learn 的 API: Visualizer。Visualizers 允許視覺化模型作為Scikit-Learn管道過程的一部分進行匹配和轉換，從而在高維資料的轉換過程中提供視覺化診斷。

關於資料

本教程使用來自 UCI Machine Learning Repository 的修改過的蘑菇資料集版本。我們的目標是基於蘑菇的特定，去預測蘑菇是有毒的還是可食用的。

這些資料包括與傘菌目(Agaricus)和環柄菇屬(Lepiota)科中23種烤蘑菇對應的假設樣本描述。每一種都被確定為絕對可食用，絕對有毒，或未知的可食用性和不推薦（後一類與有毒物種相結合）。

我們的檔案“agaricus-lepiota.txt”，包含3個名義上有價值的屬性資訊和8124個蘑菇例項的目標值(4208個可食用，3916個有毒)。

讓我們用Pandas載入資料。

import os
import pandas as pd
mushrooms = 'data/shrooms.csv'  # 資料集
dataset   = pd.read_csv(mushrooms)
# dataset.columns = names
dataset.head()

	id	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	...	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat	Unnamed: 24
0	1	p	x	s	n	t	p	f	c	n	...	w	w	p	w	o	p	k	s	u	NaN
1	2	e	x	s	y	t	a	f	c	b	...	w	w	p	w	o	p	n	n	g	NaN
2	3	e	b	s	w	t	l	f	c	b	...	w	w	p	w	o	p	n	n	m	NaN
3	4	p	x	y	w	t	p	f	c	n	...	w	w	p	w	o	p	k	s	u	NaN
4	5	e	x	s	g	f	n	f	w	b	...	w	w	p	w	o	e	n	a	g	NaN

5 rows × 25 columns

features = ['cap-shape', 'cap-surface', 'cap-color']
target   = ['class']
X = dataset[features]
y = dataset[target]

dataset.shape # 較官方文件少了倆蘑菇

(8122, 25)

dataset.groupby('class').count() # 各少了1個蘑菇

	id	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	...	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat	Unnamed: 24
class
e	4207	4207	4207	4207	4207	4207	4207	4207	4207	4207	...	4207	4207	4207	4207	4207	4207	4207	4207	4207	0
p	3915	3915	3915	3915	3915	3915	3915	3915	3915	3915	...	3915	3915	3915	3915	3915	3915	3915	3915	3915	0

2 rows × 24 columns

特徵提取

我們的資料，包括目標引數，都是分型別資料。為了使用機器學習，我們需要將這些值轉化為數值型資料。為了從資料集中提取這一點，我們必須使用Scikit-Learn的轉換器（transformers）將輸入資料集轉換為適合模型的資料集。幸運的是，Sckit-Learn提供了一個轉換器，用於將分類標籤轉換為整數: sklearn.preprocessing.LabelEncoder。不幸的是，它一次只能轉換一個向量，所以我們必須對它進行調整，以便將它應用於多個列。
有疑問，這個蘑菇分類就是一個向量？

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
class EncodeCategorical(BaseEstimator, TransformerMixin):
    """
    Encodes a specified list of columns or all columns if None.
    """

    def __init__(self, columns=None):
        self.columns  = [col for col in columns]
        self.encoders = None

    def fit(self, data, target=None):
        """
        Expects a data frame with named columns to encode.
        """
        # Encode all columns if columns is None
        if self.columns is None:
            self.columns = data.columns

        # Fit a label encoder for each column in the data frame
        self.encoders = {
            column: LabelEncoder().fit(data[column])
            for column in self.columns
        }
        return self

    def transform(self, data):
        """
        Uses the encoders to transform a data frame.
        """
        output = data.copy()
        for column, encoder in self.encoders.items():
            output[column] = encoder.transform(data[column])

        return output

建模與評估

評估分類器的常用指標

精確度(Precision) 是正確的陽性結果的數量除以所有陽性結果的數量(例如，我們預測的可食用蘑菇實際上有多少?)

召回率(Recall) 是正確的陽性結果的數量除以應該返回的陽性結果的數量(例如，我們準確預測了多少有毒蘑菇是有毒的?)

F1分數(F1 score) 是測試準確度的一種衡量標準。它同時考慮測試的精確度和召回率來計算分數。F1得分可以解釋為精度和召回率的加權平均值，其中F1得分在1處達到最佳值，在0處達到最差值。
precision = true positives / (true positives + false positives)

recall = true positives / (false negatives + true positives)

F1 score = 2 * ((precision * recall) / (precision + recall))
現在我們準備好作出一些預測了！

讓我們構建一種評估多個估算器(multiple estimators)的方法 —— 首先使用傳統的數值分數（我們稍後將與Yellowbrick庫中的一些視覺化診斷進行比較）。

from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
def model_selection(X, y, estimator):
    """
    Test various estimators.
    """
    y = LabelEncoder().fit_transform(y.values.ravel())
    model = Pipeline([
         ('label_encoding', EncodeCategorical(X.keys())),
         ('one_hot_encoder', OneHotEncoder(categories='auto')),  # 此處增加自動分類，否則有warning
         ('estimator', estimator)
    ])

    # Instantiate the classification model and visualizer
    model.fit(X, y)

    expected  = y
    predicted = model.predict(X)

    # Compute and return the F1 score (the harmonic mean of precision and recall)
    return (f1_score(expected, predicted))

from sklearn.svm import LinearSVC, NuSVC, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression, SGDClassifier
from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier

model_selection(X, y, LinearSVC())

0.6582119537920643

import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="sklearn")  # 忽略警告

model_selection(X, y, NuSVC())

0.6878837238441299

model_selection(X, y, SVC())

0.6625145971195017

model_selection(X, y, SGDClassifier())

0.5738408700629649

model_selection(X, y, KNeighborsClassifier())

0.6856846473029046

model_selection(X, y, LogisticRegressionCV())

0.6582119537920643

model_selection(X, y, LogisticRegression())

0.6578749058025622

model_selection(X, y, BaggingClassifier())

0.6873901878632248

model_selection(X, y, ExtraTreesClassifier())

0.6872294372294372

model_selection(X, y, RandomForestClassifier())

0.6992081007399714

初步模型評估

根據上面F1分數的結果，哪個模型表現最好？

視覺化模型評估

現在，讓我們重構模型評估函式，使用Yellowbrick的ClassificationReport類，這是一個模型視覺化工具，可以顯示精確度、召回率和F1分數。這個視覺化的模型分析工具集成了數值分數以及彩色編碼的熱力圖，以支援簡單的解釋和檢測，特別是對於我們用例而言非常相關(性命攸關!)的第一類錯誤(Type I error)和第二類錯誤(Type II error)的細微差別。

第一類錯誤 (或 "假陽性(false positive)" ) 是檢測一種不存在的效應(例如，當蘑菇實際上是可以食用的時候，它是有毒的)。

第二類錯誤 (或 “假陰性”"false negative" ) 是未能檢測到存在的效應(例如，當蘑菇實際上有毒時，卻認為它是可以食用的)。

from sklearn.pipeline import Pipeline
from yellowbrick.classifier import ClassificationReport


def visual_model_selection(X, y, estimator):
    """
    Test various estimators.
    """
    y = LabelEncoder().fit_transform(y.values.ravel())
    model = Pipeline([
         ('label_encoding', EncodeCategorical(X.keys())),
         ('one_hot_encoder', OneHotEncoder()),
         ('estimator', estimator)
    ])

    # Instantiate the classification model and visualizer
    visualizer = ClassificationReport(model, classes=['edible', 'poisonous'])
    visualizer.fit(X, y)
    visualizer.score(X, y)
    visualizer.poof()

visual_model_selection(X, y, LinearSVC())

# 其他分類器視覺化略
visual_model_selection(X, y, RandomForestClassifier())

檢驗

現在,哪種模型看起來最好?為什麼?
哪一個模型最有可能救你的命?
視覺化模型評估與數值模型評價，體驗起來有何不同?

準確率Precision召回率Recall以及綜合評價指標F1-Measure
http://www.makaidong.com/%E5%8D%9A%E5%AE%A2%E5%9B%AD%E7%83%AD%E6%96%87/437.shtml
f1-score綜合考慮的準確率和召回率。
視覺化就是直觀嘛，逃~

作者簡介

知乎yeayee，Py齡5年，善Flask+MongoDB+SKlearn+Bo

Python機器學習之資料探索視覺化庫yellowbrick-tutorial

背景介紹

模型選擇教程

模型選擇三元組

關於資料

特徵提取

建模與評估

評估分類器的常用指標

初步模型評估

視覺化模型評估

檢驗

作者簡介

Python機器學習之資料探索視覺化庫yellowbrick-tutorial

Python機器學習之資料探索視覺化庫yellowbrick

Python機器學習之資料預處理

機器學習之資料歸一化

機器學習之資料歸一化問題

Python與機器學習之資料視覺化(三)

python機器學習之--用凝聚層次聚類進行資料分組

FPGA機器學習之資料探勘，影象處理，機器視覺，模式識別，人工智慧，機器學習的關係

Python 機器學習之開發環境

Python機器學習之梯度提升樹

Bobo老師機器學習筆記-資料歸一化

Python讓你的資料生成視覺化圖形

Python基礎學習之資料型別

機器學習之資料集切分

機器學習之資料標準化處理

Python Django+Echarts將資料作視覺化輸出

機器學習之資料預處理（sklearn preprocessing）

Python機器學習之特徵工程

python機器學習之SMO演算法

Python & 機器學習之專案實踐

	id	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	...	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat	Unnamed: 24
0	1	p	x	s	n	t	p	f	c	n	...	w	w	p	w	o	p	k	s	u	NaN
1	2	e	x	s	y	t	a	f	c	b	...	w	w	p	w	o	p	n	n	g	NaN
2	3	e	b	s	w	t	l	f	c	b	...	w	w	p	w	o	p	n	n	m	NaN
3	4	p	x	y	w	t	p	f	c	n	...	w	w	p	w	o	p	k	s	u	NaN
4	5	e	x	s	g	f	n	f	w	b	...	w	w	p	w	o	e	n	a	g	NaN

	id	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	...	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat	Unnamed: 24
0	1	p	x	s	n	t	p	f	c	n	...	w	w	p	w	o	p	k	s	u	NaN
1	2	e	x	s	y	t	a	f	c	b	...	w	w	p	w	o	p	n	n	g	NaN
2	3	e	b	s	w	t	l	f	c	b	...	w	w	p	w	o	p	n	n	m	NaN
3	4	p	x	y	w	t	p	f	c	n	...	w	w	p	w	o	p	k	s	u	NaN
4	5	e	x	s	g	f	n	f	w	b	...	w	w	p	w	o	e	n	a	g	NaN

Python機器學習之資料探索視覺化庫yellowbrick-tutorial

背景介紹

模型選擇教程

模型選擇三元組

關於資料

特徵提取

建模與評估

評估分類器的常用指標

初步模型評估

視覺化模型評估

檢驗

作者簡介

相關推薦

	id	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	...	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat	Unnamed: 24
0	1	p	x	s	n	t	p	f	c	n	...	w	w	p	w	o	p	k	s	u	NaN
1	2	e	x	s	y	t	a	f	c	b	...	w	w	p	w	o	p	n	n	g	NaN
2	3	e	b	s	w	t	l	f	c	b	...	w	w	p	w	o	p	n	n	m	NaN
3	4	p	x	y	w	t	p	f	c	n	...	w	w	p	w	o	p	k	s	u	NaN
4	5	e	x	s	g	f	n	f	w	b	...	w	w	p	w	o	e	n	a	g	NaN