機器學習--特徵工程1--標準化

阿新 • • 發佈：2018-11-16

sklearn.preprocessing

https://scikit-learn.org/stable/modules/preprocessing.html

結合sklearn來學習一下資料的預處理過程：

安裝 pip install -U scikit-learn

sklearn原始碼位置：

C:\Users\chen\AppData\Local\Programs\Python\Python37\Lib\site-packages\sklearn\preprocessing

資料的標準化處理：大多數scikit庫都需要將資料進行標準化處理：

Gaussian with zero mean and unit variance 均值為0 單位方差的高斯分佈資料

1. 使用StandardScaler()可以實現標準化資料均值=0 單位方差：

from sklearn import preprocessing
import numpy as np
X_train = np.array([[1.,-1.,2.],[2.,0.,0.],[0.,1.,-1.]])
scaler = preprocessing.StandardScaler().fix(X_train)
scaler.transform(X_train)
scaler.mean_
scaler.scale_

原始碼位置：data.py

建構函式：是否複製是否均值化是否單位方差化

Parameters
    ----------
    copy : boolean, optional, default True
        If False, try to avoid a copy and do inplace scaling instead.
        This is not guaranteed to always work inplace; e.g. if the data is
        not a NumPy array or scipy.sparse CSR matrix, a copy may still be
        returned.

    with_mean : boolean, True by default
        If True, center the data before scaling.
        This does not work (and will raise an exception) when attempted on
        sparse matrices, because centering them entails building a dense
        matrix which in common use cases is likely to be too large to fit in
        memory.

    with_std : boolean, True by default
        If True, scale the data to unit variance (or equivalently,
        unit standard deviation).

def __init__(self, copy=True, with_mean=True, with_std=True):
        self.with_mean = with_mean
        self.with_std = with_std
        self.copy = copy

fit函式求解均值以及方差：

def fit(self, X, y=None):
        """Compute the mean and std to be used for later scaling.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape [n_samples, n_features]
            The data used to compute the mean and standard deviation
            used for later scaling along the features axis.

        y
            Ignored
        """

        # Reset internal state before fitting
        self._reset()
        return self.partial_fit(X, y)

transform函式：進行均值和方差轉化 Perform standardization by centering and scaling

def transform(self, X, y='deprecated', copy=None):
        """Perform standardization by centering and scaling

        Parameters
        ----------
        X : array-like, shape [n_samples, n_features]
            The data used to scale along the features axis.
        y : (ignored)
            .. deprecated:: 0.19
               This parameter will be removed in 0.21.
        copy : bool, optional (default: None)
            Copy the input X or not.
        """
        if not isinstance(y, string_types) or y != 'deprecated':
            warnings.warn("The parameter y on transform() is "
                          "deprecated since 0.19 and will be removed in 0.21",
                          DeprecationWarning)

        check_is_fitted(self, 'scale_')

        copy = copy if copy is not None else self.copy
        X = check_array(X, accept_sparse='csr', copy=copy, warn_on_dtype=True,
                        estimator=self, dtype=FLOAT_DTYPES,
                        force_all_finite='allow-nan')

        if sparse.issparse(X):
            if self.with_mean:
                raise ValueError(
                    "Cannot center sparse matrices: pass `with_mean=False` "
                    "instead. See docstring for motivation and alternatives.")
            if self.scale_ is not None:
                inplace_column_scale(X, 1 / self.scale_)
        else:
            if self.with_mean:
                X -= self.mean_
            if self.with_std:
                X /= self.scale_
        return X

2.歸一化資料，將特徵資料歸一化於某個範圍 `MinMaxScaler /MaxAbsScaler`

MinMaxScaler實現資料歸一化[0,1]

例項：

X_train = np.array([[ 1., -1.,  2.],[ 2.,  0.,  0.],[ 0.,  1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)

歸一化計算公式：

建構函式可以指定範圍

 Parameters
    ----------
    feature_range : tuple (min, max), default=(0, 1)
        Desired range of transformed data.

    copy : boolean, optional, default True
        Set to False to perform inplace row normalization and avoid a
        copy (if the input is already a numpy array).

def __init__(self, feature_range=(0, 1), copy=True):
        self.feature_range = feature_range
        self.copy = copy

fit方法，計算特徵的最大值以及最小值

def fit(self, X, y=None):
        """Compute the minimum and maximum to be used for later scaling.

        Parameters
        ----------
        X : array-like, shape [n_samples, n_features]
            The data used to compute the per-feature minimum and maximum
            used for later scaling along the features axis.
        """

        # Reset internal state before fitting
        self._reset()
        return self.partial_fit(X, y)

transform根據指定範圍進行範圍處理

def transform(self, X):
        """Scaling features of X according to feature_range.

        Parameters
        ----------
        X : array-like, shape [n_samples, n_features]
            Input data that will be transformed.
        """
        check_is_fitted(self, 'scale_')

        X = check_array(X, copy=self.copy, dtype=FLOAT_DTYPES,
                        force_all_finite="allow-nan")

        X *= self.scale_
        X += self.min_
        return X

MaxAbsScaler資料歸一化[-1,1]

例項：

X_train = np.array([[ 1., -1.,  2.],[ 2.,  0.,  0.],[ 0.,  1., -1.]])
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)

fit計算絕對值的最大值：

def fit(self, X, y=None):
        """Compute the maximum absolute value to be used for later scaling.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape [n_samples, n_features]
            The data used to compute the per-feature minimum and maximum
            used for later scaling along the features axis.
        """

        # Reset internal state before fitting
        self._reset()
        return self.partial_fit(X, y)

transform將資料除以絕對值最大值，歸一化到[-1,1]

def transform(self, X):
        """Scale the data

        Parameters
        ----------
        X : {array-like, sparse matrix}
            The data that should be scaled.
        """
        check_is_fitted(self, 'scale_')
        X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
                        estimator=self, dtype=FLOAT_DTYPES,
                        force_all_finite='allow-nan')

        if sparse.issparse(X):
            inplace_column_scale(X, 1.0 / self.scale_)
        else:
            X /= self.scale_
        return X

3.稀疏資料處理

4.極端資料處理

利用RobustScaler來處理極端值利用分位數的概念來處理極端值

    Scale features using statistics that are robust to outliers.

    This Scaler removes the median and scales the data according to
    the quantile range (defaults to IQR: Interquartile Range).
    The IQR is the range between the 1st quartile (25th quantile)
    and the 3rd quartile (75th quantile).

建構函式:

Parameters
    ----------
    with_centering : boolean, True by default
        If True, center the data before scaling.
        This will cause ``transform`` to raise an exception when attempted on
        sparse matrices, because centering them entails building a dense
        matrix which in common use cases is likely to be too large to fit in
        memory.

    with_scaling : boolean, True by default
        If True, scale the data to interquartile range.

    quantile_range : tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0
        Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR
        Quantile range used to calculate ``scale_``.

        .. versionadded:: 0.18

    copy : boolean, optional, default is True
        If False, try to avoid a copy and do inplace scaling instead.
        This is not guaranteed to always work inplace; e.g. if the data is
        not a NumPy array or scipy.sparse CSR matrix, a copy may still be
        returned.
    def __init__(self, with_centering=True, with_scaling=True,
                 quantile_range=(25.0, 75.0), copy=True):
        self.with_centering = with_centering
        self.with_scaling = with_scaling
        self.quantile_range = quantile_range
        self.copy = copy

機器學習--特徵工程1--標準化

sklearn.preprocessing https://scikit-learn.org/stable/modules/preprocessing.html 結合sklearn來學習一下資料的預處理過程：安裝 pip install -U scikit

機器學習--特徵工程1

之前面試遇到過好幾次特徵工程的理解，學習一下特徵工程系列知識參考地址： 1.特徵工程定義資料和特徵決定了機器學習的上限，而模型和演算法只是逼近這個上限而已。最大限度地從原始資料中提取特徵以供演算法和模型使用特徵工程主要知識點如下所示: 2.sk

機器學習特徵工程總結

一、前言資料清洗：不可信的樣本去除缺失值極多的欄位考慮去除補齊缺失值資料取樣：很多情況下，正負樣本是不均衡的，大多數模型對正負樣本是敏感的（比如LR）正樣本>>負樣本，且量都挺大：下采樣正樣本>>負

機器學習--特徵工程0

之前面試遇到過好幾次特徵工程的理解，學習一下特徵工程系列知識參考地址： https://www.cnblogs.com/peizhe123/p/7412364.html https://scikit-learn.org/stable/modules/preprocessing.html

機器學習——特徵工程和文字特徵工程提取

機器學習的資料:檔案csv 可用的資料集: scikit-learn ：資料量小，方便學習 kaggle: 大資料競賽平臺，真實資料，資料量巨大 UCI:收錄了360個數據集，覆蓋科學、生活、經濟等領域，資料量幾十萬常用資料集資料的結構組成

機器學習+特徵工程vs深度學習—如何選擇

對於資料探勘和處理類的問題，使用一般的機器學習方法，需要提前做大量的特徵工程工作，而且特徵工程的好壞會在很大程度上決定最後效果的優劣（也就是常說的一句話：資料和特徵決定了機器學習的上限，而模型和演算法只是逼近這個上限而已）。使用深度學習的話，特徵工程就沒那麼重

BAT機器學習特徵工程工作經驗總結(一)如何解決資料不平衡問題（附python程式碼）

很多人其實非常好奇BAT裡機器學習演算法工程師平時工作內容是怎樣？其實大部分人都是在跑資料，各種map-reduce，hive SQL，資料倉庫搬磚，資料清洗、資料清洗、資料清洗，業務分析、分析case、找特徵、找特徵…而複雜的模型都是極少數的資料科學家在做。例

機器學習特徵工程特徵離散化

如果想深入研究特徵離散化，請直接閱讀博文最後的英文文獻，以免浪費您的時間！一、什麼是特徵離散化簡單的說，就是把連續特徵分段，每一段內的原始連續特徵無差別的看成同一個新特徵二、為什麼進行離散化 1、離散化的特徵更易於理解 2、離散化的特徵能夠提高模

機器學習特徵工程之特徵抽取

1.資料集資料集是特徵抽取的源資料。常用資料集的結構組成：特徵值+目標值。資料中對於特徵的處理 pandas：一個數據讀取非常方便以及基本的處理格式的工具。 sklearn：對於特徵的處理提供了強大的介面。 2.資料的特徵工程 2

機器學習特徵工程之特徵預處理

特徵預處理是什麼？通過特定的統計方法（數學方法）講資料轉換成演算法要求的資料。數值型資料：歸一化標準化缺失值類別型資料：one-hot編碼時間型別：時間的切分特徵選擇的意義在對資料進行異常值、缺失值、資料轉換等處理後，我們

【Trick】機器學習特徵工程處理（一）

前言機器學習特徵工程處理系列部落格為博主學習相關視訊教程以及結合平時接觸到的特徵工程處理方法，總結出的一些處理技巧，本篇部落格介紹資料格式化、資料清洗、資料取樣等，我在之前有總結過一篇部落格介紹資料預處理的常用方法，對其中的部分操作有涉及，如有需要，可參考本

機器學習——特徵工程之子集搜尋與評價

一、前言 1、特徵：描述目標物件的屬性 2、特徵型別 a) 相關特徵：對於當前學習任務有用的屬性，即與目標物件非常相關的特徵 b) 無關特徵：對於當前學習任務無用的屬性，即與目標物件無關的特徵 c) 冗餘特徵：其包含的資訊可通過其它特徵推演 3、特徵

機器學習特徵工程

2018/3/15更新結合KAGGLE競賽經驗、演算法面試情況和jasonfreak的總結，個人總結出以下機器學習特徵處理的方法；分享給大家，希望對大家有幫助特徵使用方案：1、要實現我們目標，需要什麼資料----結合特定業務，具體情況具體分析 2、資

機器學習特徵工程

本文聊一聊機器學習的大致過程，探討下機器學習中常見的問題。本文藉助了廣告CTR預估這條主線，大概流程及內容如圖所示：詳細參見此博文 1.想特徵想特徵主要靠一些經驗，這些經驗可能來源於以前做過的專案、特徵選擇、特徵構建等一些實踐或知識。大概的方向是想出

[機器學習] 特徵工程總結

目錄 1 特徵工程是什麼？ 2 資料預處理　　2.1 無量綱化　　　　2.1.1 標準化　　　　2.1.2 區間縮放法　　　　2.1.3 標準化與歸一化的區別　　2.2 對定量特徵二值化　　2.3 對定性特徵啞編碼　　2.4 缺失值計算　　2.5 資料變換

機器學習特徵工程——給任意屬性增加任意次方的全組合

在機器學習中，我們時常會碰到需要給屬性增加欄位的情況。譬如有x、y兩個屬性，當結果傾向於線性時，我們可以很簡單的通過線性迴歸得到模型。但很多時候，線性（在數學上稱為多元一次方程），線性是擬合不了結果的。往往，我們就需要在給定的幾個屬性上，通過增加屬性來嘗試能否擬合。那麼原本只

機器學習-特徵工程-Missing value和Category encoding

好了，大家現在進入到機器學習中的一塊核心部分了，那就是特徵工程，洋文叫做Feature Engineering。實際在機器學習的應用中，真正用於演算法的結構分析和部署的工作只佔很少的一部分，相反，用於特徵工程的時間基本都佔70%以上，因為是實際的工作中，絕大部分的資料都是非標資料。因而這一塊的內容是非常重要和

機器學習-特徵工程-Feature generation 和 Feature selection

概述：上節咱們說了特徵工程是機器學習的一個核心內容。然後咱們已經學習了特徵工程中的基礎內容，分別是missing value handling和categorical data encoding的一些方法技巧。但是光會前面的一些內容，還不足以應付實際的工作中的很多情況，例如如果咱們的原始資料的feature

【機器學習--opencv3.4.1版本基於Hog特徵描述子Svm對經典手寫數字識別】

方向梯度直方圖（Histogram of Oriented Gradient, HOG）特徵是一種在計算機視覺和影象處理中用來進行物體檢測的特徵描述子。HOG特徵通過計算和統計影象區域性區域的梯度方向直方圖來構成特徵。 #include <iostream> #inc

[轉] [機器學習] 常用數據標準化（正則化）的方法

機器學習數據評價分享函數 http mean 常用方法訓練數據正則化目的：為了加快訓練網絡的收斂性，可以不進行歸一化處理源地址：http://blog.sina.com.cn/s/blog_8808cae20102vg53.html 而在多指標評價體系中，

機器學習--特徵工程1--標準化

sklearn.preprocessing

1. 使用StandardScaler()可以實現標準化資料 均值=0 單位方差：

2.歸一化資料，將特徵資料歸一化於某個範圍 MinMaxScaler /MaxAbsScaler

3.稀疏資料處理

4.極端資料處理

相關推薦

1. 使用StandardScaler()可以實現標準化資料均值=0 單位方差：

2.歸一化資料，將特徵資料歸一化於某個範圍 `MinMaxScaler /MaxAbsScaler`