什麼是機器學習裡面的特徵工程

阿新 • • 發佈：2021-06-11

1.什麼是特徵工程？
2.資料預處理
- 3.功能選擇
- 4.降維
  - 4.1主成分分析（PCA）
  - 4.2線性判別分析（LDA）

1.什麼是特徵工程？

有一種說法在業界廣為流傳：資料和特性決定了機器學習的上限，而模型和演算法恰好達到了這個上限。什麼是特徵專案？顧名思義，其本質是一項工程活動，旨在最大程度地從原始資料中提取特徵，以供演算法和模型使用。通過總結和總結，人們認為要素工程包括以下方面：

特徵處理是特徵工程的核心部分。Sklearn提供了更完整的特徵處理方法，包括資料預處理，特徵選擇和降維。與sklearn的第一次接觸通常是因為其豐富且方便的演算法模型庫而引起的，但是這裡描述的特徵處理庫也非常強大！

本文使用sklearn中的IRIS（Iris）資料集來說明特徵處理功能。IRIS資料集由Fisher於1936年編譯，包含四個特徵（Sepal.Length，Sepal.Width，Petal.Length，Petal.Width），特徵值兩者都是以釐米為單位的正浮點數。目標值是虹膜（Iris Setosa），虹膜雜色（Iris Virginica），虹膜Virgin（Iir Virginica）（弗吉尼亞虹膜）的分類。匯入IRIS資料集的程式碼如下：

from sklearn.datasets import load_iris
#import IRIS data set
Iris = load_iris()
# Feature matrix
Iris.data
#Target vector
Iris.target

2.資料預處理

通過特徵提取，可以獲得未處理的特徵，此時特徵可能存在以下問題：

不屬於同一尺寸：即功能的規格不同，無法一起比較。無量綱化可以解決此問題。

資訊冗餘：對於某些定量功能，所包含的有效資訊是區間劃分，例如學習成績。如果僅關心“通過”或不關心“通過”，則需要將定量測試分數轉換為“ 1”和“ 0”。“”表示通過和失敗。二值化可以解決這個問題。

定性特徵不能直接使用：某些機器學習演算法和模型只能接受定量特徵的輸入，因此需要將定性特徵轉換為定量特徵。最簡單的方法是為每個定性值指定一個定量值，但是此方法過於靈活，會增加調整工作。通常通過偽編碼將定性特徵轉換為定量特徵：如果有N個定性值，則此功能擴充套件為N個功能。當原始特徵值是第i個定性值時，將分配第i個擴充套件特徵。為1時，其他擴充套件功能的賦值為0。與直接指定的方法相比，啞編碼方法不需要增加引數調整的工作。對於線性模型，使用啞編碼功能可以實現非線性效果。

有缺失值：需要新增缺失值。

資訊利用率低：不同的機器學習演算法和模型在資料中使用不同的資訊。如前所述，線上性模型中，使用定性特徵啞編碼可以實現非線性效果。類似地，量化變數的多項式或其他變換可以實現非線性效果。

我們使用sklearn中的預處理庫進行資料預處理，以解決上述問題。

2.1無量綱

無量綱將不同規格的資料轉換為相同規格。常見的無量綱化方法是標準化和區間縮放。標準化的前提是特徵值遵循正態分佈，歸一化後將其轉換為標準正態分佈。間隔縮放方法利用邊界值資訊來將特徵範圍縮放到一系列特徵範圍，例如[0，1]。

2.1.1標準化

標準化需要計算特徵的平均值和標準偏差，表示為：

\(x=\frac{x-\bar{X}}{S}\)

使用預處理庫的StandardScaler類對資料進行規範化的程式碼如下：

from sklearn.preprocessing import StandardScaler
# Standardization, return data is normalized
StandardScaler().fit_transform(iris.data)

2.1.2間隔縮放方法

關於間隔縮放有很多想法。常見的一種是使用兩個最大值進行縮放。公式表示為：

\(x^{\prime}=\frac{x-M i n}{M a x-M i n}\)

使用預處理庫的MinMaxScaler類進行資料間隔縮放的程式碼如下：

from sklearn.preprocessing import MinMaxScaler
#interval scaling, the return value is the data scaled to the [0, 1] interval
MinMaxScaler().fit_transform(iris.data)

2.1.3標準化與規範化之間的區別

簡而言之，標準化就是根據特徵矩陣的列來處理資料，通過z評分方法將樣本的特徵值轉換為相同維度。歸一化是根據特徵矩陣的行對資料進行的處理。目的是當點乘法運算或其他核函式計算相似度時，樣本向量具有統一的標準，即，將其轉換為“單位向量”。規則為l2的歸一化公式如下：

\(x^{\prime}=\frac{x}{\sqrt{\sum_{j}^{m} x[j]^{2}}}\)

使用預處理庫的Normalizer類對資料進行規範化的程式碼如下：

from sklearn.preprocessing import Normalizer
#Normalization, return value is normalized data
Normalizer().fit_transform(iris.data)

2.2二進位制定量特徵

量化特徵二值化的核心是設定閾值。大於閾值的值為1，小於或等於閾值的值為0。公式如下：

\(x^{\prime}=\left\{\begin{array}{l}1, x>\text { threshold } \\ 0, x \leq \text { threshold }\end{array}\right.\)

使用預處理庫的Binarizer類對資料進行二進位制化的程式碼如下：

from sklearn.preprocessing import Binarizer
# Binarization, a threshold value is set to 3, the return data is binarized
Binarizer(threshold=3).fit_transform(iris.data)

2.3對於定性特徵，啞編碼

由於IRIS資料集的特徵都是定量特徵，因此將其目標值用於偽編碼（實際上不是必需的）。使用預處理庫的OneHotEncoder類對資料進行啞編碼的程式碼如下：

from sklearn.preprocessing import OneHotEncoder
# Dummy encoding, the target value of IRIS data set, return the value of the dummy data encoding
OneHotEncoder().fit_transform(iris.target.reshape((-1,1)))

2.4遺漏值計算

由於IRIS資料集沒有缺失值，因此將新樣本新增到資料集，並且為所有四個要素分配了NaN值，表明該資料缺失。使用預先處理庫的Imputer類進行的缺少資料計算的程式碼如下：

from numpy import vstack, array, nan
from sklearn.preprocessing import Imputer
#missing value calculation, return value is the data after calculating the missing value
# The parameter missing_value is a representation of the missing value. The default is NaN.
#Parameters is a missing value filling method, the default is mean (mean)
Imputer().fit_transform(vstack((array([nan, nan, nan, nan]),iris.data)))

2.5資料轉換

常見的資料轉換是基於多項式，基於指數和基於日誌的函式。次數為2的多項式轉換公式的四個特徵如下：

\(\left(x_{1}^{\prime} x_{2}^{\prime} x_{3}^{\prime} x_{4}^{\prime} x_{5}^{\prime} x_{6}^{\prime} x_{7}^{\prime} x_{8}^{\prime} x_{8}^{\prime} x_{10}^{\prime} x_{11}^{\prime} x_{12}^{\prime}, x_{13}^{\prime} x_{14}^{\prime} x_{15}^{\prime}\right)\)
\(=\left(1, x_{1}, x_{2}, x_{3}, x_{4}, x_{1}^{2}, x_{1} * x_{2}, x_{1} * x_{3}, x_{1} * x_{4}, x_{2}^{2}, x_{2} * x_{3}, x_{2} * x_{4}, x_{3}^{2} x_{2} * x_{4}, x_{4}^{2}\right)\)

使用預處理庫的PolynomialFeatures類進行資料的多項式轉換的程式碼如下：

from sklearn.preprocessing import PolynomialFeatures
# polynomial conversion
# Parameterdegree is degree, default is 2
PolynomialFeatures().fit_transform(iris.data)

基於單引數函式的資料轉換可以統一進行。使用預處理庫的FunctionTransformer對資料進行對數函式轉換的程式碼如下：

from numpy import log1p
from sklearn.preprocessing import FunctionTransformer
#Custom conversion function is a data transformation of logarithmic function
# The first parameter is a function of univariate
functionTransformer(log1p).fit_transform(iris.data)

3.功能選擇

資料預處理完成後，我們需要選擇有意義的演算法和機器模型以進行機器學習訓練。通常，從兩個角度選擇功能：

特徵是否發散：例如，如果某個特徵不發散，則方差接近於零，也就是說，樣本在此特徵中基本沒有差異，則此特徵對於區分樣本沒有用。
特徵與目標之間的相關性：這更加明顯，與目標高度相關的特徵應被優先考慮。除了方差法，本文還介紹了其他方法的相關性。

根據特徵選擇的形式，特徵選擇方法可分為三種類型：

篩選器：篩選器方法，該方法根據差異或相關性對每個特徵評分，設定要選擇的閾值或閾值數量，然後選擇特徵。
包裝器：一種包裝器方法，它根據目標函式（通常是預測效果得分）一次選擇多個特徵，或排除多個特徵。
嵌入式：一種整合方法，該方法首先使用一些機器學習演算法和模型進行訓練，獲得每個特徵的權重係數，然後根據係數的大小從大到小選擇特徵。與“過濾器”方法類似，但是經過訓練可以確定功能的優缺點。

我們使用sklearn中的feature_selection庫進行特徵選擇。

3.1過濾器

3.1.1方差選擇方法

使用方差選擇方法，首先計算每個特徵的方差，然後根據閾值選擇方差大於閾值的特徵。使用feature_selection庫的Variance Threshold類選擇要素的程式碼如下：

from sklearn.feature_selection import VarianceThreshold
# variance selection method, the return value is the data after the feature selection
#Parameter threshold is the threshold of variance
VarianceThreshold(threshold=3).fit_transform(iris.data)

3.1.2相關係數法

使用相關係數方法，首先計算每個特徵與目標值的相關係數以及相關係數的P值。使用feature_selection庫的SelectKBest類組合相關係數以選擇特徵碼，如下所示：

from sklearn.feature_selection import SelectKBest
from scipy.stats import pearsonr
#Select K best features, return the data after selecting the feature
The first parameter is a function to calculate whether the evaluation feature is good. The function inputs the feature matrix and the target vector, and outputs an array of two groups (score, P value). The i-th item of the array is the score and P value of the i-th feature. . Defined here as the correlation coefficient
#Parameter k is the number of features selected
SelectKBest(lambda X, Y: array(map(lambda x:pearsonr(x, Y), X.T)).T, k=2).fit_transform(iris.data, iris.target)

3.1.3卡方檢驗

經典卡方檢驗是測試定性自變數與定性因變數的相關性。假設自變數具有N種值，因變數具有M種值。考慮自變數等於i而因變數等於j的取樣頻率的觀測值與期望值之間的差，並構造統計量：

\(\chi^{2}=\sum \frac{(A-E)^{2}}{E}\)

不難發現，該統計的含義僅僅是自變數與因變數的相關性卡方檢驗維基百科。將feature_selection庫的SelectKBest類與卡方檢驗結合使用，以選擇特徵程式碼，如下所示：

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
#Select K best features, return the data after selecting the feature
SelectKBest(chi2, k=2).fit_transform(iris.data, iris.target)

3.1.4相互資訊法

經典互資訊還用於評估定性自變數與定性因變數的相關性。相互資訊的計算公式如下：

\(I(X ; Y)=\sum_{x \in X} \sum_{y \in Y} p(x, y) \log \frac{p(x, y)}{p(x) p(y)}\)

為了處理定量資料，提出了最大資訊係數法。將feature_selection庫的SelectKBest類與最大資訊係數方法結合使用以選擇特徵的程式碼如下：

from sklearn.feature_selection import SelectKBest
from minepy import MINE
 
# Since the design of MINE is not functional, the mic method is defined as a functional one, returning a binary group, and the second item of the binary group is set to a fixed P value of 0.5.
def mic(x, y):
     m = MINE()
     m.compute_score(x, y)
     return (m.mic(), 0.5)
#Select K best features, return the data after feature selection
SelectKBest(lambda X, Y: array(map(lambda x:mic(x, Y), X.T)).T, k=2).fit_transform(iris.data, iris.target)

3.2包裝器

3.2.1遞迴特徵消除

遞迴消除特徵方法使用基本模型來執行多輪訓練。在每一輪訓練之後，將消除幾個權重係數的特徵，並根據新的特徵集執行下一輪訓練。使用feature_selection庫的RFE類選擇要素的程式碼如下：

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
#Recursive feature elimination method, returning the data after feature selection
#Parameter estimator is the base model
#Parameter n_features_to_select is the number of features selected
RFE(estimator=LogisticRegression(),n_features_to_select=2).fit_transform(iris.data, iris.target)

3.3嵌入式

3.3.1基於懲罰的功能選擇

使用帶有懲罰項的基本模型，除了濾除特徵外，還執行降維。將feature_selection庫的SelectFromModel類與具有L1懲罰的邏輯迴歸模型一起使用，以選擇特徵程式碼，如下所示：

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
#Logo regression with L1 penalty term as feature selection of base model
SelectFromModel(LogisticRegression(penalty="l1",C=0.1)).fit_transform(iris.data, iris.target)

實際上，L1懲罰項的降維原理是保留與目標值具有同等相關性的特徵之一，因此未選擇的特徵並不表示不重要。因此，可以結合L2懲罰項對其進行優化。具體操作如下：如果某要素在L1中的權重為1，則在L2中的權重差較小且在L1中的權重為0的要素構成同質集，並且該組中的要素被均等分割進入L1。權重，因此您需要構建一個新的邏輯迴歸模型：

from sklearn.linear_model import LogisticRegression

class LR(LogisticRegression):
    def __init__(self, threshold=0.01, dual=False, tol=1e-4, C=1.0,
                 fit_intercept=True, intercept_scaling=1, class_weight=None,
                 random_state=None, solver='liblinear', max_iter=100,
                 multi_class='ovr', verbose=0, warm_start=False, n_jobs=1):

        #Thresold
        self.threshold = threshold
        LogisticRegression.__init__(self, penalty='l1', dual=dual, tol=tol, C=C,
                 fit_intercept=fit_intercept, intercept_scaling=intercept_scaling, class_weight=class_weight,
                 random_state=random_state, solver=solver, max_iter=max_iter,
                 multi_class=multi_class, verbose=verbose, warm_start=warm_start, n_jobs=n_jobs)
        #Create L2 logistic regression using the same parameters
        self.l2 = LogisticRegression(penalty='l2', dual=dual, tol=tol, C=C, fit_intercept=fit_intercept, intercept_scaling=intercept_scaling, class_weight = class_weight, random_state=random_state, solver=solver, max_iter=max_iter, multi_class=multi_class, verbose=verbose, warm_start=warm_start, n_jobs=n_jobs)

    def fit(self, X, y, sample_weight=None):
        # Training L1 logistic regression
        super(LR, self).fit(X, y, sample_weight=sample_weight)
        self.coef_old_ = self.coef_.copy()
        # L2 logistic regression training
        self.l2.fit(X, y, sample_weight=sample_weight)

        cntOfRow, cntOfCol = self.coef_.shape
        #Number of coefficient matrix The number of rows corresponds to the number of types of target values
        for i in range(cntOfRow):
            for j in range(cntOfCol):
                coef = self.coef_[i][j]
                # The weight coefficient of L1 logistic regression is not 0.
                if coef != 0:
                    idx = [j]
                    #correspond to the weight coefficient in L2 logistic regression
                    coef1 = self.l2.coef_[i][j]
                    for k in range(cntOfCol):
                        coef2 = self.l2.coef_[i][k]
                        #In L2 logistic regression, the difference between the weight coefficients is less than the set threshold, and the corresponding weight in L1 is 0.
                        if abs(coef1-coef2) < self.threshold and j != k and self.coef_[i][k] == 0:
                            idx.append(k)
                    #Calculate the mean value of the weight coefficient of this type of feature
                    mean = coef / len(idx)
                    self.coef_[i][idx] = mean
        return self

將feature_selection庫的SelectFromModel類與具有L1和L2懲罰項的邏輯迴歸模型一起使用，以選擇特徵程式碼，如下所示：

from sklearn.feature_selection import SelectFromModel
 
#Logo regression with L1 and L2 penalty terms as feature selection of the base model
#Parameter threshold is the threshold of the difference between the weight coefficients
SelectFromModel(LR(threshold=0.5, C=0.1)).fit_transform(iris.data, iris.target)

3.3.2基於樹模型的特徵選擇

在樹模型中，GBDT也可以用作特徵選擇的基礎模型。通過將feature_selection庫的SelectFromModel類與GBDT模型結合使用來選擇功能部件的程式碼。

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingClassifier
#GBDT as the feature selection of the base model
SelectFromModel(GradientBoostingClassifier()).fit_transform(iris.data, iris.target)

4.降維

特徵選擇完成後，可以直接訓練模型，但特徵矩陣太大，導致計算量大，訓練時間長。因此，也有必要減小特徵矩陣的維數。常見的降維方法除了上述基於L1懲罰的模型外，還有主成分分析（PCA）和線性判別分析（LDA）。線性判別分析本身也是一個分類模型。PCA和LDA有許多相似之處，其本質是將原始樣本對映到低維樣本空間，但是PCA和LDA的對映目標不同：PCA是使對映樣本具有最大的差異。LDA旨在為對映的樣本提供最佳的分類效能。因此，PCA是一種無監督的降維方法，而LDA是一種無監督的降維方法。

4.1主成分分析（PCA）

使用分解庫的PCA類選擇要素的程式碼如下：

from sklearn.decomposition import PCA
#Principal component analysis method, returning the data after dimension reduction
#Parameter n_components number of main components
PCA(n_components=2).fit_transform(iris.data)

4.2線性判別分析（LDA）

使用lda庫的LDA類選擇功能的程式碼如下：

from sklearn.lda import LDA
#linear discriminant analysis method, returning the data after dimensionality reduction
#Parameter n_components is the dimensionality after dimension reduction
LDA(n_components=2).fit_transform(iris.data, iris.target)

參考文獻：

https://www.quora.com/topic/Data-Cleansing

https://www.quora.com/What-is-the-real-meaning-of-data-cleaning-for-a-Data-Scientist

https://www.quora.com/What-is-your-best-description-of-data-cleaning-in-data-analysis-and-machine-learninghttps://www.quora.com/What-is-your-best-description-of-data-cleaning-in-data-analysis-and-machine-learning

https://www.quora.com/What-is-your-best-description-of-data-cleaning-in-data-analysis-and-machine-learning

原文連結 https://medium.com/ml-research-lab/chapter-6-how-to-learn-feature-engineering-49f4246f0d41