1. 程式人生 > 實用技巧 >Column Transformer with Mixed Types -- of sklearn

Column Transformer with Mixed Types -- of sklearn

Column Transformer with Mixed Types

https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py

使用ColumnTransformer, 應用不同的預處理和特徵提取管道,到不同的特徵子集上。

這個工具是非常便利的對於處理異構資料集。

例如 對數值型資料進行 縮放, 對分型別資料進行 one-hot 編碼。

This example illustrates how to apply different preprocessing and feature extraction pipelines to different subsets of features, using ColumnTransformer. This is particularly handy for the case of datasets that contain heterogeneous data types, since we may want to scale the numeric features and one-hot encode the categorical ones.

對於數值型資料, 先執行資料填充, 使用中位數進行填充; 然後進行標準縮放。

對於 分型別資料, 先試用 missing值進行填充, 然後進行編碼。

In this example, the numeric data is standard-scaled after mean-imputation, while the categorical data is one-hot encoded after imputing missing values with a new category ('missing').

可以使用不同的方法, 將列分發到特定的處理器上, 按照列名 或者 按照列的 資料型別。

In addition, we show two different ways to dispatch the columns to the particular pre-processor: by column names and by column data types.

最終, 預處理管道被程序到 一個完整的預測管道上, 對接上一個簡單的分類模型。

Finally, the preprocessing pipeline is integrated in a full prediction pipeline using Pipeline, together with a simple classification model.

Use ColumnTransformer by selecting column by names

  (1)對數值型資料構建一個變換流水線: numeric_transformer, 包括資料填充器SimpleImputer 和 標準變換器 StandardScaler, 這是一個子流水線, 後面整合到 ColumnTransformer 中。

(2) 對分型別資料,建立一個變換器categorical_transformer

然後將 (1) (2)整合到 ColumnTransformer 中,按照列名整合, 構成列變換器 ,命名為 preprocessor

最後 將 preprocesssor 和 LogisticRegression 模型整合為 最終的 流水線。

We will train our classifier with the following features:

Numeric Features:

  • age: float;

  • fare: float.

Categorical Features:

  • embarked: categories encoded as strings {'C', 'S', 'Q'};

  • sex: categories encoded as strings {'female', 'male'};

  • pclass: ordinal integers {1, 2, 3}.

We create the preprocessing pipelines for both numeric and categorical data. Note that pclass could either be treated as a categorical or numeric feature.

numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=0)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

HTML representation of Pipeline

檢視流水線

When the Pipeline is printed out in a jupyter notebook an HTML representation of the estimator is displayed as follows:

from sklearn import set_config

set_config(display='diagram')
clf

StandardScaler

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler

數值資料, 由於單位不同, 量值表示的含義不同, 導致量值的範圍,有的很大, 有的很小。

如果 不進行正規化, 則具有較大變化範圍的特徵, 將會成為模型的主要影響力量, 其它較小範圍的變數, 影響力可能會被拋棄。

正規化到相似的範圍空間, 則可以平均化每個特徵的影響能力。

Standardize features by removing the mean and scaling to unit variance

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

This scaler can also be applied to sparse CSR or CSC matrices by passing with_mean=False to avoid breaking the sparsity structure of the data.

>>> from sklearn.preprocessing import StandardScaler
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler()
>>> print(scaler.mean_)
[0.5 0.5]
>>> print(scaler.transform(data))
[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]
>>> print(scaler.transform([[2, 2]]))
[[3. 3.]]

Use ColumnTransformer by selecting column by data types

對於不同型別資料,如果都有相同的處理過程, 則可以使用 列選擇器, 選擇出對應型別的列, 送入對應的 列的 變換器 或者 預處理的子流水線。

When dealing with a cleaned dataset, the preprocessing can be automatic by using the data types of the column to decide whether to treat a column as a numerical or categorical feature. sklearn.compose.make_column_selector gives this possibility. First, let’s only select a subset of columns to simplify our example.

subset_feature = ['embarked', 'sex', 'pclass', 'age', 'fare']
X_train, X_test = X_train[subset_feature], X_test[subset_feature]

Then, we introspect the information regarding each column data type.

X_train.info()

Out:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1047 entries, 1118 to 684
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   embarked  1045 non-null   category
 1   sex       1047 non-null   category
 2   pclass    1047 non-null   float64
 3   age       841 non-null    float64
 4   fare      1046 non-null   float64
dtypes: category(2), float64(3)
memory usage: 35.0 KB

We can observe that the embarked and sex columns were tagged as category columns when loading the data with fetch_openml. Therefore, we can use this information to dispatch the categorical columns to the categorical_transformer and the remaining columns to the numerical_transformer.

from sklearn.compose import make_column_selector as selector

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, selector(dtype_exclude="category")),
    ('cat', categorical_transformer, selector(dtype_include="category"))
])
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])


clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

Using the prediction pipeline in a grid search

使用網格搜尋, 來確定流水線中, 節點引數, 包括 數值型 填充策略, 和 分類器引數。

Grid search can also be performed on the different preprocessing steps defined in the ColumnTransformer object, together with the classifier’s hyperparameters as part of the Pipeline. We will search for both the imputer strategy of the numeric preprocessing and the regularization parameter of the logistic regression using GridSearchCV.

param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__C': [0.1, 1.0, 10, 100],
}

grid_search = GridSearchCV(clf, param_grid, cv=10)
grid_search

檢視最好的結果對應的引數。

Calling ‘fit’ triggers the cross-validated search for the best hyper-parameters combination:

grid_search.fit(X_train, y_train)

print(f"Best params:")
print(grid_search.best_params_)

Out:

Best params:
{'classifier__C': 0.1, 'preprocessor__num__imputer__strategy': 'mean'}

The internal cross-validation scores obtained by those parameters is:

print(f"Internal CV score: {grid_search.best_score_:.3f}")

Out:

Internal CV score: 0.784

檢視所有引數哦搜尋組合中得分較大的前五組

We can also introspect the top grid search results as a pandas dataframe:

import pandas as pd

cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results = cv_results.sort_values("mean_test_score", ascending=False)
cv_results[["mean_test_score", "std_test_score",
            "param_preprocessor__num__imputer__strategy",
            "param_classifier__C"
            ]].head(5)
mean_test_scorestd_test_scoreparam_preprocessor__num__imputer__strategyparam_classifier__C
0 0.784167 0.035824 mean 0.1
2 0.780366 0.032722 mean 1
1 0.780348 0.037245 median 0.1
4 0.779414 0.033105 mean 10
6 0.779414 0.033105 mean 100


The best hyper-parameters have be used to re-fit a final model on the full training set. We can evaluate that final model on held out test data that was not used for hyparameter tuning.

print(("best logistic regression from grid search: %.3f"
       % grid_search.score(X_test, y_test)))

Out:

best logistic regression from grid search: 0.794