Column Transformer with Mixed Types -- of sklearn
Column Transformer with Mixed Types
https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py
使用ColumnTransformer
, 應用不同的預處理和特徵提取管道,到不同的特徵子集上。
這個工具是非常便利的對於處理異構資料集。
例如 對數值型資料進行 縮放, 對分型別資料進行 one-hot 編碼。
This example illustrates how to apply different preprocessing and feature extraction pipelines to different subsets of features, using
ColumnTransformer
. This is particularly handy for the case of datasets that contain heterogeneous data types, since we may want to scale the numeric features and one-hot encode the categorical ones.
對於數值型資料, 先執行資料填充, 使用中位數進行填充; 然後進行標準縮放。
對於 分型別資料, 先試用 missing值進行填充, 然後進行編碼。
In this example, the numeric data is standard-scaled after mean-imputation, while the categorical data is one-hot encoded after imputing missing values with a new category (
'missing'
).
可以使用不同的方法, 將列分發到特定的處理器上, 按照列名 或者 按照列的 資料型別。
In addition, we show two different ways to dispatch the columns to the particular pre-processor: by column names and by column data types.
最終, 預處理管道被程序到 一個完整的預測管道上, 對接上一個簡單的分類模型。
Finally, the preprocessing pipeline is integrated in a full prediction pipeline using
Pipeline
, together with a simple classification model.
Use ColumnTransformer
by selecting column by names
(1)對數值型資料構建一個變換流水線: numeric_transformer, 包括資料填充器SimpleImputer 和 標準變換器 StandardScaler, 這是一個子流水線, 後面整合到 ColumnTransformer
中。
(2) 對分型別資料,建立一個變換器categorical_transformer 。
然後將 (1) (2)整合到 ColumnTransformer 中,按照列名整合, 構成列變換器 ,命名為 preprocessor
最後 將 preprocesssor 和 LogisticRegression 模型整合為 最終的 流水線。
We will train our classifier with the following features:
Numeric Features:
age
: float;
fare
: float.Categorical Features:
embarked
: categories encoded as strings{'C', 'S', 'Q'}
;
sex
: categories encoded as strings{'female', 'male'}
;
pclass
: ordinal integers{1, 2, 3}
.We create the preprocessing pipelines for both numeric and categorical data. Note that
pclass
could either be treated as a categorical or numeric feature.
numeric_features = ['age', 'fare'] numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]) categorical_features = ['embarked', 'sex', 'pclass'] categorical_transformer = OneHotEncoder(handle_unknown='ignore') preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features)]) # Append classifier to preprocessing pipeline. # Now we have a full prediction pipeline. clf = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression())]) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) clf.fit(X_train, y_train) print("model score: %.3f" % clf.score(X_test, y_test))
HTML representation of Pipeline
檢視流水線
When the
Pipeline
is printed out in a jupyter notebook an HTML representation of the estimator is displayed as follows:from sklearn import set_config set_config(display='diagram') clf
StandardScaler
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler
數值資料, 由於單位不同, 量值表示的含義不同, 導致量值的範圍,有的很大, 有的很小。
如果 不進行正規化, 則具有較大變化範圍的特徵, 將會成為模型的主要影響力量, 其它較小範圍的變數, 影響力可能會被拋棄。
正規化到相似的範圍空間, 則可以平均化每個特徵的影響能力。
Standardize features by removing the mean and scaling to unit variance
The standard score of a sample
x
is calculated as:z = (x - u) / s
where
u
is the mean of the training samples or zero ifwith_mean=False
, ands
is the standard deviation of the training samples or one ifwith_std=False
.Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using
transform
.Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
This scaler can also be applied to sparse CSR or CSC matrices by passing
with_mean=False
to avoid breaking the sparsity structure of the data.
>>> from sklearn.preprocessing import StandardScaler >>> data = [[0, 0], [0, 0], [1, 1], [1, 1]] >>> scaler = StandardScaler() >>> print(scaler.fit(data)) StandardScaler() >>> print(scaler.mean_) [0.5 0.5] >>> print(scaler.transform(data)) [[-1. -1.] [-1. -1.] [ 1. 1.] [ 1. 1.]] >>> print(scaler.transform([[2, 2]])) [[3. 3.]]
Use ColumnTransformer
by selecting column by data types
對於不同型別資料,如果都有相同的處理過程, 則可以使用 列選擇器, 選擇出對應型別的列, 送入對應的 列的 變換器 或者 預處理的子流水線。
When dealing with a cleaned dataset, the preprocessing can be automatic by using the data types of the column to decide whether to treat a column as a numerical or categorical feature.
sklearn.compose.make_column_selector
gives this possibility. First, let’s only select a subset of columns to simplify our example.subset_feature = ['embarked', 'sex', 'pclass', 'age', 'fare'] X_train, X_test = X_train[subset_feature], X_test[subset_feature]
Then, we introspect the information regarding each column data type.
X_train.info()
Out:
<class 'pandas.core.frame.DataFrame'> Int64Index: 1047 entries, 1118 to 684 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 embarked 1045 non-null category 1 sex 1047 non-null category 2 pclass 1047 non-null float64 3 age 841 non-null float64 4 fare 1046 non-null float64 dtypes: category(2), float64(3) memory usage: 35.0 KBWe can observe that the
embarked
andsex
columns were tagged ascategory
columns when loading the data withfetch_openml
. Therefore, we can use this information to dispatch the categorical columns to thecategorical_transformer
and the remaining columns to thenumerical_transformer
.
from sklearn.compose import make_column_selector as selector preprocessor = ColumnTransformer(transformers=[ ('num', numeric_transformer, selector(dtype_exclude="category")), ('cat', categorical_transformer, selector(dtype_include="category")) ]) clf = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression())]) clf.fit(X_train, y_train) print("model score: %.3f" % clf.score(X_test, y_test))
Using the prediction pipeline in a grid search
使用網格搜尋, 來確定流水線中, 節點引數, 包括 數值型 填充策略, 和 分類器引數。
Grid search can also be performed on the different preprocessing steps defined in the
ColumnTransformer
object, together with the classifier’s hyperparameters as part of thePipeline
. We will search for both the imputer strategy of the numeric preprocessing and the regularization parameter of the logistic regression usingGridSearchCV
.param_grid = { 'preprocessor__num__imputer__strategy': ['mean', 'median'], 'classifier__C': [0.1, 1.0, 10, 100], } grid_search = GridSearchCV(clf, param_grid, cv=10) grid_search
檢視最好的結果對應的引數。
Calling ‘fit’ triggers the cross-validated search for the best hyper-parameters combination:
grid_search.fit(X_train, y_train) print(f"Best params:") print(grid_search.best_params_)
Out:
Best params: {'classifier__C': 0.1, 'preprocessor__num__imputer__strategy': 'mean'}The internal cross-validation scores obtained by those parameters is:
print(f"Internal CV score: {grid_search.best_score_:.3f}")
Out:
Internal CV score: 0.784
檢視所有引數哦搜尋組合中得分較大的前五組
We can also introspect the top grid search results as a pandas dataframe:
import pandas as pd cv_results = pd.DataFrame(grid_search.cv_results_) cv_results = cv_results.sort_values("mean_test_score", ascending=False) cv_results[["mean_test_score", "std_test_score", "param_preprocessor__num__imputer__strategy", "param_classifier__C" ]].head(5)
mean_test_score std_test_score param_preprocessor__num__imputer__strategy param_classifier__C 0 0.784167 0.035824 mean 0.1 2 0.780366 0.032722 mean 1 1 0.780348 0.037245 median 0.1 4 0.779414 0.033105 mean 10 6 0.779414 0.033105 mean 100
The best hyper-parameters have be used to re-fit a final model on the full training set. We can evaluate that final model on held out test data that was not used for hyparameter tuning.
print(("best logistic regression from grid search: %.3f" % grid_search.score(X_test, y_test)))
Out:
best logistic regression from grid search: 0.794