1. 程式人生 > >One-Hot Encoding獨熱編碼

One-Hot Encoding獨熱編碼

文章目錄

one-hot encoding:The Standard Approach for Categorical Features

Categorical feature:如,color of flowers: yellow, red, green。

Imgur

one-hot encoding:一種碼制,有多少個狀態(或者叫類別值)就有多少個位元,且只有一個位元為1,其它全為0.

Pandas offers a convenient function called get_dummies to get one-hot encodings.

code

獨熱編碼
Pandas offers a convenient function called get_dummies to get one-hot encodings. Call it like this:

one_hot_encoded_data = pd.get_dummies(data)
help(pd.get_dummies)
Help on function get_dummies in module pandas.core.reshape.reshape:

get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
    Convert categorical variable into dummy/indicator variables
    
    Parameters
    ----------
    data : array-like, Series, or DataFrame
    prefix : string, list of strings, or dict of strings, default None
        String to append DataFrame column names.
        Pass a list with length equal to the number of columns
        when calling get_dummies on a DataFrame. Alternatively, `prefix`
        can be a dictionary mapping column names to prefixes.
    prefix_sep : string, default '_'
        If appending prefix, separator/delimiter to use. Or pass a
        list or dictionary as with `prefix.`
    dummy_na : bool, default False
        Add a column to indicate NaNs, if False NaNs are ignored.
    columns : list-like, default None
        Column names in the DataFrame to be encoded.
        If `columns` is None then all the columns with
        `object` or `category` dtype will be converted.
    sparse : bool, default False
        Whether the dummy columns should be sparse or not.  Returns
        SparseDataFrame if `data` is a Series or if all columns are included.
        Otherwise returns a DataFrame with some SparseBlocks.
    drop_first : bool, default False
        Whether to get k-1 dummies out of k categorical levels by removing the
        first level.
    
        .. versionadded:: 0.18.0
    
    dtype : dtype, default np.uint8
        Data type for new columns. Only a single dtype is allowed.
    
        .. versionadded:: 0.23.0
    
    Returns
    -------
    dummies : DataFrame or SparseDataFrame
    
    Examples
    --------
    >>> import pandas as pd
    >>> s = pd.Series(list('abca'))
    
    >>> pd.get_dummies(s)
       a  b  c
    0  1  0  0
    1  0  1  0
    2  0  0  1
    3  1  0  0
    
    >>> s1 = ['a', 'b', np.nan]
    
    >>> pd.get_dummies(s1)
       a  b
    0  1  0
    1  0  1
    2  0  0
    
    >>> pd.get_dummies(s1, dummy_na=True)
       a  b  NaN
    0  1  0    0
    1  0  1    0
    2  0  0    1
    
    >>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
    ...                    'C': [1, 2, 3]})
    
    >>> pd.get_dummies(df, prefix=['col1', 'col2'])
       C  col1_a  col1_b  col2_a  col2_b  col2_c
    0  1       1       0       0       1       0
    1  2       0       1       1       0       0
    2  3       1       0       0       0       1
    
    >>> pd.get_dummies(pd.Series(list('abcaa')))
       a  b  c
    0  1  0  0
    1  0  1  0
    2  0  0  1
    3  1  0  0
    4  1  0  0
    
    >>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
       b  c
    0  0  0
    1  1  0
    2  0  1
    3  0  0
    4  0  0
    
    >>> pd.get_dummies(pd.Series(list('abc')), dtype=float)
         a    b    c
    0  1.0  0.0  0.0
    1  0.0  1.0  0.0
    2  0.0  0.0  1.0
    
    See Also
    --------
    Series.str.get_dummies

align:

final_train_predictors, final_test_predictors= one_hot_encoded_training_data_predictors.align(one_hot_encoded_test_data_predictors, join='left',axis=1, fill_value=0)
#axis=1:columns
#join='left' : keep exactly the columns from our training data
#fill_value=0:對齊後沒有值的地方填0,預設填的是NaN
#align
help
(one_hot_encoded_X.align)
Help on method align in module pandas.core.frame:

align(self, other, join='outer', axis=None, level=None, copy=True, fill_value=None, method=None, limit=None, fill_axis=0, broadcast_axis=None) method of pandas.core.frame.DataFrame instance
    Align two objects on their axes with the
    specified join method for each axis Index
    
    Parameters
    ----------
    other : DataFrame or Series
    join : {'outer', 'inner', 'left', 'right'}, default 'outer'
    axis : allowed axis of the other object, default None
        Align on index (0), columns (1), or both (None)
    level : int or level name, default None
        Broadcast across a level, matching Index values on the
        passed MultiIndex level
    copy : boolean, default True
        Always returns new objects. If copy=False and no reindexing is
        required then original objects are returned.
    fill_value : scalar, default np.NaN
        Value to use for missing values. Defaults to NaN, but can be any
        "compatible" value
    method : str, default None
    limit : int, default None
    fill_axis : {0 or 'index', 1 or 'columns'}, default 0
        Filling axis, method and limit
    broadcast_axis : {0 or 'index', 1 or 'columns'}, default None
        Broadcast values along this axis, if aligning two objects of
        different dimensions
    
    Returns
    -------
    (left, right) : (DataFrame, type of other)
        Aligned objects

Example:西瓜資料3.0

#_*_coding:utf-8_*_
import pandas as pd
watermelon_data= pd.read_csv(r'G:\kaggle\watermelon_3.csv')
watermelon_data
編號 色澤 根蒂 敲聲 紋理 臍部 觸感 密度 含糖率 好瓜
0 1 青綠 蜷縮 濁響 清晰 凹陷 硬滑 0.697 0.460
1 2 烏黑 蜷縮 沉悶 清晰 凹陷 硬滑 0.774 0.376
2 3 烏黑 蜷縮 濁響 清晰 凹陷 硬滑 0.634 0.264
3 4 青綠 蜷縮 沉悶 清晰 凹陷 硬滑 0.608 0.318
4 5 淺白 蜷縮 濁響 清晰 凹陷 硬滑 0.556 0.215
5 6 青綠 稍蜷 濁響 清晰 稍凹 軟粘 0.403 0.237
6 7 烏黑 稍蜷 濁響 稍糊 稍凹 軟粘 0.481 0.149
7 8 烏黑 稍蜷 濁響 清晰 稍凹 硬滑 0.437 0.211
8 9 烏黑 稍蜷 沉悶 稍糊 稍凹 硬滑 0.666 0.091
9 10 青綠 硬挺 清脆 清晰 平坦 軟粘 0.243 0.267
10 11 淺白 硬挺 清脆 模糊 平坦 硬滑 0.245 0.057
11 12 淺白 蜷縮 濁響 模糊 平坦 軟粘 0.343 0.099
12 13 青綠 稍蜷 濁響 稍糊 凹陷 硬滑 0.639 0.161
13 14 淺白 稍蜷 沉悶 稍糊 凹陷 硬滑 0.657 0.198
14 15 烏黑 稍蜷 濁響 清晰 稍凹 軟粘 0.360 0.370
15 16 淺白 蜷縮 濁響 模糊 平坦 硬滑 0.593 0.042
16 17 青綠 蜷縮 沉悶 稍糊 稍凹 硬滑 0.719 0.103

遇到問題:讀出來的中文亂碼。
解決:將csv檔案用記事本開啟,然後點另存為後出現編碼選項,選擇:UTF-8

watermelon_data.dtypes
編號       int64
色澤      object
根蒂      object
敲聲      object
紋理      object
臍部      object
觸感      object
密度     float64
含糖率    float64
好瓜      object
dtype: object
#色澤下有幾種值
watermelon_data['色澤'].nunique()
3
#判斷色澤的的值的型別是不是非數值的
watermelon_data['色澤'].dtype=="object"
True
watermelon_data['含糖率'].dtype=="float"
True
#target
y=watermelon_data['好瓜']
#X
X=watermelon_data.drop(['編號','好瓜'], axis=1)
X

#----------或者------------
#features=...
#X= watermelon_data[features]
色澤 根蒂 敲聲 紋理 臍部 觸感 密度 含糖率
0 青綠 蜷縮 濁響 清晰 凹陷 硬滑 0.697 0.460
1 烏黑 蜷縮 沉悶 清晰 凹陷 硬滑 0.774 0.376
2 烏黑 蜷縮 濁響 清晰 凹陷 硬滑 0.634 0.264
3 青綠 蜷縮 沉悶 清晰 凹陷 硬滑 0.608 0.318
4 淺白 蜷縮 濁響 清晰 凹陷 硬滑 0.556 0.215
5 青綠 稍蜷 濁響 清晰 稍凹 軟粘 0.403 0.237
6 烏黑 稍蜷 濁響 稍糊 稍凹 軟粘 0.481 0.149
7 烏黑 稍蜷 濁響 清晰 稍凹 硬滑 0.437 0.211
8 烏黑 稍蜷 沉悶 稍糊 稍凹 硬滑 0.666 0.091
9 青綠 硬挺 清脆 清晰 平坦 軟粘 0.243 0.267
10 淺白 硬挺 清脆 模糊 平坦 硬滑 0.245 0.057
11 淺白 蜷縮 濁響 模糊 平坦 軟粘 0.343 0.099
12 青綠 稍蜷 濁響 稍糊 凹陷 硬滑 0.639 0.161
13 淺白 稍蜷 沉悶 稍糊 凹陷 硬滑 0.657 0.198
14 烏黑 稍蜷 濁響 清晰 稍凹 軟粘 0.360 0.370
15 淺白 蜷縮 濁響 模糊 平坦 硬滑 0.593 0.042
16 青綠 蜷縮 沉悶 稍糊 稍凹 硬滑 0.719 0.103
#處理categorical feature:獨熱編碼
one_hot_encoded_X= pd.get_dummies(X)
one_hot_encoded_X
密度 含糖率 色澤_烏黑 色澤_淺白 色澤_青綠 根蒂_硬挺 根蒂_稍蜷 根蒂_蜷縮 敲聲_沉悶 敲聲_濁響 敲聲_清脆 紋理_模糊 紋理_清晰 紋理_稍糊 臍部_凹陷 臍部_平坦 臍部_稍凹 觸感_硬滑 觸感_軟粘
0 0.697 0.460 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0
1 0.774 0.376 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0
2 0.634 0.264 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0
3 0.608 0.318 0 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0
4 0.556 0.215 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0
5 0.403 0.237 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1
6 0.481 0.149 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 1
7 0.437 0.211 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 0
8 0.666 0.091 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1 1 0
9 0.243 0.267 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1
10 0.245 0.057 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0
11 0.343 0.099 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 1
12 0.639 0.161 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0
13 0.657 0.198 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0
14 0.360 0.370 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1
15 0.593 0.042 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0
16 0.719 0.103 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 1 0
from sklearn.tree import DecisionTreeClassifier
#model
model= DecisionTreeClassifier()
model
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
#fit
clf= model.fit(one_hot_encoded_X, y)
clf
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
#predict[青綠,蜷縮,沉悶,清晰,凹陷,硬滑,0.608,0.300]
clf.predict([[0,0,1,0,0,1,0,1,0,0,1,0,1,0,0,1,0,0.608,0.300]])
array(['\xe5\x90\xa6'], dtype=object)

怎麼能看到中文???若有多個檔案(test dataset、一些其它的做預測的資料)。Sklearn對columns的順序敏感,所以如果訓練集和測試集沒有對齊,結果將毫無意義。
if a categorical had a different number of values in the training data vs the test data,這將有可能發生。
如何確保test data和training data以同樣的方式編碼呢?
假如:
test data:watermelon_3_test.csv(csv檔案編碼:UTF-8)

import pandas as pd
watermelon_test_data= pd.read_csv(r'G:\kaggle\watermelon_3_test.csv')
watermelon_test_data= watermelon_test_data.drop(['編號'], axis=1)
watermelon_test_data
#看到:test檔案裡的紋理僅有2個值,而訓練資料中紋理有3個值,那one_hot encoding後是不一致的
色澤 根蒂 敲聲 紋理 臍部 觸感 密度 含糖率
0 烏黑 蜷縮 濁響 清晰 凹陷 硬滑 0.697 0.460
1 烏黑 蜷縮 沉悶 清晰 凹陷 硬滑 0.774 0.376
2 烏黑 蜷縮 濁響 清晰 凹陷 硬滑 0.611 0.264
3 青綠 蜷縮 沉悶 清晰 凹陷 硬滑 0.608 0.318
4 青綠 稍蜷 濁響 稍糊 稍凹 硬滑 0.639 0.172
5 淺白 稍蜷 沉悶 稍糊 凹陷 硬滑 0.657 0.198
6 烏黑 稍蜷 濁響 清晰 稍凹 軟粘 0.360 0.370
one_hot_encoded_watermelon_test_data= pd.get_dummies(watermelon_test_data)
one_hot_encoded_watermelon_test_data
密度 含糖率 色澤_烏黑 色澤_淺白 色澤_青綠 根蒂_稍蜷 根蒂_蜷縮 敲聲_沉悶 敲聲_濁響 紋理_清晰 紋理_稍糊 臍部_凹陷 臍部_稍凹 觸感_硬滑 觸感_軟粘
0 0.697 0.460 1 0 0 0 1 0 1 1 0 1 0 1 0
1 0.774 0.376 1 0 0 0 1 1 0 1 0 1 0 1 0
2 0.611 0.264 1 0 0 0 1 0 1 1 0 1 0 1 0
3 0.608 0.318 0 0 1 0 1 1 0 1 0 1 0 1 0
4 0.639 0.172 0 0 1 1 0 0 1 0 1 0 1 1 0
5 0.657 0.198 0 1 0 1 0 1 0 0 1 1 0 1 0
6 0.360 0.370 1 0 0 1 0 0 1 1 0 0 1 0 1

如何讓test.csv的one_hot_encoded_watermelon_test_data和訓練集one_hot_encoded_X的編碼保持align呢:

final_train, final_test= one_hot_encoded_X.align(one_hot_encoded_watermelon_test_data, join='left',axis=1, fill_value=0)
#axis=1:columns
#join='left' : keep exactly the columns from our training data
#fill_value=0:對齊後沒有值的地方填0,預設填的是NaN
final_train
密度 含糖率 色澤_烏黑 色澤_淺白 色澤_青綠 根蒂_硬挺 根蒂_稍蜷 根蒂_蜷縮 敲聲_沉悶 敲聲_濁響 敲聲_清脆 紋理_模糊 紋理_清晰 紋理_稍糊 臍部_凹陷 臍部_平坦 臍部_稍凹 觸感_硬滑 觸感_軟粘
0 0.697 0.460 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0
1 0.774 0.376 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0
2 0.634 0.264 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0
3 0.608 0.318 0 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0
4 0.556 0.215 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0
5 0.403 0.237 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1
6 0.481 0.149 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 1
7 0.437 0.211 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 0
8 0.666 0.091 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1 1 0
9 0.243 0.267 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1
10 0.245 0.057 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0
11 0.343 0.099 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 1
12 0.639 0.161 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0
13 0.657 0.198 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0
14 0.360 0.370 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1
15 0.593 0.042 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0
16 0.719 0.103 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 1 0
final_test
密度 含糖率 色澤_烏黑 色澤_淺白 色澤_青綠 根蒂_硬挺 根蒂_稍蜷 根蒂_蜷縮 敲聲_沉悶 敲聲_濁響 敲聲_清脆 紋理_模糊 紋理_清晰 紋理_稍糊 臍部_凹陷 臍部_平坦 臍部_稍凹 觸感_硬滑 觸感_軟粘
0 0.697 0.460 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0
1 0.774 0.376 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0
2 0.611 0.264 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0
3 0.608 0.318 0 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0
4 0.639 0.172 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 1 0
5 0.657 0.198 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0
6 0.360 0.370 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1
clf.predict(final_test)
array(['\xe6\x98\xaf', '\xe6\x98\xaf', '\xe6\x98\xaf', '\xe6\x98\xaf',
       '\xe5\x90\xa6', '\xe5\x90\xa6', '\xe5\x90\xa6'], dtype=object)