One-Hot Encoding獨熱編碼
阿新 • • 發佈:2019-01-13
文章目錄
one-hot encoding:The Standard Approach for Categorical Features
Categorical feature:如,color of flowers: yellow, red, green。
one-hot encoding:一種碼制,有多少個狀態(或者叫類別值)就有多少個位元,且只有一個位元為1,其它全為0.
Pandas offers a convenient function called get_dummies to get one-hot encodings.
code
獨熱編碼
Pandas offers a convenient function called get_dummies to get one-hot encodings. Call it like this:
one_hot_encoded_data = pd.get_dummies(data)
help(pd.get_dummies)
Help on function get_dummies in module pandas.core.reshape.reshape: get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None) Convert categorical variable into dummy/indicator variables Parameters ---------- data : array-like, Series, or DataFrame prefix : string, list of strings, or dict of strings, default None String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, `prefix` can be a dictionary mapping column names to prefixes. prefix_sep : string, default '_' If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with `prefix.` dummy_na : bool, default False Add a column to indicate NaNs, if False NaNs are ignored. columns : list-like, default None Column names in the DataFrame to be encoded. If `columns` is None then all the columns with `object` or `category` dtype will be converted. sparse : bool, default False Whether the dummy columns should be sparse or not. Returns SparseDataFrame if `data` is a Series or if all columns are included. Otherwise returns a DataFrame with some SparseBlocks. drop_first : bool, default False Whether to get k-1 dummies out of k categorical levels by removing the first level. .. versionadded:: 0.18.0 dtype : dtype, default np.uint8 Data type for new columns. Only a single dtype is allowed. .. versionadded:: 0.23.0 Returns ------- dummies : DataFrame or SparseDataFrame Examples -------- >>> import pandas as pd >>> s = pd.Series(list('abca')) >>> pd.get_dummies(s) a b c 0 1 0 0 1 0 1 0 2 0 0 1 3 1 0 0 >>> s1 = ['a', 'b', np.nan] >>> pd.get_dummies(s1) a b 0 1 0 1 0 1 2 0 0 >>> pd.get_dummies(s1, dummy_na=True) a b NaN 0 1 0 0 1 0 1 0 2 0 0 1 >>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], ... 'C': [1, 2, 3]}) >>> pd.get_dummies(df, prefix=['col1', 'col2']) C col1_a col1_b col2_a col2_b col2_c 0 1 1 0 0 1 0 1 2 0 1 1 0 0 2 3 1 0 0 0 1 >>> pd.get_dummies(pd.Series(list('abcaa'))) a b c 0 1 0 0 1 0 1 0 2 0 0 1 3 1 0 0 4 1 0 0 >>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True) b c 0 0 0 1 1 0 2 0 1 3 0 0 4 0 0 >>> pd.get_dummies(pd.Series(list('abc')), dtype=float) a b c 0 1.0 0.0 0.0 1 0.0 1.0 0.0 2 0.0 0.0 1.0 See Also -------- Series.str.get_dummies
align:
final_train_predictors, final_test_predictors= one_hot_encoded_training_data_predictors.align(one_hot_encoded_test_data_predictors, join='left',axis=1, fill_value=0)
#axis=1:columns
#join='left' : keep exactly the columns from our training data
#fill_value=0:對齊後沒有值的地方填0,預設填的是NaN
#align
help (one_hot_encoded_X.align)
Help on method align in module pandas.core.frame:
align(self, other, join='outer', axis=None, level=None, copy=True, fill_value=None, method=None, limit=None, fill_axis=0, broadcast_axis=None) method of pandas.core.frame.DataFrame instance
Align two objects on their axes with the
specified join method for each axis Index
Parameters
----------
other : DataFrame or Series
join : {'outer', 'inner', 'left', 'right'}, default 'outer'
axis : allowed axis of the other object, default None
Align on index (0), columns (1), or both (None)
level : int or level name, default None
Broadcast across a level, matching Index values on the
passed MultiIndex level
copy : boolean, default True
Always returns new objects. If copy=False and no reindexing is
required then original objects are returned.
fill_value : scalar, default np.NaN
Value to use for missing values. Defaults to NaN, but can be any
"compatible" value
method : str, default None
limit : int, default None
fill_axis : {0 or 'index', 1 or 'columns'}, default 0
Filling axis, method and limit
broadcast_axis : {0 or 'index', 1 or 'columns'}, default None
Broadcast values along this axis, if aligning two objects of
different dimensions
Returns
-------
(left, right) : (DataFrame, type of other)
Aligned objects
Example:西瓜資料3.0
#_*_coding:utf-8_*_
import pandas as pd
watermelon_data= pd.read_csv(r'G:\kaggle\watermelon_3.csv')
watermelon_data
編號 | 色澤 | 根蒂 | 敲聲 | 紋理 | 臍部 | 觸感 | 密度 | 含糖率 | 好瓜 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 青綠 | 蜷縮 | 濁響 | 清晰 | 凹陷 | 硬滑 | 0.697 | 0.460 | 是 |
1 | 2 | 烏黑 | 蜷縮 | 沉悶 | 清晰 | 凹陷 | 硬滑 | 0.774 | 0.376 | 是 |
2 | 3 | 烏黑 | 蜷縮 | 濁響 | 清晰 | 凹陷 | 硬滑 | 0.634 | 0.264 | 是 |
3 | 4 | 青綠 | 蜷縮 | 沉悶 | 清晰 | 凹陷 | 硬滑 | 0.608 | 0.318 | 是 |
4 | 5 | 淺白 | 蜷縮 | 濁響 | 清晰 | 凹陷 | 硬滑 | 0.556 | 0.215 | 是 |
5 | 6 | 青綠 | 稍蜷 | 濁響 | 清晰 | 稍凹 | 軟粘 | 0.403 | 0.237 | 是 |
6 | 7 | 烏黑 | 稍蜷 | 濁響 | 稍糊 | 稍凹 | 軟粘 | 0.481 | 0.149 | 是 |
7 | 8 | 烏黑 | 稍蜷 | 濁響 | 清晰 | 稍凹 | 硬滑 | 0.437 | 0.211 | 是 |
8 | 9 | 烏黑 | 稍蜷 | 沉悶 | 稍糊 | 稍凹 | 硬滑 | 0.666 | 0.091 | 否 |
9 | 10 | 青綠 | 硬挺 | 清脆 | 清晰 | 平坦 | 軟粘 | 0.243 | 0.267 | 否 |
10 | 11 | 淺白 | 硬挺 | 清脆 | 模糊 | 平坦 | 硬滑 | 0.245 | 0.057 | 否 |
11 | 12 | 淺白 | 蜷縮 | 濁響 | 模糊 | 平坦 | 軟粘 | 0.343 | 0.099 | 否 |
12 | 13 | 青綠 | 稍蜷 | 濁響 | 稍糊 | 凹陷 | 硬滑 | 0.639 | 0.161 | 否 |
13 | 14 | 淺白 | 稍蜷 | 沉悶 | 稍糊 | 凹陷 | 硬滑 | 0.657 | 0.198 | 否 |
14 | 15 | 烏黑 | 稍蜷 | 濁響 | 清晰 | 稍凹 | 軟粘 | 0.360 | 0.370 | 否 |
15 | 16 | 淺白 | 蜷縮 | 濁響 | 模糊 | 平坦 | 硬滑 | 0.593 | 0.042 | 否 |
16 | 17 | 青綠 | 蜷縮 | 沉悶 | 稍糊 | 稍凹 | 硬滑 | 0.719 | 0.103 | 否 |
遇到問題:讀出來的中文亂碼。
解決:將csv檔案用記事本開啟,然後點另存為後出現編碼選項,選擇:UTF-8
watermelon_data.dtypes
編號 int64
色澤 object
根蒂 object
敲聲 object
紋理 object
臍部 object
觸感 object
密度 float64
含糖率 float64
好瓜 object
dtype: object
#色澤下有幾種值
watermelon_data['色澤'].nunique()
3
#判斷色澤的的值的型別是不是非數值的
watermelon_data['色澤'].dtype=="object"
True
watermelon_data['含糖率'].dtype=="float"
True
#target
y=watermelon_data['好瓜']
#X
X=watermelon_data.drop(['編號','好瓜'], axis=1)
X
#----------或者------------
#features=...
#X= watermelon_data[features]
色澤 | 根蒂 | 敲聲 | 紋理 | 臍部 | 觸感 | 密度 | 含糖率 | |
---|---|---|---|---|---|---|---|---|
0 | 青綠 | 蜷縮 | 濁響 | 清晰 | 凹陷 | 硬滑 | 0.697 | 0.460 |
1 | 烏黑 | 蜷縮 | 沉悶 | 清晰 | 凹陷 | 硬滑 | 0.774 | 0.376 |
2 | 烏黑 | 蜷縮 | 濁響 | 清晰 | 凹陷 | 硬滑 | 0.634 | 0.264 |
3 | 青綠 | 蜷縮 | 沉悶 | 清晰 | 凹陷 | 硬滑 | 0.608 | 0.318 |
4 | 淺白 | 蜷縮 | 濁響 | 清晰 | 凹陷 | 硬滑 | 0.556 | 0.215 |
5 | 青綠 | 稍蜷 | 濁響 | 清晰 | 稍凹 | 軟粘 | 0.403 | 0.237 |
6 | 烏黑 | 稍蜷 | 濁響 | 稍糊 | 稍凹 | 軟粘 | 0.481 | 0.149 |
7 | 烏黑 | 稍蜷 | 濁響 | 清晰 | 稍凹 | 硬滑 | 0.437 | 0.211 |
8 | 烏黑 | 稍蜷 | 沉悶 | 稍糊 | 稍凹 | 硬滑 | 0.666 | 0.091 |
9 | 青綠 | 硬挺 | 清脆 | 清晰 | 平坦 | 軟粘 | 0.243 | 0.267 |
10 | 淺白 | 硬挺 | 清脆 | 模糊 | 平坦 | 硬滑 | 0.245 | 0.057 |
11 | 淺白 | 蜷縮 | 濁響 | 模糊 | 平坦 | 軟粘 | 0.343 | 0.099 |
12 | 青綠 | 稍蜷 | 濁響 | 稍糊 | 凹陷 | 硬滑 | 0.639 | 0.161 |
13 | 淺白 | 稍蜷 | 沉悶 | 稍糊 | 凹陷 | 硬滑 | 0.657 | 0.198 |
14 | 烏黑 | 稍蜷 | 濁響 | 清晰 | 稍凹 | 軟粘 | 0.360 | 0.370 |
15 | 淺白 | 蜷縮 | 濁響 | 模糊 | 平坦 | 硬滑 | 0.593 | 0.042 |
16 | 青綠 | 蜷縮 | 沉悶 | 稍糊 | 稍凹 | 硬滑 | 0.719 | 0.103 |
#處理categorical feature:獨熱編碼
one_hot_encoded_X= pd.get_dummies(X)
one_hot_encoded_X
密度 | 含糖率 | 色澤_烏黑 | 色澤_淺白 | 色澤_青綠 | 根蒂_硬挺 | 根蒂_稍蜷 | 根蒂_蜷縮 | 敲聲_沉悶 | 敲聲_濁響 | 敲聲_清脆 | 紋理_模糊 | 紋理_清晰 | 紋理_稍糊 | 臍部_凹陷 | 臍部_平坦 | 臍部_稍凹 | 觸感_硬滑 | 觸感_軟粘 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.697 | 0.460 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | 0.774 | 0.376 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
2 | 0.634 | 0.264 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
3 | 0.608 | 0.318 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
4 | 0.556 | 0.215 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
5 | 0.403 | 0.237 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
6 | 0.481 | 0.149 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
7 | 0.437 | 0.211 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
8 | 0.666 | 0.091 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
9 | 0.243 | 0.267 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
10 | 0.245 | 0.057 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
11 | 0.343 | 0.099 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
12 | 0.639 | 0.161 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
13 | 0.657 | 0.198 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
14 | 0.360 | 0.370 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
15 | 0.593 | 0.042 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
16 | 0.719 | 0.103 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
from sklearn.tree import DecisionTreeClassifier
#model
model= DecisionTreeClassifier()
model
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
#fit
clf= model.fit(one_hot_encoded_X, y)
clf
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
#predict[青綠,蜷縮,沉悶,清晰,凹陷,硬滑,0.608,0.300]
clf.predict([[0,0,1,0,0,1,0,1,0,0,1,0,1,0,0,1,0,0.608,0.300]])
array(['\xe5\x90\xa6'], dtype=object)
怎麼能看到中文???若有多個檔案(test dataset、一些其它的做預測的資料)。Sklearn對columns的順序敏感,所以如果訓練集和測試集沒有對齊,結果將毫無意義。
if a categorical had a different number of values in the training data vs the test data,這將有可能發生。
如何確保test data和training data以同樣的方式編碼呢?
假如:
test data:watermelon_3_test.csv(csv檔案編碼:UTF-8)
import pandas as pd
watermelon_test_data= pd.read_csv(r'G:\kaggle\watermelon_3_test.csv')
watermelon_test_data= watermelon_test_data.drop(['編號'], axis=1)
watermelon_test_data
#看到:test檔案裡的紋理僅有2個值,而訓練資料中紋理有3個值,那one_hot encoding後是不一致的
色澤 | 根蒂 | 敲聲 | 紋理 | 臍部 | 觸感 | 密度 | 含糖率 | |
---|---|---|---|---|---|---|---|---|
0 | 烏黑 | 蜷縮 | 濁響 | 清晰 | 凹陷 | 硬滑 | 0.697 | 0.460 |
1 | 烏黑 | 蜷縮 | 沉悶 | 清晰 | 凹陷 | 硬滑 | 0.774 | 0.376 |
2 | 烏黑 | 蜷縮 | 濁響 | 清晰 | 凹陷 | 硬滑 | 0.611 | 0.264 |
3 | 青綠 | 蜷縮 | 沉悶 | 清晰 | 凹陷 | 硬滑 | 0.608 | 0.318 |
4 | 青綠 | 稍蜷 | 濁響 | 稍糊 | 稍凹 | 硬滑 | 0.639 | 0.172 |
5 | 淺白 | 稍蜷 | 沉悶 | 稍糊 | 凹陷 | 硬滑 | 0.657 | 0.198 |
6 | 烏黑 | 稍蜷 | 濁響 | 清晰 | 稍凹 | 軟粘 | 0.360 | 0.370 |
one_hot_encoded_watermelon_test_data= pd.get_dummies(watermelon_test_data)
one_hot_encoded_watermelon_test_data
密度 | 含糖率 | 色澤_烏黑 | 色澤_淺白 | 色澤_青綠 | 根蒂_稍蜷 | 根蒂_蜷縮 | 敲聲_沉悶 | 敲聲_濁響 | 紋理_清晰 | 紋理_稍糊 | 臍部_凹陷 | 臍部_稍凹 | 觸感_硬滑 | 觸感_軟粘 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.697 | 0.460 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 |
1 | 0.774 | 0.376 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
2 | 0.611 | 0.264 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 |
3 | 0.608 | 0.318 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
4 | 0.639 | 0.172 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |
5 | 0.657 | 0.198 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
6 | 0.360 | 0.370 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 |
如何讓test.csv的one_hot_encoded_watermelon_test_data和訓練集one_hot_encoded_X的編碼保持align呢:
final_train, final_test= one_hot_encoded_X.align(one_hot_encoded_watermelon_test_data, join='left',axis=1, fill_value=0)
#axis=1:columns
#join='left' : keep exactly the columns from our training data
#fill_value=0:對齊後沒有值的地方填0,預設填的是NaN
final_train
密度 | 含糖率 | 色澤_烏黑 | 色澤_淺白 | 色澤_青綠 | 根蒂_硬挺 | 根蒂_稍蜷 | 根蒂_蜷縮 | 敲聲_沉悶 | 敲聲_濁響 | 敲聲_清脆 | 紋理_模糊 | 紋理_清晰 | 紋理_稍糊 | 臍部_凹陷 | 臍部_平坦 | 臍部_稍凹 | 觸感_硬滑 | 觸感_軟粘 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.697 | 0.460 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | 0.774 | 0.376 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
2 | 0.634 | 0.264 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
3 | 0.608 | 0.318 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
4 | 0.556 | 0.215 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
5 | 0.403 | 0.237 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
6 | 0.481 | 0.149 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
7 | 0.437 | 0.211 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
8 | 0.666 | 0.091 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
9 | 0.243 | 0.267 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
10 | 0.245 | 0.057 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
11 | 0.343 | 0.099 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
12 | 0.639 | 0.161 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
13 | 0.657 | 0.198 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
14 | 0.360 | 0.370 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
15 | 0.593 | 0.042 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
16 | 0.719 | 0.103 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
final_test
密度 | 含糖率 | 色澤_烏黑 | 色澤_淺白 | 色澤_青綠 | 根蒂_硬挺 | 根蒂_稍蜷 | 根蒂_蜷縮 | 敲聲_沉悶 | 敲聲_濁響 | 敲聲_清脆 | 紋理_模糊 | 紋理_清晰 | 紋理_稍糊 | 臍部_凹陷 | 臍部_平坦 | 臍部_稍凹 | 觸感_硬滑 | 觸感_軟粘 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.697 | 0.460 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | 0.774 | 0.376 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
2 | 0.611 | 0.264 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
3 | 0.608 | 0.318 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
4 | 0.639 | 0.172 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
5 | 0.657 | 0.198 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
6 | 0.360 | 0.370 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
clf.predict(final_test)
array(['\xe6\x98\xaf', '\xe6\x98\xaf', '\xe6\x98\xaf', '\xe6\x98\xaf',
'\xe5\x90\xa6', '\xe5\x90\xa6', '\xe5\x90\xa6'], dtype=object)