資料探勘中對Categorical特徵的處理

阿新 • • 發佈：2021-01-12

Categorical特徵常被稱為離散特徵、分類特徵，資料型別通常是object型別，而我們的機器學習模型通常只能處理數值資料，所以需要對Categorical資料轉換成Numeric特徵。

Categorical特徵又有兩類，我們需要理解它們的具體含義並進行對應的轉換。

Ordinal型別：這種型別的Categorical存在著自然的順序結構，如果你對Ordinal 型別資料進行排序的話，可以是增序或者降序，比如在學習成績這個特徵中具體的值可能有：A、B、C、D四個等級，但是根據成績的優異成績進行排序的話有A>B>C>D
Nominal型別：這種是常規的Categorical型別，不能對Nominal型別資料進行排序。比如血型特徵可能的值有：A、B、O、AB

，但你不能得出A>B>O>AB的結論。

對於Ordinal和Nominal型別資料有不同的方法將它們轉換成數字。

對於Ordinal型別資料可以使用LabelEncoder進行編碼處理，例如成績的A、B、C、D四個等級進行LabelEncoder處理後會對映成1、2、3、4，這樣資料間的自然大小關係也會保留下來。

對於Nominal型別資料可以使用OneHotEncoder進行編碼處理

Use pandas’ get_dummies() method to return a new DataFrame containing a new column for each dummy variable

Use the concat() method to add these dummy columns back to the original DataFrame
Then drop the original columns entirely using the drop method

In case you are dealing with ordinal feature –> you map its values to 1, 2, 3, 4 or 3, 2, 1 or whatever if not already mapped. Ordinal feature means its values may be arranged in some order that makes logical sense. For example, you have a feature “Size” with alphanumeric values, let’s say “small, medium, big”; indeed “big” is bigger than “small”, you can compare those values and it will make sense. You map “small, medium, big” to 1, 2, 3 for example. Example in Titanic: Pclass is an ordinal feature: Pclass=1 is better than Pclass=3. Note that in this case Pclass feature is already mapped to 1, 2, 3 so you don’t have to do anything with it. You would have to map it if Pclass contained alphanumeric values like “high_class, medium_class, low_class”.

In case you are dealing with categorical feature - you look at how much categories (possible values in that particular feature) do you have. If you have only 2 categories you map them to 0 and 1 or to -1 and that’s it. If you have more than 2 categories, you create dummy variables. Example in Titanic: Sex is a categorical variable with 2 categories - ‘male’ and ‘female’, you map them for example to 0 and 1, and that’s it. Note that it’s not ordinal because male is not
better nor worse than female, you can’t logically compare them. Now, Embarked is a categorical feature too, but it has 3 categories instead of just 2. You make dummy variables out of this feature. And make just 2, not 3, the 3rd one is redundant. Well this feature is redundant by itself but anyway.

Edit: following further discussion, there are cases when turning ordinal features to dummies may improve your score a bit. It’s hard to tell beforehand, so it should be usefeul to make 2 sets of features, one including ordinal data and the other with ordinal-to-one-hot data, compare the results on various models and pick the one that worked out best in your specific case.

資料探勘中對Categorical特徵的處理

資料探勘中對Categorical特徵的處理

python文字處理資料探勘停用詞檢索

資料分析與資料探勘 - 07資料處理

對商品的評論進行資料探勘得到評論標籤（商品屬性+評論觀點），以及使用者的分組資訊

資料探勘演算法和實踐（二十）：sklearn中通用資料集datasets

《最終幻想14》總監吉田直樹譴責第三方外掛對資料探勘零容忍

python適合做資料探勘嗎

資料分析筆記：財政收入預測資料探勘分析

工資分配與資料探勘

人工智慧之資料探勘：如何使用sklearn做資料探勘

資料探勘領域十大經典演算法之—K-鄰近演算法/kNN（超詳細附程式碼）

[資料分析-資料探勘]BI-data analytics-data science

天池 - “零基礎入門資料探勘 - 二手車交易價格預測”TOP 2%開原始碼

資料探勘實訓週報week3

資料探勘相關知識與工具總結

0基礎大資料學習：資料探勘的作用

第四屆工業大資料賽事：時序序列預測 + 結構化資料探勘2種類型賽題！

【資料探勘】GBDT，XGBoost

【資料探勘】使用可檢視方法轉換時間序列為複雜網路

資料探勘知識點-決策樹

資料探勘中對Categorical特徵的處理

相關推薦