資料探勘中對Categorical特徵的處理
阿新 • • 發佈:2021-01-12
Categorical特徵常被稱為離散特徵、分類特徵,資料型別通常是object型別,而我們的機器學習模型通常只能處理數值資料,所以需要對Categorical資料轉換成Numeric特徵。
Categorical特徵又有兩類,我們需要理解它們的具體含義並進行對應的轉換。
- Ordinal型別:這種型別的Categorical存在著自然的順序結構,如果你對Ordinal 型別資料進行排序的話,可以是增序或者降序,比如在學習成績這個特徵中具體的值可能有:
A、B、C、D
四個等級,但是根據成績的優異成績進行排序的話有A>B>C>D
- Nominal型別:這種是常規的Categorical型別,不能對Nominal型別資料進行排序。比如血型特徵可能的值有:
A、B、O、AB
A>B>O>AB
的結論。
對於Ordinal和Nominal型別資料有不同的方法將它們轉換成數字。
對於Ordinal型別資料可以使用LabelEncoder進行編碼處理,例如成績的A、B、C、D
四個等級進行LabelEncoder處理後會對映成1、2、3、4
,這樣資料間的自然大小關係也會保留下來。
對於Nominal型別資料可以使用OneHotEncoder進行編碼處理
- Use pandas’ get_dummies() method to return a new DataFrame containing a new column for each dummy variable
- Use the concat() method to add these dummy columns back to the original DataFrame
- Then drop the original columns entirely using the drop method
- In case you are dealing with ordinal feature –> you map its values to 1, 2, 3, 4 or 3, 2, 1 or whatever if not already mapped. Ordinal feature means its values may be arranged in some order that makes logical sense. For example, you have a feature “Size” with alphanumeric values, let’s say “small, medium, big”; indeed “big” is bigger than “small”, you can compare those values and it will make sense. You map “small, medium, big” to 1, 2, 3 for example. Example in Titanic: Pclass is an ordinal feature: Pclass=1 is better than Pclass=3. Note that in this case Pclass feature is already mapped to 1, 2, 3 so you don’t have to do anything with it. You would have to map it if Pclass contained alphanumeric values like “high_class, medium_class, low_class”.
- In case you are dealing with categorical feature - you look at how much categories (possible values in that particular feature) do you have. If you have only 2 categories you map them to 0 and 1 or to -1 and that’s it. If you have more than 2 categories, you create dummy variables. Example in Titanic: Sex is a categorical variable with 2 categories - ‘male’ and ‘female’, you map them for example to 0 and 1, and that’s it. Note that it’s not ordinal because male is not
better nor worse than female, you can’t logically compare them. Now, Embarked is a categorical feature too, but it has 3 categories instead of just 2. You make dummy variables out of this feature. And make just 2, not 3, the 3rd one is redundant. Well this feature is redundant by itself but anyway.
Edit: following further discussion, there are cases when turning ordinal features to dummies may improve your score a bit. It’s hard to tell beforehand, so it should be usefeul to make 2 sets of features, one including ordinal data and the other with ordinal-to-one-hot data, compare the results on various models and pick the one that worked out best in your specific case.