Tree-based Model 如何處理categorical variable
阿新 • • 發佈:2018-12-10
www. gre with use res each repl som round
categorical variable 分為 order variale 和 non-order variable,其中order variable直接使用sklearn.preprocess.LabelEncoder是最好的處理方法。對於order variable的處理方法主要在於是否使用one-hot encoding。在這篇quora answer (author: Clem Wang)中給出了其它的處理方法:
One can try a few other approaches:
- look at how the response variable responds to the categorical values and try to group them.
- Find another ML algorithm that works better with categorical features or with one-hot encoding and use that to train a submodel that just uses the categorical features. Then replace the categorical feature with a probability score. For instance, use a Logistic Regression on the hot-encoded values.
- Try to combine the categorical feature with some other features.
- Build N xgboost classifiers, one for each category.
This may require playing around with the data a bit. Plotting the data may help you see patterns that you didn‘t know that were there.
這篇博客對於在xgboost中使用one-hot給出了一個總體結論:
總結起來的結論,大至兩條:
- 1.對於類別有序的類別型變量,比如age等,當成數值型變量處理可以的。對於非類別有序的類別型變量,推薦one-hot。但是one-hot會增加內存開銷以及訓練時間開銷。
- 2.類別型變量在範圍較小時(tqchen給出的是[10,100]範圍內)推薦使用
其他相關的資料
comment:re sklearn -- integer encoding vs 1-hot
Tree-based Model 如何處理categorical variable