1. 程式人生 > >Tree-based Model 如何處理categorical variable

Tree-based Model 如何處理categorical variable

www. gre with use res each repl som round

categorical variable 分為 order variale 和 non-order variable,其中order variable直接使用sklearn.preprocess.LabelEncoder是最好的處理方法。對於order variable的處理方法主要在於是否使用one-hot encoding。在這篇quora answer (author: Clem Wang)中給出了其它的處理方法:

One can try a few other approaches:

  • look at how the response variable responds to the categorical values and try to group them.
  • Find another ML algorithm that works better with categorical features or with one-hot encoding and use that to train a submodel that just uses the categorical features. Then replace the categorical feature with a probability score. For instance, use a Logistic Regression on the hot-encoded values.
  • Try to combine the categorical feature with some other features.
  • Build N xgboost classifiers, one for each category.

This may require playing around with the data a bit. Plotting the data may help you see patterns that you didn‘t know that were there.

這篇博客對於在xgboost中使用one-hot給出了一個總體結論:

總結起來的結論,大至兩條:

  • 1.對於類別有序的類別型變量,比如age等,當成數值型變量處理可以的。對於非類別有序的類別型變量,推薦one-hot。但是one-hot會增加內存開銷以及訓練時間開銷。
  • 2.類別型變量在範圍較小時(tqchen給出的是[10,100]範圍內)推薦使用

其他相關的資料

comment:re sklearn -- integer encoding vs 1-hot

Tree-based Model 如何處理categorical variable