XGBoost特徵選擇
阿新 • • 發佈:2021-10-19
XGBoost進行特徵選擇
1. 特徵選擇的思維導圖
2. XGBoost特徵選擇演算法
(1)XGBoost演算法背景
2016年,陳天奇在論文《 XGBoost:A Scalable Tree Boosting System》中正式提出該演算法。XGBoost的基本思想和GBDT相同,但是做了一些優化,比如二階導數使損失函式更精準;正則項避免樹過擬合;Block儲存可以平行計算等。XGBoost具有高效、靈活和輕便的特點,在資料探勘、推薦系統等領域得到廣泛的應用。
(2) 演算法原理
(3) 演算法實現
from sklearn.model_selection import train_test_split from sklearn import metrics import xgboost as xgb import matplotlib.pyplot as plt from sklearn.model_selection import GridSearchCV import pandas as pd, numpy as np import matplotlib as mpl # mpl.rcParams['font.sans-serif']=['FangSong'] # mpl.rcParams['axes.unicode_minus']=False fpath = r".\processData\filter.csv" Dataset = pd.read_csv(fpath) x = Dataset.loc[:, "nAcid":"Zagreb"] y1 = Dataset.loc[:, "IC50_nM"] y2 = Dataset.loc[:, "pIC50"] names = x.columns names = list(names) key = list(range(0, len(names))) names_dict = dict(zip(key, names)) names_dicts = pd.DataFrame([names_dict]) x_train, x_test, y_train, y_test = train_test_split(x, y2, test_size=0.33, random_state=7) """ max_depth:樹的最大深度 """ model = xgb.XGBRegressor(max_depth=6, learning_rate=0.12, n_estimators=90, min_child_weight=6, objective="reg:gamma") model.fit(x_train, y_train) feature_important = model.feature_importances_ rank_idx = np.argsort(feature_important)[::-1] rank_idx30 = rank_idx[:30] rank_names30 = names_dicts.loc[:, rank_idx30] label = rank_names30.values[0, :] path1 = r"Xgboost排名前30的特徵.csv" pd.DataFrame(label).to_csv(path1, index=False) x_score = np.sort(feature_important)[::-1] path = r"Xgboost排名前30的得分.csv" pd.DataFrame(x_score[:30]).to_csv(path, index=False) # xgboost網格搜尋調參 gsCv = GridSearchCV(model, {'max_depth':list(range(3, 10, 1)), 'learning_rate':[0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2], 'min_child_weight':list(range(2, 8, 2)), 'n_estimators':list(range(10, 101, 10))}) gsCv.fit(x_train, y_train) print(gsCv.best_params_) cv_results = pd.DataFrame(gsCv.cv_results_) path = r"paramRank.csv" cv_results.to_csv(path, index=False) # 視覺化 plt.figure() plt.bar(range(len(model.feature_importances_)), model.feature_importances_) plt.xlabel("Feature") plt.ylabel("Feature Score") plt.title("Feature Importance") plt.savefig("Xgboost") # 視覺化 plt.figure() plt.barh(label[::-1], x_score[:30][::-1], 0.6, align='center') plt.grid(ls=':', color='gray', alpha=0.4) plt.title("Xgboost Feature Importance") # 新增資料標籤 # for a, b in enumerate(rf_score[:30][::-1]): # plt.text(b+0.1, a-0.6/2, '%s' % b, ha='center', va='bottom') plt.savefig("前30名特徵") plt.show()
注意:該演算法沒有資料是不能執行的,需要做適當的修改,後面使用網格調參,找到最優引數。