XGBoost特徵選擇

阿新 • • 發佈：2021-10-19

XGBoost進行特徵選擇

1. 特徵選擇的思維導圖

2. XGBoost特徵選擇演算法

(1)XGBoost演算法背景

　　　　2016年，陳天奇在論文《 XGBoost：A Scalable Tree Boosting System》中正式提出該演算法。XGBoost的基本思想和GBDT相同，但是做了一些優化，比如二階導數使損失函式更精準；正則項避免樹過擬合；Block儲存可以平行計算等。XGBoost具有高效、靈活和輕便的特點，在資料探勘、推薦系統等領域得到廣泛的應用。

　　(2) 演算法原理

　　(3) 演算法實現

from sklearn.model_selection import train_test_split
from sklearn import metrics
import xgboost as xgb
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
import pandas as pd, numpy as np
import matplotlib as mpl

# mpl.rcParams['font.sans-serif']=['FangSong']
# mpl.rcParams['axes.unicode_minus']=False

fpath = r".\processData\filter.csv"
Dataset = pd.read_csv(fpath)

x = Dataset.loc[:, "nAcid":"Zagreb"]
y1 = Dataset.loc[:, "IC50_nM"]
y2 = Dataset.loc[:, "pIC50"]

names = x.columns
names = list(names)
key = list(range(0, len(names)))
names_dict = dict(zip(key, names))
names_dicts = pd.DataFrame([names_dict])

x_train, x_test, y_train, y_test = train_test_split(x, y2, test_size=0.33, random_state=7)
"""
max_depth:樹的最大深度
"""
model = xgb.XGBRegressor(max_depth=6, learning_rate=0.12, n_estimators=90, min_child_weight=6, objective="reg:gamma")
model.fit(x_train, y_train)

feature_important = model.feature_importances_
rank_idx  = np.argsort(feature_important)[::-1]
rank_idx30 = rank_idx[:30]

rank_names30 = names_dicts.loc[:, rank_idx30]
label = rank_names30.values[0, :]
path1 = r"Xgboost排名前30的特徵.csv"
pd.DataFrame(label).to_csv(path1, index=False)

x_score = np.sort(feature_important)[::-1]
path = r"Xgboost排名前30的得分.csv"
pd.DataFrame(x_score[:30]).to_csv(path, index=False)
# xgboost網格搜尋調參
gsCv = GridSearchCV(model,
                {'max_depth':list(range(3, 10, 1)),
                 'learning_rate':[0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2],
                 'min_child_weight':list(range(2, 8, 2)),
                 'n_estimators':list(range(10, 101, 10))})

gsCv.fit(x_train, y_train)
print(gsCv.best_params_)
cv_results = pd.DataFrame(gsCv.cv_results_)
path = r"paramRank.csv"
cv_results.to_csv(path, index=False)

# 視覺化
plt.figure()
plt.bar(range(len(model.feature_importances_)), model.feature_importances_)
plt.xlabel("Feature")
plt.ylabel("Feature Score")
plt.title("Feature Importance")
plt.savefig("Xgboost")

# 視覺化
plt.figure()
plt.barh(label[::-1], x_score[:30][::-1], 0.6, align='center')
plt.grid(ls=':', color='gray', alpha=0.4)
plt.title("Xgboost Feature Importance")
# 新增資料標籤
# for a, b in enumerate(rf_score[:30][::-1]):
#     plt.text(b+0.1, a-0.6/2, '%s' % b, ha='center', va='bottom')

plt.savefig("前30名特徵")
plt.show()

注意：該演算法沒有資料是不能執行的，需要做適當的修改，後面使用網格調參，找到最優引數。

XGBoost特徵選擇

XGBoost進行特徵選擇 1. 特徵選擇的思維導圖 2. XGBoost特徵選擇演算法 (1)XGBoost演算法背景

特徵選擇

特徵選擇特徵選擇是從資料集的諸多特徵裡面選擇和目標變數相關的特徵，去掉那些不相關的特徵。

[翻譯]特徵選擇：比特徵本身重要麼？

[翻譯]特徵選擇：比特徵本身重要麼？翻譯：Feature Selection: Beyond feature importance?

【機器學習】scikit-learn中的特徵選擇小結

一.概述 1. 特徵工程特徵工程是將原始資料轉換為更能代表預測模型的潛在問題的特徵的過程，可以通過挑選最相關的特徵，提取特徵以及創造特徵來實現。

機器學習之特徵選擇（Feature Selection）

引言　　特徵提取和特徵選擇作為機器學習的重點內容，可以將原始資料轉換為更能代表預測模型的潛在問題和特徵的過程，可以通過挑選最相關的特徵，提取特徵和創造特徵來實現。要想學習特徵選擇必然要了解什麼是特徵提

機器學習深度研究：特徵選擇中幾個重要的統計學概念

機器學習深度研究：特徵選擇過濾法中幾個重要的統計學概念————卡方檢驗、方差分析、相關係數、p值

P12 資料的降維及特徵選擇

https://www.bilibili.com/video/BV184411Q7Ng?p=12 註解：這裡了的降維不是指陣列的維度，不是1維、2維、3維那個維。

特徵選擇學習筆記

技術標籤：機器學習python 學習目標掌握特徵選擇的基本原理及方法，實現特徵的選擇

特徵選擇---SelectKBest

from sklearn.feature_selection import SelectKBest http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest.set_param