XGBoost - 演算法理論以及例項

阿新 • • 發佈：2022-04-02

XGBoost是一種基於Boost演算法的機器學習方法，全稱EXtreme Gradient Boosting。
XGBoost在GBDT的基礎上，引入了：

CART迴歸樹
正則項
泰勒公式二階導數
Blocks資料結構（用於加速運算）
從而實現了比GBDT更好的實現效果。

一. 理論

關於XGBoost的理論在官網上介紹地很清楚，可以參考： https://xgboost.readthedocs.io/en/stable/tutorials/model.html

函式模型

XGBoost 的目標模型可以表示為：

其中L(θ)為損失項，Ω(θ)為正則項。正則項反映了模型的複雜程度。
一般L(θ)可以用方差來進行統計：

用於邏輯迴歸中：

下圖可以解釋兩者和模型匹配程度之間的關係：

迴歸樹

XGBoost樹整合模型由一組分類和迴歸樹（CART）組成。下面是一個 CART 的簡單示例，用於對某人是否會喜歡假設的電腦遊戲 X 進行分類。

我們將一個家族的成員分類為不同的葉子，並給他們分配相應葉子上的分數。CART 與決策樹略有不同，在決策樹中，葉僅包含決策值。在CART中，每片葉子都有一個真實的分數，這給了我們更豐富的解釋，超越了分類。這也允許採用有原則的統一優化方法，我們將在本教程的後面部分中看到。
通常，一棵樹不夠堅固，無法在實踐中使用。實際使用的是整合模型，它將多個樹的預測相加。

在數學上，我們可以將模型寫成

其中K是樹的數量，f(x)是函式空間中的函式，並且是所有可能的 CART 的集合。要優化的目標函式由下式給出

XGBoost

下面給出用XGBoost求解的步驟，這部分參考:
https://zhuanlan.zhihu.com/p/162001079

一：重新定義一棵樹

二：定義這棵樹的複雜度

三：分組節點

合併一次項係數、二次項係數。

四：求解目標值

優化目標即讓Obj的值儘可能的小

分裂點選擇

訓練樹時，分裂點的選擇有如下幾種演算法：

貪心演算法
近似演算法
加權分位數草圖演算法
稀疏感知法

二. 例項

用XGBoost做迴歸：

用TPS2022 Mar的題目做示例，用過時間預測道路擁堵程度：

原始資料

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error

train_orig = pd.read_csv('train.csv', index_col='row_id', parse_dates=['time'])
train_orig.head()

特徵工程

%%time
#用道路和方向進行OneHot編碼

# Feature engineering
# Combine x, y and direction into a single categorical feature with 65 unique values
# which can be one-hot encoded
def place_dir(df):
    return df.apply(lambda row: f"{row.x}-{row.y}-{row.direction}", axis=1).values.reshape([-1, 1])

for df in [train_orig]:
    df['place_dir'] = place_dir(df)
    
ohe = OneHotEncoder(drop='first', sparse=False)
ohe.fit(train_orig[['place_dir']])

def engineer(df):
    """Return a new dataframe with the engineered features"""
    
    new_df = pd.DataFrame(ohe.transform(df[['place_dir']]),
                          columns=ohe.categories_[0][1:],
                          index=df.index)
    new_df['saturday'] = df.time.dt.weekday == 5
    new_df['sunday'] = df.time.dt.weekday == 6
    new_df['daytime'] = df.time.dt.hour * 60 + df.time.dt.minute
    new_df['dayofyear'] = df.time.dt.dayofyear # to model the trend
    return new_df


train = engineer(train_orig)

train['congestion'] = train_orig.congestion

features = list(train.columns)
print(list(features))

處理後的資料：

資料分割

# Split into train and test
# Use all Monday-Wednesday afternoons in August and September for validation
val_idx = ((train_orig.time.dt.month >= 8) & 
           (train_orig.time.dt.weekday <= 3) &
           (train_orig.time.dt.hour >= 12)).values
train_idx = ~val_idx

X_tr, X_va = train.loc[train_idx, features], train.loc[val_idx, features]
y_tr, y_va = train.loc[train_idx, 'congestion'], train.loc[val_idx, 'congestion']

AdaBoost

%%time

from sklearn.ensemble import AdaBoostRegressor

regr = AdaBoostRegressor(random_state=0, n_estimators=100)
regr.fit(X_tr, y_tr)
y_pred = regr.predict(X_va)
mean_absolute_error(y_pred, y_va)

GDBT

%%time

from sklearn.ensemble import GradientBoostingRegressor

regr = GradientBoostingRegressor(random_state=0, n_estimators=100)
regr.fit(X_tr, y_tr)
y_pred = regr.predict(X_va)
mean_absolute_error(y_pred, y_va)

XGBoost

%%time

import xgboost as xgb

model = xgb.XGBRegressor(n_estimators=200)
model.fit(X_tr, y_tr)
y_pred = model.predict(X_va)
score_xgb = mean_absolute_error(y_pred, y_va)

結果：

演算法	結果
AdaBoost	13.654161859220649
GBDT	7.55620711428925
XGBoost	5.880979271309312

XGBoost引數：

詳見官網，也可以參考這篇部落格：https://www.cnblogs.com/TimVerion/p/11436001.html

XGBoost做特徵挖掘

XGBoost還有一個常用功能是做特徵挖掘，挖掘房價預測的相關性示例：
清洗後的資料：

在挖掘特徵相關性之前，需要提前做一次預測：

%%time

xgbr = xgb.XGBRegressor(verbosity=0, n_estimators=100)
xgbr.fit(X_train, y_train)
y_pred = xgbr.predict(X_test)

之後用XGBoost自帶的方法就可以plot模型中各個特徵相關情況

xgb.plot_importance(xgbr)

XGBoost - 演算法理論以及例項

一. 理論

函式模型

迴歸樹

XGBoost

一：重新定義一棵樹

二：定義這棵樹的複雜度

三：分組節點

四：求解目標值

分裂點選擇

二. 例項

用XGBoost做迴歸：

原始資料

特徵工程

資料分割

AdaBoost

GDBT

XGBoost

結果：

XGBoost引數：

XGBoost做特徵挖掘

相關推薦