Practical Lessons from Predicting Clicks on Ads at Facebook (GBDT + LR) 模型實踐
部落格程式碼均以上傳至GitHub,歡迎follow和start~~!
1. GBDT構造組合特徵的方式
利用GBDT進行特徵構造依據其模型組合方式一共有兩種方式:
-
GBDT + LR
與原論文的實現方式一樣,利用GBDT構造組合特徵,再將組合特徵進行one-hot編碼(本實踐程式碼也屬此類);
-
GBDT + FFM 或者 GBDT + 樹模型
此時,使用利用GBDT構造的組合特徵不再進行one-hot編碼,而是直接利用輸出葉節點的索引資訊,如果將GBDT組合特徵輸出到其他樹模型,則可直接利用節點索引資訊;若是將GBDT資訊輸出到FFM中,依舊是利用索引資訊,但是需要將索引資訊組織成FFM資料輸入形式。
2. GBDT組合特徵實現方式
GBDT實現特徵組合主要有兩種實現方式:
- 可以設定pre_leaf=True獲得每個樣本在每顆樹上的
leaf_Index
,可以檢視下XGBoost官方文件查閱一下API:
原來這個引數是在predict裡面,在對原始特徵進行簡單調參訓 練後,對原始資料以及測試資料進行new_feature= bst.predict(d_test, pred_leaf=True)
即可得到一個(nsample, ntrees)
的結果矩陣,即每個樣本在每個樹上的index。
-
通過設定apply實現 (注意結合LR時候,後接[:,:,0]進行降維):
可以看到他用的是apply()方法,這裡就有點疑惑了,在XGBoost官方API並沒有看到這個方法,於是我去SKlearn GBDT API看了下,果然有apply()方法可以獲得leaf indices:
因為XGBoost有自帶介面和Scikit-Learn介面,所以程式碼上有所差異。
值得注意的是,當使用apply方式時候,返回比直接呼叫XGBoost的多了
n_classes
:這也是為什麼在GBDT+LR使用apply方式獲得GBDT的組合特徵時往往加上[:,:,0],為的就是去掉n_class那一維,如下:
例項程式碼:
'''
使用X_train訓練GBDT模型,後面用此模型構造特徵
'''
grd.fit(X_train, y_train)
# fit one-hot編碼器
grd_enc.fit(grd.apply(X_train)[:, :, 0])
'''
使用訓練好的GBDT模型構建特徵,然後將特徵經過one-hot編碼作為新的特徵輸入到LR模型訓練。
'''
grd_lm.fit(grd_enc.transform(grd.apply(X_train_lr)[:, :, 0]), y_train_lr)
# 用訓練好的LR模型多X_test做預測
y_pred_grd_lm = grd_lm.predict_proba(grd_enc.transform(grd.apply(X_test)[:, :, 0]))[:, 1]
# 根據預測結果輸出
fpr_grd_lm, tpr_grd_lm, _ = roc_curve(y_test, y_pred_grd_lm)
3. 程式碼實踐
3.1 介紹
針對CTR預估,測試LR + GBDT的方案效果.
3.2 資料集
這裡提供兩份資料集,第一份比較好是CTR的,第二份也還湊合,之前在DeepFm中有用過。按理來說用第一個資料更好,但是壓縮包大小為4G+ 有點大.
所以我採用的是第二個資料。感興趣的同學,可以嘗試下用第一個的資料進行試驗。非常歡迎分享下實驗結果~
3.3 kaggle CTR比賽
使用kaggle 2014年比賽 criteo-Display Advertising Challenge比賽的資料集。第一名的方案就是參考了Facebook的論文,使用GBDT進行特徵轉換,後面跟FFM
比賽地址: https://www.kaggle.com/c/criteo-display-ad-challenge/data
資料集下載:http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/
第一名方案參考
https://www.kaggle.com/c/criteo-display-ad-challenge/discussion/10555
PPT: https://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf
3.4 kaggle 比賽
kaggle上一個預測任務
https://www.kaggle.com/c/porto-seguro-safe-driver-prediction
其中資料集及jupyter notebook版說明均已上傳至個人GitHub
採用了LightGBM樹整合架構實現的GBDT,當然也可採用XGBoost或者sklearn自帶的GBDT實現;
Code:
import gc
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# 1. 讀取資料
path = "./data/"
train_file = "train.csv"
test_file = "test.csv"
trainDf = pd.read_csv(path + train_file)
# testDf = pd.read_csv(path + train_file, nrows=1000, skiprows=range(1, 10000))
pos_trainDf = trainDf[trainDf['target'] == 1]
neg_trainDf = trainDf[trainDf['target'] == 0].sample(n=20000, random_state=2018)
trainDf = pd.concat([pos_trainDf, neg_trainDf], axis=0).sample(frac=1.0, random_state=2018)
del pos_trainDf
del neg_trainDf
gc.collect()
print(trainDf.shape, trainDf['target'].mean())
trainDf, testDf, _, _ = train_test_split(trainDf, trainDf['target'], test_size=0.25, random_state=2018)
print(trainDf['target'].mean(), trainDf.shape)
print(testDf['target'].mean(), testDf.shape)
"""
一共59個特徵,包括id, target
bin特徵17個;cat特徵14個;連續特徵26個;
Code:
columns = trainDf.columns.tolist()
bin_feats = []
cat_feats = []
con_feats = []
for col in columns:
if 'bin' in col:
bin_feats.append(col)
continue
if 'cat' in col:
cat_feats.append(col)
continue
if 'id' != col and 'target' != col:
con_feats.append(col)
print(len(bin_feats), bin_feats)
print(len(cat_feats), cat_feats)
print(len(con_feats), con_feats)
"""
bin_feats = ['ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin', 'ps_ind_09_bin', 'ps_ind_10_bin', 'ps_ind_11_bin',
'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_ind_16_bin', 'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_calc_15_bin',
'ps_calc_16_bin', 'ps_calc_17_bin', 'ps_calc_18_bin', 'ps_calc_19_bin', 'ps_calc_20_bin']
cat_feats = ['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat',
'ps_car_04_cat', 'ps_car_05_cat', 'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat', 'ps_car_09_cat',
'ps_car_10_cat', 'ps_car_11_cat']
con_feats = ['ps_ind_01', 'ps_ind_03', 'ps_ind_14', 'ps_ind_15', 'ps_reg_01', 'ps_reg_02', 'ps_reg_03', 'ps_car_11',
'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15', 'ps_calc_01', 'ps_calc_02', 'ps_calc_03', 'ps_calc_04',
'ps_calc_05', 'ps_calc_06', 'ps_calc_07', 'ps_calc_08', 'ps_calc_09', 'ps_calc_10', 'ps_calc_11',
'ps_calc_12', 'ps_calc_13', 'ps_calc_14']
# 2. 特徵處理
trainDf = trainDf.fillna(0)
testDf = testDf.fillna(0)
train_sz = trainDf.shape[0]
combineDf = pd.concat([trainDf, testDf], axis=0)
del trainDf
del testDf
gc.collect()
# 2.1 連續特徵全部歸一化
from sklearn.preprocessing import MinMaxScaler
for col in con_feats:
scaler = MinMaxScaler()
combineDf[col] = scaler.fit_transform(np.array(combineDf[col].values.tolist()).reshape(-1, 1))
# 2.2 離散特徵one-hot
for col in bin_feats + cat_feats:
onehotret = pd.get_dummies(combineDf[col], prefix=col)
combineDf = pd.concat([combineDf, onehotret], axis=1)
# 3. 訓練模型
label = 'target'
onehot_feats = [col for col in combineDf.columns if col not in ['id', 'target'] + con_feats + cat_feats + bin_feats]
train = combineDf[:train_sz]
test = combineDf[train_sz:]
print("Train.shape: {0}, Test.shape: {0}".format(train.shape, test.shape))
del combineDf
# 3.1 LR模型
lr_feats = con_feats + onehot_feats
lr = LogisticRegression(penalty='l2', C=1)
lr.fit(train[lr_feats], train[label].values)
def do_model_metric(y_true, y_pred, y_pred_prob):
print("Predict 1 percent: {0}".format(np.mean(y_pred)))
print("Label 1 percent: {0}".format(train[label].mean()))
from sklearn.metrics import roc_auc_score, accuracy_score
print("AUC: {0:.3}".format(roc_auc_score(y_true=y_true, y_score=y_pred_prob[:, 1])))
print("Accuracy: {0}".format(accuracy_score(y_true=y_true, y_pred=y_pred)))
print("Train............")
do_model_metric(y_true=train[label], y_pred=lr.predict(train[lr_feats]), y_pred_prob=lr.predict_proba(train[lr_feats]))
print("\n\n")
print("Test.............")
do_model_metric(y_true=test[label], y_pred=lr.predict(test[lr_feats]), y_pred_prob=lr.predict_proba(test[lr_feats]))
# 3.2 GBDT
lgb_feats = con_feats + cat_feats + bin_feats
categorical_feature_list = cat_feats + bin_feats
import lightgbm as lgb
lgb_params = {
'objective': 'binary',
'boosting_type': 'gbdt',
'metric': 'auc',
'learning_rate': 0.01,
'num_leaves': 5,
'max_depth': 4,
'min_data_in_leaf': 100,
'bagging_fraction': 0.8,
'feature_fraction': 0.8,
'bagging_freq': 10,
'lambda_l1': 0.2,
'lambda_l2': 0.2,
'scale_pos_weight': 1,
}
lgbtrain = lgb.Dataset(train[lgb_feats].values, label=train[label].values,
feature_name=lgb_feats,
categorical_feature=categorical_feature_list
)
lgbvalid = lgb.Dataset(test[lgb_feats].values, label=test[label].values,
feature_name=lgb_feats,
categorical_feature=categorical_feature_list
)
evals_results = {}
print('train')
lgb_model = lgb.train(lgb_params,
lgbtrain,
valid_sets=lgbvalid,
evals_result=evals_results,
num_boost_round=1000,
early_stopping_rounds=60,
verbose_eval=50,
categorical_feature=categorical_feature_list,
)
# 3.3 LR + GBDT
train_sz = train.shape[0]
combineDf = pd.concat([train, test], axis=0, ignore_index=True)
# 得到葉節點編號 Feature Transformation
gbdt_feats_vals = lgb_model.predict(combineDf[lgb_feats], pred_leaf=True)
gbdt_columns = ["gbdt_leaf_indices_" + str(i) for i in range(0, gbdt_feats_vals.shape[1])]
combineDf = pd.concat(
[combineDf, pd.DataFrame(data=gbdt_feats_vals, index=range(0, gbdt_feats_vals.shape[0]), columns=gbdt_columns)],
axis=1)
# onehotencoder(gbdt_feats)
origin_columns = combineDf.columns
for col in gbdt_columns:
combineDf = pd.concat([combineDf, pd.get_dummies(combineDf[col], prefix=col)], axis=1)
gbdt_onehot_feats = [col for col in combineDf.columns if col not in origin_columns]
# 恢復train, test
train = combineDf[:train_sz]
test = combineDf[train_sz:]
del combineDf;
gc.collect();
lr_gbdt_feats = lr_feats + gbdt_onehot_feats
lr_gbdt_model = LogisticRegression(penalty='l2', C=1)
lr_gbdt_model.fit(train[lr_gbdt_feats], train[label])
print("Train................")
do_model_metric(y_true=train[label], y_pred=lr_gbdt_model.predict(train[lr_gbdt_feats]),
y_pred_prob=lr_gbdt_model.predict_proba(train[lr_gbdt_feats]))
print("Test..................")
do_model_metric(y_true=test[label], y_pred=lr_gbdt_model.predict(test[lr_gbdt_feats]),
y_pred_prob=lr_gbdt_model.predict_proba(test[lr_gbdt_feats]))
3.5 使用apply方式生成GBDT特徵
Code:
# coding: utf-8
from sklearn.model_selection import train_test_split
from sklearn import metrics
from xgboost.sklearn import XGBClassifier
import numpy as np
class XgboostFeature():
##可以傳入xgboost的引數
##常用傳入特徵的個數 即樹的個數 預設30
def __init__(self,n_estimators=30,learning_rate =0.3,max_depth=3,min_child_weight=1,gamma=0.3,subsample=0.8,colsample_bytree=0.8,objective= 'binary:logistic',nthread=4,scale_pos_weight=1,reg_alpha=1e-05,reg_lambda=1,seed=27):
self.n_estimators=n_estimators
self.learning_rate=learning_rate
self.max_depth=max_depth
self.min_child_weight=min_child_weight
self.gamma=gamma
self.subsample=subsample
self.colsample_bytree=colsample_bytree
self.objective=objective
self.nthread=nthread
self.scale_pos_weight=scale_pos_weight
self.reg_alpha=reg_alpha
self.reg_lambda=reg_lambda
self.seed=seed
print 'Xgboost Feature start, new_feature number:',n_estimators
def mergeToOne(self,X,X2):
X3=[]
for i in xrange(X.shape[0]):
tmp=np.array([list(X[i]),list(X2[i])])
X3.append(list(np.hstack(tmp)))
X3=np.array(X3)
return X3
##切割訓練
def fit_model_split(self,X_train,y_train,X_test,y_test):
##X_train_1用於生成模型 X_train_2用於和新特徵組成新訓練集合
X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_train, y_train, test_size=0.6, random_state=0)
clf = XGBClassifier(
learning_rate =self.learning_rate,
n_estimators=self.n_estimators,
max_depth=self.max_depth,
min_child_weight=self.min_child_weight,
gamma=self.gamma,
subsample=self.subsample,
colsample_bytree=self.colsample_bytree,
objective= self.objective,
nthread=self.nthread,
scale_pos_weight=self.scale_pos_weight,
reg_alpha=self.reg_alpha,
reg_lambda=self.reg_lambda,
seed=self.seed)
clf.fit(X_train_1, y_train_1)
y_pre= clf.predict(X_train_2)
y_pro= clf.predict_proba(X_train_2)[:,1]
print "pred_leaf=T AUC Score : %f" % metrics.roc_auc_score(y_train_2, y_pro)
print"pred_leaf=T Accuracy : %.4g" % metrics.accuracy_score(y_train_2, y_pre)
new_feature= clf.apply(X_train_2)
X_train_new2=self.mergeToOne(X_train_2,new_feature)
new_feature_test= clf.apply(X_test)
X_test_new=self.mergeToOne(X_test,new_feature_test)
print "Training set of sample size 0.4 fewer than before"
return X_train_new2,y_train_2,X_test_new,y_test
##整體訓練
def fit_model(self,X_train,y_train,X_test,y_test):
clf = XGBClassifier(
learning_rate =self.learning_rate,
n_estimators=self.n_estimators,
max_depth=self.max_depth,
min_child_weight=self.min_child_weight,
gamma=self.gamma,
subsample=self.subsample,
colsample_bytree=self.colsample_bytree,
objective= self.objective,
nthread=self.nthread,
scale_pos_weight=self.scale_pos_weight,
reg_alpha=self.reg_alpha,
reg_lambda=self.reg_lambda,
seed=self.seed)
clf.fit(X_train, y_train)
y_pre= clf.predict(X_test)
y_pro= clf.predict_proba(X_test)[:,1]
print "pred_leaf=T AUC Score : %f" % metrics.roc_auc_score(y_test, y_pro)
print"pred_leaf=T Accuracy : %.4g" % metrics.accuracy_score(y_test, y_pre)
new_feature= clf.apply(X_train)
X_train_new=self.mergeToOne(X_train,new_feature)
new_feature_test= clf.apply(X_test)
X_test_new=self.mergeToOne(X_test,new_feature_test)
print "Training set sample number remains the same"
return X_train_new,y_train,X_test_new,y_test
4. 模板
4.1 GBDT + LR 模板
from scipy.sparse.construct import hstack
from sklearn.model_selection import train_test_split
from sklearn.datasets.svmlight_format import load_svmlight_file
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.linear_model.logistic import Lo