Practical Lessons from Predicting Clicks on Ads at Facebook (GBDT + LR) 模型實踐

阿新 • • 發佈：2019-01-07

部落格程式碼均以上傳至GitHub，歡迎follow和start~~!

1. GBDT構造組合特徵的方式

利用GBDT進行特徵構造依據其模型組合方式一共有兩種方式：

GBDT + LR

與原論文的實現方式一樣，利用GBDT構造組合特徵，再將組合特徵進行one-hot編碼（本實踐程式碼也屬此類）；
GBDT + FFM 或者 GBDT + 樹模型

此時，使用利用GBDT構造的組合特徵不再進行one-hot編碼，而是直接利用輸出葉節點的索引資訊，如果將GBDT組合特徵輸出到其他樹模型，則可直接利用節點索引資訊；若是將GBDT資訊輸出到FFM中，依舊是利用索引資訊，但是需要將索引資訊組織成FFM資料輸入形式。

2. GBDT組合特徵實現方式

GBDT實現特徵組合主要有兩種實現方式：

可以設定pre_leaf=True獲得每個樣本在每顆樹上的leaf_Index，可以檢視下XGBoost官方文件查閱一下API：

原來這個引數是在predict裡面，在對原始特徵進行簡單調參訓練後，對原始資料以及測試資料進行new_feature= bst.predict(d_test, pred_leaf=True)即可得到一個(nsample, ntrees) 的結果矩陣，即每個樣本在每個樹上的index。

通過設定apply實現 (注意結合LR時候，後接[:,:,0]進行降維)：

可以看到他用的是apply()方法，這裡就有點疑惑了，在XGBoost官方API並沒有看到這個方法，於是我去SKlearn GBDT API看了下，果然有apply()方法可以獲得leaf indices：

因為XGBoost有自帶介面和Scikit-Learn介面，所以程式碼上有所差異。

值得注意的是，當使用apply方式時候，返回比直接呼叫XGBoost的多了n_classes:

這也是為什麼在GBDT+LR使用apply方式獲得GBDT的組合特徵時往往加上[:,:,0]，為的就是去掉n_class那一維，如下：

例項程式碼：

'''
使用X_train訓練GBDT模型，後面用此模型構造特徵
''' 
grd.fit(X_train, y_train)
# fit one-hot編碼器 
grd_enc.fit(grd.apply(X_train)[:, :, 0]) 
'''  
使用訓練好的GBDT模型構建特徵，然後將特徵經過one-hot編碼作為新的特徵輸入到LR模型訓練。 
''' 
grd_lm.fit(grd_enc.transform(grd.apply(X_train_lr)[:, :, 0]), y_train_lr)
# 用訓練好的LR模型多X_test做預測 
y_pred_grd_lm = grd_lm.predict_proba(grd_enc.transform(grd.apply(X_test)[:, :, 0]))[:, 1]
# 根據預測結果輸出 
fpr_grd_lm, tpr_grd_lm, _ = roc_curve(y_test, y_pred_grd_lm)

3. 程式碼實踐

3.1 介紹

針對CTR預估，測試LR + GBDT的方案效果.

3.2 資料集

這裡提供兩份資料集，第一份比較好是CTR的，第二份也還湊合，之前在DeepFm中有用過。按理來說用第一個資料更好，但是壓縮包大小為4G+ 有點大.
所以我採用的是第二個資料。感興趣的同學，可以嘗試下用第一個的資料進行試驗。非常歡迎分享下實驗結果~

3.3 kaggle CTR比賽

使用kaggle 2014年比賽 criteo-Display Advertising Challenge比賽的資料集。第一名的方案就是參考了Facebook的論文，使用GBDT進行特徵轉換，後面跟FFM

比賽地址： https://www.kaggle.com/c/criteo-display-ad-challenge/data
資料集下載：http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/

第一名方案參考
https://www.kaggle.com/c/criteo-display-ad-challenge/discussion/10555
PPT： https://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf

3.4 kaggle 比賽

kaggle上一個預測任務
https://www.kaggle.com/c/porto-seguro-safe-driver-prediction

其中資料集及jupyter notebook版說明均已上傳至個人GitHub

採用了LightGBM樹整合架構實現的GBDT，當然也可採用XGBoost或者sklearn自帶的GBDT實現；

Code:

import gc
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# 1. 讀取資料
path = "./data/"
train_file = "train.csv"
test_file = "test.csv"

trainDf = pd.read_csv(path + train_file)
# testDf = pd.read_csv(path + train_file, nrows=1000, skiprows=range(1, 10000))

pos_trainDf = trainDf[trainDf['target'] == 1]
neg_trainDf = trainDf[trainDf['target'] == 0].sample(n=20000, random_state=2018)
trainDf = pd.concat([pos_trainDf, neg_trainDf], axis=0).sample(frac=1.0, random_state=2018)
del pos_trainDf
del neg_trainDf
gc.collect()

print(trainDf.shape, trainDf['target'].mean())

trainDf, testDf, _, _ = train_test_split(trainDf, trainDf['target'], test_size=0.25, random_state=2018)

print(trainDf['target'].mean(), trainDf.shape)
print(testDf['target'].mean(), testDf.shape)

"""
一共59個特徵，包括id， target
bin特徵17個;cat特徵14個;連續特徵26個;
Code:
columns = trainDf.columns.tolist()
bin_feats = []
cat_feats = []
con_feats = []
for col in  columns:
    if 'bin' in col:
        bin_feats.append(col)
        continue
    if 'cat' in col:
        cat_feats.append(col)
        continue
    if 'id' != col and 'target' != col:
        con_feats.append(col)

print(len(bin_feats), bin_feats)
print(len(cat_feats), cat_feats)
print(len(con_feats), con_feats)
"""
bin_feats = ['ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin', 'ps_ind_09_bin', 'ps_ind_10_bin', 'ps_ind_11_bin',
             'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_ind_16_bin', 'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_calc_15_bin',
             'ps_calc_16_bin', 'ps_calc_17_bin', 'ps_calc_18_bin', 'ps_calc_19_bin', 'ps_calc_20_bin']
cat_feats = ['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat',
             'ps_car_04_cat', 'ps_car_05_cat', 'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat', 'ps_car_09_cat',
             'ps_car_10_cat', 'ps_car_11_cat']
con_feats = ['ps_ind_01', 'ps_ind_03', 'ps_ind_14', 'ps_ind_15', 'ps_reg_01', 'ps_reg_02', 'ps_reg_03', 'ps_car_11',
             'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15', 'ps_calc_01', 'ps_calc_02', 'ps_calc_03', 'ps_calc_04',
             'ps_calc_05', 'ps_calc_06', 'ps_calc_07', 'ps_calc_08', 'ps_calc_09', 'ps_calc_10', 'ps_calc_11',
             'ps_calc_12', 'ps_calc_13', 'ps_calc_14']

# 2. 特徵處理
trainDf = trainDf.fillna(0)
testDf = testDf.fillna(0)

train_sz = trainDf.shape[0]
combineDf = pd.concat([trainDf, testDf], axis=0)
del trainDf
del testDf
gc.collect()

# 2.1 連續特徵全部歸一化
from sklearn.preprocessing import MinMaxScaler

for col in con_feats:
    scaler = MinMaxScaler()
    combineDf[col] = scaler.fit_transform(np.array(combineDf[col].values.tolist()).reshape(-1, 1))

# 2.2 離散特徵one-hot
for col in bin_feats + cat_feats:
    onehotret = pd.get_dummies(combineDf[col], prefix=col)
    combineDf = pd.concat([combineDf, onehotret], axis=1)

# 3. 訓練模型
label = 'target'
onehot_feats = [col for col in combineDf.columns if col not in ['id', 'target'] + con_feats + cat_feats + bin_feats]
train = combineDf[:train_sz]
test = combineDf[train_sz:]
print("Train.shape: {0}, Test.shape: {0}".format(train.shape, test.shape))
del combineDf

# 3.1 LR模型
lr_feats = con_feats + onehot_feats
lr = LogisticRegression(penalty='l2', C=1)
lr.fit(train[lr_feats], train[label].values)


def do_model_metric(y_true, y_pred, y_pred_prob):
    print("Predict 1 percent: {0}".format(np.mean(y_pred)))
    print("Label 1 percent: {0}".format(train[label].mean()))
    from sklearn.metrics import roc_auc_score, accuracy_score
    print("AUC: {0:.3}".format(roc_auc_score(y_true=y_true, y_score=y_pred_prob[:, 1])))
    print("Accuracy: {0}".format(accuracy_score(y_true=y_true, y_pred=y_pred)))


print("Train............")
do_model_metric(y_true=train[label], y_pred=lr.predict(train[lr_feats]), y_pred_prob=lr.predict_proba(train[lr_feats]))

print("\n\n")
print("Test.............")
do_model_metric(y_true=test[label], y_pred=lr.predict(test[lr_feats]), y_pred_prob=lr.predict_proba(test[lr_feats]))

# 3.2 GBDT
lgb_feats = con_feats + cat_feats + bin_feats
categorical_feature_list = cat_feats + bin_feats

import lightgbm as lgb

lgb_params = {
    'objective': 'binary',
    'boosting_type': 'gbdt',
    'metric': 'auc',
    'learning_rate': 0.01,
    'num_leaves': 5,
    'max_depth': 4,
    'min_data_in_leaf': 100,
    'bagging_fraction': 0.8,
    'feature_fraction': 0.8,
    'bagging_freq': 10,
    'lambda_l1': 0.2,
    'lambda_l2': 0.2,
    'scale_pos_weight': 1,
}

lgbtrain = lgb.Dataset(train[lgb_feats].values, label=train[label].values,
                       feature_name=lgb_feats,
                       categorical_feature=categorical_feature_list
                       )
lgbvalid = lgb.Dataset(test[lgb_feats].values, label=test[label].values,
                       feature_name=lgb_feats,
                       categorical_feature=categorical_feature_list
                       )

evals_results = {}
print('train')
lgb_model = lgb.train(lgb_params,
                      lgbtrain,
                      valid_sets=lgbvalid,
                      evals_result=evals_results,
                      num_boost_round=1000,
                      early_stopping_rounds=60,
                      verbose_eval=50,
                      categorical_feature=categorical_feature_list,
                      )

# 3.3 LR + GBDT
train_sz = train.shape[0]
combineDf = pd.concat([train, test], axis=0, ignore_index=True)

# 得到葉節點編號 Feature Transformation
gbdt_feats_vals = lgb_model.predict(combineDf[lgb_feats], pred_leaf=True)
gbdt_columns = ["gbdt_leaf_indices_" + str(i) for i in range(0, gbdt_feats_vals.shape[1])]

combineDf = pd.concat(
    [combineDf, pd.DataFrame(data=gbdt_feats_vals, index=range(0, gbdt_feats_vals.shape[0]), columns=gbdt_columns)],
    axis=1)

# onehotencoder(gbdt_feats)
origin_columns = combineDf.columns
for col in gbdt_columns:
    combineDf = pd.concat([combineDf, pd.get_dummies(combineDf[col], prefix=col)], axis=1)
gbdt_onehot_feats = [col for col in combineDf.columns if col not in origin_columns]

# 恢復train, test
train = combineDf[:train_sz]
test = combineDf[train_sz:]
del combineDf;
gc.collect();

lr_gbdt_feats = lr_feats + gbdt_onehot_feats

lr_gbdt_model = LogisticRegression(penalty='l2', C=1)
lr_gbdt_model.fit(train[lr_gbdt_feats], train[label])

print("Train................")
do_model_metric(y_true=train[label], y_pred=lr_gbdt_model.predict(train[lr_gbdt_feats]),
                y_pred_prob=lr_gbdt_model.predict_proba(train[lr_gbdt_feats]))

print("Test..................")
do_model_metric(y_true=test[label], y_pred=lr_gbdt_model.predict(test[lr_gbdt_feats]),
                y_pred_prob=lr_gbdt_model.predict_proba(test[lr_gbdt_feats]))

3.5 使用apply方式生成GBDT特徵

Code：

# coding: utf-8
from sklearn.model_selection import train_test_split
from sklearn import metrics
from xgboost.sklearn import XGBClassifier
import numpy as np

class XgboostFeature():
      ##可以傳入xgboost的引數
      ##常用傳入特徵的個數 即樹的個數 預設30
      def __init__(self,n_estimators=30,learning_rate =0.3,max_depth=3,min_child_weight=1,gamma=0.3,subsample=0.8,colsample_bytree=0.8,objective= 'binary:logistic',nthread=4,scale_pos_weight=1,reg_alpha=1e-05,reg_lambda=1,seed=27):
          self.n_estimators=n_estimators
          self.learning_rate=learning_rate
          self.max_depth=max_depth
          self.min_child_weight=min_child_weight
          self.gamma=gamma
          self.subsample=subsample
          self.colsample_bytree=colsample_bytree
          self.objective=objective
          self.nthread=nthread
          self.scale_pos_weight=scale_pos_weight
          self.reg_alpha=reg_alpha
          self.reg_lambda=reg_lambda
          self.seed=seed
          print 'Xgboost Feature start, new_feature number:',n_estimators
      def mergeToOne(self,X,X2):
          X3=[]
          for i in xrange(X.shape[0]):
              tmp=np.array([list(X[i]),list(X2[i])])
              X3.append(list(np.hstack(tmp)))
          X3=np.array(X3)
          return X3
      ##切割訓練
      def fit_model_split(self,X_train,y_train,X_test,y_test):
          ##X_train_1用於生成模型  X_train_2用於和新特徵組成新訓練集合
          X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_train, y_train, test_size=0.6, random_state=0)
          clf = XGBClassifier(
                 learning_rate =self.learning_rate,
                 n_estimators=self.n_estimators,
                 max_depth=self.max_depth,
                 min_child_weight=self.min_child_weight,
                 gamma=self.gamma,
                 subsample=self.subsample,
                 colsample_bytree=self.colsample_bytree,
                 objective= self.objective,
                 nthread=self.nthread,
                 scale_pos_weight=self.scale_pos_weight,
                 reg_alpha=self.reg_alpha,
                 reg_lambda=self.reg_lambda,
                 seed=self.seed)
          clf.fit(X_train_1, y_train_1)
          y_pre= clf.predict(X_train_2)
          y_pro= clf.predict_proba(X_train_2)[:,1]
          print "pred_leaf=T AUC Score : %f" % metrics.roc_auc_score(y_train_2, y_pro)
          print"pred_leaf=T  Accuracy : %.4g" % metrics.accuracy_score(y_train_2, y_pre)
          new_feature= clf.apply(X_train_2)
          X_train_new2=self.mergeToOne(X_train_2,new_feature)
          new_feature_test= clf.apply(X_test)
          X_test_new=self.mergeToOne(X_test,new_feature_test)
          print "Training set of sample size 0.4 fewer than before"
          return X_train_new2,y_train_2,X_test_new,y_test
      ##整體訓練
      def fit_model(self,X_train,y_train,X_test,y_test):
          clf = XGBClassifier(
                 learning_rate =self.learning_rate,
                 n_estimators=self.n_estimators,
                 max_depth=self.max_depth,
                 min_child_weight=self.min_child_weight,
                 gamma=self.gamma,
                 subsample=self.subsample,
                 colsample_bytree=self.colsample_bytree,
                 objective= self.objective,
                 nthread=self.nthread,
                 scale_pos_weight=self.scale_pos_weight,
                 reg_alpha=self.reg_alpha,
                 reg_lambda=self.reg_lambda,
                 seed=self.seed)
          clf.fit(X_train, y_train)
          y_pre= clf.predict(X_test)
          y_pro= clf.predict_proba(X_test)[:,1]
          print "pred_leaf=T  AUC Score : %f" % metrics.roc_auc_score(y_test, y_pro)
          print"pred_leaf=T  Accuracy : %.4g" % metrics.accuracy_score(y_test, y_pre)
          new_feature= clf.apply(X_train)
          X_train_new=self.mergeToOne(X_train,new_feature)
          new_feature_test= clf.apply(X_test)
          X_test_new=self.mergeToOne(X_test,new_feature_test)
          print "Training set sample number remains the same"
          return X_train_new,y_train,X_test_new,y_test

4. 模板

4.1 GBDT + LR 模板

from scipy.sparse.construct import hstack
from sklearn.model_selection import train_test_split
from sklearn.datasets.svmlight_format import load_svmlight_file
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.linear_model.logistic import Lo

 
 
              
           
              
              
            
            相關推薦
			   
            
            
            
 

    

    
    Practical Lessons from Predicting Clicks on Ads at Facebook (GBDT + LR) 模型實踐
       
  
  
 部落格程式碼均以上傳至GitHub，歡迎follow和start~~! 
 1. GBDT構造組合特徵的方式 
 利用GBDT進行特徵構造依據其模型組合方式一共有兩種方式： 
  
   GBDT + LR 與原論文的實現方式一樣，利用GBDT構造組合特徵，再將組合特徵進行one-hot編碼 

  
 

    

    
    Practical Lessons from Predicting Clicks on Ads at Facebook 論文閱讀總結
       
  
  
 Abstract 
 Online advertising allows advertisers to only bid and pay for measurable user responses, such as clicks on ads. As a consequence, click 

  
 

    

    
    Lessons from Building Observability Tools at Netflix
      Lessons from Building Observability Tools at NetflixOur mission at Netflix is to deliver joy to our members by providing high-quality content, presented wi 

  
 

    

    
    【點擊模型學習筆記】Inferring clickthrough rates on ads from click behavior on search results_wsdm2011
      rac   嵌入   權重   tps   peak   ref   細節   文章   搜索結果   
								
        								            
						
                
概要：
看這篇文章的初衷，是這篇文章回答了問題“在一個query的結果其中，給出 

  
 

    

    
    對Deep Learning Face Representation from Predicting 10,000 Classes論文的理解
       
 
 接下來從以下四個方面來介紹我對這篇論文的理解： 
 一.目的 
 利用深度學習學習人臉的高維特徵來進行人臉驗證 
 二.Deep convnets(特徵提取模型) 
  
                    

  
 

    

    
    【Python】django報錯SyntaxError: from __future__ imports must occur at the beginning of the file解決方法
      
                


D:\PythonWorkstation\django\django_station\queryset>python manage.py makemigrations
Traceback (most recent call last):
  File "manag 

  
 

    

    
    Connecting to SQL Server from Python app on CentOS 7
      
歐耶，可以在 Mac OS X / Linux 連 MS SQL 真方便。Driver 的部分，沒有使用 FreeTDS，直接用 MS提供的Driver.


上面的教學文章，我沒有全部服用，請直接改服用  MS 官方提供的解法：



Install unixODBC
Microsoft’s SQL Se 

  
 

    

    
    18 Lessons From 13 Years of Tricky Bugs
      
				In Learning From Your Bugs, I wrote about how I have been keeping track of the most interesting bugs I have come across. I recently reviewed all 194 e 

  
 

    

    
    Some lessons from Kaggle’s competition
      
				About two months ago, I joined the competition of ‘RSNA Pneumonia Detection’ in Kaggle. It’s ended yesterday, but I still have many experiences and l 

  
 

    

    
    Ask HN: Let go from startup I was cofounder at. What to do next?
      I've been working at this startup for the last two years as a cofounder/CTO, and I was informed just this morning by my CEO that they want me to transition 

  
 

    

    
    Mindfulness Lessons from Science and Children
      Mindfulness Lessons from Science and ChildrenThere are some human characteristics that we describe as childlike. In growing up, we gladly leave behind many 

  
 

    

    
    7 Lessons from The Future is Now: How Ready is Treasury?
      As readers of this blog will know, i am interested in all things technology and payments related so the recent report The Future is Now: How Ready is Treas 

  
 

    

    
    10 Lessons from Decade with Erlang
      Higher-order ConstructsErlang as a language is pretty simple, with few types, few keywords and a set of very basic operations. Those are the building block 

  
 

    

    
    Powered Science Research on Display at GTC Europe | NVIDIA Blog
      Thousands flocked to Munich this week for a major gathering — not Oktoberfest, but GTC Europe.
The conference, now in its third year, is a celebration of g 

  
 

    

    
    Impressions and Lessons from the O’Reilly AI Conf 2018
      Impressions and Lessons from the O’Reilly AI Conf 2018AI superstar and the author’s personal hero, Peter Norvig, giving his keynote a the AI Conf 2018I rec 

  
 

    

    
    Lessons from My Math Degree That Have Nothing to Do with Math
      Picture a mathematician. What do you see?I’ll conjure up a popular image. It’s late at night. A figure is hunched over a desk scribbling into a notebook. T 

  
 

    

    
    What I Learned from My First UX Internship at the Home Depot
      To give a high-level summary, the most important things I learned include how to advocate for UX, how to work within the business context, the importance o 

  
 

    

    
    Ask HN: Twitter thread from last year on blockchain hype by female influencer
      Last year there was a scathing long thread on why this particular person thought blockchain was stupid. All I remember was that it was long, funny, and tha 

  
 

    

    
    6 Lessons from Chatbot User Testing
      6 Lessons from Chatbot User TestingBad bots — they’re easy to find, but hard to pinpoint how it went so, so wrong.A common reason for the poor user experie 

  
 

    

    
    Lessons from converting an app to 100% Kotlin
      Lessons from converting an app to 100% KotlinI’ve been following the development of Kotlin for a while. Kotlin is a relatively new language that primarily