用機器學習對CTR預估建模（一）

阿新 • • 發佈：2019-01-22

資料集介紹：

train - Training set. 10 days of click-through data, ordered chronologically. Non-clicks and clicks
are subsampled according to different strategies.
Train.csv 解壓後有5.6G,樣本個數非常大，一般200m的csv資料（20~30維）用pandas讀取成資料幀（dataframe）格式，大概會佔用記憶體1G左右，所以這麼的資料集單機記憶體一般吃不消。

test - Test set. 1 day of ads to for testing your model predictions.
Test.csv解壓後有673m，不是很大。

sampleSubmission.csv - Sample submission file in the correct format, corresponds to the All-0.5 Benchmark.

對特徵進行篩選和down sampling來降低資料集

# -*- coding: utf-8 -*-
"""
Created on Wed Feb 01 12:51:31 2017

@author: JR.Lu
"""
import pandas as pd
import numpy as np

train_df=pd.read_csv('train.csv',nrows=10000000 
)
test_df=pd.read_csv('test.csv')

#down sampling
temp_0=train_df.click==0
data_0=train_df[temp_0] # 16546986./20000000 佔了0.8273493左右
temp_1=train_df.click==1
data_1=train_df[temp_1] # 3453014
data_0_ed=data_0[0:len(data_1)]
data_downsampled=pd.concat([data_1,data_0_ed])

#select features
#通過每個columns對label的影響來選擇feature，這裡使用grouby實現 

#train_df.groupby(train_df['device_model'])['click'].mean()
columns_select_test=['id','device_type','C1','C15','C16','banner_pos','banner_pos','site_category']
columns_select=['click','device_type','C1','C15','C16','banner_pos','banner_pos','site_category']
data_downsampled_1=data_downsampled[columns_select]
test_small=test_df[columns_select_test]

# 打亂資料
sampler = np.random.permutation(len(data_downsampled_1))
data_downsampled_1=data_downsampled_1.take(sampler)
data_downsampled_1.to_csv('train_small.csv')
test_small.to_csv('test_small.csv')

其次是用簡單的特徵來測試模型，用網格搜尋的方式來進行引數優選

# -*- coding: utf-8 -*-
"""
Created on Wed Feb 01 20:36:46 2017

@author: JR.Lu
"""
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.learning_curve import learning_curve
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp



def logloss(act, pred):
    '''
    比賽使用logloss作為evaluation
    '''
    epsilon = 1e-15
    pred = sp.maximum(epsilon, pred)
    pred = sp.minimum(1-epsilon, pred)
    ll = sum(act*sp.log(pred) + sp.subtract(1,act)*sp.log(sp.subtract(1,pred)))
    ll = ll * -1.0/len(act)
    return ll

# 結果衡量
def print_metrics(true_values, predicted_values):
    print "logloss: ", logloss(true_values, predicted_values)
    print "Accuracy: ", metrics.accuracy_score(true_values, predicted_values)
    print "AUC: ", metrics.roc_auc_score(true_values, predicted_values)
    print "Confusion Matrix: ", + metrics.confusion_matrix(true_values, predicted_values)
    print metrics.classification_report(true_values, predicted_values)    


def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    '''
    輸入一個模型、title、x、y，返回模型學習過程曲線
    '''
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt


#讀取經過down sampling後的small資料
train_df=pd.read_csv('train_small.csv',nrows=100000) 
test_df=pd.read_csv('test_small.csv') 

feature_columns=['device_type','C1','C15','C16','banner_pos',
                'banner_pos','site_category']
train_x=train_df[feature_columns]
test_x=test_df[feature_columns]
x=pd.concat([train_x,test_x])

#變成one-hot encoding
temp=x
for each in feature_columns:
    temp_dummies=pd.get_dummies(x[each])
    temp=pd.concat([temp,temp_dummies],axis=1)

x_dummies=temp.drop(feature_columns,axis=1)

X_train=x_dummies[0:len(train_x)]
Y_train=train_df['click']

x_train, x_test, y_train, y_test=train_test_split(X_train,Y_train,test_size=0.33)


#建模


#模型引數選擇，使用GridSearchCV實現
"""
LR模型可調的引數,沒幾個能調的，gs調參只能輸入list，不能對str進行選擇。

LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True,
                   intercept_scaling=1, class_weight=None, random_state=None, 
                   solver='liblinear', max_iter=100, multi_class='ovr', 
                   verbose=0, warm_start=False, n_jobs=1)

        solver : {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’}, default: ‘liblinear’
"""
#param_LR= {'C':[0.1,1,2]}
#
#    
#gsearch_LR = GridSearchCV(estimator = LogisticRegression(penalty='l1',solver='liblinear'),
#                          param_grid=param_LR,cv=3)
#gsearch_LR.fit(x_train,y_train)
#gsearch_LR.grid_scores_, gsearch_LR.best_params_, gsearch_LR.best_score_

title='LRlearning{penalty=l1,solver=liblinear,cv=3}'                          
plot_learning_curve(LogisticRegression(penalty='l1',solver='liblinear',C=1),
                    title=title,cv=10,X=x_train,y=y_train)

#gsearch_LR.fit(x_train,y_train)


#gbdt模型

#param_GBDT= {'learning_rate':[0.1,0.5],
#             'n_estimators':[100,200,300,400],
#             'max_depth':[3,4]}
#
#gsearch_GBDT = GridSearchCV(estimator =GradientBoostingClassifier(),
#                          param_grid=param_GBDT,cv=10)
#gsearch_GBDT.fit(x_train,y_train)
##gsearch_GBDT.grid_scores_
#gsearch_GBDT.best_params_
#gsearch_GBDT.best_score_
#最佳引數：'n_estimators': 200, 'learning_rate': 0.1, 'max_depth': 3

title='GDBTlearning{n_estimators: 200, learning_rate: 0.1, max_depth: 3}'
plot_learning_curve(estimator=GradientBoostingClassifier(n_estimators=200, learning_rate=0.1, max_depth=3),
                    title=title,cv=2,X=x_train,y=y_train)
#比LR好那麼一點點

#rf建模


#param_rf= {'n_estimators':[100,200,300],
#            'max_depth':[2,3,4]}
#
#gsearch_rf = GridSearchCV(estimator =RandomForestClassifier(),
#                          param_grid=param_rf,cv=3)
#
#gsearch_rf.fit(x_train,y_train)
##gsearch_GBDT.grid_scores_
#gsearch_rf.best_params_
#gsearch_rf.best_score_

# 最佳引數： {'n_estimators': 200, 'max_depth': 4} 

title='RFlearning{n_estimators: 200,  max_depth: 4}'
plot_learning_curve(estimator=RandomForestClassifier(n_estimators=200, max_depth=4),
                    title=title,cv=2,X=x_train,y=y_train)



# predict

lr_model=LogisticRegression(penalty='l1',solver='liblinear',C=1)
gbdt_model=GradientBoostingClassifier(n_estimators=200, learning_rate=0.1, max_depth=3)
rf_model=RandomForestClassifier(n_estimators=200, max_depth=4)

lr_model.fit(x_train,y_train)
gbdt_model.fit(x_train,y_train)
rf_model.fit(x_train,y_train)

lr_predict=lr_model.predict( x_test)
gbdt_predict=gbdt_model.predict(x_test)
rf_predict=rf_model.predict(x_test)

print "LRmodel 效能如下：-------"
print_metrics(y_test, lr_predict)

print "GBDTmodel 效能如下：-------"
print_metrics(y_test, gbdt_predict)

print "RFmodel 效能如下：-------"
print_metrics(y_test, rf_predict)

結果大概如下：

LRmodel 效能如下：-------
logloss:  14.8549419892
Accuracy:  0.569909090909
AUC:  0.570339428461
Confusion Matrix:  [[11141  5293]
 [ 8900  7666]]
             precision    recall  f1-score   support

          0       0.56      0.68      0.61     16434
          1       0.59      0.46      0.52     16566

avg / total       0.57      0.57      0.56     33000

GBDTmodel 效能如下：-------
logloss:  14.7952832304
Accuracy:  0.571636363636
AUC:  0.572068547036
Confusion Matrix:  [[11177  5257]
 [ 8879  7687]]
             precision    recall  f1-score   support

          0       0.56      0.68      0.61     16434
          1       0.59      0.46      0.52     16566

avg / total       0.58      0.57      0.57     33000

RFmodel 效能如下：-------
logloss:  15.4713065032
Accuracy:  0.552060606061
AUC:  0.553565705536
Confusion Matrix:  [[15281  1153]
 [13629  2937]]
             precision    recall  f1-score   support

          0       0.53      0.93      0.67     16434
          1       0.72      0.18      0.28     16566

avg / total       0.62      0.55      0.48     33000

插個圖看看結果：
這裡寫圖片描述

這裡寫圖片描述

用機器學習對CTR預估建模（一）

資料集介紹： train - Training set. 10 days of click-through data, ordered chronologically. Non-clicks and clicks are subsampled acco

機器學習之SVM初解與淺析（一）:最大距離

機器學習 svm 最大距離 2 / ||w|| 這段時間在看周誌華大佬的《機器學習》，在看書的過程中，有時候會搜搜其他人寫的文章，對比來講，周教授講的內容還是比較深刻的，但是前幾天看到SVM這一章的時候，感覺甚是晦澀啊，第一感覺就是比較抽象，特別是對於像本人這種I

機器學習之SVM初解與淺析（一）:

機器學習 svm 最大距離 2 / ||w||sdsshngshan‘gccha 這段時間在看周誌華大佬的《機器學習》，在看書的過程中，有時候會搜搜其他人寫的文章，對比來講，周教授講的內容還是比較深刻的，但是前幾天看到SVM這一章的時候，感覺甚是晦澀啊，第一感覺就

機器學習之支持向量機（一）：支持向量機的公式推導

根據監督式 art 通用利用哪些這就是在線方法註：關於支持向量機系列文章是借鑒大神的神作，加以自己的理解寫成的；若對原作者有損請告知，我會及時處理。轉載請標明來源。序：我在支持向量機系列中主要講支持向量機的公式推導，第一部分講到推出拉格朗日對偶函數的對偶因

機器學習實戰--K近鄰演算法實現（一）

KNN演算法的工作原理為：存在一個樣本資料的集合，也稱作訓練樣本集合，並且樣本集的每個資料都存在標籤。輸入沒有標籤的新資料後，將新資料的每個特徵與樣本集中資料對應的特徵進行比較，然後演算法提取樣本集中特徵最相似的分類標籤，一般只選擇樣本集中前K個最相似的資料，前K個相似資

深度學習/機器學習入門基礎數學知識整理（一）：線性代數基礎，矩陣，範數等

前面大概有2年時間，利用業餘時間斷斷續續寫了一個機器學習方法系列，和深度學習方法系列，還有一個三十分鐘理解系列（一些趣味知識）；新的一年開始了，今年給自己定的學習目標——以補齊基礎理論為重點，研究一些基礎課題；同時逐步繼續寫上述三個系列的文章。最近越來越多的

周志華《機器學習》課後習題解答系列（一）：目錄

對機器學習一直很感興趣，也曾閱讀過李航老師的《統計學習導論》和Springer的《統計學習導論-基於R應用》等相關書籍，但總感覺自己缺乏深入的理解和系統的實踐。最近從實驗室角落覓得南京大學周志華老師《機器學習》一書，隨意翻看之間便被本書內容文筆深深吸引，如獲至寶

機器學習十大算法系列（一）——邏輯迴歸

　　本系列博文整理了常見的機器學習演算法，大部分資料問題都可以通過它們解決： 1.線性迴歸 (Linear Regression) 2.邏輯迴歸 (Logistic Regression) 3.決策樹 (Decision Tree) 4.支援向量機（SV

知乎專欄 —機器學習筆試題精選試題總結（一）

機器學習筆試題精選試題一 1. 線上性迴歸問題中，利用R平方（R-Squared）來判斷擬合度：數值越大說明模型擬合的越好。數值在[0 1]之間。隨著樣本數量的增加，R平方的數值必然也會增加，無法定量地說明新增的特徵有無意義。對於新增的特徵，R平方的值可能變大也可能

機器學習要註意的事情（一）

問題內容沒有一模一樣 jpg 重要解決容易 bsp 大家都知道，機器學習在人工智能中是一個非常重要的內容，我們在進行學習人工智能之前要對機器學習有一定的了解，而機器學習中最重要的就是那些算法了，只有我們掌握了那些算法我們才能夠更好地掌握和熟料機器學習的內容。對於

深度學習在CTR預估中的應用

搜索前言 deep 帶來 python 2017年進入訓練信息歡迎大家前往騰訊雲+社區，獲取更多騰訊海量技術實踐幹貨哦~ 本文由鵝廠優文發表於雲+社區專欄一、前言二、深度學習模型 1. Factorization-machine（FM） FM = LR

深度學習在CTR預估的應用

深度學習在各個領域的成功深度學習在影象和音訊等方向比傳統方向有大的提升，導致很多產品能快速落地第一行三張圖片代表圖片和音訊方向相比傳統提升30%-50%，第二行第一張代表深度學習在自然語言處理方面方向的應用(相比傳統學習方法有提升，但是提升效果有限)，後兩張代表生成式模型(生成圖片

機器學習數據預處理——標準化/歸一化方法總結

目標 out enc 並不是 depend 區間 standards ima HA 通常，在Data Science中，預處理數據有一個很關鍵的步驟就是數據的標準化。這裏主要引用sklearn文檔中的一些東西來說明，主要把各個標準化方法的應用場景以及優缺點總結概括，以來充當

深度長文 | 從FM推演各深度CTR預估模型（附開原始碼）

作者丨龍心塵 & 寒小陽研究方向丨機器學習，資料探勘題記：多年以後，當資深演算法專家們看

機器學習實戰精讀--------奇異值分解（SVD）

svd 奇異值分解奇異值分解（SVD）：是一種強大的降維工具，通過利用SVD來逼近矩陣並從中提取重要特征，通過保留矩陣80%~ 90%的能量，就能得到重要的特征並去掉噪聲SVD分解會降低程序的速度，大型系統中SVD每天運行一次或者頻率更低，並且還要離線進行。隱性語義索引（LST）：試圖繞過自然語言理解，用統計

機器學習數學基礎之矩陣理論（三）

gis 引入定義增加 2017年理論值 nbsp 得到正數矩陣求導目錄一、矩陣求導的基本概念 1. 一階導定義 2. 二階導數二、梯度下降 1. 方向導數. 1.1 定義 1.2 方向導數的計算公式. 1.3 梯度下降最快的方向 1.

【機器學習】支持向量機（SVM）

cto nom 機器 ins 神經網絡學習參數 mage 36-6 感謝中國人民大學胡鶴老師，課程深入淺出，非常好關於SVM 可以做線性分類、非線性分類、線性回歸等，相比邏輯回歸、線性回歸、決策樹等模型（非神經網絡）功效最好傳統線性分類：選出兩堆數據的質心，並

機器學習之支持向量機（三）：核函數和KKT條件的理解

麻煩 ron 現在調整所有核函數多項式 err ges 註：關於支持向量機系列文章是借鑒大神的神作，加以自己的理解寫成的；若對原作者有損請告知，我會及時處理。轉載請標明來源。序：我在支持向量機系列中主要講支持向量機的公式推導，第一部分講到推出拉格朗日對偶函數的對

機器學習之貝葉斯網路（三）

引言　　貝葉斯網路是機器學習中非常經典的演算法之一，它能夠根據已知的條件來估算出不確定的知識，應用範圍非常的廣泛。貝葉斯網路以貝葉斯公式為理論接觸構建成了一個有向無環圖，我們可以通過貝葉斯網路構建的圖清晰的根據已有資訊預測未來資訊。貝葉斯網路適用於表達和分析不確定性和概率性的事件，應用於有條件地依賴多種控

機器學習之支持向量機（四）

應用問題計算過程非線性簡單常熟一段約束有關引言：　　SVM是一種常見的分類器，在很長一段時間起到了統治地位。而目前來講SVM依然是一種非常好用的分類器，在處理少量數據的時候有非常出色的表現。SVM是一個非常常見的分類器，在真正了解他的原理之前我們多多少少

用機器學習對CTR預估建模（一）

對特徵進行篩選和down sampling來降低資料集

其次是用簡單的特徵來測試模型，用網格搜尋的方式來進行引數優選

相關推薦