Jane Street Market Prediction Rank10 (1st rerun) 經驗分享，程式碼部分（2/3）

阿新 • • 發佈：2021-04-03

在這裡插入圖片描述

XGBOOST模型

這裡介紹比賽中用到的XGBOOST模型。模型的超參參考了這個notebook，額外加入了l1, l2正則化，注意到正則化在線上線下都能夠帶來提升（上分點）。

載入依賴

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option( 
'display.max_columns', 500)
import cufflinks
import cufflinks as cf
import plotly.figure_factory as ff
import os
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
import os
import numpy as np
import pandas as pd
from sklearn import preprocessing
import xgboost as 
 xgb
from sklearn.metrics import *
print("XGBoost version:", xgb.__version__)
from tqdm.notebook import tqdm

讀取資料

這裡需要將path改為儲存資料的地址

# path = './'
path = '../input/jane-street-market-prediction/'

train = pd.read_csv(path+'train.csv')
features = pd.read_csv(path+'features.csv')
example_test = 
 pd.read_csv(path+'example_test.csv')
sample_prediction_df = pd.read_csv(path+'example_sample_submission.csv')
print ("Data is loaded!")

壓縮資料集

def reduce_mem_usage(props):
    start_mem_usg = props.memory_usage().sum() / 1024**2 
    print("Memory usage of properties dataframe is :",start_mem_usg," MB")
    NAlist = [] # Keeps track of columns that have missing values filled in. 
    for col in props.columns:
        if props[col].dtype != object:  # Exclude strings
            
            # Print current column type
            # print("******************************")
            # print("Column: ",col)
            # print("dtype before: ",props[col].dtype)
            
            # make variables for Int, max and min
            IsInt = False
            mx = props[col].max()
            mn = props[col].min()
            
            # Integer does not support NA, therefore, NA needs to be filled
            if not np.isfinite(props[col]).all(): 
                NAlist.append(col)
                props[col].fillna(mn-1,inplace=True)  
                   
            # test if column can be converted to an integer
            asint = props[col].fillna(0).astype(np.int64)
            result = (props[col] - asint)
            result = result.sum()
            if result > -0.01 and result < 0.01:
                IsInt = True

            
            # Make Integer/unsigned Integer datatypes
            if IsInt:
                if mn >= 0:
                    if mx < 255:
                        props[col] = props[col].astype(np.uint8)
                    elif mx < 65535:
                        props[col] = props[col].astype(np.uint16)
                    elif mx < 4294967295:
                        props[col] = props[col].astype(np.uint32)
                    else:
                        props[col] = props[col].astype(np.uint64)
                else:
                    if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
                        props[col] = props[col].astype(np.int8)
                    elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
                        props[col] = props[col].astype(np.int16)
                    elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
                        props[col] = props[col].astype(np.int32)
                    elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
                        props[col] = props[col].astype(np.int64)    
            
            # Make float datatypes 32 bit
            else:
                props[col] = props[col].astype(np.float32)
            
            # Print new column type
            # print("dtype after: ",props[col].dtype)
            # print("******************************")
    
    # Print final result
    print("___MEMORY USAGE AFTER COMPLETION:___")
    mem_usg = props.memory_usage().sum() / 1024**2 
    print("Memory usage is: ",mem_usg," MB")
    print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")
    return props, NAlist

train, _ = reduce_mem_usage(train)

設定訓練集與標籤

注意到這裡沒有進行特種工程，實際上線下進行的少量特徵工程可以帶來線下分數的提升，但是由於線下沒有提升，所以最後沒有加入最終的模型。這裡去掉了利用模型判斷出來的 outliers(異常資料)，可以帶來200分的線上提升，具體演算法可以參考這裡。
這裡的標籤選擇了 resp_3，而不是 resp，原因參見文章第一部分的敘述。

exclude = set([2,5,19,26,29,36,37,43,63,77,87,173,262,264,268,270,276,294,347,499])
train = train[~train.date.isin(exclude)]

features = [c for c in train.columns if 'feature' in c]

f_mean = train[features[1:]].mean()
train[features[1:]] = train[features[1:]].fillna(f_mean)

train = train[train.weight>0]


train['action'] = ((train['resp'].values) > 0).astype('int')
train['action1'] = ((train['resp_1'].values) > 0).astype('int')
train['action2'] = ((train['resp_2'].values) > 0).astype('int')
train['action3'] = ((train['resp_3'].values) > 0).astype('int')
train['action4'] = ((train['resp_4'].values) > 0).astype('int')

X = train.loc[:, train.columns.str.contains('feature')]
y = train.loc[:, 'action3'].astype('int').values

XGBOOST模型與訓練

這裡加入的超參是L1、L2正則化，10是線下測出來最好的引數，線上也帶來了最好的分數。

clf2 = xgb.XGBClassifier(
      n_estimators=400,
      max_depth=11,
      learning_rate=0.05,
      subsample=0.90,
      colsample_bytree=0.7,
      missing=-999,
      random_state=2020,
      tree_method='gpu_hist',  # THE MAGICAL PARAMETER
      reg_alpha=10,
      reg_lambda=10,
)
  
clf2.fit(X, y)

輸出結果

import janestreet
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() # an iterator which loops over the test set

tofill = f_mean.values.reshape((1,-1))
for (test_df, sample_prediction_df) in iter_test:
    
    
    if test_df['weight'].values[0] == 0:
        sample_prediction_df.action = 0
    else:
        X_test = test_df.loc[:, features].values
        if np.isnan(X_test.sum()):
            X_test[0,1:] = np.where(np.isnan(X_test[0,1:]), tofill, X_test[0,1:])
        y_preds = int((clf2.predict_proba(X_test)[0][1])>0.5)
        sample_prediction_df.action = y_preds
    env.predict(sample_prediction_df)

結論

這裡提供的xgboost模型線上可以取得79xx分。
我們通過通過去掉outlier提升了一定的分數，新增L1、L2正則化在線上線下都提升了分數，注意到這裡的模型運行了400輪，實際上線下分數最好的模型在100輪後就在過擬合了（提取不到有用資訊），所以最終提交的xgboost模型只選擇訓練了100輪。
另一方面，通過調整隨機種子，可以發現xgboost取得的分數不會抖動太多，證明模型確實從資料中學習到了有用的資訊，而不是隨機抖動上去的。

Jane Street Market Prediction Rank10 (1st rerun) 經驗分享，程式碼部分（2/3）

目錄

XGBOOST模型

載入依賴

讀取資料

壓縮資料集

設定訓練集與標籤

XGBOOST模型與訓練

輸出結果

結論

Jane Street Market Prediction Rank10 (1st rerun) 經驗分享，程式碼部分（2/3）

kivy打包虛擬機器，手機軟體執行日誌等經驗分享，一個電話號碼查詢小應用例項

位元組跳動Java崗6月9號一面經驗分享，是真的有難度

Java工程師跳槽經驗分享，mybatis總結

2021Java大廠面試經驗分享，Java知識總結

2021Java大廠面試經驗分享，35歲老年程式設計師的絕地翻身之路

2021Android大廠面試經驗分享，GitHub重磅官宣

Java工程師跳槽經驗分享，你會的還只有初級工程師的技術嗎

位元組7年經驗分享，如何從0開始搭建公司自動化測試框架？

微服務經驗分享&雜談

Redis如何在專案中合理使用經驗分享

mysql大資料查詢優化經驗分享(推薦)

巧妙解決Oracle NClob讀寫問題(經驗分享)

Jar包一鍵重啟的Shell指令碼及新伺服器部署的一些經驗分享

pycharm第三方庫安裝失敗的問題及解決經驗分享

檔案翻譯經驗分享（Markdown）

CISSP 考試經驗分享

三年經驗的Java程式設計師面經分享，備戰三個月入職“大廠”

【原創】大叔經驗分享（117）mac/windows/linux遠端桌面互聯

個人多年經驗分析，學習Java，自學還是培訓？

Jane Street Market Prediction Rank10 (1st rerun) 經驗分享，程式碼部分 （2/3）

目錄

XGBOOST模型

載入依賴

讀取資料

壓縮資料集

設定訓練集與標籤

XGBOOST模型與訓練

輸出結果

結論

相關推薦

Jane Street Market Prediction Rank10 (1st rerun) 經驗分享，程式碼部分（2/3）