1. 程式人生 > 其它 >Jane Street Market Prediction Rank10 (1st rerun) 經驗分享,程式碼部分 (2/3)

Jane Street Market Prediction Rank10 (1st rerun) 經驗分享,程式碼部分 (2/3)

在這裡插入圖片描述

目錄

在這裡插入圖片描述比賽連結:Jane Street Market Prediction

XGBOOST模型

這裡介紹比賽中用到的XGBOOST模型。模型的超參參考了這個notebook,額外加入了l1, l2正則化,注意到正則化在線上線下都能夠帶來提升(上分點)。

載入依賴

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option(
'display.max_columns', 500) import cufflinks import cufflinks as cf import plotly.figure_factory as ff import os import matplotlib.pyplot as plt import warnings warnings.filterwarnings("ignore") import os import numpy as np import pandas as pd from sklearn import preprocessing import xgboost as
xgb from sklearn.metrics import * print("XGBoost version:", xgb.__version__) from tqdm.notebook import tqdm

讀取資料

這裡需要將path改為儲存資料的地址

# path = './'
path = '../input/jane-street-market-prediction/'

train = pd.read_csv(path+'train.csv')
features = pd.read_csv(path+'features.csv')
example_test =
pd.read_csv(path+'example_test.csv') sample_prediction_df = pd.read_csv(path+'example_sample_submission.csv') print ("Data is loaded!")

壓縮資料集

def reduce_mem_usage(props):
    start_mem_usg = props.memory_usage().sum() / 1024**2 
    print("Memory usage of properties dataframe is :",start_mem_usg," MB")
    NAlist = [] # Keeps track of columns that have missing values filled in. 
    for col in props.columns:
        if props[col].dtype != object:  # Exclude strings
            
            # Print current column type
            # print("******************************")
            # print("Column: ",col)
            # print("dtype before: ",props[col].dtype)
            
            # make variables for Int, max and min
            IsInt = False
            mx = props[col].max()
            mn = props[col].min()
            
            # Integer does not support NA, therefore, NA needs to be filled
            if not np.isfinite(props[col]).all(): 
                NAlist.append(col)
                props[col].fillna(mn-1,inplace=True)  
                   
            # test if column can be converted to an integer
            asint = props[col].fillna(0).astype(np.int64)
            result = (props[col] - asint)
            result = result.sum()
            if result > -0.01 and result < 0.01:
                IsInt = True

            
            # Make Integer/unsigned Integer datatypes
            if IsInt:
                if mn >= 0:
                    if mx < 255:
                        props[col] = props[col].astype(np.uint8)
                    elif mx < 65535:
                        props[col] = props[col].astype(np.uint16)
                    elif mx < 4294967295:
                        props[col] = props[col].astype(np.uint32)
                    else:
                        props[col] = props[col].astype(np.uint64)
                else:
                    if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
                        props[col] = props[col].astype(np.int8)
                    elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
                        props[col] = props[col].astype(np.int16)
                    elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
                        props[col] = props[col].astype(np.int32)
                    elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
                        props[col] = props[col].astype(np.int64)    
            
            # Make float datatypes 32 bit
            else:
                props[col] = props[col].astype(np.float32)
            
            # Print new column type
            # print("dtype after: ",props[col].dtype)
            # print("******************************")
    
    # Print final result
    print("___MEMORY USAGE AFTER COMPLETION:___")
    mem_usg = props.memory_usage().sum() / 1024**2 
    print("Memory usage is: ",mem_usg," MB")
    print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")
    return props, NAlist

train, _ = reduce_mem_usage(train)

設定訓練集與標籤

注意到這裡沒有進行特種工程,實際上線下進行的少量特徵工程可以帶來線下分數的提升,但是由於線下沒有提升,所以最後沒有加入最終的模型。這裡去掉了利用模型判斷出來的 outliers(異常資料),可以帶來200分的線上提升,具體演算法可以參考這裡
這裡的標籤選擇了 resp_3,而不是 resp,原因參見文章第一部分的敘述。

exclude = set([2,5,19,26,29,36,37,43,63,77,87,173,262,264,268,270,276,294,347,499])
train = train[~train.date.isin(exclude)]

features = [c for c in train.columns if 'feature' in c]

f_mean = train[features[1:]].mean()
train[features[1:]] = train[features[1:]].fillna(f_mean)

train = train[train.weight>0]


train['action'] = ((train['resp'].values) > 0).astype('int')
train['action1'] = ((train['resp_1'].values) > 0).astype('int')
train['action2'] = ((train['resp_2'].values) > 0).astype('int')
train['action3'] = ((train['resp_3'].values) > 0).astype('int')
train['action4'] = ((train['resp_4'].values) > 0).astype('int')

X = train.loc[:, train.columns.str.contains('feature')]
y = train.loc[:, 'action3'].astype('int').values

XGBOOST模型與訓練

這裡加入的超參是L1、L2正則化,10是線下測出來最好的引數,線上也帶來了最好的分數。

clf2 = xgb.XGBClassifier(
      n_estimators=400,
      max_depth=11,
      learning_rate=0.05,
      subsample=0.90,
      colsample_bytree=0.7,
      missing=-999,
      random_state=2020,
      tree_method='gpu_hist',  # THE MAGICAL PARAMETER
      reg_alpha=10,
      reg_lambda=10,
)
  
clf2.fit(X, y)

輸出結果

import janestreet
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() # an iterator which loops over the test set

tofill = f_mean.values.reshape((1,-1))
for (test_df, sample_prediction_df) in iter_test:
    
    
    if test_df['weight'].values[0] == 0:
        sample_prediction_df.action = 0
    else:
        X_test = test_df.loc[:, features].values
        if np.isnan(X_test.sum()):
            X_test[0,1:] = np.where(np.isnan(X_test[0,1:]), tofill, X_test[0,1:])
        y_preds = int((clf2.predict_proba(X_test)[0][1])>0.5)
        sample_prediction_df.action = y_preds
    env.predict(sample_prediction_df)

結論

這裡提供的xgboost模型線上可以取得79xx分。
我們通過通過去掉outlier提升了一定的分數,新增L1、L2正則化在線上線下都提升了分數,注意到這裡的模型運行了400輪,實際上線下分數最好的模型在100輪後就在過擬合了(提取不到有用資訊),所以最終提交的xgboost模型只選擇訓練了100輪。
另一方面,通過調整隨機種子,可以發現xgboost取得的分數不會抖動太多,證明模型確實從資料中學習到了有用的資訊,而不是隨機抖動上去的。