傳統機器學習&資料探勘比賽程式碼框架
阿新 • • 發佈:2018-12-11
傳統資料探勘比賽中程式碼框架如下: 1.匯入庫 2.讀取資料檔案 3.定義特徵構建函式 (希望構建新的特徵提升分數,只需要新增框架中的第 3 和第 4 部分。) 4.呼叫函式,構建特徵 5.拆分資料集的特徵與標籤 6.模型的交叉驗證 7.模型的訓練與預測 8.結果檔案的寫出
# coding:utf-8 # 1. 匯入庫 import numpy as np import pandas as pd ... # 2. 讀取資料檔案 train = pd.read_csv('../data/input/train.csv') test = pd.read_csv('../data/input/evaluation_public.csv') ... # 3. 定義特徵構建函式 def get_entbase_feature(df): ... def get_alter_feature(df): ... ... # 4. 呼叫函式,構建特徵 entbase_feat = get_entbase_feature(entbase) alter_feat = get_alter_feature(alter) ... # 5. 拆分資料集的特徵與標籤 dataset = pd.merge(entbase_feat, alter_feat, on='EID', how='left') ... trainset = pd.merge(train, dataset, on='EID', how='left') testset = pd.merge(test, dataset, on='EID', how='left') train_feature = trainset.drop(['TARGET', 'ENDDATE'], axis=1) train_label = trainset.TARGET.values test_feature = testset test_index = testset.EID.values # 6. 模型的交叉驗證 ... iterations, best_score = xgb_cv(train_feature, train_label, params, config['folds'], config['rounds']) ... # 7. 模型的訓練與預測 ... model, pred = xgb_predict(train_feature, train_label, test_feature, iterations, params) ... # 8. 結果檔案的寫出 res = store_result(test_index, pred, 0.18, '1207-xgb-%f(r%d)' % (best_score, iterations))