1. 程式人生 > 其它 >天貓使用者重複購買預測賽題——模型訓練、驗證和評測

天貓使用者重複購買預測賽題——模型訓練、驗證和評測

技術標籤:天池大賽—天貓使用者重複購買預測賽題深度學習機器學習

天貓使用者重複購買預測賽題——模型訓練、驗證和評測

天池大賽比賽地址:連結

理論知識

  • 分類是一個有監督的學習過程,在大量帶標籤資料的前提下,計算出未知樣本的標籤取值,二分類和多分類問題

  • 邏輯迴歸 雖然叫回歸 但是屬於分類演算法 通過將線性函式的結果對映到Sigmoid函式中 預估出概率並分類

  • Sigmoid函式

    • 是歸一化函式,將連續數值轉化為0到1的範圍,連續型–>離散型

    • 迴歸函式 f ( x ) = 1 1 + e − x f(x) = {1 \over 1+e^{-x}}

      f(x)=1+ex1

    • from sklearn.linear_model import LogisticRegression
      from sklearn.preprocessing import StandardScaler
      from sklearn.model_selection import train_test_split
      
      # 需要標準化
      stdScaler = StandardScaler()
      X = stdScaler.fit_transform(train)
      X_train,X_test,y_train,y_test = train_test_split(X,target,test_size=
      0.3,random_state=2020) clf = LogisticRegression(random_state=2020,slover='lbfgs',multi_class='multinomial'). fit(X_train,y_train)
  • K近鄰分類

    • 計算樣本資料中的點和當前點的距離,如歐式距離

    • 提取樣本最相似資料分類標籤

    • 確定前k個點所在類別的出現頻率

    • 返回前k個點所出現頻率最高的類別 作為當前點的預測分類

    • from sklearn.neighbors import KNeighborsClassifier
      from sklearn.preprocessing import
      StandardScaler # 需要標準化 stdScaler = StandardScaler() X = stdScaler.fit_transform(train) X_train,X_test,y_train,y_test = train_test_split(X,target,test_size=0.3,random_state=2020) clf = KNeighborsClassifier(n_neighbors=3).fit(X_train,y_train)
  • 高斯貝葉斯分類模型

    • P ( A ∣ B ) = P ( A , B ) P ( B ) = P ( B ∣ A ) ∗ P ( A ) P ( B ) P(A|B) = {P(A,B) \over P(B)} = {P(B|A) * P(A) \over P(B)} P(AB)=P(B)P(A,B)=P(B)P(BA)P(A)

    • from sklearn.naive_bayes import GussianNB
      from sklearn.preprocessing import StandardScaler
      
      # 需要標準化
      stdScaler = StandardScaler()
      X = stdScaler.fit_transform(train)
      X_train,X_test,y_train,y_test = train_test_split(X,target,test_size=0.3,random_state=2020)
      clf = GussianNB().fit(X_train,y_train)
      
  • 整合學習分類模型

    • Bagging 抽取m個樣本 進行訓練 多個訓練器結合策略

    • Boosting 帶權重訓練集 訓練 基於學習誤差率 更新權重係數 重新訓練

    • 隨機森林

    • LightGBM

    • 極端隨機森林 Extra Tree ET

      • 多個決策樹構成
      • 隨機森林 應用的是Bagging模型,極端隨機森林使用所有的訓練樣本計算
      • 隨機森林 是在一個隨機子集內得到最佳的分叉屬性極端隨機模型依靠完全隨機得到分叉值
  • 模型驗證指標

    • 指標描述方法 sklearn.metrics
      Accurary準確率accuray_score
      Percision查準率precision_score
      Recall查全率recall_score
      F1F1值f1_score
      Classification Report分類報告classification_report
      Confusion Matrix混淆矩陣confusion_matrix
      ROCROC曲線roc_curve
      AUCROC曲線下的面積auc
    • 查準率和查全率

      • 假設有個不太準的驗鈔機 假的會攔住 真的會存起來 但有時候會出問
      • 查準率 precision:存起來的鈔票中 真鈔的比例 = 存起來的真鈔票 / (存起來的真鈔+存起來的假鈔)
      • 查全率recall:所以真鈔中被 存起來的比例 = 存起來的真鈔票 / (存起來的真鈔+誤攔住的真鈔)
    • F1值

      • 查準率和查全率的加權調和平均
      • $ F = {(a^2+1)*R\over a^2-(P+R)}$
      • 當a = 1時,就是最常見的F1值, F 1 = 2 P R P + R F1 = {2PR\over P+R} F1=P+R2PR
    • 分類報告

      • 提供查準率、查全率、F1值 三種評估指標
    • 混淆矩陣

      • 預測值=1預測值=0
        真實值=1TP(True Postive)TN(True Negative)
        真實值=0FP(False Postive)FN(Flash Negative)
    • ROC

      • 橫座標是FPR(Fasle Postive Rate)
      • 縱座標是TPR(True Postive Rate)
      • 理想的目標是TPR=1,FPR=0 ROC曲線越靠攏(0,1)點,越偏離45度對角線效果越好
    • AUC曲線

      • ROC曲線下方的面積

1. 設定交叉驗證方式

# 1.簡單交叉驗證
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)
scores = cross_val_score(clf, train, target, cv=5,scoring='f1_macro')
print(scores)

# 2.使用ShuffleSplit切分資料
from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=5,test_size=0.3,random_state=2020)
scores = cross_val_score(clf, train, target, cv=cv)

# 3.使用KFold切分資料 
from sklearn.model_selection import KFlod
kf = KFold(n_splits=5)
for k,(train_index,test_index) in enumerate(kf.split(train)):
  X_train,X_test,y_train,y_test = train[train_index], train[test_index], target[train_index], target[test_index]
  clf = clf.fit(X_train, y_train)
  print(k, clf.score(X_test, y_test))
  
# 4.使用StratifiedKFold切分資料 label均分
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
for k, (train_index, test_index) in enumerate(skf.split(train, target)):
    X_train, X_test, y_train, y_test = train[train_index], train[test_index], target[train_index], target[test_index]
    clf = clf.fit(X_train, y_train)
    print(k, clf.score(X_test, y_test))

2. 模型調參

from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.3, random_state=0)
clf = RandomForestClassifier(n_job=-1)

parameters = {
  'n_estimators':[50,100,200],
  'max_depth':[2,5]
}
clf = GridSearchCV(clf,param_grid=parameters,cv=5,scoring='precision_macro')
print(clf.cv_results)
print(clf.best_params_)

3. 不同的分類模型

# LR模型
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# 標準化
stdScaler = StandardScaler()
X = stdScaler.fit_transform(train)
X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)
clf.score(X_test, y_test)

# KNN模型
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
# 標準化
stdScaler = StandardScaler()
X = stdScaler.fit_transform(train)
X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)
clf = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
clf.score(X_test, y_test)

# 高斯貝葉斯模型
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
# 標準化
stdScaler = StandardScaler()
X = stdScaler.fit_transform(train)
X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)
clf = GaussianNB().fit(X_train, y_train)
clf.score(X_test, y_test)

# bagging模型
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, max_features=0.5)
clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

# 隨機森林模型
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

# ExTree模型
from sklearn.ensemble import ExtraTreesClassifier
clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)

# AdaBoost模型
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier(n_estimators=100)

# GBDT模型
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)

# lgb模型
import lightgbm
clf = lightgbm
train_matrix = clf.Dataset(X_train, label=y_train)
test_matrix = clf.Dataset(X_test, label=y_test)
params = {
          'boosting_type': 'gbdt',
          'objective': 'multiclass',
          'metric': 'multi_logloss',
          'min_child_weight': 1.5,
          'num_leaves': 2**5,
          'lambda_l2': 1,
          'subsample': 0.7,
          'colsample_bytree': 0.7,
          'colsample_bylevel': 0.7,
          'learning_rate': 0.03,
          'seed': 2020,
          "num_class": 2,
          'silent': True,
          }
num_round = 10000
early_stopping_rounds = 100
model = clf.train(params, 
                  train_matrix,
                  num_round,
                  valid_sets=test_matrix,
                  early_stopping_rounds=early_stopping_rounds)
pre= model.predict(X_valid,num_iteration=model.best_iteration)

# xgb模型
import xgboost
clf = xgboost
train_matrix = clf.DMatrix(X_train, label=y_train, missing=-1)
test_matrix = clf.DMatrix(X_test, label=y_test, missing=-1)
z = clf.DMatrix(X_valid, label=y_valid, missing=-1)
params = {'booster': 'gbtree',
          'objective': 'multi:softprob',
          'eval_metric': 'mlogloss',
          'gamma': 1,
          'min_child_weight': 1.5,
          'max_depth': 5,
          'lambda': 1,
          'subsample': 0.7,
          'colsample_bytree': 0.7,
          'colsample_bylevel': 0.7,
          'eta': 0.03,
          'tree_method': 'exact',
          'seed': 2020,
          "num_class": 2
          }

num_round = 10000
early_stopping_rounds = 100
watchlist = [(train_matrix, 'train'),
             (test_matrix, 'eval')]
model = clf.train(params,
                  train_matrix,
                  num_boost_round=num_round,
                  evals=watchlist,
                  early_stopping_rounds=early_stopping_rounds)
pre = model.predict(z,ntree_limit=model.best_ntree_limit)

4. 模型融合

def stacking_reg(clf, train_x, train_y, test_x, clf_name, kf,):
    valid_y_pre = np.zeros((train_y.shape[0],1))
    test = np.zeros((test_x.shape[0],1))
    test_y_pre_k = np.empty((splits,test_x.shape[0],1))
    
    cv_scores = []
    for i ,(train_idx,test_idx) in enumerate(kf.split(train_x)):
        tr_x = train_x[train_idx]
        tr_y = train_y[train_idx]
        te_x = train_x[test_idx]
        te_y = train_y[test_idx]
        if clf_name in ['rf','ada','gb','et','lr','en','ls','kr1',]:
            clf.fit(tr_x,tr_y)
            te_y_pre = clf.predict(te_x).reshape(-1,1)
            valid_y_pre[test_idx] = te_y_pre
            test_y_pre_k[i,:] = clf.predict(test_x).reshape(-1,1)
            cv_scores.append(mean_squared_error(te_y,te_y_pre))
        elif clf_name in ['xgb']:
            train_matrix = clf.DMatrix(tr_x,label=tr_y,missing=-1)
            test_matrix = clf.DMatrix(te_x,label=te_y,missing=-1)
            z = clf.DMatrix(test_x,missing=-1)
            params = {
                'booster': 'gbtree',# 樹的結構
                'eval_metric': 'rmse',
                'gamma': 0.1, # 節點分裂所需的最小損失函式下降值 越大 演算法越保守
                'min_child_weight': 1,# 子集中例項重量的最小總和  如果小於這個數 就不繼續分 越大越保守
                'max_depth': 8,
                'lambda': 3, # L2正則化係數
                'subsample': 0.8, # 每棵樹隨機取樣的比例  越小 越保守
                'colsample_bytree': 0.8,  # 每棵隨機取樣的列數的佔比
                'colsample_bylevel': 0.8, # 每棵樹每次節點分裂的時候列取樣的比例
                'eta': 0.03, #更新中使用的步長搜尋 縮小特徵權重
                'tree_method': 'auto', # 構建樹的方法
                'seed': 2020,
                'nthread': 8 #執行緒數
            }
            num_round = 10000
            early_stopping_rounds = 100
            watchlist = [(train_matrix,'train'),(test_matrix,'evel')]
            if test_matrix:
                model = clf.train(params,train_matrix,
                                  num_boost_round = num_round,
                                  evals=watchlist,
                                  early_stopping_rounds=early_stopping_rounds)
                
                te_y_pre = model.predict(test_matrix,
                                         ntree_limit=model.best_ntree_limit).reshape(-1,1)
                valid_y_pre[test_idx] = te_y_pre
                
                test_y_pre_k[i,:] = model.predict(z,
                                                ntree_limit=model.best_ntree_limit).reshape(-1,1)
                
                cv_scores.append(mean_squared_error(te_y,te_y_pre))
        elif clf_name in ['lgb']:
            train_matrix = clf.Dataset(tr_x,label=tr_y)
            test_matrix = clf.Dataset(te_x,label=te_y)
            params = {
                'boosting_type': 'gbdt',
                'objective': 'regression_l2',
                'min_child_weight': 1,
                'metric': 'mse',
                'num_leaves': 31,
                'lambda_l2': 3,
                'subsample': 0.8,
                'colsample_bytree': 0.8,
                'learning_rate': 0.01,
                'seed': 2020,
                'nthread': -1,
                'silent': True, # 輸出細節
            }
            num_round = 10000
            early_stopping_rounds = 100
            if test_matrix:
                model = clf.train(params,train_matrix,
                                  num_round,
                                  valid_sets=test_matrix,
                                  early_stopping_rounds=early_stopping_rounds)
                
                te_y_pre = model.predict(te_x,
                                         num_iteration = model.best_iteration).reshape(-1,1)
                valid_y_pre[test_idx] = te_y_pre
                
                test_y_pre_k[i,:] = model.predict(test_x,
                                        num_iteration = model.best_iteration).reshape(-1,1)
                cv_scores.append(mean_squared_error(te_y,te_y_pre))
        else:
            raise IOError("please add new clf")
        print("%s now score is:" % clf_name, cv_scores)
        
    test[:] = test_y_pre_k.mean(axis=0)
    print("%s_score_list:" % clf_name,cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    #print('{}_mean_squared_error: {}'.format(clf_name,mean_squared_error(y_valid,test)))
    #opt_models_new[clf_name] = mean_squared_error(y_valid,test)
    return valid_y_pre.reshape(-1,1),test.reshape(-1,1)

def ls_reg(x_train, y_train, x_valid, kf):
    ls_reg = Lasso(alpha=0.0005)
    ls_train,ls_test = stacking_reg(ls_reg,x_train,y_train,x_valid,'ls',kf)   
    return ls_train,ls_test,'ls_reg'

def svr_reg(x_train, y_train, x_valid, kf):
    svr_reg = SVR(kernel='linear')
    svr_train,svr_test = stacking_reg(svr_reg,x_train,y_train,x_valid,'svr',kf)   
    return svr_train,svr_test,'svr_reg'

def lr_reg(x_train, y_train, x_valid, kf):
    lr_reg = LinearRegression(n_jobs=-1)
    lr_train,lr_test = stacking_reg(lr_reg,x_train,y_train,x_valid,'lr',kf)   
    return lr_train,lr_test,'lr_reg'

def en_reg(x_train, y_train, x_valid, kf):
    en_reg = ElasticNet(alpha=0.0005, l1_ratio=.9, )
    en_train,en_test = stacking_reg(en_reg,x_train,y_train,x_valid,'en',kf)   
    return en_train,en_test,'en_reg'

def gb_reg(x_train, y_train, x_valid, kf):
    gbdt = GradientBoostingRegressor(
                                     n_estimators=250,
                                     random_state=2020,
                                     max_features='auto',
                                     verbose=1)
    gbdt_train, gbdt_test = stacking_reg(gbdt,x_train,y_train,
                                         x_valid,"gb",kf)
    return gbdt_train, gbdt_test, "gb_reg"

def rf_reg(x_train, y_train, x_valid, kf):
    randomforest = RandomForestRegressor(
                                         n_estimators=350,
                                         n_jobs=-1,
                                         random_state=2020,
                                         max_features='auto',
                                         verbose=1)
    rf_train, rf_test = stacking_reg(randomforest,x_train,y_train,
                                     x_valid,"rf",kf)
    return rf_train, rf_test, "rf_reg"

def ada_reg(x_train, y_train, x_valid, kf):
    adaboost = AdaBoostRegressor(n_estimators=800,
                                 random_state=2020,
                                 learning_rate=0.01)
    ada_train, ada_test = stacking_reg(adaboost,x_train,y_train,
                                       x_valid,"ada",kf)
    return ada_train, ada_test, "ada_reg"

def et_reg(x_train, y_train, x_valid, kf):
    extratree = ExtraTreesRegressor(n_estimators=600,
                                    max_depth=32,
                                    max_features="auto",
                                    n_jobs=-1,
                                    random_state=2020,
                                    verbose=1)
    et_train, et_test = stacking_reg(extratree,x_train,y_train,
                                     x_valid,"et",kf)
    return et_train, et_test, "et_reg"

def xgb_reg(x_train, y_train, x_valid, kf):
    xgb_train, xgb_test = stacking_reg(xgboost,x_train,y_train,
                                       x_valid,"xgb",kf)
    return xgb_train, xgb_test, "xgb_reg"

def lgb_reg(x_train, y_train, x_valid, kf, ):
    lgb_train, lgb_test = stacking_reg(lightgbm,x_train,y_train,
                                       x_valid,"lgb",kf)
    return lgb_train, lgb_test, "lgb_reg"

def stacking_clf(clf, train_x, train_y, test_x, clf_name, kf,):
    valid_y_pre = np.zeros((train_y.shape[0],1))
    test = np.zeros((test_x.shape[0],1))
    test_y_pre_k = np.empty((splits,test_x.shape[0],1))
    
    cv_scores = []
    for i ,(train_idx,test_idx) in enumerate(kf.split(train_x)):
        tr_x = train_x[train_idx]
        tr_y = train_y[train_idx]
        te_x = train_x[test_idx]
        te_y = train_y[test_idx]
        if clf_name in ["rf","ada","gb","et","lr","knn","gnb"]:
            clf.fit(tr_x,tr_y)
            te_y_pre = clf.predict(te_x).reshape(-1,1)
            valid_y_pre[test_idx] = te_y_pre
            test_y_pre_k[i,:] = clf.predict(test_x).reshape(-1,1)
            cv_scores.append(log_loss(te_y,te_y_pre))
        elif clf_name in ['xgb']:
            train_matrix = clf.DMatrix(tr_x,label=tr_y,missing=-1)
            test_matrix = clf.DMatrix(te_x,label=te_y,missing=-1)
            z = clf.DMatrix(test_x,missing=-1)
            params = {
                'booster': 'gbtree',# 樹的結構
                'objective':'multi:softprob',
                'eval_metric': 'mlogloss',
                'num_class':2,
                'gamma': 1, # 節點分裂所需的最小損失函式下降值 越大 演算法越保守
                'min_child_weight': 1.5,# 子集中例項重量的最小總和  如果小於這個數 就不繼續分 越大越保守
                'max_depth': 8,
                'lambda': 5, # L2正則化係數
                'subsample': 0.8, # 每棵樹隨機取樣的比例  越小 越保守
                'colsample_bytree': 0.8,  # 每棵隨機取樣的列數的佔比
                'colsample_bylevel': 0.8, # 每棵樹每次節點分裂的時候列取樣的比例
                'eta': 0.03, #更新中使用的步長搜尋 縮小特徵權重
                'tree_method': 'exact', # 構建樹的方法
                'seed': 2020,
                'nthread': -1, #執行緒數
                
            }
            num_round = 10000
            early_stopping_rounds = 100
            watchlist = [(train_matrix,'train'),(test_matrix,'evel')]
            if test_matrix:
                model = clf.train(params,train_matrix,
                                  num_boost_round = num_round,
                                  evals=watchlist,
                                  early_stopping_rounds=early_stopping_rounds)
                
                te_y_pre = model.predict(test_matrix,
                                         ntree_limit=model.best_ntree_limit)
                valid_y_pre[test_idx] = te_y_pre[:,0].reshape(-1,1)
                
                test_y_pre_k[i,:] = model.predict(z,
                                                ntree_limit=model.best_ntree_limit)[:,0].reshape(-1,1)
                
                cv_scores.append(log_loss(te_y,valid_y_pre[test_idx]))
        elif clf_name in ['lgb']:
            train_matrix = clf.Dataset(tr_x,label=tr_y)
            test_matrix = clf.Dataset(te_x,label=te_y)
            params = {
                'boosting_type': 'gbdt',
                'objective': 'multiclass',
                'num_class':2,
                'metric': 'multi_logloss',
                'min_child_weight': 1.5,
                'num_leaves': 32,
                'lambda_l2': 5,
                'subsample': 0.8,
                'colsample_bytree': 0.8,
                'learning_rate': 0.01,
                'seed': 2020,
            }
            num_round = 10000
            early_stopping_rounds = 100
            if test_matrix:
                model = clf.train(params,train_matrix,
                                  num_round,
                                  valid_sets=test_matrix,
                                  early_stopping_rounds=early_stopping_rounds)
                
                te_y_pre = model.predict(te_x,
                                         num_iteration = model.best_iteration)
                valid_y_pre[test_idx] = te_y_pre[:,0].reshape(-1,1)
                
                test_y_pre_k[i,:] = model.predict(test_x,
                                        num_iteration = model.best_iteration)[:,0].reshape(-1,1)
                cv_scores.append(log_loss(te_y,valid_y_pre[test_idx]))
        else:
            raise IOError("please add new clf")
        print("%s now score is:" % clf_name, cv_scores)
        
    test[:] = test_y_pre_k.mean(axis=0)
    print("%s_score_list:" % clf_name,cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    #print('{}_mean_squared_error: {}'.format(clf_name,mean_squared_error(y_valid,test)))
    #opt_models_new[clf_name] = mean_squared_error(y_valid,test)
    return valid_y_pre.reshape(-1,1),test.reshape(-1,1)

def rf_clf(x_train, y_train, x_valid, kf):
    randomforest = RandomForestClassifier(n_estimators=1200, max_depth=20, n_jobs=-1, random_state=2020, max_features="auto",verbose=1)
    rf_train, rf_test = stacking_clf(randomforest, x_train, y_train, x_valid, "rf", kf)
    return rf_train, rf_test,"rf"

def ada_clf(x_train, y_train, x_valid, kf):
    adaboost = AdaBoostClassifier(n_estimators=250, random_state=2020, learning_rate=0.01)
    ada_train, ada_test = stacking_clf(adaboost, x_train, y_train, x_valid, "ada", kf)
    return ada_train, ada_test,"ada"

def gb_clf(x_train, y_train, x_valid, kf):
    gbdt = GradientBoostingClassifier(learning_rate=0.04, n_estimators=300, subsample=0.8, random_state=2020,max_depth=5,verbose=1)
    gbdt_train, gbdt_test = stacking_clf(gbdt, x_train, y_train, x_valid, "gb", kf)
    return gbdt_train, gbdt_test,"gb"

def et_clf(x_train, y_train, x_valid, kf):
    extratree = ExtraTreesClassifier(n_estimators=1200, max_depth=35, max_features="auto", n_jobs=-1, random_state=2017,verbose=1)
    et_train, et_test = stacking_clf(extratree, x_train, y_train, x_valid, "et", kf)
    return et_train, et_test,"et"

def xgb_clf(x_train, y_train, x_valid, kf):
    xgb_train, xgb_test = stacking_clf(xgboost, x_train, y_train, x_valid, "xgb", kf)
    return xgb_train, xgb_test,"xgb"

def lgb_clf(x_train, y_train, x_valid, kf):
    lgb_train, lgb_test = stacking_clf(lightgbm, x_train, y_train, x_valid, "lgb", kf)
    return lgb_train, lgb_test,"lgb"

def gnb_clf(x_train, y_train, x_valid, kf):
    gnb=GaussianNB()
    gnb_train, gnb_test = stacking_clf(gnb, x_train, y_train, x_valid, "gnb", kf)
    return gnb_train, gnb_test,"gnb"

def lr_clf(x_train, y_train, x_valid, kf):
    logisticregression = LogisticRegression(n_jobs=-1,random_state=2020, solver='lbfgs', multi_class='multinomial')
    lr_train, lr_test = stacking_clf(logisticregression, x_train, y_train, x_valid, "lr", kf)
    return lr_train, lr_test, "lr"

def knn_clf(x_train, y_train, x_valid, kf):
    kneighbors=KNeighborsClassifier(n_neighbors=150,n_jobs=-1)
    knn_train, knn_test = stacking_clf(kneighbors, x_train, y_train, x_valid, "knn", kf)
    return knn_train, knn_test, "knn"

5. 訓練並驗證

from sklearn.model_selection import StratifiedKFold
splits = 3
kf = KFold(n_splits=3, shuffle=True, random_state=0)
sk = StratifiedKFold(n_splits=splits,shuffle=True,random_state=2020)
# clf_name = [rf_clf,ada_clf,lr_clf,gb_clf,et_clf,xgb_clf,lgb_clf,
#            lr_reg,en_reg,et_reg,ls_reg,rf_reg,gb_reg,xgb_reg,lgb_reg]
clf_name = [lgb_reg,xgb_reg]
test_pre_k = np.empty((len(clf_name),test.shape[0],1))
#test_pre_k = np.empty((len(clf_name),test.shape[0],1))
def model_bagging(X_train,y_train,test,clf_name):
    X_train = X_train
    y_train = y_train
    test = test
    for k,clf in enumerate(clf_name):
        tmp_train,tmp_test,name = clf(X_train,y_train,test,sk)
        test_pre_k[k,:] = tmp_test
    test_pre = test_pre_k.mean(axis = 0)
    return test_pre

test_pre =  model_bagging(X,y,test,clf_name)
#print('bagging_mean_squared_error: %s' %mean_squared_error(Y_valid,test_pre))

6. 初步結果

在這裡插入圖片描述