kaggle入門項目:Titanic存亡預測(五)驗證與實現
原kaggle比賽地址:https://www.kaggle.com/c/titanic
原kernel地址:A Data Science Framework: To Achieve 99% Accuracy
首先我們繪制出皮爾森系相關度的熱力圖,關於皮爾森系數可以翻閱資料,是一個很簡潔的判斷相關度的公式。
終於要進行最終的模型擬合了,我們使用投票法則,首先構建一個投票算法的list,將我們所需要的算法全部包含進去:
vote_est = [ #Ensemble Methods: http://scikit-learn.org/stable/modules/ensemble.html (‘ada‘, ensemble.AdaBoostClassifier()), (‘bc‘, ensemble.BaggingClassifier()), (‘etc‘,ensemble.ExtraTreesClassifier()), (‘gbc‘, ensemble.GradientBoostingClassifier()), (‘rfc‘, ensemble.RandomForestClassifier()), #Gaussian Processes: http://scikit-learn.org/stable/modules/gaussian_process.html#gaussian-process-classification-gpc(‘gpc‘, gaussian_process.GaussianProcessClassifier()), #GLM: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression (‘lr‘, linear_model.LogisticRegressionCV()), #Navies Bayes: http://scikit-learn.org/stable/modules/naive_bayes.html (‘bnb‘, naive_bayes.BernoulliNB()), (‘gnb‘, naive_bayes.GaussianNB()), #Nearest Neighbor: http://scikit-learn.org/stable/modules/neighbors.html (‘knn‘, neighbors.KNeighborsClassifier()), #SVM: http://scikit-learn.org/stable/modules/svm.html (‘svc‘, svm.SVC(probability=True)), #xgboost: http://xgboost.readthedocs.io/en/latest/model.html (‘xgb‘, XGBClassifier()) ]
重要的來了,我們使用VotingClassifier()進行投票,這裏的投票法分為硬投票和軟投票,硬投票按照算法標簽進行排序,軟投票通過類概率進行選擇。
#Hard Vote or majority rules vote_hard = ensemble.VotingClassifier(estimators = vote_est , voting = ‘hard‘) vote_hard_cv = model_selection.cross_validate(vote_hard, data1[data1_x_bin], data1[Target], cv = cv_split) vote_hard.fit(data1[data1_x_bin], data1[Target]) print("Hard Voting Training w/bin score mean: {:.2f}". format(vote_hard_cv[‘train_score‘].mean()*100)) print("Hard Voting Test w/bin score mean: {:.2f}". format(vote_hard_cv[‘test_score‘].mean()*100)) print("Hard Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_hard_cv[‘test_score‘].std()*100*3)) print(‘-‘*10) #Soft Vote or weighted probabilities vote_soft = ensemble.VotingClassifier(estimators = vote_est , voting = ‘soft‘) vote_soft_cv = model_selection.cross_validate(vote_soft, data1[data1_x_bin], data1[Target], cv = cv_split) vote_soft.fit(data1[data1_x_bin], data1[Target]) print("Soft Voting Training w/bin score mean: {:.2f}". format(vote_soft_cv[‘train_score‘].mean()*100)) print("Soft Voting Test w/bin score mean: {:.2f}". format(vote_soft_cv[‘test_score‘].mean()*100)) print("Soft Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_soft_cv[‘test_score‘].std()*100*3)) print(‘-‘*10)vote
然後是兩種投票的結果:
接下來就是及其耗費時間和算力的GridSearchCV選取所有算法的超參數了,我們將所有算法的參數可能取值設定好(包含在一個list裏),然後放入循環,對每一個算法進行FridSearch,最終得出所有算法的最佳參數和相應運算時間。
#WARNING: Running is very computational intensive and time expensive. #Code is written for experimental/developmental purposes and not production ready! #Hyperparameter Tune with GridSearchCV: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html grid_n_estimator = [10, 50, 100, 300] grid_ratio = [.1, .25, .5, .75, 1.0] grid_learn = [.01, .03, .05, .1, .25] grid_max_depth = [2, 4, 6, 8, 10, None] grid_min_samples = [5, 10, .03, .05, .10] grid_criterion = [‘gini‘, ‘entropy‘] grid_bool = [True, False] grid_seed = [0] grid_param = [ [{ #AdaBoostClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html ‘n_estimators‘: grid_n_estimator, #default=50 ‘learning_rate‘: grid_learn, #default=1 #‘algorithm‘: [‘SAMME‘, ‘SAMME.R‘], #default=’SAMME.R ‘random_state‘: grid_seed }], [{ #BaggingClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier ‘n_estimators‘: grid_n_estimator, #default=10 ‘max_samples‘: grid_ratio, #default=1.0 ‘random_state‘: grid_seed }], [{ #ExtraTreesClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier ‘n_estimators‘: grid_n_estimator, #default=10 ‘criterion‘: grid_criterion, #default=”gini” ‘max_depth‘: grid_max_depth, #default=None ‘random_state‘: grid_seed }], [{ #GradientBoostingClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier #‘loss‘: [‘deviance‘, ‘exponential‘], #default=’deviance’ ‘learning_rate‘: [.05], #default=0.1 -- 12/31/17 set to reduce runtime -- The best parameter for GradientBoostingClassifier is {‘learning_rate‘: 0.05, ‘max_depth‘: 2, ‘n_estimators‘: 300, ‘random_state‘: 0} with a runtime of 264.45 seconds. ‘n_estimators‘: [300], #default=100 -- 12/31/17 set to reduce runtime -- The best parameter for GradientBoostingClassifier is {‘learning_rate‘: 0.05, ‘max_depth‘: 2, ‘n_estimators‘: 300, ‘random_state‘: 0} with a runtime of 264.45 seconds. #‘criterion‘: [‘friedman_mse‘, ‘mse‘, ‘mae‘], #default=”friedman_mse” ‘max_depth‘: grid_max_depth, #default=3 ‘random_state‘: grid_seed }], [{ #RandomForestClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier ‘n_estimators‘: grid_n_estimator, #default=10 ‘criterion‘: grid_criterion, #default=”gini” ‘max_depth‘: grid_max_depth, #default=None ‘oob_score‘: [True], #default=False -- 12/31/17 set to reduce runtime -- The best parameter for RandomForestClassifier is {‘criterion‘: ‘entropy‘, ‘max_depth‘: 6, ‘n_estimators‘: 100, ‘oob_score‘: True, ‘random_state‘: 0} with a runtime of 146.35 seconds. ‘random_state‘: grid_seed }], [{ #GaussianProcessClassifier ‘max_iter_predict‘: grid_n_estimator, #default: 100 ‘random_state‘: grid_seed }], [{ #LogisticRegressionCV - http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV ‘fit_intercept‘: grid_bool, #default: True #‘penalty‘: [‘l1‘,‘l2‘], ‘solver‘: [‘newton-cg‘, ‘lbfgs‘, ‘liblinear‘, ‘sag‘, ‘saga‘], #default: lbfgs ‘random_state‘: grid_seed }], [{ #BernoulliNB - http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB ‘alpha‘: grid_ratio, #default: 1.0 }], #GaussianNB - [{}], [{ #KNeighborsClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier ‘n_neighbors‘: [1,2,3,4,5,6,7], #default: 5 ‘weights‘: [‘uniform‘, ‘distance‘], #default = ‘uniform’ ‘algorithm‘: [‘auto‘, ‘ball_tree‘, ‘kd_tree‘, ‘brute‘] }], [{ #SVC - http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC #http://blog.hackerearth.com/simple-tutorial-svm-parameter-tuning-python-r #‘kernel‘: [‘linear‘, ‘poly‘, ‘rbf‘, ‘sigmoid‘], ‘C‘: [1,2,3,4,5], #default=1.0 ‘gamma‘: grid_ratio, #edfault: auto ‘decision_function_shape‘: [‘ovo‘, ‘ovr‘], #default:ovr ‘probability‘: [True], ‘random_state‘: grid_seed }], [{ #XGBClassifier - http://xgboost.readthedocs.io/en/latest/parameter.html ‘learning_rate‘: grid_learn, #default: .3 ‘max_depth‘: [1,2,4,6,8,10], #default 2 ‘n_estimators‘: grid_n_estimator, ‘seed‘: grid_seed }] ] start_total = time.perf_counter() #https://docs.python.org/3/library/time.html#time.perf_counter for clf, param in zip (vote_est, grid_param): #https://docs.python.org/3/library/functions.html#zip #print(clf[1]) #vote_est is a list of tuples, index 0 is the name and index 1 is the algorithm #print(param) start = time.perf_counter() best_search = model_selection.GridSearchCV(estimator = clf[1], param_grid = param, cv = cv_split, scoring = ‘roc_auc‘) best_search.fit(data1[data1_x_bin], data1[Target]) run = time.perf_counter() - start best_param = best_search.best_params_ print(‘The best parameter for {} is {} with a runtime of {:.2f} seconds.‘.format(clf[1].__class__.__name__, best_param, run)) clf[1].set_params(**best_param) run_total = time.perf_counter() - start_total print(‘Total optimization time was {:.2f} minutes.‘.format(run_total/60)) print(‘-‘*10)GradSearch for all
我們可以看到gridsearch非常耗時,這還是在千數據量的情況下。我的8250u跑了數分鐘,其實作者的耗時(截圖中體現)比我的耗時更長了點。可想而知如果面對百萬千萬的數據集確實需要更強的算力和更長的時間了。
然後我們根據已經選出的每個算法的最佳參數,再使用投票法。結果如下。
最後終於要輸出結果了,作者給出了我們之前涉及到的所有方法的最終預測代碼與準確率:
#prepare data for modeling print(data_val.info()) print("-"*10) #data_val.sample(10) #handmade decision tree - submission score = 0.77990 data_val[‘Survived‘] = mytree(data_val).astype(int) #decision tree w/full dataset modeling submission score: defaults= 0.76555, tuned= 0.77990 #submit_dt = tree.DecisionTreeClassifier() #submit_dt = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = ‘roc_auc‘, cv = cv_split) #submit_dt.fit(data1[data1_x_bin], data1[Target]) #print(‘Best Parameters: ‘, submit_dt.best_params_) #Best Parameters: {‘criterion‘: ‘gini‘, ‘max_depth‘: 4, ‘random_state‘: 0} #data_val[‘Survived‘] = submit_dt.predict(data_val[data1_x_bin]) #bagging w/full dataset modeling submission score: defaults= 0.75119, tuned= 0.77990 #submit_bc = ensemble.BaggingClassifier() #submit_bc = model_selection.GridSearchCV(ensemble.BaggingClassifier(), param_grid= {‘n_estimators‘:grid_n_estimator, ‘max_samples‘: grid_ratio, ‘oob_score‘: grid_bool, ‘random_state‘: grid_seed}, scoring = ‘roc_auc‘, cv = cv_split) #submit_bc.fit(data1[data1_x_bin], data1[Target]) #print(‘Best Parameters: ‘, submit_bc.best_params_) #Best Parameters: {‘max_samples‘: 0.25, ‘n_estimators‘: 500, ‘oob_score‘: True, ‘random_state‘: 0} #data_val[‘Survived‘] = submit_bc.predict(data_val[data1_x_bin]) #extra tree w/full dataset modeling submission score: defaults= 0.76555, tuned= 0.77990 #submit_etc = ensemble.ExtraTreesClassifier() #submit_etc = model_selection.GridSearchCV(ensemble.ExtraTreesClassifier(), param_grid={‘n_estimators‘: grid_n_estimator, ‘criterion‘: grid_criterion, ‘max_depth‘: grid_max_depth, ‘random_state‘: grid_seed}, scoring = ‘roc_auc‘, cv = cv_split) #submit_etc.fit(data1[data1_x_bin], data1[Target]) #print(‘Best Parameters: ‘, submit_etc.best_params_) #Best Parameters: {‘criterion‘: ‘entropy‘, ‘max_depth‘: 6, ‘n_estimators‘: 100, ‘random_state‘: 0} #data_val[‘Survived‘] = submit_etc.predict(data_val[data1_x_bin]) #random foreset w/full dataset modeling submission score: defaults= 0.71291, tuned= 0.73205 #submit_rfc = ensemble.RandomForestClassifier() #submit_rfc = model_selection.GridSearchCV(ensemble.RandomForestClassifier(), param_grid={‘n_estimators‘: grid_n_estimator, ‘criterion‘: grid_criterion, ‘max_depth‘: grid_max_depth, ‘random_state‘: grid_seed}, scoring = ‘roc_auc‘, cv = cv_split) #submit_rfc.fit(data1[data1_x_bin], data1[Target]) #print(‘Best Parameters: ‘, submit_rfc.best_params_) #Best Parameters: {‘criterion‘: ‘entropy‘, ‘max_depth‘: 6, ‘n_estimators‘: 100, ‘random_state‘: 0} #data_val[‘Survived‘] = submit_rfc.predict(data_val[data1_x_bin]) #ada boosting w/full dataset modeling submission score: defaults= 0.74162, tuned= 0.75119 #submit_abc = ensemble.AdaBoostClassifier() #submit_abc = model_selection.GridSearchCV(ensemble.AdaBoostClassifier(), param_grid={‘n_estimators‘: grid_n_estimator, ‘learning_rate‘: grid_ratio, ‘algorithm‘: [‘SAMME‘, ‘SAMME.R‘], ‘random_state‘: grid_seed}, scoring = ‘roc_auc‘, cv = cv_split) #submit_abc.fit(data1[data1_x_bin], data1[Target]) #print(‘Best Parameters: ‘, submit_abc.best_params_) #Best Parameters: {‘algorithm‘: ‘SAMME.R‘, ‘learning_rate‘: 0.1, ‘n_estimators‘: 300, ‘random_state‘: 0} #data_val[‘Survived‘] = submit_abc.predict(data_val[data1_x_bin]) #gradient boosting w/full dataset modeling submission score: defaults= 0.75119, tuned= 0.77033 #submit_gbc = ensemble.GradientBoostingClassifier() #submit_gbc = model_selection.GridSearchCV(ensemble.GradientBoostingClassifier(), param_grid={‘learning_rate‘: grid_ratio, ‘n_estimators‘: grid_n_estimator, ‘max_depth‘: grid_max_depth, ‘random_state‘:grid_seed}, scoring = ‘roc_auc‘, cv = cv_split) #submit_gbc.fit(data1[data1_x_bin], data1[Target]) #print(‘Best Parameters: ‘, submit_gbc.best_params_) #Best Parameters: {‘learning_rate‘: 0.25, ‘max_depth‘: 2, ‘n_estimators‘: 50, ‘random_state‘: 0} #data_val[‘Survived‘] = submit_gbc.predict(data_val[data1_x_bin]) #extreme boosting w/full dataset modeling submission score: defaults= 0.73684, tuned= 0.77990 #submit_xgb = XGBClassifier() #submit_xgb = model_selection.GridSearchCV(XGBClassifier(), param_grid= {‘learning_rate‘: grid_learn, ‘max_depth‘: [0,2,4,6,8,10], ‘n_estimators‘: grid_n_estimator, ‘seed‘: grid_seed}, scoring = ‘roc_auc‘, cv = cv_split) #submit_xgb.fit(data1[data1_x_bin], data1[Target]) #print(‘Best Parameters: ‘, submit_xgb.best_params_) #Best Parameters: {‘learning_rate‘: 0.01, ‘max_depth‘: 4, ‘n_estimators‘: 300, ‘seed‘: 0} #data_val[‘Survived‘] = submit_xgb.predict(data_val[data1_x_bin]) #hard voting classifier w/full dataset modeling submission score: defaults= 0.75598, tuned = 0.77990 #data_val[‘Survived‘] = vote_hard.predict(data_val[data1_x_bin]) data_val[‘Survived‘] = grid_hard.predict(data_val[data1_x_bin]) #soft voting classifier w/full dataset modeling submission score: defaults= 0.73684, tuned = 0.74162 #data_val[‘Survived‘] = vote_soft.predict(data_val[data1_x_bin]) #data_val[‘Survived‘] = grid_soft.predict(data_val[data1_x_bin]) #submit file submit = data_val[[‘PassengerId‘,‘Survived‘]] submit.to_csv("../working/submit.csv", index=False) print(‘Validation Data Distribution: \n‘, data_val[‘Survived‘].value_counts(normalize = True)) submit.sample(10)predict
最終的輸出我就不再貼圖了,和我們所需要的最終結果關系不大,無非是各個feature的數量總結。
我們在當前文件夾生成了‘submit.csv’的文件,直接提交到kaggle上吧。
Step 7: Optimize and Strategize
最後我們驚奇的發現,盡管使用了那麽多算法,進行了那麽多參數調整,我們得出的最終準確率竟然和我們手動推出來的決策樹幾乎一樣!但是作者最後說,僅用了幾個算法在這一個數據集上並不能得出什麽可靠的結論,我們需要記住一下幾點:
1.訓練集的分布和測試集的分布並不一樣,因此我們自己所做的CV分數和kaggle提交上去的準確率確實會有很大差別。
2.在相同的數據集下,基於決策樹的算法在調整參數之後似乎會收斂到同一個準確率。
3.如果忽略調參,沒有任何的MLA能強過自制的算法
所以如果想取得更好的成果還是要進行更多的數據處理和特征工程。
完
kaggle入門項目:Titanic存亡預測(五)驗證與實現