1. 程式人生 > >kaggle--Digit Recognizer(python實現)

kaggle--Digit Recognizer(python實現)

使用xgboost並對其進行簡單的調參,準確率可達到97%

排名較低,就不要在意這些細節啦,小白一枚,新手上路。

1、安裝環境

這裡使用anaconda(python 3.6)版本。
首先從https://www.lfd.uci.edu/~gohlke/pythonlibs/下載xgboost的whl檔案
這裡寫圖片描述
選擇相應的版本。
開啟cmd並cd到相應的資料夾下使用pip install xgboost‑0.7‑cp36‑cp36m‑win_amd64.whl
import xgboost驗證xgboost是否安裝成功

2、程式碼實現

# -*- coding: utf-8 -*-
import xgboost as xgb from sklearn.datasets import load_digits from sklearn.cross_validation import train_test_split from sklearn.metrics import accuracy_score import pandas as pd import numpy as np def createDataSet(): trainDataSet = pd.read_csv('train.csv') testDataSet = pd.read_csv('test.csv'
) trainDataSet = np.array(trainDataSet) testDataSet = np.array(testDataSet) trainData = trainDataSet[:, 1:len(trainDataSet)] trainLabels = trainDataSet[:, 0] testData = testDataSet return trainData, trainLabels, testData def getPredict(datas, labels): x_train, x_test, y_train, y_test = train_test_split(datas, labels, test_size = 0.1
) param = { 'booster':'gbtree', 'objective': 'multi:softmax', #多分類的問題 'num_class':10, # 類別數,與 multisoftmax 並用 'gamma':0.1, # 用於控制是否後剪枝的引數,越大越保守,一般0.1、0.2 'max_depth':12, # 構建樹的深度,越大越容易過擬合 'lambda':2, # 控制模型複雜度的權重值的L2正則化項引數,引數越大,模型越不容易過擬合。 'subsample':0.8, # 隨機取樣訓練樣本 'colsample_bytree':0.7, # 生成樹時進行的列取樣 'min_child_weight':5, # 這個引數預設是 1,是每個葉子裡面 h 的和至少是多少,對正負樣本不均衡時的 0-1 分類而言 #,假設 h 在 0.01 附近,min_child_weight 為 1 意味著葉子節點中最少需要包含 100 個樣本。 #這個引數非常影響結果,控制葉子節點中二階導的和的最小值,該引數值越小,越容易 overfitting。 'silent':False,#設定成True無資訊輸出, 'learning_rate': 0.05, # 學習率 'seed':1000 } xgb_train = xgb.DMatrix(data = x_train, label = y_train) xgb_val = xgb.DMatrix(data = x_test, label = y_test) xgb_test = xgb.DMatrix(x_test) watchlist = [(xgb_train, 'train'),(xgb_val, 'val')] model = xgb.train(params = param, dtrain = xgb_train, num_boost_round = 5000, #初始boost迭代次數 evals = watchlist, early_stopping_rounds=100 #100輪後當模型基本沒有提升時會提前結束 ) print('best best_ntree_limit:', model.best_ntree_limit) # 儲存模型 # model.save_model('1.model') y_pred = model.predict(xgb_test) print(accuracy_score(y_test, y_pred)) trainData, trainLabels, testData = createDataSet() getPredict(trainData, trainLabels) xgbPredict = xgb.DMatrix(testData) #訓練完後可載入模型 model = xgb.Booster() model.load_model('1.model') y_pred = model.predict(xgbPredict) print(y_pred) #儲存檔案 f = open('submission.csv', 'w', encoding = 'utf-8') f.write('ImageId,Label\n') for i in range(len(y_pred)): f.write(str(i + 1) + ',' + str(int(y_pred[i])) + '\n') f.close()

3、調參小結

xgboost設定引數
max_depth:樹的最大深度,預設值為6通常取值3-10
eta:為了防止過擬合,更新過程中用到的收縮步長,在每次提升計算之後,演算法會直接獲得新特徵的權重 ,通常設定為[0.01-0.2]
silent:取0時表示打印出執行時資訊,取1時表示以緘默方式執行,不列印執行時資訊。預設值為0 ,建議取0,過程中的輸出資料有助於理解模型以及調參。另外實際上我設定其為1也通常無法緘默執行
objective:預設值 reg:linear 定義學習任務及相應的學習目標,可選目標函式如下:
“reg:linear” –線性迴歸。
“reg:logistic” –邏輯迴歸。
“binary:logistic” –二分類的邏輯迴歸問題,輸出為概率。
“binary:logitraw” –二分類的邏輯迴歸問題,輸出的結果為wTx。
“count:poisson” –計數問題的poisson迴歸,輸出結果為poisson分佈,在
“multi:softmax” –讓XGBoost採用softmax目標函式處理多分類問題,同時需要設定引數num_class(類別個數)
“multi:softprob” –和softmax一樣,但是輸出的是ndata * nclass的向量,可以將該向量reshape成ndata行nclass列的矩陣。沒行資料表示樣本所屬於每個類別的概率。
“rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwise loss

Keras CNN實現:

準確率達到0.99585

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from keras.models import Sequential 
from keras.utils.np_utils import to_categorical
import matplotlib.pyplot as plt
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from keras.optimizers import RMSprop
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ReduceLROnPlateau


train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

target = train['label']
train = train.drop(['label'], axis = 1)
train = train / 255.0
test = test / 255.0
train = train.values
test = test.values
train = train.reshape(-1, 28, 28, 1)
test = test.reshape(-1, 28, 28, 1)
target = to_categorical(target, num_classes = 10)

x_train, x_val, y_train, y_val = train_test_split(train, target, test_size = 0.2, random_state = 2018)

model = Sequential()

model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu', input_shape = (28,28,1)))
model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))


model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Dropout(0.25)) 

model.add(Conv2D(filters = 128, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
model.add(Conv2D(filters = 128, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Dropout(0.25)) 

model.add(Flatten())
model.add(Dense(256, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(10, activation = "softmax"))

optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)
model.compile(optimizer = optimizer , loss = "categorical_crossentropy", metrics=["accuracy"])

epochs = 30
batch_size = 86

learning_rate_reduction = ReduceLROnPlateau(monitor='val_acc', 
                                            patience=3, 
                                            verbose=1, 
                                            factor=0.5, 
                                            min_lr=0.00001)

img = ImageDataGenerator(featurewise_center=False, 
                        samplewise_center=False, 
                        featurewise_std_normalization=False,  
                        samplewise_std_normalization=False,  
                        zca_whitening=False,  
                        rotation_range=10,  
                        zoom_range = 0.1,  
                        width_shift_range=0.1,  
                        height_shift_range=0.1,  
                        horizontal_flip=False,  
                        vertical_flip=False)

img.fit(x_train)


model.fit_generator(img.flow(x_train, y_train, batch_size = batch_size),
                    validation_data = (x_val, y_val),
                    steps_per_epoch = x_train.shape[0] // batch_size, 
                    epochs = epochs,
                    callbacks = [learning_rate_reduction])

results = model.predict(test)

print(results)
results = np.argmax(results, axis = 1)
print(results)

results = pd.Series(results, name="Label")
submission = pd.concat([pd.Series(range(1,28001), name = "ImageId"), results], axis = 1)

submission.to_csv("submission/cnn_mnist.csv", index = False)

參考: