kaggle--Digit Recognizer(python實現)
使用xgboost並對其進行簡單的調參,準確率可達到97%
排名較低,就不要在意這些細節啦,小白一枚,新手上路。
1、安裝環境
這裡使用anaconda(python 3.6)版本。
首先從https://www.lfd.uci.edu/~gohlke/pythonlibs/下載xgboost的whl檔案
選擇相應的版本。
開啟cmd並cd到相應的資料夾下使用pip install xgboost‑0.7‑cp36‑cp36m‑win_amd64.whl
import xgboost驗證xgboost是否安裝成功
2、程式碼實現
# -*- coding: utf-8 -*-
import xgboost as xgb
from sklearn.datasets import load_digits
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
def createDataSet():
trainDataSet = pd.read_csv('train.csv')
testDataSet = pd.read_csv('test.csv' )
trainDataSet = np.array(trainDataSet)
testDataSet = np.array(testDataSet)
trainData = trainDataSet[:, 1:len(trainDataSet)]
trainLabels = trainDataSet[:, 0]
testData = testDataSet
return trainData, trainLabels, testData
def getPredict(datas, labels):
x_train, x_test, y_train, y_test = train_test_split(datas, labels, test_size = 0.1 )
param = {
'booster':'gbtree',
'objective': 'multi:softmax', #多分類的問題
'num_class':10, # 類別數,與 multisoftmax 並用
'gamma':0.1, # 用於控制是否後剪枝的引數,越大越保守,一般0.1、0.2
'max_depth':12, # 構建樹的深度,越大越容易過擬合
'lambda':2, # 控制模型複雜度的權重值的L2正則化項引數,引數越大,模型越不容易過擬合。
'subsample':0.8, # 隨機取樣訓練樣本
'colsample_bytree':0.7, # 生成樹時進行的列取樣
'min_child_weight':5,
# 這個引數預設是 1,是每個葉子裡面 h 的和至少是多少,對正負樣本不均衡時的 0-1 分類而言
#,假設 h 在 0.01 附近,min_child_weight 為 1 意味著葉子節點中最少需要包含 100 個樣本。
#這個引數非常影響結果,控制葉子節點中二階導的和的最小值,該引數值越小,越容易 overfitting。
'silent':False,#設定成True無資訊輸出,
'learning_rate': 0.05, # 學習率
'seed':1000
}
xgb_train = xgb.DMatrix(data = x_train, label = y_train)
xgb_val = xgb.DMatrix(data = x_test, label = y_test)
xgb_test = xgb.DMatrix(x_test)
watchlist = [(xgb_train, 'train'),(xgb_val, 'val')]
model = xgb.train(params = param,
dtrain = xgb_train,
num_boost_round = 5000, #初始boost迭代次數
evals = watchlist,
early_stopping_rounds=100 #100輪後當模型基本沒有提升時會提前結束
)
print('best best_ntree_limit:', model.best_ntree_limit)
# 儲存模型
# model.save_model('1.model')
y_pred = model.predict(xgb_test)
print(accuracy_score(y_test, y_pred))
trainData, trainLabels, testData = createDataSet()
getPredict(trainData, trainLabels)
xgbPredict = xgb.DMatrix(testData)
#訓練完後可載入模型
model = xgb.Booster()
model.load_model('1.model')
y_pred = model.predict(xgbPredict)
print(y_pred)
#儲存檔案
f = open('submission.csv', 'w', encoding = 'utf-8')
f.write('ImageId,Label\n')
for i in range(len(y_pred)):
f.write(str(i + 1) + ',' + str(int(y_pred[i])) + '\n')
f.close()
3、調參小結
xgboost設定引數
max_depth:樹的最大深度,預設值為6通常取值3-10
eta:為了防止過擬合,更新過程中用到的收縮步長,在每次提升計算之後,演算法會直接獲得新特徵的權重 ,通常設定為[0.01-0.2]
silent:取0時表示打印出執行時資訊,取1時表示以緘默方式執行,不列印執行時資訊。預設值為0 ,建議取0,過程中的輸出資料有助於理解模型以及調參。另外實際上我設定其為1也通常無法緘默執行
objective:預設值 reg:linear 定義學習任務及相應的學習目標,可選目標函式如下:
“reg:linear” –線性迴歸。
“reg:logistic” –邏輯迴歸。
“binary:logistic” –二分類的邏輯迴歸問題,輸出為概率。
“binary:logitraw” –二分類的邏輯迴歸問題,輸出的結果為wTx。
“count:poisson” –計數問題的poisson迴歸,輸出結果為poisson分佈,在
“multi:softmax” –讓XGBoost採用softmax目標函式處理多分類問題,同時需要設定引數num_class(類別個數)
“multi:softprob” –和softmax一樣,但是輸出的是ndata * nclass的向量,可以將該向量reshape成ndata行nclass列的矩陣。沒行資料表示樣本所屬於每個類別的概率。
“rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwise loss
Keras CNN實現:
準確率達到0.99585
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.utils.np_utils import to_categorical
import matplotlib.pyplot as plt
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from keras.optimizers import RMSprop
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ReduceLROnPlateau
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
target = train['label']
train = train.drop(['label'], axis = 1)
train = train / 255.0
test = test / 255.0
train = train.values
test = test.values
train = train.reshape(-1, 28, 28, 1)
test = test.reshape(-1, 28, 28, 1)
target = to_categorical(target, num_classes = 10)
x_train, x_val, y_train, y_val = train_test_split(train, target, test_size = 0.2, random_state = 2018)
model = Sequential()
model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same',
activation ='relu', input_shape = (28,28,1)))
model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same',
activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same',
activation ='relu'))
model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same',
activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Dropout(0.25))
model.add(Conv2D(filters = 128, kernel_size = (3,3),padding = 'Same',
activation ='relu'))
model.add(Conv2D(filters = 128, kernel_size = (3,3),padding = 'Same',
activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(10, activation = "softmax"))
optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)
model.compile(optimizer = optimizer , loss = "categorical_crossentropy", metrics=["accuracy"])
epochs = 30
batch_size = 86
learning_rate_reduction = ReduceLROnPlateau(monitor='val_acc',
patience=3,
verbose=1,
factor=0.5,
min_lr=0.00001)
img = ImageDataGenerator(featurewise_center=False,
samplewise_center=False,
featurewise_std_normalization=False,
samplewise_std_normalization=False,
zca_whitening=False,
rotation_range=10,
zoom_range = 0.1,
width_shift_range=0.1,
height_shift_range=0.1,
horizontal_flip=False,
vertical_flip=False)
img.fit(x_train)
model.fit_generator(img.flow(x_train, y_train, batch_size = batch_size),
validation_data = (x_val, y_val),
steps_per_epoch = x_train.shape[0] // batch_size,
epochs = epochs,
callbacks = [learning_rate_reduction])
results = model.predict(test)
print(results)
results = np.argmax(results, axis = 1)
print(results)
results = pd.Series(results, name="Label")
submission = pd.concat([pd.Series(range(1,28001), name = "ImageId"), results], axis = 1)
submission.to_csv("submission/cnn_mnist.csv", index = False)