DeepLearning to digit recognizer in kaggle
DeepLearning to digit recongnizer in kaggle
近期在看deeplearning,於是就找了kaggle上字符識別進行練習。這裏我主要用兩種工具箱進行求解。並比對兩者的結果。
兩種工具箱各自是DeepLearningToolbox和caffe。
DeeplearningToolbox源代碼解析見:http://blog.csdn.net/lu597203933/article/details/46576017
Caffe學習見:http://caffe.berkeleyvision.org/
一:DeeplearningToolbox
DeeplearningToolbox基於matlab,很的簡單,讀下源代碼,對於了解卷積神經網絡等過程很有幫助。
這裏我主要是對digit recongnizer給出的數據集進行預處理以使其適用於我們的deeplearningToolbox工具箱。主要包括兩個.m文件,各自是predeal.m和cnntest.m文件。
所須要做的就是改變addpath的路徑,代碼凝視很具體,大家自己看。
代碼
predeal.m
% use the deeplearnToolbox to solve the digit recongnizer in kaggle! clear;clc trainFile = ‘train.csv‘; testFile = ‘test.csv‘; fidId = fopen(trainFile); M = csvread(trainFile, 1); % 讀取csv文件除第一行以外的全部數據 train_x = M(:, 2:end); %第2列開始為數據data label = M(:,1)‘; %第一列為標簽 label(label == 0) = 10; % 不變為10 以下一句無法處理 train_y = full(sparse(label, 1:size(train_x, 1), 1)); %將標簽變成一個矩陣 train_x = double(reshape(train_x‘,28,28,size(train_x, 1)))/255; fidId = fopen(‘test.csv‘); %% 處理預測的數據 M = csvread(testFile, 1); % 讀取csv文件除第一行以外的全部數據 test_x = double(reshape(M‘,28,28,size(M, 1)))/255; clear fidId label testFile M testFile trainFile addpath D:\DeepLearning\DeepLearnToolbox-master\data\ %路徑須要改下 addpath D:\DeepLearning\DeepLearnToolbox-master\CNNaddpath D:\DeepLearning\DeepLearnToolbox-master\util rand(‘state‘,0) cnn.layers = { %%% 設置各層feature maps個數及卷積模板大小等屬性 struct(‘type‘, ‘i‘) %input layer struct(‘type‘, ‘c‘, ‘outputmaps‘, 6, ‘kernelsize‘, 5) %convolution layer struct(‘type‘, ‘s‘, ‘scale‘, 2) %sub sampling layer struct(‘type‘, ‘c‘, ‘outputmaps‘, 12, ‘kernelsize‘, 5) %convolution layer struct(‘type‘, ‘s‘, ‘scale‘, 2) %subsampling layer }; opts.alpha = 0.01; %叠代下降的速率 opts.batchsize = 50; %每次選擇50個樣本進行更新 隨機梯度下降。每次僅僅選用50個樣本進行更新 opts.numepochs = 25; %叠代次數 cnn = cnnsetup(cnn, train_x, train_y); %對各層參數進行初始化 包含權重和偏置 cnn = cnntrain(cnn, train_x, train_y, opts); %訓練的過程,包含bp算法及叠代過程 test_y = cnntest(cnn, test_x); %對測試數據集進行測試 test_y(test_y == 10) = 0; %標簽10 須要反轉為0 test_y = test_y‘; M = [(1:length(test_y))‘ test_y(:)]; csvwrite(‘test_y.csv‘, M); figure; plot(cnn.rL);
cnntest.m
function [test_y] = cnntest(net, x) % feedforward net = cnnff(net, x); [~, test_y] = max(net.o); end
結果:用deeplearningToolbox得到的結果並非非常好,僅僅有0.94586
二:caffe to digit recongnizer
盡管caffe自帶了mnist對樣例對字符進行處理。可是官網給出的數據是二進制的文件,得到的結果也僅僅是一個簡單的準確率,所以不能無限制的套用。
過程例如以下:
1:將給定csv數據轉變成lmdb格式
這裏我在mnist的目錄下寫了一個convert_data_to_lmdb.cpp的程序對數據進行處理:
代碼例如以下:
#include <iostream> #include <string> #include <sstream> #include <gflags/gflags.h> #include "boost/scoped_ptr.hpp" #include "gflags/gflags.h" #include "glog/logging.h" #include "caffe/proto/caffe.pb.h" #include "caffe/util/db.hpp" #include "caffe/util/io.hpp" #include "caffe/util/rng.hpp" using namespace caffe; using namespace std; using std::pair; using boost::scoped_ptr; /* edited by Zack * argv[1] the input file, argv[2] the output file*/ DEFINE_string(backend, "lmdb", "The backend for storing the result"); // get Flags_backend == lmdb int main(int argc, char **argv){ ::google::InitGoogleLogging(argv[0]); #ifndef GFLAGS_GFLAGS_H_ namespace gflags = google; #endif if(argc < 3){ LOG(ERROR)<< "please check the input arguments!"; return 1; } ifstream infile(argv[1]); if(!infile){ LOG(ERROR)<< "please check the input arguments!"; return 1; } string str; int count = 0; int rows = 28; int cols = 28; unsigned char *buffer = new unsigned char[rows*cols]; stringstream ss; Datum datum; // this data structure store the data and label datum.set_channels(1); // the channels datum.set_height(rows); // rows datum.set_width(cols); // cols scoped_ptr<db::DB> db(db::GetDB(FLAGS_backend)); // new DB object db->Open(argv[2], db::NEW); // open the lmdb file to store the data scoped_ptr<db::Transaction> txn(db->NewTransaction()); // new Transaction object to put and commit the data const int kMaxKeyLength = 256; // to save the key char key_cstr[kMaxKeyLength]; bool flag= false; while(getline(infile, str)){ if(flag == false){ flag = true; continue; } int beg = 0; int end = 0; int str_index = 0; //test need to add this----------1 //datum.set_label(0); while((end = str.find_first_of(‘,‘, beg)) != string::npos){ //cout << end << endl; string dig_str = str.substr(beg, end - beg); int pixes; ss.clear(); ss << dig_str; ss >> pixes; // test need to delete this--------------2 if(beg == 0){ datum.set_label(pixes); beg = ++ end; continue; } buffer[str_index++] = (unsigned char)pixes; beg = ++end; } string dig_str = str.substr(beg); int pixes; ss.clear(); ss << dig_str; ss >> pixes; buffer[str_index++] = (unsigned char)pixes; datum.set_data(buffer, rows*cols); int length = snprintf(key_cstr, kMaxKeyLength, "%08d", count); // Put in db string out; CHECK(datum.SerializeToString(&out)); // serialize to string txn->Put(string(key_cstr, length), out); // put it, both the key and value if (++count % 1000 == 0) { // to commit every 1000 iteration // Commit db txn->Commit(); txn.reset(db->NewTransaction()); LOG(ERROR) << "Processed " << count << " files."; } } // write the last batch if (count % 1000 != 0) { // commit the last batch txn->Commit(); LOG(ERROR) << "Processed " << count << " files."; } return 0; }
然後我們運行make all –j8對代碼進行編譯。
這樣在build目錄下就會生成對應的二進制文件了。
如圖:
然後運行./build/examples/mnist/convert_data_to_lmdb.bin examples/mnist/kaggle/data/train.csvexamples/mnist/kaggle/mnist_train_lmdb --backend=lmdb
就能夠得到得到訓練文件的lmdb格式文件了。對於測試test.csv,因為test.csv沒有標簽,所以須要對代碼進行細微調整,2處調整已在上述代碼中標註了。
然後相同運行make all –j8,再運行./build/examples/mnist/convert_data_to_lmdb.bin examples/mnist/kaggle/data/test.csvexamples/mnist/kaggle/mnist_test_lmdb --backend=lmdb
就能夠得到所相應的測試數據的lmdb格式文件了。
2:用訓練數據進行訓練得到model
Caffe在訓練model的時候,代碼須要在每隔test_iter時間就要對測試數據集進行測試,因此我們這裏能夠用train.csv的前1000條數據制作一個交叉驗證的數據集lmdb, 過程和上面一樣。
分別將mnist文件夾以下的lenet_solver.prototxt和lenet_train_test.prototxt復制到kaggle文件夾以下。並對相應的包括文件所在文件夾和相應的batch size進行改動。詳細見:下載地址。
然後運行./build/tools/caffe train –solver=examples/mnist/kaggle/lenet_solver.prototxt,這樣就能夠得到我們的lenet_iter_10000.caffemodel了。
3:提取測試集prob層的特征。
這裏我們使用tools文件下的extract_features.cpp的源文件。可是該源文件產生的結果是lmdb的格式。因此我對源代碼進行了改動例如以下:
#include <stdio.h> // for snprintf #include <string> #include <vector> #include <fstream> #include "boost/algorithm/string.hpp" #include "google/protobuf/text_format.h" #include "caffe/blob.hpp" #include "caffe/common.hpp" #include "caffe/net.hpp" #include "caffe/proto/caffe.pb.h" #include "caffe/util/db.hpp" #include "caffe/util/io.hpp" #include "caffe/vision_layers.hpp" using caffe::Blob; using caffe::Caffe; using caffe::Datum; using caffe::Net; using boost::shared_ptr; using std::string; namespace db = caffe::db; template<typename Dtype> int feature_extraction_pipeline(int argc, char** argv); int main(int argc, char** argv) { return feature_extraction_pipeline<float>(argc, argv); // return feature_extraction_pipeline<double>(argc, argv); } template<typename Dtype> int feature_extraction_pipeline(int argc, char** argv) { ::google::InitGoogleLogging(argv[0]); const int num_required_args = 7; /// the parameters must be not less 7 if (argc < num_required_args) { LOG(ERROR)<< "This program takes in a trained network and an input data layer, and then" " extract features of the input data produced by the net.\n" "Usage: extract_features pretrained_net_param" " feature_extraction_proto_file extract_feature_blob_name1[,name2,...]" " save_feature_dataset_name1[,name2,...] num_mini_batches db_type" " [CPU/GPU] [DEVICE_ID=0]\n" "Note: you can extract multiple features in one pass by specifying" " multiple feature blob names and dataset names seperated by ‘,‘." " The names cannot contain white space characters and the number of blobs" " and datasets must be equal."; return 1; } int arg_pos = num_required_args; //the necessary nums of parameters arg_pos = num_required_args; if (argc > arg_pos && strcmp(argv[arg_pos], "GPU") == 0) { // whether use GPU------ -gpu 0 LOG(ERROR)<< "Using GPU"; uint device_id = 0; if (argc > arg_pos + 1) { device_id = atoi(argv[arg_pos + 1]); CHECK_GE(device_id, 0); } LOG(ERROR) << "Using Device_id=" << device_id; Caffe::SetDevice(device_id); Caffe::set_mode(Caffe::GPU); } else { LOG(ERROR) << "Using CPU"; Caffe::set_mode(Caffe::CPU); } arg_pos = 0; // the name of the executable std::string pretrained_binary_proto(argv[++arg_pos]); // the mode had been trained // Expected prototxt contains at least one data layer such as // the layer data_layer_name and one feature blob such as the // fc7 top blob to extract features. /* layers { name: "data_layer_name" type: DATA data_param { source: "/path/to/your/images/to/extract/feature/images_leveldb" mean_file: "/path/to/your/image_mean.binaryproto" batch_size: 128 crop_size: 227 mirror: false } top: "data_blob_name" top: "label_blob_name" } layers { name: "drop7" type: DROPOUT dropout_param { dropout_ratio: 0.5 } bottom: "fc7" top: "fc7" } */ std::string feature_extraction_proto(argv[++arg_pos]); // get the net structure shared_ptr<Net<Dtype> > feature_extraction_net( new Net<Dtype>(feature_extraction_proto, caffe::TEST)); //new net object and set each layers------feature_extraction_net feature_extraction_net->CopyTrainedLayersFrom(pretrained_binary_proto); // init the weights std::string extract_feature_blob_names(argv[++arg_pos]); //exact which blob‘s feature std::vector<std::string> blob_names; boost::split(blob_names, extract_feature_blob_names, boost::is_any_of(",")); //you can exact many blobs‘ features and to store them in different dirname std::string save_feature_dataset_names(argv[++arg_pos]); // to store the features std::vector<std::string> dataset_names; boost::split(dataset_names, save_feature_dataset_names, // each dataset_names to store one blob‘s feature boost::is_any_of(",")); CHECK_EQ(blob_names.size(), dataset_names.size()) << " the number of blob names and dataset names must be equal"; size_t num_features = blob_names.size(); // how many features you exact for (size_t i = 0; i < num_features; i++) { CHECK(feature_extraction_net->has_blob(blob_names[i])) << "Unknown feature blob name " << blob_names[i] << " in the network " << feature_extraction_proto; } int num_mini_batches = atoi(argv[++arg_pos]); // each exact num_mini_batches of images // init the DB and Transaction for all blobs you want to extract features std::vector<shared_ptr<db::DB> > feature_dbs; // new DB object, is a vector maybe has many blogs‘ feature std::vector<shared_ptr<db::Transaction> > txns; // new Transaction object, is a vectore maybe has many blob‘s feature // edit by Zack //std::string strfile = "/home/hadoop/caffe/textileImage/features/probTest"; std::string strfile = argv[argc-1]; std::vector<std::ofstream*> vec(num_features, 0); const char* db_type = argv[++arg_pos]; //the data to store style == lmdb for (size_t i = 0; i < num_features; ++i) { LOG(INFO)<< "Opening dataset " << dataset_names[i]; // dataset_name[i] to store the feature which type is lmdb shared_ptr<db::DB> db(db::GetDB(db_type)); // the type of the db db->Open(dataset_names.at(i), db::NEW); // open the dir to store the feature feature_dbs.push_back(db); // put the db to the vector shared_ptr<db::Transaction> txn(db->NewTransaction()); // the transaction to the db txns.push_back(txn); // put the transaction to the vector // edit by Zack std::stringstream ss; ss.clear(); string index; ss << i; ss >> index; std::string str = strfile + index + ".txt"; vec[i] = new std::ofstream(str.c_str()); } LOG(ERROR)<< "Extacting Features"; Datum datum; const int kMaxKeyStrLength = 100; char key_str[kMaxKeyStrLength]; // to store the key std::vector<Blob<float>*> input_vec; std::vector<int> image_indices(num_features, 0); /// how many blogs‘ feature you exact for (int batch_index = 0; batch_index < num_mini_batches; ++batch_index) { feature_extraction_net->Forward(input_vec); for (int i = 0; i < num_features; ++i) { // to exact the blobs‘ name maybe fc7 fc8 const shared_ptr<Blob<Dtype> > feature_blob = feature_extraction_net ->blob_by_name(blob_names[i]); int batch_size = feature_blob->num(); // the nums of images-------batch size int dim_features = feature_blob->count() / batch_size; // this dim of this feature of each image in this blob const Dtype* feature_blob_data; // float is the features for (int n = 0; n < batch_size; ++n) { datum.set_height(feature_blob->height()); // set the height datum.set_width(feature_blob->width()); // set the width datum.set_channels(feature_blob->channels()); // set the channel datum.clear_data(); // clear data datum.clear_float_data(); // clear float_data feature_blob_data = feature_blob->cpu_data() + feature_blob->offset(n); //the features of which image for (int d = 0; d < dim_features; ++d) { datum.add_float_data(feature_blob_data[d]); (*vec[i]) << feature_blob_data[d] << " "; // save the features } (*vec[i]) << std::endl; //LOG(ERROR)<< "dim" << dim_features; int length = snprintf(key_str, kMaxKeyStrLength, "%010d", image_indices[i]); // key di ji ge tupian string out; CHECK(datum.SerializeToString(&out)); // serialize to string txns.at(i)->Put(std::string(key_str, length), out); // put to transaction ++image_indices[i]; // key++ if (image_indices[i] % 1000 == 0) { // when it reach to 1000 ,we commit it txns.at(i)->Commit(); txns.at(i).reset(feature_dbs.at(i)->NewTransaction()); LOG(ERROR)<< "Extracted features of " << image_indices[i] << " query images for feature blob " << blob_names[i]; } } // for (int n = 0; n < batch_size; ++n) } // for (int i = 0; i < num_features; ++i) } // for (int batch_index = 0; batch_index < num_mini_batches; ++batch_index) // write the last batch for (int i = 0; i < num_features; ++i) { if (image_indices[i] % 1000 != 0) { // commit the last path images txns.at(i)->Commit(); } // edit by Zack vec[i]->close(); delete vec[i]; LOG(ERROR)<< "Extracted features of " << image_indices[i] << " query images for feature blob " << blob_names[i]; feature_dbs.at(i)->Close(); } LOG(ERROR)<< "Successfully extracted the features!"; return 0; }
最後將得到的prob層(即最後得到的概率)存入到了txt中了。
此外對網絡結構進行了調整,僅僅須要預測,網絡中的參數都能夠去掉不要了。,
deploy.prototxt代碼例如以下:
name: "LeNet" layer { name: "mnist" type: "Data" top: "data" top: "label" transform_param { scale: 0.00390625 } data_param { source: "examples/mnist/kaggle/mnist_test_lmdb" batch_size: 100 backend: LMDB } } layer { name: "conv1" type: "Convolution" bottom: "data" top: "conv1" convolution_param { num_output: 20 kernel_size: 5 stride: 1 } } layer { name: "pool1" type: "Pooling" bottom: "conv1" top: "pool1" pooling_param { pool: MAX kernel_size: 2 stride: 2 } } layer { name: "conv2" type: "Convolution" bottom: "pool1" top: "conv2" convolution_param { num_output: 50 kernel_size: 5 stride: 1 } } layer { name: "pool2" type: "Pooling" bottom: "conv2" top: "pool2" pooling_param { pool: MAX kernel_size: 2 stride: 2 } } layer { name: "ip1" type: "InnerProduct" bottom: "pool2" top: "ip1" inner_product_param { num_output: 500 } } layer { name: "relu1" type: "ReLU" bottom: "ip1" top: "ip1" } layer { name: "ip2" type: "InnerProduct" bottom: "ip1" top: "ip2" inner_product_param { num_output: 10 } } layer { name: "prob" type: "Softmax" bottom: "ip2" top: "prob" } layer { name: "accuracy" type: "Accuracy" bottom: "prob" bottom: "label" top: "accuracy" } layer { name: "loss" type: "SoftmaxWithLoss" bottom: "ip2" bottom: "label" top: "loss" }
然後運行
./build/tools/extract_features.bin examples/mnist/kaggle/lenet_iter_10000.caffemodel examples/mnist/kaggle/deploy.prototxt prob examples/mnist/kaggle/features 280 lmdb /home/hadoop/caffe/caffe-master/examples/mnist/kaggle/feature
當中280為叠代次數,由於在deploy.prototxt中batch_size設為了100。故就為總共的測試數據集的大小=28000. /home/hadoop/caffe/caffe-master/examples/mnist/kaggle/feature為終於的提取特征存放在txt保存的路徑。examples/mnist/kaggle/lenet_iter_10000.caffemodel為訓練的權重參數,examples/mnist/kaggle/deploy.prototxt為網絡結構。
4:對得到的txt進行後處理
通過上面三個步驟,我們就能夠得到feture0.txt。存放的數據位28000*10大小。相應每一個樣本屬於哪一類發生的概率。然後運行下面matlab代碼就能夠得到kaggle所須要的提交結果了。最後的準確率為0.98986。排名也提升了400+。great!!
% caffe toolbox, the postprocessing of the data clear;clc; feature = load(‘feature0.txt‘); feature = feature‘; [~,test_y] = max(feature); [M,N] = size(test_y); test_y = test_y - repmat([1], M, N); test_y = test_y‘; M = [(1:length(test_y))‘ test_y(:)]; csvwrite(‘test_y3.csv‘, M);
全部文件代碼下載見:https://github.com/zack6514/zackcoding
DeepLearning to digit recognizer in kaggle