MNIST機器學習數據集
介紹
在學習機器學習的時候,首當其沖的就是準備一份通用的數據集,方便與其他的算法進行比較。在這裏,我寫了一個用於加載MNIST數據集的方法,並將其進行封裝,主要用於將MNIST數據集轉換成numpy.array()格式的訓練數據。直接下面看下面的代碼吧(主要還是如何用python去讀取binnary file)!
MNIST數據集原網址:http://yann.lecun.com/exdb/mnist/
Github源碼下載:數據集(源文件+解壓文件+字體圖像jpg格式), py源碼文件
文件目錄
/utils/data_util.py 用於加載MNIST數據集方法文件
/utils/test.py 用於測試的文件,一個簡單的KNN測試MNIST數據集
/data/train-images.idx3-ubyte 訓練集X
/dataset/train-labels.idx1-ubyte 訓練集y
/dataset/data/t10k-images.idx3-ubyte 測試集X
/dataset/data/t10k-labels.idx1-ubyte 測試集y
MNIST數據集解釋
將MNIST文件解壓後,發現這些文件並不是標準的圖像格式。這些圖像數據都保存在二進制文件中。每個樣本圖像的寬高為28*28。
mnist的結構如下,選取train-images
[code]TRAINING SET IMAGE FILE (train-images-idx3-ubyte): [offset] [type] [value] [description] 0000 32 bit integer 0x00000803(2051) magic number 0004 32 bit integer 60000 number of images 0008 32 bit integer 28 number of rows 0012 32 bit integer 28 number of columns 0016 unsigned byte ?? pixel 0017 unsigned byte ?? pixel ........ xxxx unsigned byte ?? pixel
首先該數據是以二進制存儲的,我們讀取的時候要以’rb’方式讀取;其次,真正的數據只有[value]這一項,其他的[type]等只是來描述的,並不真正在數據文件裏面。也就是說,在讀取真實數據之前,我們要讀取4個
32 bit integer
.由[offset]我們可以看出真正的pixel是從0016開始的,一個int 32位,所以在讀取pixel之前我們要讀取4個 32 bit integer,也就是magic number, number of images, number of rows, number of columns. 當然,在這裏使用struct.unpack_from()會比較方便.
源碼
說明:
‘>IIII’指的是使用大端法讀取4個unsinged int 32 bit integer
‘>784B’指的是使用大端法讀取784個unsigned byte
data_util.py文件
[code]# -*- coding: utf-8 -*- """ Created on Thu Feb 25 14:40:06 2016 load MNIST dataset @author: liudiwei """ import numpy as np import struct import matplotlib.pyplot as plt import os class DataUtils(object): """MNIST數據集加載 輸出格式為:numpy.array() 使用方法如下 from data_util import DataUtils def main(): trainfile_X = ‘../dataset/MNIST/train-images.idx3-ubyte‘ trainfile_y = ‘../dataset/MNIST/train-labels.idx1-ubyte‘ testfile_X = ‘../dataset/MNIST/t10k-images.idx3-ubyte‘ testfile_y = ‘../dataset/MNIST/t10k-labels.idx1-ubyte‘ train_X = DataUtils(filename=trainfile_X).getImage() train_y = DataUtils(filename=trainfile_y).getLabel() test_X = DataUtils(testfile_X).getImage() test_y = DataUtils(testfile_y).getLabel() #以下內容是將圖像保存到本地文件中 #path_trainset = "../dataset/MNIST/imgs_train" #path_testset = "../dataset/MNIST/imgs_test" #if not os.path.exists(path_trainset): # os.mkdir(path_trainset) #if not os.path.exists(path_testset): # os.mkdir(path_testset) #DataUtils(outpath=path_trainset).outImg(train_X, train_y) #DataUtils(outpath=path_testset).outImg(test_X, test_y) return train_X, train_y, test_X, test_y """ def __init__(self, filename=None, outpath=None): self._filename = filename self._outpath = outpath self._tag = ‘>‘ self._twoBytes = ‘II‘ self._fourBytes = ‘IIII‘ self._pictureBytes = ‘784B‘ self._labelByte = ‘1B‘ self._twoBytes2 = self._tag + self._twoBytes self._fourBytes2 = self._tag + self._fourBytes self._pictureBytes2 = self._tag + self._pictureBytes self._labelByte2 = self._tag + self._labelByte def getImage(self): """ 將MNIST的二進制文件轉換成像素特征數據 """ binfile = open(self._filename, ‘rb‘) #以二進制方式打開文件 buf = binfile.read() binfile.close() index = 0 numMagic,numImgs,numRows,numCols=struct.unpack_from(self._fourBytes2, buf, index) index += struct.calcsize(self._fourBytes) images = [] for i in range(numImgs): imgVal = struct.unpack_from(self._pictureBytes2, buf, index) index += struct.calcsize(self._pictureBytes2) imgVal = list(imgVal) for j in range(len(imgVal)): if imgVal[j] > 1: imgVal[j] = 1 images.append(imgVal) return np.array(images) def getLabel(self): """ 將MNIST中label二進制文件轉換成對應的label數字特征 """ binFile = open(self._filename,‘rb‘) buf = binFile.read() binFile.close() index = 0 magic, numItems= struct.unpack_from(self._twoBytes2, buf,index) index += struct.calcsize(self._twoBytes2) labels = []; for x in range(numItems): im = struct.unpack_from(self._labelByte2,buf,index) index += struct.calcsize(self._labelByte2) labels.append(im[0]) return np.array(labels) def outImg(self, arrX, arrY): """ 根據生成的特征和數字標號,輸出png的圖像 """ m, n = np.shape(arrX) #每張圖是28*28=784Byte for i in range(1): img = np.array(arrX[i]) img = img.reshape(28,28) outfile = str(i) + "_" + str(arrY[i]) + ".png" plt.figure() plt.imshow(img, cmap = ‘binary‘) #將圖像黑白顯示 plt.savefig(self._outpath + "/" + outfile)
test.py文件:簡單地測試了一下KNN算法,代碼如下
[code]# -*- coding: utf-8 -*- """ Created on Thu Feb 25 16:09:58 2016 Test MNIST dataset @author: liudiwei """ from sklearn import neighbors from data_util import DataUtils import datetime def main(): trainfile_X = ‘../dataset/MNIST/train-images.idx3-ubyte‘ trainfile_y = ‘../dataset/MNIST/train-labels.idx1-ubyte‘ testfile_X = ‘../dataset/MNIST/t10k-images.idx3-ubyte‘ testfile_y = ‘../dataset/MNIST/t10k-labels.idx1-ubyte‘ train_X = DataUtils(filename=trainfile_X).getImage() train_y = DataUtils(filename=trainfile_y).getLabel() test_X = DataUtils(testfile_X).getImage() test_y = DataUtils(testfile_y).getLabel() return train_X, train_y, test_X, test_y def testKNN(): train_X, train_y, test_X, test_y = main() startTime = datetime.datetime.now() knn = neighbors.KNeighborsClassifier(n_neighbors=3) knn.fit(train_X, train_y) match = 0; for i in xrange(len(test_y)): predictLabel = knn.predict(test_X[i])[0] if(predictLabel==test_y[i]): match += 1 endTime = datetime.datetime.now() print ‘use time: ‘+str(endTime-startTime) print ‘error rate: ‘+ str(1-(match*1.0/len(test_y))) if __name__ == "__main__": testKNN()
通過main方法,最後直接返回numpy.array()格式的數據:train_X, train_y, test_X, test_y。如果你需要,直接條用main方法即可!
更多機器學習文章請進:http://www.csuldw.com.
MNIST機器學習數據集