暴力補坑:win10+tensorflow+mnist+python3.6匯入mnist資料錯誤:UnicodeEncodeError
阿新 • • 發佈:2019-02-12
問題背景描述
mnist本身是tensorflow下最常用也是最簡單基礎的資料包。
所以,在新安裝tensorflow,給tensorflow配gpu版本,或者試驗tensorflow的其他沒有接觸過的操作時經常被拿來作為測試之用。
然而,官方文件裡所說的引用mnist資料庫的方法:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets(‘MNIST_data’, one_hot=True)
在直接執行時會報錯:
File “C:\Anaconda3\envs\tensorflow-gpu\lib\site-packages\zmq\utils\jsonapi.py”, line 43, in dumps
s = s.encode(‘utf8’)
UnicodeEncodeError: ‘utf-8’ codec can’t encode character ‘\udcd5’ in position 2416: surrogates not allowed
網路上這個問題相關的資料還是很多的,也有解決方案,但是我並不喜歡:
- 很多回答都是基於Linux,對於像我這樣只有win10的,有些方法值得商榷
- 對於剛接觸tensorflow的人,或者說python不熟的,需要的是一個簡單粗暴的使用mnist資料的方式,在encode問題或者用二進位制檔案讀取mnist上花費學習的經歷簡直是消磨革命熱情
所以,不如直接基於mnist的二進位制讀取,寫個和官方的引用方式差不多的猴版module,豈不美哉?
準備
程式碼
在工作目錄下建立mnist.py,高仿從命名開始,內容如下:
# -*- coding: utf-8 -*-
import numpy as np
import struct
# 讀取圖片,返回 [樣本數,影象寬*影象高]的numpy陣列
def read_img(path, filename):
with open(path+filename,'rb') as bitfile:
buffer = bitfile.read()
head = struct.unpack_from('>IIII',buffer,0)
print('load head:', head)
imgNum = head[1]
width = head[2 ]
hight = head[3]
bits = imgNum*width*hight
bitsString = '>'+str(bits)+'B'
offset = struct.calcsize('>IIII')
imgs = struct.unpack_from(bitsString,buffer,offset)
imgs = np.reshape(imgs,[imgNum,width*hight])
print('load image finished')
return imgs
# 讀取真值,返回 [樣本數]的numpy陣列
def read_label(path, filename):
with open(path+filename,'rb') as bitfile:
buffer = bitfile.read()
head = struct.unpack_from('>II',buffer,0)
print('load head:', head)
labelNum = head[1]
labelString = '>'+str(labelNum)+'B'
offset = struct.calcsize('>II')
imgs = struct.unpack_from(labelString,buffer,offset)
label = np.reshape(imgs,[labelNum,1])
labels = np.zeros([labelNum,10])
for _ in range(labelNum):
labels[_,label[_]] = 1.0
print('load labels finished')
return labels
class train(object):
def __init__(self,path='mnist\\'):
self.images = read_img(path, 'x_train.idx3-ubyte')
self.labels = read_label(path,'y_train.idx1-ubyte')
self.it_img=iter(self.images)
self.it_label=iter(self.labels)
def next_batch(self,batch_size):
try:
while True:
batch_img=[]
batch_label=[]
for _ in range(batch_size):
batch_img.append(next(self.it_img))
batch_label.append(next(self.it_label))
return np.array(batch_img), np.array(batch_label)
except StopIteration:
return StopIteration
class test(object):
def __init__(self,path='mnist\\'):
self.images = read_img(path, 'x_test.idx3-ubyte')
self.labels = read_label(path,'y_test.idx1-ubyte')
self.it_img=iter(self.images)
self.it_label=iter(self.labels)
def next_batch(self,batch_size):
try:
while True:
batch_img=[]
batch_label=[]
for _ in range(batch_size):
batch_img.append(next(self.it_img))
batch_label.append(next(self.it_label))
return np.array(batch_img), np.array(batch_label)
except StopIteration:
return StopIteration
兩個類,train 和test,讀取不同的資料則建立對應的物件即可
食用方法
In[1]: import mnist
In[2]: train=mnist.train()
load head: (2051, 60000, 28, 28)
load image finished
load head: (2049, 60000)
load labels finished
In[3]: test=mnist.test()
load head: (2051, 10000, 28, 28)
load image finished
load head: (2049, 10000)
load labels finished
In[4]: img,label=train.next_batch(batch_size=5)
img
Out[5]:
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
label
Out[6]:
array([[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])
In[7]: import numpy as np
In[8]: np.shape(img)
Out[8]: (5, 784)
In[9]: np.shape(label)
Out[9]: (5, 10)
基本上可以算開袋即食了
現在只包含了next_batch這一方法,如果之後有需求可以再加。