1. 程式人生 > >暴力補坑:win10+tensorflow+mnist+python3.6匯入mnist資料錯誤:UnicodeEncodeError

暴力補坑:win10+tensorflow+mnist+python3.6匯入mnist資料錯誤:UnicodeEncodeError

問題背景描述

mnist本身是tensorflow下最常用也是最簡單基礎的資料包。
所以,在新安裝tensorflow,給tensorflow配gpu版本,或者試驗tensorflow的其他沒有接觸過的操作時經常被拿來作為測試之用。
然而,官方文件裡所說的引用mnist資料庫的方法:

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets(‘MNIST_data’, one_hot=True)

在直接執行時會報錯:

File “C:\Anaconda3\envs\tensorflow-gpu\lib\site-packages\zmq\utils\jsonapi.py”, line 43, in dumps
s = s.encode(‘utf8’)
UnicodeEncodeError: ‘utf-8’ codec can’t encode character ‘\udcd5’ in position 2416: surrogates not allowed

網路上這個問題相關的資料還是很多的,也有解決方案,但是我並不喜歡:

  1. 很多回答都是基於Linux,對於像我這樣只有win10的,有些方法值得商榷
  2. 對於剛接觸tensorflow的人,或者說python不熟的,需要的是一個簡單粗暴的使用mnist資料的方式,在encode問題或者用二進位制檔案讀取mnist上花費學習的經歷簡直是消磨革命熱情

所以,不如直接基於mnist的二進位制讀取,寫個和官方的引用方式差不多的猴版module,豈不美哉?

準備

程式碼

在工作目錄下建立mnist.py,高仿從命名開始,內容如下:

# -*- coding: utf-8 -*-
import
numpy as np import struct # 讀取圖片,返回 [樣本數,影象寬*影象高]的numpy陣列 def read_img(path, filename): with open(path+filename,'rb') as bitfile: buffer = bitfile.read() head = struct.unpack_from('>IIII',buffer,0) print('load head:', head) imgNum = head[1] width = head[2
] hight = head[3] bits = imgNum*width*hight bitsString = '>'+str(bits)+'B' offset = struct.calcsize('>IIII') imgs = struct.unpack_from(bitsString,buffer,offset) imgs = np.reshape(imgs,[imgNum,width*hight]) print('load image finished') return imgs # 讀取真值,返回 [樣本數]的numpy陣列 def read_label(path, filename): with open(path+filename,'rb') as bitfile: buffer = bitfile.read() head = struct.unpack_from('>II',buffer,0) print('load head:', head) labelNum = head[1] labelString = '>'+str(labelNum)+'B' offset = struct.calcsize('>II') imgs = struct.unpack_from(labelString,buffer,offset) label = np.reshape(imgs,[labelNum,1]) labels = np.zeros([labelNum,10]) for _ in range(labelNum): labels[_,label[_]] = 1.0 print('load labels finished') return labels class train(object): def __init__(self,path='mnist\\'): self.images = read_img(path, 'x_train.idx3-ubyte') self.labels = read_label(path,'y_train.idx1-ubyte') self.it_img=iter(self.images) self.it_label=iter(self.labels) def next_batch(self,batch_size): try: while True: batch_img=[] batch_label=[] for _ in range(batch_size): batch_img.append(next(self.it_img)) batch_label.append(next(self.it_label)) return np.array(batch_img), np.array(batch_label) except StopIteration: return StopIteration class test(object): def __init__(self,path='mnist\\'): self.images = read_img(path, 'x_test.idx3-ubyte') self.labels = read_label(path,'y_test.idx1-ubyte') self.it_img=iter(self.images) self.it_label=iter(self.labels) def next_batch(self,batch_size): try: while True: batch_img=[] batch_label=[] for _ in range(batch_size): batch_img.append(next(self.it_img)) batch_label.append(next(self.it_label)) return np.array(batch_img), np.array(batch_label) except StopIteration: return StopIteration

兩個類,train 和test,讀取不同的資料則建立對應的物件即可

食用方法


In[1]: import mnist

In[2]: train=mnist.train()

load head: (2051, 60000, 28, 28)
load image finished
load head: (2049, 60000)
load labels finished

In[3]: test=mnist.test()

load head: (2051, 10000, 28, 28)
load image finished
load head: (2049, 10000)
load labels finished

In[4]: img,label=train.next_batch(batch_size=5)
img
Out[5]: 
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

label
Out[6]: 
array([[ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.]])

In[7]: import numpy as np

In[8]: np.shape(img)
Out[8]: (5, 784)

In[9]: np.shape(label)
Out[9]: (5, 10)

基本上可以算開袋即食了

現在只包含了next_batch這一方法,如果之後有需求可以再加。