Bunch 轉換為 HDF5 文件：高效存儲 Cifar 等數據集

阿新 • • 發佈：2018-06-29

scene sta hdf baidu nsh item bow none asa

關於如何將數據集封裝為 Bunch 可參考關於『AI 專屬數據庫的定制』的改進。

PyTables 是 Python 與 HDF5 數據庫/文件標準的結合。它專門為優化 I/O 操作的性能、最大限度地利用可用硬件而設計，並且它還支持壓縮功能。

下面的代碼均是在 Jupyter NoteBook 下完成的：

import sys 
sys.path.append(‘E:/xinlib‘)
from base.filez import DataBunch

import tables as tb
import numpy as np


def bunch2hdf5(root):
    ‘‘‘
    這裏我僅僅封裝了 Cifar10、Cifar100、MNIST、Fashion MNIST 數據集， 

    使用者還可以自己追加數據集。
    ‘‘‘
    db = DataBunch(root)
    filters = tb.Filters(complevel=7, shuffle=False)
    # 這裏我采用了壓縮表，因而保存為 `.h5c` 但也可以保存為 `.h5`
    with tb.open_file(f‘{root}X.h5c‘, ‘w‘, filters=filters, title=‘Xinet\‘s dataset‘) as h5:
        for name in db.keys():
            h5.create_group(‘/‘ 
, name, title=f‘{db[name].url}‘)
            if name != ‘cifar100‘:
                h5.create_array(h5.root[name], ‘trainX‘, db[name].trainX, title=‘訓練數據‘)
                h5.create_array(h5.root[name], ‘trainY‘, db[name].trainY, title=‘訓練標簽‘)
                h5.create_array(h5.root[name], ‘testX‘, db[name].testX, title= 
‘測試數據‘)
                h5.create_array(h5.root[name], ‘testY‘, db[name].testY, title=‘測試標簽‘)
            else:
                h5.create_array(h5.root[name], ‘trainX‘, db[name].trainX, title=‘訓練數據‘)
                h5.create_array(h5.root[name], ‘testX‘, db[name].testX, title=‘測試數據‘)
                h5.create_array(h5.root[name], ‘train_coarse_labels‘, db[name].train_coarse_labels, title=‘超類訓練標簽‘)
                h5.create_array(h5.root[name], ‘test_coarse_labels‘, db[name].test_coarse_labels, title=‘超類測試標簽‘)
                h5.create_array(h5.root[name], ‘train_fine_labels‘, db[name].train_fine_labels, title=‘子類訓練標簽‘)
                h5.create_array(h5.root[name], ‘test_fine_labels‘, db[name].test_fine_labels, title=‘子類測試標簽‘)

        for k in [‘cifar10‘, ‘cifar100‘]:
            for name in db[k].meta.keys():
                name = name.decode()
                if name.endswith(‘names‘):
                    label_names = np.asanyarray([label_name.decode() for label_name in db[k].meta[name.encode()]])
                    h5.create_array(h5.root[k], name, label_names, title=‘標簽名稱‘)

完成 `Bunch`到 `HDF5` 的轉換

root = ‘E:/Data/Zip/‘
bunch2hdf5(root)

h5c = tb.open_file(‘E:/Data/Zip/X.h5c‘)
h5c

File(filename=E:/Data/Zip/X.h5c, title="Xinet‘s dataset", mode=‘r‘, root_uep=‘/‘, filters=Filters(complevel=7, complib=‘zlib‘, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) "Xinet‘s dataset"
/cifar10 (Group) ‘https://www.cs.toronto.edu/~kriz/cifar.html‘
/cifar10/label_names (Array(10,)) ‘標簽名稱‘
  atom := StringAtom(itemsize=10, shape=(), dflt=b‘‘)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/cifar10/testX (Array(10000, 32, 32, 3)) ‘測試數據‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/cifar10/testY (Array(10000,)) ‘測試標簽‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/cifar10/trainX (Array(50000, 32, 32, 3)) ‘訓練數據‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/cifar10/trainY (Array(50000,)) ‘訓練標簽‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/cifar100 (Group) ‘https://www.cs.toronto.edu/~kriz/cifar.html‘
/cifar100/coarse_label_names (Array(20,)) ‘標簽名稱‘
  atom := StringAtom(itemsize=30, shape=(), dflt=b‘‘)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/cifar100/fine_label_names (Array(100,)) ‘標簽名稱‘
  atom := StringAtom(itemsize=13, shape=(), dflt=b‘‘)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/cifar100/testX (Array(10000, 32, 32, 3)) ‘測試數據‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/cifar100/test_coarse_labels (Array(10000,)) ‘超類測試標簽‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/cifar100/test_fine_labels (Array(10000,)) ‘子類測試標簽‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/cifar100/trainX (Array(50000, 32, 32, 3)) ‘訓練數據‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/cifar100/train_coarse_labels (Array(50000,)) ‘超類訓練標簽‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/cifar100/train_fine_labels (Array(50000,)) ‘子類訓練標簽‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/fashion_mnist (Group) ‘https://github.com/zalandoresearch/fashion-mnist‘
/fashion_mnist/testX (Array(10000, 28, 28, 1)) ‘測試數據‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/fashion_mnist/testY (Array(10000,)) ‘測試標簽‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/fashion_mnist/trainX (Array(60000, 28, 28, 1)) ‘訓練數據‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/fashion_mnist/trainY (Array(60000,)) ‘訓練標簽‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/mnist (Group) ‘http://yann.lecun.com/exdb/mnist‘
/mnist/testX (Array(10000, 28, 28, 1)) ‘測試數據‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/mnist/testY (Array(10000,)) ‘測試標簽‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/mnist/trainX (Array(60000, 28, 28, 1)) ‘訓練數據‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/mnist/trainY (Array(60000,)) ‘訓練標簽‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None

從上面的結構可看出我將 Cifar10、Cifar100、MNIST、Fashion MNIST 進行了封裝，並且還附帶了它們各種的數據集信息。比如標簽名，數字特征（以數組的形式進行封裝）等。

%%time
arr = h5c.root.cifar100.trainX.read() # 讀取數據十分快速

Wall time: 125 ms

arr.shape

(50000, 32, 32, 3)

h5c.root

/ (RootGroup) "Xinet‘s dataset"
  children := [‘cifar10‘ (Group), ‘cifar100‘ (Group), ‘fashion_mnist‘ (Group), ‘mnist‘ (Group)]

`X.h5c` 使用說明

下面我們以 Cifar100 為例來展示我們自創的數據集 X.h5c（我將其上傳到了百度雲盤「鏈接：https://pan.baidu.com/s/1nzaicwHmFZH9Xgf2foSw6Q 密碼：bl2e」可以下載直接使用；亦可你自己生成，不過我推薦自己生成，可以對數據集加深理解）

cifar100 = h5c.root.cifar100
cifar100

/cifar100 (Group) ‘https://www.cs.toronto.edu/~kriz/cifar.html‘
  children := [‘coarse_label_names‘ (Array), ‘fine_label_names‘ (Array), ‘testX‘ (Array), ‘test_coarse_labels‘ (Array), ‘test_fine_labels‘ (Array), ‘trainX‘ (Array), ‘train_coarse_labels‘ (Array), ‘train_fine_labels‘ (Array)]

‘coarse_label_names‘ 指的是粗粒度或超類標簽名，‘fine_label_names‘ 則是細粒度標簽名。

可以使用 read() 方法直接獲取信息，也可以使用索引的方式獲取。

coarse_label_names = cifar100.coarse_label_names[:]
# 或者
coarse_label_names = cifar100.coarse_label_names.read()
coarse_label_names.astype(‘str‘)

array([‘aquatic_mammals‘, ‘fish‘, ‘flowers‘, ‘food_containers‘,
       ‘fruit_and_vegetables‘, ‘household_electrical_devices‘,
       ‘household_furniture‘, ‘insects‘, ‘large_carnivores‘,
       ‘large_man-made_outdoor_things‘, ‘large_natural_outdoor_scenes‘,
       ‘large_omnivores_and_herbivores‘, ‘medium_mammals‘,
       ‘non-insect_invertebrates‘, ‘people‘, ‘reptiles‘, ‘small_mammals‘,
       ‘trees‘, ‘vehicles_1‘, ‘vehicles_2‘], dtype=‘<U30‘)

fine_label_names = cifar100.fine_label_names[:].astype(‘str‘)
fine_label_names

array([‘apple‘, ‘aquarium_fish‘, ‘baby‘, ‘bear‘, ‘beaver‘, ‘bed‘, ‘bee‘,
       ‘beetle‘, ‘bicycle‘, ‘bottle‘, ‘bowl‘, ‘boy‘, ‘bridge‘, ‘bus‘,
       ‘butterfly‘, ‘camel‘, ‘can‘, ‘castle‘, ‘caterpillar‘, ‘cattle‘,
       ‘chair‘, ‘chimpanzee‘, ‘clock‘, ‘cloud‘, ‘cockroach‘, ‘couch‘,
       ‘crab‘, ‘crocodile‘, ‘cup‘, ‘dinosaur‘, ‘dolphin‘, ‘elephant‘,
       ‘flatfish‘, ‘forest‘, ‘fox‘, ‘girl‘, ‘hamster‘, ‘house‘,
       ‘kangaroo‘, ‘keyboard‘, ‘lamp‘, ‘lawn_mower‘, ‘leopard‘, ‘lion‘,
       ‘lizard‘, ‘lobster‘, ‘man‘, ‘maple_tree‘, ‘motorcycle‘, ‘mountain‘,
       ‘mouse‘, ‘mushroom‘, ‘oak_tree‘, ‘orange‘, ‘orchid‘, ‘otter‘,
       ‘palm_tree‘, ‘pear‘, ‘pickup_truck‘, ‘pine_tree‘, ‘plain‘, ‘plate‘,
       ‘poppy‘, ‘porcupine‘, ‘possum‘, ‘rabbit‘, ‘raccoon‘, ‘ray‘, ‘road‘,
       ‘rocket‘, ‘rose‘, ‘sea‘, ‘seal‘, ‘shark‘, ‘shrew‘, ‘skunk‘,
       ‘skyscraper‘, ‘snail‘, ‘snake‘, ‘spider‘, ‘squirrel‘, ‘streetcar‘,
       ‘sunflower‘, ‘sweet_pepper‘, ‘table‘, ‘tank‘, ‘telephone‘,
       ‘television‘, ‘tiger‘, ‘tractor‘, ‘train‘, ‘trout‘, ‘tulip‘,
       ‘turtle‘, ‘wardrobe‘, ‘whale‘, ‘willow_tree‘, ‘wolf‘, ‘woman‘,
       ‘worm‘], dtype=‘<U13‘)

‘testX‘ 與 ‘trainX‘ 分別代表數據的測試數據和訓練數據，而其他的節點所代表的含義也是類似的。

例如，我們可以看看訓練集的數據和標簽：

trainX = cifar100.trainX
train_coarse_labels = cifar100.train_coarse_labels

array([11, 15,  4, ...,  8,  7,  1])

shape 為 (50000, 32, 32, 3)，數據的獲取，我們一樣可以采用索引的形式或者使用 read()：

train_data = trainX[:]
print(train_data[0].shape)
print(train_data.dtype)

(32, 32, 3)
uint8

當然，我們也可以直接使用 trainX 做運算。

for x in cifar100.trainX:
    y = x * 2
    break

print(y.shape)

(32, 32, 3)

h5c.get_node(h5c.root.cifar100, ‘trainX‘)

/cifar100/trainX (Array(50000, 32, 32, 3)) ‘訓練數據‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None

更甚者，我們可以直接定義叠代器來獲取數據：

trainX = cifar100.trainX
train_coarse_labels = cifar100.train_coarse_labels

def data_iter(X, Y, batch_size):
    n = X.nrows
    idx = np.arange(n)
    if X.name.startswith(‘train‘):
        np.random.shuffle(idx)
    for i in range(0, n ,batch_size):
        k = idx[i: min(n, i + batch_size)].tolist()
        yield np.take(X, k, 0), np.take(Y, k, 0)

for x, y in data_iter(trainX, train_coarse_labels, 8):
    print(x.shape, y)
    break

(8, 32, 32, 3) [ 7  7  0 15  4  8  8  3]

Bunch 轉換為 HDF5 文件：高效存儲 Cifar 等數據集

scene sta hdf baidu nsh item bow none asa 關於如何將數據集封裝為 Bunch 可參考關於『AI 專屬數據庫的定制』的改進。 PyTables 是 Python 與 HDF5 數據庫/文件標準的結合。它專門為優化 I/O 操作的性

Bunch 轉換為 HDF5 文件：高效存儲 Cifar 等數據集

完成 `Bunch`到 `HDF5` 的轉換

`X.h5c` 使用說明

Bunch 轉換為 HDF5 文件：高效存儲 Cifar 等數據集

性能測試四十：Mysql存儲過程造數據

手機拍照達人：如何將照片轉換為PDF文件

什麽軟件可以將PDF文件轉換為DWG文件

Java 使用 jacob 將 word 文檔轉換為 pdf 文件

關於音頻錄制raw格式轉換為mp3文件

vue專案把文字轉換為markdown文件

批量將JPG轉換為PDF文件的方法，你掌握了嗎

如何把pdf轉換為txt文件，pdf轉txt的好方法

Wps文檔如何轉換為pdf文件

如何提取pdf中的文字並將其轉換為TXT文件

如何將知網下載的caj文件轉換為pdf文件

用vs code將qt designer的.ui文件轉換為.py文件

Ajax異步請求返回文件流（eg：導出文件時，直接將導出數據用文件流的形式返回客戶端供客戶下載）

比較Apache Hadoop生態系統中不同的文件格式和存儲引擎的性能

Android 音頻采集——MediaRecord（編碼後錄影文件）、AudioRecord（PCM原始數據）

安卓手機下載的視頻文件誤刪怎麽恢復手機數據恢復方法

不小心把回收站的文件刪除了怎麽恢復回收站數據恢復教程

LabVIEW - 獲取當前VI所在文件夾路徑、電子表格記錄數據

遍歷win10文件夾並解析json文件，按照json格式存入mongo數據庫（基於python 3.6）

Bunch 轉換為 HDF5 文件：高效存儲 Cifar 等數據集

完成 Bunch到 HDF5 的轉換

X.h5c 使用說明

相關推薦

完成 `Bunch`到 `HDF5` 的轉換

`X.h5c` 使用說明