Tensorflow資料輸入---TFRecords詳解\TFRecords影象預處理

阿新 • • 發佈：2018-12-09

1、概述

在Tensoflow中，預處理資料除了使用tf.data.Dataset以外，還可以使用TFRecords。和tf.data.Dataset相比，優缺點如下（個人總結）：

在訓練時節省資料預處理的計算資源。 使用TFRecords時，是將原始資料處理之後，以一種特定的格式儲存為TFRecords檔案，訓練是隻是簡單的將資料取出來訓練，在訓練時可以節省相當多的計算資源。
預處理資料的邏輯可以非常複雜。 使用TFRecords時，資料預處理可以使用任意python程式碼完成，而不必拘泥於Tensorflow預定義的操作，為預處理資料提供了相當的靈活性，預處理資料的邏輯可以非常複雜。
訓練時佔用的記憶體更小。 這點可能時因為不需要複雜的資料預處理，所以佔用的記憶體更小。
處理之後的資料可能比原始資料大好幾倍。 這點是針對影象資料的，影象因為有壓縮，所以影象原檔案都比較小。預處理完成以後，畫素值以浮點數或者整數的形式儲存。所以和影象原始影象資料相比，處理之後的資料要比原始資料大好幾倍。

本文將從一個實際的例子出發，講解TFRecords如何使用，分為如下幾部分：第二節講如何預處理資料（以處理影象為例）和將資料儲存為TFRecords檔案；第三部分講解如何讀取TFRecords檔案，並如何在Tensoflow和Keras中使用這些資料。

參考連結（多看官方連結才是王道）：

2、預處理資料

在我的專案中，資料預處理的需求是：首先從一個文字檔案中讀取影象路徑和其標籤（屬於哪一類）；然後根據圖片路徑讀出圖片，把畫素值的範圍從 [0, 255] 縮放到 [-1.0, 1.0] ；然後把處理後的資料和對應的標籤儲存到TFRecords檔案裡面。上述的文字檔案的名字為train.txt，一行代表一個圖片樣本，由圖片路徑和其對應的標籤組成，部分幾行如下：

data/M-PIE/test/001/001_01_01_051_09.png 0
data/M-PIE/test/001/001_01_01_051_10.png 0
data/M-PIE/test/002/002_01_01_051_19.png 1
data/M-PIE/test/002/002_01_01_051_09.png 1
data/M-PIE/test/003/003_01_01_051_14.png 2
data/M-PIE/test/003/003_01_01_051_03.png 2
data/M-PIE/test/004/004_01_01_051_05.png 3
data/M-PIE/test/004/004_01_01_051_06.png 3
...

2.1、常量定義

此部分主要把一些常量定義在一個globals.py檔案當中，免得到處都是常量，也便於將來修改。

# coding=utf-8
# 相容python3
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
import random
import numpy as np
import tensorflow as tf

# -----------------------常量區--------------------------
# 設定種子，隨便定義的
SEED = 1213

# 分類網路的類別數目，也是網路最後一層的單元數目
NUM_CLASSES = 285  

# 預處理後圖像的大小
IMAGE_SHAPE = (227, 227, 3)
# 預處理後圖像畫素值的個數
IMAGE_SIZE = IMAGE_SHAPE[0] * IMAGE_SHAPE[1] * IMAGE_SHAPE[2]

# 未處理的影象的畫素值最大值
IMAGE_DEPTH = 255

# 訓練多少輪
NUM_TRAIN_EPOCH = 400
# 訓練的batch size
TRAIN_BATCH_SIZE = 128

# 訓練集的圖片-標籤對文字檔案路徑
TRAIN_LIST = 'data/train.txt'
# 預處理後的訓練集的儲存路徑
TRAIN_TFRECORDS = 'data/train.tfrecords'

# 驗證集的圖片-標籤對文字檔案路徑
VAL_LIST = 'data/test.txt'
# 預處理後的驗證集的儲存路徑
VAL_TFRECORDS = 'data/test.tfrecords'

# ------------------------------------------------------

def set_seed():
    """
    固定隨機數的種子，避免每次隨機過程結果不一樣，得到可復現的結果。
    """
    os.environ['PYTHONHASHSEED'] = str(SEED)
    np.random.seed(seed=SEED)
    tf.set_random_seed(seed=SEED)
    random.seed(SEED)

2.2、匯入庫

從這裡開始，程式碼均在preprocess.py中實現，全部程式碼不到100行。

# coding=utf-8
# 前三行是為了python2的程式碼相容python3
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import tensorflow as tf
import cv2

import globals as _g
# 設定固定的種子
_g.set_seed()

2.3、從train.txt檔案中讀取圖片-標籤對

這部分程式碼非常簡單，使用numpy的一個函式就可以完成，還不會出錯。

def main(list_name, record_file_name):
    """
    預處理圖片和把預處理的資料儲存到tfrecords檔案裡面
    :param list_name: 含有圖片-標籤對的文字檔案路徑
    :param record_file_name: tfrecords的檔案路徑
    """
    # 讀取圖片-標籤對，讀取出來的格式為((path1，label1),(path2, label2), ...)
    lists_and_labels = np.loadtxt(list_name, dtype=str).tolist()
    # 以圖片-標籤對為單位，打亂資料集
    np.random.shuffle(lists_and_labels)

2.4、預處理圖片並儲存

本部分程式碼是main函式的一部分。非常簡單，先上程式碼：

	# 定義一個TFRecordWriter，用來寫TFRecords檔案
    writer = tf.python_io.TFRecordWriter(record_file_name)

    for file_name, label in lists_and_labels:
        # 使用read_image函式讀取並預處理圖片，得到一個numpy陣列
        img = read_image(file_name)
        # 把img的shape從_g.IMAGE_SHAPE調整為[_g.IMAGE_SIZE, ]
        img_reshape = np.reshape(img, [_g.IMAGE_SIZE, ])
        print(file_name, img.shape, img_reshape.shape)

        # 建立特徵的字典，這裡我們只需要標籤和影象原始資料，如果要
        # 儲存圖片路徑，再建立一個特徵就行。
        feature = {
            'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(label)])),
            'image_raw': tf.train.Feature(float_list=tf.train.FloatList(value=img_reshape.tolist()))
        }

        # 定義一個Example
        example = tf.train.Example(features=tf.train.Features(feature=feature))
		 # 把example寫入到檔案中
        writer.write(example.SerializeToString())

    writer.close()

關於read_image
上面程式碼中用到的使用OpenCV庫來讀取圖片，然後使用numpy來完成資料型別轉換，把值的範圍從[0, 255]縮放到[-1.0, 1.0]，程式碼如下：

def read_image(file_name):
    """
    讀取並預處理圖片。
    :param file_name: 圖片的路徑
    :return: numpy陣列，shape為_g.IMAGE_SHAPE
    """
    # 讀取圖片，img為numpy陣列，dtype=np.uint8
    img = cv2.imread(file_name, cv2.IMREAD_UNCHANGED)
    # 調整img的大小，只需要指定行數和列數
    img = cv2.resize(img, _g.IMAGE_SHAPE[0:2])
    # 轉換img的資料型別
    img = img.astype(dtype=np.float32)
    # 把畫素值的範圍從[0, 255]縮放到[-1.0, 1.0]
    img -= _g.IMAGE_DEPTH / 2
    img /= _g.IMAGE_DEPTH / 2
    return img

關於tf.train.Feature
一個tf.Example由許多tf.train.Feature組成（可以這樣理解）。tf.train.Feature可以接收以下三種類型的資料，其他型別的資料基本都可以轉換為這三種資料：

bytes_list（string，byte）
float_list （float32，float64）
int64_list（bool，enum，int32， uint32，int64， uint64）

為了將標準型別轉換為tf.train.Feature相容，可以使用如下函式：

def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

需要說明的是，tf.train.*****List的value引數是一個list。 使用上述函式的一些例子：

print(_bytes_feature('test_string'))
print(_bytes_feature(bytes('test_bytes')))
print(_float_feature(np.exp(1)))
print(_int64_feature(True))
print(_int64_feature(1))

在文中，我儲存label是使用int64_list；儲存影象資料是使用float_list，之所以使用float_list，一個是為了能在讀取時的程式碼更加簡單，另外一個就是節省讀取時的CPU資源，缺點是檔案佔用的空間比較大。

2.5、呼叫main函式

對訓練集和驗證集呼叫main函式，完成資料預處理：

if __name__ == '__main__':
    main(_g.TRAIN_LIST, _g.TRAIN_TFRECORDS)
    main(_g.VAL_LIST, _g.VAL_TFRECORDS)

至此，資料的預處理就完成了。

3、讀取預處理後的資料

此部分程式碼主要演示如何使用tf.data.TFRecordDataset讀取tfrecords檔案。使用TFRecordDataset即可以作為tensorflow程式碼所寫的模型的輸入，也可以作為keras模型的輸入，簡直美滋滋。還有其他讀取tfrecords檔案的程式碼，就不多說了。此部分的程式碼實現於inputs_tfrecords.py中。

3.1、匯入庫

# coding=utf-8
# 相容python3
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import multiprocessing as mt
import tensorflow as tf
import cv2
import globals as _g

_g.set_seed()

3.2、定義TFRecordDataset

TFRecordDataset和tf.data.Dataset非常相似，在這裡不做過多的闡述，函式的說明可參見我另外一篇部落格tf.data.Dataset影象預處理詳解的第2部分。

def prepare_dataset(record_name, list_name):
    """
    從record_name指定的TFRecords檔案，初始化一個dataset
    :param record_name: TFRecords檔案路徑
    :param list_name: 與record_name相對應的圖片-標籤對檔案路徑
    """
    # 定義TFRecordDataset
    dataset = tf.data.TFRecordDataset([record_name])
    # 對每個dataset的每個樣本呼叫_parse_function來讀取TFRecords資料
    dataset = dataset.map(_parse_function, mt.cpu_count())
    # 定義batch size大小，非常重要。
    dataset = dataset.batch(_g.TRAIN_BATCH_SIZE)
    # 無限重複資料集
    dataset = dataset.repeat()
    # 返回dataset和訓練一輪需要的步數
    return dataset, compute_steps(list_name)

關於_parse_function
_parse_function是解析TFRecords的函式，實現如下：

def _parse_function(record):
    # 定義一個特徵詞典，和寫TFRecords時的特徵詞典相對應
    features = {
        'label': tf.FixedLenFeature([], tf.int64, default_value=0),
        'image_raw': tf.FixedLenFeature([_g.IMAGE_SIZE, ], tf.float32,)
    }

    # 根據上面的特徵解析單個數據（儲存時的單個Example）
    example = tf.parse_single_example(record, features)

    # 把image的shape從[_g.IMAGE_SIZE, ]調整回_g.IMAGE_SHAPE
    image = tf.reshape(example['image_raw'], _g.IMAGE_SHAPE)

    # 如果使用dataset作為keras中，model.fit函式等的引數，則需要使用one_hot編碼
    # 在tensorflow中，基本是不需要的，可以直接返回example['label']。
    one_hot_label = tf.one_hot(example['label'], _g.NUM_CLASSES)

    return image, one_hot_label

tf.FixedLenFeature的第一個引數為特徵的長度（元素的個數），如果只有一個整數，直接傳[]，後面的default_value可設定為0；如果是一個list，有很多數，那麼需要指定第一個引數為該特徵的長度（這個長度要和2.4節儲存的資料的個數相同），default_value建議不設定。第二個引數為特徵的型別。

關於compute_steps
compute_steps的作用就是訓練一輪（one epoch）需要多少步（steps）。計算steps很簡單，record_name對應的那個list_name包含多少個樣本（一行一個樣本，就是有多少行），然後除以batch size並向上取整就可以得到steps：

def compute_steps(list_name):
    # 讀取所有的圖片-標籤對
    lists_and_labels = np.loadtxt(list_name, dtype=str).tolist()
	# 除以batch size並向上取整
    return np.ceil(len(list(lists_and_labels)) / _g.TRAIN_BATCH_SIZE).astype(np.int32)

3.3、驗證是否成功讀取了資料

驗證預處理的是否正確的方式比較簡單，總體思路是從dataset獲取影象和標籤，然後儲存影象，看看對不對。

def save_image(file_name, image):
    """
    儲存image到file_name指定的位置
    """
    # 把影象的值範圍從[-1.0, 1.0] 縮放回 [0, 255]
    image *= _g.IMAGE_DEPTH / 2
    image += _g.IMAGE_DEPTH / 2
    # 轉換型別
    image = image.astype(dtype=np.uint8)
    # 儲存圖片
    cv2.imwrite(file_name, image)
    
def inputs_test():
    dataset, steps = prepare_dataset(_g.TRAIN_TFRECORDS, _g.TRAIN_LIST)

    print('shapes:', dataset.output_shapes)
    print('types:', dataset.output_types)
    print('steps: ', steps)

    next_op = dataset.make_one_shot_iterator().get_next()
    with tf.Session() as sess:
        for i in range(10):
            image, label = sess.run(next_op)
            print(image.shape, label.shape)
            save_image('logs/%d.png' % i, image[0])

3.3、在tensorflow模型中使用

思路為從dataset中取出資料，然後作為sess.run的feed_dict的引數：

import inputs_tfrecords
...

def train():
    # 訓練集
    dataset, steps = inputs_tfrecords.prepare_dataset(_g.TRAIN_TFRECORDS, _g.TRAIN_LIST)
    # 驗證集
    val_dataset, val_steps = inputs_tfrecords.prepare_dataset(_g.VAL_TFRECORDS, _g.VAL_LIST)

    print('shapes:', dataset.output_shapes)
    print('types:', dataset.output_types)
    print('steps: ', steps)

    # 計算shape
    shape = _g.IMAGE_SHAPE[:]
    shape.insert(0, _g.TRAIN_BATCH_SIZE)
    # 定義placeholder
    img = tf.placeholder(shape=shape, name='image')
    lab = tf.placeholder(shape=[_g.TRAIN_BATCH_SIZE, ], name='label')
    # 定義訓練操作
    train_op = ...


    # 訓練
    next_op = dataset.make_one_shot_iterator().get_next()
    with tf.Session() as sess:
        for i in range(steps):
            image, label = sess.run(next_op)
            print(image.shape, label.shape)
            sess.run([train_op], feed_dict={'image': image, 'label': label})
            ...

3.5、在Keras中使用

那真的是非常簡單了：

import inputs_tfrecords
...

def train():
    # 訓練集
    dataset, steps = inputs_tfrecords.prepare_dataset(_g.TRAIN_TFRECORDS, _g.TRAIN_LIST)
    # 驗證集
    val_dataset, val_steps = inputs_tfrecords.prepare_dataset(_g.VAL_TFRECORDS, _g.VAL_LIST)

    print('shapes:', dataset.output_shapes)
    print('types:', dataset.output_types)
    print('steps: ', steps)

    # 得到模型
    model = tf.keras.Sequential()
    ...
    # 訓練
    model.fit(train_dataset, epochs=_g.NUM_TRAIN_EPOCH, steps_per_epoch=train_steps,
              validation_data=val_dataset, validation_steps=val_steps)

關於tensorflow中如何使用keras，可參考：

Tensorflow資料輸入---TFRecords詳解\TFRecords影象預處理

目錄

1、概述

2、預處理資料

2.1、常量定義

2.2、匯入庫

2.3、從train.txt檔案中讀取圖片-標籤對

2.4、預處理圖片並儲存

2.5、呼叫main函式

3、讀取預處理後的資料

3.1、匯入庫

3.2、定義TFRecordDataset

3.3、驗證是否成功讀取了資料

3.3、在tensorflow模型中使用

3.5、在Keras中使用

Tensorflow資料輸入---TFRecords詳解\TFRecords影象預處理

tensorflow的資料讀取機制詳解

EditText(輸入框)詳解

在 Angular6 中使用 HTTP 請求服務端資料的步驟詳解

js入門關於js屬性及其資料型別（詳解）

tf.data.Dataset影象預處理詳解

Spring cache資料(二，詳解)

Redis rdb資料結構原始碼詳解

Java實現陣列去除重複資料的方法詳解

mysql儲存引擎InnoDB插入資料的過程詳解

減少網站跳轉時間，增強網站資料安全——HSTS 詳解

王權富貴：VOC2007資料集格式詳解和下載

Redis資料過期策略詳解

tensorflow學習之MultiRNNCell詳解

Tensorflow-tf.nn.zero_fraction()詳解

Java實現陣列去除重複資料的方法詳解(轉)

hive-資料傾斜解決詳解

Redis學習筆記--Redis資料過期策略詳解

大資料之hdfs詳解之三：put許可權剖析與常用命令

shell三劍客之awk 資料擷取工具詳解

Tensorflow資料輸入---TFRecords詳解\TFRecords影象預處理

目錄

1、概述

2、預處理資料

2.1、常量定義

2.2、匯入庫

2.3、從train.txt檔案中讀取圖片-標籤對

2.4、預處理圖片並儲存

2.5、呼叫main函式

3、讀取預處理後的資料

3.1、匯入庫

3.2、定義TFRecordDataset

3.3、驗證是否成功讀取了資料

3.3、在tensorflow模型中使用

3.5、在Keras中使用

相關推薦