tensorflow Importing Data

阿新 • • 發佈：2018-11-06

tf.data API可以建立複雜的輸入管道。它可以從分散式檔案系統中彙總資料，對每個影象資料施加隨機擾動，隨機選擇影象組成一個批次訓練。一個文字模型的管道可能涉及提取原始文字資料的符號，使用查詢表將它們轉換成嵌入標識，將不同長度的資料組成一個批次。tf.data API讓處理大規模資料、不同格式資料和進行復雜變換更容易。

tf.data API引入了兩個抽象機制。

（1）tf.data.Dataset 表示一個元素序列，每個元素包含一個或多個Tensor物件。比如，一個影象管道中，一個元素可能是單個訓練樣例，由一對tensor組成，包括影象資料和標籤。有兩種不同的方式來生成一個dataset:

1)生成一個源(source)(舉例：Dataset.from_tensor_slices()),從一個或多個tf.Tensor物件中構造一個dataset。

2）應用一個變換（舉例：Dataset.batch()）,從一個或多個tf.data.Dataset物件中構造一個dataset。

（2）tf.data.Iterator提供了主要方法來從dataset中提取元素。通過Iterator.get_next()產生Dataset中下一個要執行的元素，這是輸入管道和模型之間的一個介面。最簡單的迭代器是"one-shot iterator", 這個迭代器和一個特定的Dataset聯絡，並只從中迭代一次。對於更多複雜的使用情況，Iterator.initializer操作允許重新初始化和引數化一個迭代器使用不同datasets，比如，在同樣的程式中，迭代訓練資料和驗證資料多次。

1.基本機制

這節描述生成不同Dataset和Iterator物件的基礎知識，和如何從中提取資料。

為了開始一個輸入管道，首先需要定義一個源（source）。比如，從記憶體中的一些tensors中構造一個Dataset，可以使用tf.data.Dataset.from_tensors()或tf.data.Dataset.from_tensor_slices()。另外，如果輸入資料是以推薦的TFRecord格式儲存在硬碟中，可以構造tf.data.TFRecordDataset.

一旦有了Dataset物件，可以通過呼叫tf.data.Dataset的鏈方法將其變換成新的Dataset。比如，可以應用逐元素的變換如Dataset.map()(應用一個函式到每個元素),和多元素變換如Dataset.batch()。請參考tf.data.Dataset中完整的轉換列表。

從Dataset中消耗值的最常用方式是，構建一個iterator物件，提供每次提供dataset中一個元素的獲取（比如，呼叫Dataset.make_one_shot_iterator()）。一個tf.data.Iterator提供兩種操作：Iterator.initializer,用來初始化迭代器狀態，Iterator.get_next()，返回表示下一個元素的tf.Tensor物件。取決於使用情況，可以使用不同型別的iterator,不同型別會在下面介紹。

Dataset結構

一個dataset包含了許多具有相同結構的元素，一個元素包含了一個或多個tf.Tensor物件，稱為components。每個component有一個tf.DType代表元素型別，和一個tf.TensorShape代表每個元素的靜態形狀。Dataset.output_types和Dataset.output_shapes屬性允許你檢查dataset每個元素中每個component的型別和形狀。這些屬性的巢狀結構對映到每個元素的結構，可能是單個tensor,一個tensor元組，或一個巢狀的tensor元組。舉例：

dataset1 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4, 10]))
print(dataset1.output_types)  # ==> "tf.float32"
print(dataset1.output_shapes)  # ==> "(10,)"

dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random_uniform([4]),
    tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)))
print(dataset2.output_types)  # ==> "(tf.float32, tf.int32)"
print(dataset2.output_shapes)  # ==> "((), (100,))"

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
print(dataset3.output_types)  # ==> (tf.float32, (tf.float32, tf.int32))
print(dataset3.output_shapes)  # ==> "(10, ((), (100,)))"

通常給出一個元素的每個component的名字會更方便，如果他們表示訓練樣本的不同特徵。除了元組(tuples)外，可以使用collections.namedtuple或一個字典對映字串到tensors,來表示Dataset的一個單個元素。

dataset = tf.data.Dataset.from_tensor_slices(
   {"a": tf.random_uniform([4]),
    "b": tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)})
print(dataset.output_types)  # ==> "{'a': tf.float32, 'b': tf.int32}"
print(dataset.output_shapes)  # ==> "{'a': (), 'b': (100,)}"

Dataset變換支援任何結構的datasets，當使用Dataset.map(), Dataset.flat_map(), 和Dataset.filter()變換時，這些變換對每個元素應用一個函式，元素結構決定了函式引數。

dataset1 = dataset1.map(lambda x: ...)

dataset2 = dataset2.flat_map(lambda x, y: ...)

# Note: Argument destructuring is not available in Python 3.
dataset3 = dataset3.filter(lambda x, (y, z): ...)

生成一個iterator

一旦建立了Dataset來表示你的輸入資料，下一步是生活從呢個一個Iterator來獲得資料集中的元素。tf.data API支援下列iterators, 複雜度依次遞增。

one-shot
initializable
reinitializable
feedable

一個one-shot iterator是最簡單形式的iterator, 只支援dataset中的一次迭代，不需要顯示初始化。one-shot iterators處理幾乎所有基於佇列支援的輸入管道情況，但是它們不支援引數。使用Dataset.range()例子：

dataset = tf.data.Dataset.range(100)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

for i in range(100):
  value = sess.run(next_element)
  assert i == value

一個initializable iterator要求在使用之前執行一個顯式的iterator.initializer操作。為了交換方便，它可以引數化定義dataset，使用一個或多個tf.placeholder() tensors 在初始化迭代器時被喂資料。

max_value = tf.placeholder(tf.int64, shape=[])
dataset = tf.data.Dataset.range(max_value)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()

# Initialize an iterator over a dataset with 10 elements.
sess.run(iterator.initializer, feed_dict={max_value: 10})
for i in range(10):
  value = sess.run(next_element)
  assert i == value

# Initialize the same iterator over a dataset with 100 elements.
sess.run(iterator.initializer, feed_dict={max_value: 100})
for i in range(100):
  value = sess.run(next_element)
  assert i == value

一個reinitializable iterator可以使用多個不同的Dataset物件初始化。舉例，你可能有一個訓練輸入管道，使用隨機擾動來提升輸入影象的泛化能力，和一個驗證輸入管道評估未修改資料的預測。這些管道通常會使用不同的Dataset物件，並且有相同的結構。

# Define training and validation datasets with the same structure.
training_dataset = tf.data.Dataset.range(100).map(
    lambda x: x + tf.random_uniform([], -10, 10, tf.int64))
validation_dataset = tf.data.Dataset.range(50)

# A reinitializable iterator is defined by its structure. We could use the
# `output_types` and `output_shapes` properties of either `training_dataset`
# or `validation_dataset` here, because they are compatible.
iterator = tf.data.Iterator.from_structure(training_dataset.output_types,
                                           training_dataset.output_shapes)
next_element = iterator.get_next()

training_init_op = iterator.make_initializer(training_dataset)
validation_init_op = iterator.make_initializer(validation_dataset)

# Run 20 epochs in which the training dataset is traversed, followed by the
# validation dataset.
for _ in range(20):
  # Initialize an iterator over the training dataset.
  sess.run(training_init_op)
  for _ in range(100):
    sess.run(next_element)

  # Initialize an iterator over the validation dataset.
  sess.run(validation_init_op)
  for _ in range(50):
    sess.run(next_element)

一個feedable iterator可以結合tf.placeholder使用，來選擇每次呼叫tf.Session.run時使用什麼Iterator, 通過熟悉的feed_dict機制。它提供了與reinitializable iterator同樣的功能，但當你在迭代器間切換時不要求從資料集的開始初始化迭代器。舉例，使用上例中同樣的訓練和測試樣例，可以使用tf.data.Iterator.from_string_handle來定義一個feedable iterator，這允許你在兩個資料集間切換。

# Define training and validation datasets with the same structure.
training_dataset = tf.data.Dataset.range(100).map(
    lambda x: x + tf.random_uniform([], -10, 10, tf.int64)).repeat()
validation_dataset = tf.data.Dataset.range(50)

# A feedable iterator is defined by a handle placeholder and its structure. We
# could use the `output_types` and `output_shapes` properties of either
# `training_dataset` or `validation_dataset` here, because they have
# identical structure.
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(
    handle, training_dataset.output_types, training_dataset.output_shapes)
next_element = iterator.get_next()

# You can use feedable iterators with a variety of different kinds of iterator
# (such as one-shot and initializable iterators).
training_iterator = training_dataset.make_one_shot_iterator()
validation_iterator = validation_dataset.make_initializable_iterator()

# The `Iterator.string_handle()` method returns a tensor that can be evaluated
# and used to feed the `handle` placeholder.
training_handle = sess.run(training_iterator.string_handle())
validation_handle = sess.run(validation_iterator.string_handle())

# Loop forever, alternating between training and validation.
while True:
  # Run 200 steps using the training dataset. Note that the training dataset is
  # infinite, and we resume from where we left off in the previous `while` loop
  # iteration.
  for _ in range(200):
    sess.run(next_element, feed_dict={handle: training_handle})

  # Run one pass over the validation dataset.
  sess.run(validation_iterator.initializer)
  for _ in range(50):
    sess.run(next_element, feed_dict={handle: validation_handle})

從iterator中消耗值

Iterator.get_next()方法返回一個或多個tf.Tensor物件，對應迭代器的下一個元素。

dataset = tf.data.Dataset.range(5)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()

# Typically `result` will be the output of a model, or an optimizer's
# training operation.
result = tf.add(next_element, next_element)

sess.run(iterator.initializer)
print(sess.run(result))  # ==> "0"
print(sess.run(result))  # ==> "2"
print(sess.run(result))  # ==> "4"
print(sess.run(result))  # ==> "6"
print(sess.run(result))  # ==> "8"
try:
  sess.run(result)
except tf.errors.OutOfRangeError:
  print("End of dataset")  # ==> "End of dataset"

儲存迭代器狀態

tf.contrib.data.make_saveable_from_iterator函式生成一個SaveableObject，從一個迭代器中，這可以用來儲存或還原迭代器的當前狀態。這樣生成的一個儲存物件可以被加入到tf.train.Saver變數列表或tf.GraphKeys.SAVEABLE_OBJECTS collection中，以與tf.Variable相同的形式儲存或還原。

# Create saveable object from iterator.
saveable = tf.contrib.data.make_saveable_from_iterator(iterator)

# Save the iterator state by adding it to the saveable objects collection.
tf.add_to_collection(tf.GraphKeys.SAVEABLE_OBJECTS, saveable)
saver = tf.train.Saver()

with tf.Session() as sess:

  if should_checkpoint:
    saver.save(path_to_checkpoint)

# Restore the iterator state.
with tf.Session() as sess:
  saver.restore(sess, path_to_checkpoint)

2.讀輸入資料

如果所有的輸入資料都適合記憶體，生成Dataset最簡單的方式是將它們轉換成tf.Tensor物件，使用Dataset.from_tensor_slices()

# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
  features = data["features"]
  labels = data["labels"]

# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]

dataset = tf.data.Dataset.from_tensor_slices((features, labels))

上面的方式比較佔據記憶體。作為替代，可以使用tf.placeholder()定義Dataset，當初始化Iterator時喂Numpy陣列。

# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
  features = data["features"]
  labels = data["labels"]

# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]

features_placeholder = tf.placeholder(features.dtype, features.shape)
labels_placeholder = tf.placeholder(labels.dtype, labels.shape)

dataset = tf.data.Dataset.from_tensor_slices((features_placeholder, labels_placeholder))
# [Other transformations on `dataset`...]
dataset = ...
iterator = dataset.make_initializable_iterator()

sess.run(iterator.initializer, feed_dict={features_placeholder: features,
                                          labels_placeholder: labels})

消費TFRecord資料

tf.data API支援一系列檔案格式，這樣可以處理不適應記憶體的大型資料集。TFRecord檔案格式是單個面向記錄的二進位制格式，許多tensorflow應用使用它作為訓練資料。tf.data.TFRecordDataset類可以將一個或多個TFRecord檔案作為內容輸入管道。

# Creates a dataset that reads all of the examples from two files.
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)

filenames引數既可以是string, strings列表，或tf.Tensor of strings. 當有兩個檔案集合用於訓練和驗證目的時，可以使用tf.placeholder(tf.string)表示filenames,從合適的filenames中初始化迭代器。

filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...)  # Parse the record into tensors.
dataset = dataset.repeat()  # Repeat the input indefinitely.
dataset = dataset.batch(32)
iterator = dataset.make_initializable_iterator()

# You can feed the initializer with the appropriate filenames for the current
# phase of execution, e.g. training vs. validation.

# Initialize `iterator` with training data.
training_filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
sess.run(iterator.initializer, feed_dict={filenames: training_filenames})

# Initialize `iterator` with validation data.
validation_filenames = ["/var/data/validation1.tfrecord", ...]
sess.run(iterator.initializer, feed_dict={filenames: validation_filenames})

消費文字資料

filenames = ["/var/data/file1.txt", "/var/data/file2.txt"]
dataset = tf.data.TextLineDataset(filenames)

filenames = ["/var/data/file1.txt", "/var/data/file2.txt"]

dataset = tf.data.Dataset.from_tensor_slices(filenames)

# Use `Dataset.flat_map()` to transform each file as a separate nested dataset,
# and then concatenate their contents sequentially into a single "flat" dataset.
# * Skip the first line (header row).
# * Filter out lines beginning with "#" (comments).
dataset = dataset.flat_map(
    lambda filename: (
        tf.data.TextLineDataset(filename)
        .skip(1)
        .filter(lambda line: tf.not_equal(tf.substr(line, 0, 1), "#"))))

3.使用Dataset.map()預處理資料

Dataset.map(f)變換生成一個新的資料集，通過對輸入資料集的每個元素應用函式f。map()函式通常應用在列表結構。

解析tf.Example協議緩衝訊息

許多輸入管道提取tf.train.Example協議緩衝訊息，從TFRecord格式檔案中。每個tf.train.Example記錄包含一個或多個“features”，輸入管道通常將這些特徵轉換為tensors.

# Transforms a scalar string `example_proto` into a pair of a scalar string and
# a scalar integer, representing an image and its label, respectively.
def _parse_function(example_proto):
  features = {"image": tf.FixedLenFeature((), tf.string, default_value=""),
              "label": tf.FixedLenFeature((), tf.int64, default_value=0)}
  parsed_features = tf.parse_single_example(example_proto, features)
  return parsed_features["image"], parsed_features["label"]

# Creates a dataset that reads all of the examples from two files, and extracts
# the image and label features.
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(_parse_function)

解碼影象資料和resizing it

# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def _parse_function(filename, label):
  image_string = tf.read_file(filename)
  image_decoded = tf.image.decode_jpeg(image_string)
  image_resized = tf.image.resize_images(image_decoded, [28, 28])
  return image_resized, label

# A vector of filenames.
filenames = tf.constant(["/var/data/image1.jpg", "/var/data/image2.jpg", ...])

# `labels[i]` is the label for the image in `filenames[i].
labels = tf.constant([0, 37, ...])

dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)

使用tf.py_func()應用任意Python logic

某些時候，使用額外的Python庫解析輸入資料時，是有用的。這時，在Dataset.map()變換中呼叫tf.py_func()操作。

import cv2

# Use a custom OpenCV function to read the image, instead of the standard
# TensorFlow `tf.read_file()` operation.
def _read_py_function(filename, label):
  image_decoded = cv2.imread(filename.decode(), cv2.IMREAD_GRAYSCALE)
  return image_decoded, label

# Use standard TensorFlow operations to resize the image to a fixed shape.
def _resize_function(image_decoded, label):
  image_decoded.set_shape([None, None, None])
  image_resized = tf.image.resize_images(image_decoded, [28, 28])
  return image_resized, label

filenames = ["/var/data/image1.jpg", "/var/data/image2.jpg", ...]
labels = [0, 37, 29, 1, ...]

dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(
    lambda filename, label: tuple(tf.py_func(
        _read_py_function, [filename, label], [tf.uint8, label.dtype])))
dataset = dataset.map(_resize_function)

4. Batching dataset elements

簡單的批處理

最簡單的批處理形式是將資料集的n個連續元素堆疊成單個元素。Dataset.batch()變換做這件事，和tf.stack()有相同的限制，對每個component i, 所有元素必須有相同的shape。

inc_dataset = tf.data.Dataset.range(100)
dec_dataset = tf.data.Dataset.range(0, -100, -1)
dataset = tf.data.Dataset.zip((inc_dataset, dec_dataset))
batched_dataset = dataset.batch(4)

iterator = batched_dataset.make_one_shot_iterator()
next_element = iterator.get_next()

print(sess.run(next_element))  # ==> ([0, 1, 2,   3],   [ 0, -1,  -2,  -3])
print(sess.run(next_element))  # ==> ([4, 5, 6,   7],   [-4, -5,  -6,  -7])
print(sess.run(next_element))  # ==> ([8, 9, 10, 11],   [-8, -9, -10, -11])

Batching tensors with padding

為了處理許多模型（比如序列模型）的輸入資料有不同size的情況，Dataset.padded_batch()變換可以將不同形狀的tensors指定一個或多個維度padding,來進行批處理。

dataset = tf.data.Dataset.range(100)
dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))
dataset = dataset.padded_batch(4, padded_shapes=[None])

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

print(sess.run(next_element))  # ==> [[0, 0, 0], [1, 0, 0], [2, 2, 0], [3, 3, 3]]
print(sess.run(next_element))  # ==> [[4, 4, 4, 4, 0, 0, 0],
                               #      [5, 5, 5, 5, 5, 0, 0],
                               #      [6, 6, 6, 6, 6, 6, 0],
                               #      [7, 7, 7, 7, 7, 7, 7]]

5. 訓練工作流

處理多個epochs

最簡單的處理方式是使用Dataset.repeat()變換。

filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...)
dataset = dataset.repeat(10)
dataset = dataset.batch(32)

隨機打亂輸入資料

filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)
dataset = dataset.repeat()

tensorflow Importing Data

1.基本機制

Dataset結構

生成一個iterator

從iterator中消耗值

儲存迭代器狀態

2.讀輸入資料

消費TFRecord資料

消費文字資料

3.使用Dataset.map()預處理資料

解析tf.Example協議緩衝訊息

解碼影象資料和resizing it

使用tf.py_func()應用任意Python logic

4. Batching dataset elements

簡單的批處理

Batching tensors with padding

5. 訓練工作流

處理多個epochs

隨機打亂輸入資料

使用高階APIs

tensorflow Importing Data

成功解決AttributeError: module 'tensorflow.contrib.data' has no attribute 'TextLineDataset'

tensorflow-tf.data(1)

Tensorflow-tf.data 如何構建資料通道

QIIME 2使用者文件. 8資料匯入Importing data(2018.11)

tensorflow data's save and load

tensorflow使用tf.keras.Mode寫模型並使用tf.data.Dataset作為資料輸入

TensorFlow 官網API學習（Reading data）

tensorflow-讀寫資料tf.data(1)

tensorflow-讀寫數據tf.data(1)

tensorflow-讀寫資料tf.data(2)

tensorflow-讀寫數據tf.data(2)

Data Augment ------TensorFlow 訓練圖片處理

利用Tensorflow的DNN對北京PM2.5資料集Beijing PM2.5 Data Data Set進行分類

tensorflow中OSError: [Errno 13] Permission denied: 'data'怎麼解決

tensorflow data-cfemail="0e7d7a7c6b6f63676069517c6b6d6f62624e65">[email protected]

Tensorflow 資料讀取 tf.data.Dataset API 相關介紹

Implement TensorFlow's next_batch for own data

TensorFlow Data模組

未能加載文件或程序集“System.Data.SQLite”

tensorflow Importing Data

1.基本機制

Dataset結構

生成一個iterator

從iterator中消耗值

儲存迭代器狀態

2.讀輸入資料

消費TFRecord資料

消費文字資料

3.使用Dataset.map()預處理資料

解析tf.Example協議緩衝訊息

解碼影象資料和resizing it

使用tf.py_func()應用任意Python logic

4. Batching dataset elements

簡單的批處理

Batching tensors with padding

5. 訓練工作流

處理多個epochs

隨機打亂輸入資料

使用高階APIs

相關推薦