High performance models in TensorFlow

This document and accompanying scripts detail how to build highly scalable models that target a variety of system types and network topologies. The techniques in this document utilize some low-level TensorFlow Python primitives. In the future, many of these techniques will be incorporated into high-level APIs.

本文件和隨附的指令碼詳細介紹瞭如何構建針對各種系統型別和網路拓撲的高度可擴充套件的模型。本文件中的技術使用一些低階TensorFlow Python基元。將來，許多技術將被併入高階API。

Input Pipeline

The Performance Guide explains how to identify possible input pipeline issues and best practices. We found that using tf.FIFOQueue and tf.train.queue_runner could not saturate multiple current generation GPUs when using large inputs and processing with higher samples per second, such as training ImageNet with

AlexNet . This is due to the use of Python threads as its underlying implementation. The overhead of Python threads is too large.

“效能指南”介紹瞭如何識別輸入管道可能存在的問題和最佳實踐。我們發現，當有大量輸入和每秒更高的取樣率處理（如使用AlexNet訓練ImageNet）時，使用tf.FIFOQueue和tf.train.queue_runner，不能使多個目前這一代的GPU飽和。這是由於使用Python執行緒作為其底層實現。 Python執行緒的開銷太大了。

Another approach, which we have implemented in the scripts , is to build an input pipeline using the native parallelism in TensorFlow. Our implementation is made up of 3 stages:

I/O reads: Choose and read image files from disk.
Image Processing: Decode image records into images, preprocess, and organize into mini-batches.
CPU-to-GPU Data Transfer: Transfer images from CPU to GPU.

我們在指令碼中實現的另一種方法是使用TensorFlow中的本機並行構建輸入管道。我們的方法由3個階段組成：

I/O 讀取：從磁碟中選擇和讀取影象檔案。
影象處理：將影象記錄解碼為影象，預處理和組織成小批量。
CPU到GPU資料傳輸：將影象從CPU傳輸到GPU。

The dominant part of each stage is executed in parallel with the other stages usingdata_flow_ops.StagingArea . StagingArea is a queue-like operator similar to tf.FIFOQueue . The difference is that StagingArea offers simpler functionality and can be executed on both CPU and GPU in parallel with other stages. Breaking the input pipeline into 3 stages that operate independently in parallel is scalable and takes full advantage of large multi-core environments. The rest of this section details the stages followed by details about usingdata_flow_ops.StagingArea .

使用data_flow_ops.StagingArea，每個階段的主要部分是與其他階段並行執行。StagingArea是類似於tf.FIFOQueue的類似佇列的操作符。不同之處在於，StagingArea提供了更簡單的功能，並且可以在CPU和GPU上與其他階段並行執行。將輸入管道分解成3個可並行執行的階段是可擴充套件的，並充分利用大型多核環境。本節的其餘部分將詳細介紹有關使用data_flow_ops.StagingArea的詳細資訊。

Parallelize I/O Reads 並行I/O讀取

data_flow_ops.RecordInput is used to parallelize reading from disk. Given a list of input files representing TFRecords, RecordInput continuously reads records using background threads. The records are placed into its own large internal pool and when it has loaded at least half of its capacity, it produces output tensors.

This op has its own internal threads that are dominated by I/O time that consume minimal CPU, which allows it to run smoothly in parallel with the rest of the model.

data_flow_ops.RecordInput用於從磁碟中並行讀取。給定一個表示TFRecords的輸入檔案的列表，RecordInput使用後臺執行緒連續讀取記錄。這些記錄被放置在大型內部池中，並且當它們已經載入了其容量的至少一半時，它產生輸出張量。

這個操作有自己的內部執行緒，主要時間為佔用最少CPU的 I/O時間，這就允許它與模型的其餘部分平行執行。

Parallelize Image Processing 並行影象處理

After images are read from RecordInput they are passed as tensors to the image processing pipeline. To make the image processing pipeline easier to explain, assume that the input pipeline is targeting 8 GPUs with a batch size of 256 (32 per GPU).

256 records are read and processed individually in parallel. This starts with 256 independentRecordInput read ops in the graph. Each read op is followed by an identical set of ops for image preprocessing that are considered independent and executed in parallel. The image preprocessing ops include operations such as image decoding, distortion, and resizing.

從RecordInput讀取影象後，它們作為張量傳遞到影象處理流水線。為了使影象處理流水線更容易解釋，假設輸入流水線的目標是8個GPU，批量大小為256（每GPU 32個）。

並行讀取和處理256個記錄。這從圖中的256個獨立的RecordInput讀操作開始。每個讀操作後跟一組相同的用於影象預處理的操作，它們被認為是獨立的並行執行。影象預處理操作包括諸如影象解碼，失真和調整大小的操作。

Once the images are through preprocessing, they are concatenated together into 8 batch size 32 tensors. Rather than use tf.concat for this purpose, which is implemented as a single op that waits for all the inputs to be ready before concatenating them together, tf.parallel_stackis used. tf.parallel_stack allocates an uninitialized tensor as an output, and each input tensor is written to its designated portion of the output tensor as soon as the input is available.

When all the input tensors are finished, the output tensor is passed along in the graph. This effectively hides all the memory latency with the long tail of producing all the input tensors.

一旦影象通過預處理，它們就被連線成8個批量32張張量。不是為了這個目的使用tf.concat，而是將其作為一個單獨的操作實現，等待所有輸入準備就緒，然後將它們連線在一起，就可以使用tf.parallel_stack。 tf.parallel_stack將未初始化的張量分配為輸出，並且一旦輸入可用，則每個輸入張量被寫入輸出張量的指定部分。

當所有輸入張量完成時，輸出張量在圖中傳遞。這有效地隱藏了產生所有輸入張量的長尾的所有記憶體延遲。

Parallelize CPU-to-GPU Data Transfer 並行從CPU到GPU的資料傳輸

Continuing with the assumption that the target is 8 GPUs with a batch size of 256 (32 per GPU). Once the input images are processed and concatenated together by the CPU, we have 8 tensors each with a batch-size of 32.

TensorFlow enables tensors from one device to be used on any other device directly. TensorFlow inserts implicit copies to make the tensors available on any devices where they are used. The runtime schedules the copy between devices to run before the tensors are actually used. However, if the copy cannot finish in time, the computation that needs those tensors will stall and result in decreased performance.

In this implementation, data_flow_ops.StagingArea is used to explicitly schedule the copy in parallel. The end result is that when computation starts on the GPU, all the tensors are already available.

繼續假設目標是8個GPU，批量大小為256（每GPU 32個）。一旦輸入影象被CPU處理並連線在一起，我們就有8個張量，每個標籤的批量大小為32。

TensorFlow可以使一個裝置的張量直接在任何其他裝置上使用。 TensorFlow插入隱式副本，使張量在使用它們的任何裝置上可用。在實際使用張量之前，執行時會在裝置之間排程副本以執行。然而，如果副本無法及時完成，則需要這些張量的計算將停止並導致效能下降。

在此實現中，data_flow_ops.StagingArea用於並行顯式排程副本。最終的結果是當GPU開始計算時，所有的張量都已經可用了。

Software Pipelining 軟體管道

With all the stages capable of being driven by different processors,data_flow_ops.StagingArea is used between them so they run in parallel. StagingArea is a queue-like operator similar to tf.FIFOQueue that offers simpler functionalities that can be executed on both CPU and GPU.

Before the model starts running all the stages, the input pipeline stages are warmed up to prime the staging buffers in between with one set of data. During each run step, one set of data is read from the staging buffers at the beginning of each stage, and one set is pushed at the end.

由於所有階段都能夠被不同的處理器驅動，所以在它們之間使用data_flow_ops.StagingArea，以便它們並行執行。 StagingArea是類似於tf.FIFOQueue的類似佇列的操作，它提供了可以在CPU和GPU上執行的更簡單的功能。

在模型開始執行所有階段之前，輸入流水線階段將被加熱，以將其間的分段緩衝區置於一組資料之間。在每個執行步驟中，在每個階段的開始處，從分段緩衝器中讀取一組資料，最後一個數據被推送。

For example: if there are three stages: A, B and C. There are two staging areas in between: S1 and S2. During the warm up, we run:

例如：如果有三個階段：A，B和C。之間有兩個分段區域 S1 和 S2。在預熱時，我們執行：

Warm up:
Step 1: A0
Step 2: A1  B0

Actual execution:
Step 3: A2  B1  C0
Step 4: A3  B2  C1
Step 5: A4  B3  C2

After the warm up, S1 and S2 each have one set of data in them. For each step of the actual execution, one set of data is consumed from each staging area, and one set is added to each.

預熱後，S1和S2各有一組資料。對於實際執行的每個步驟，從每個暫存區域消耗一組資料，並將一組資料新增到每個。

Benefits of using this scheme:

All stages are non-blocking, since the staging areas always have one set of data after the warm up.
Each stage can run in parallel since they can all start immediately.
The staging buffers have a fixed memory overhead. They will have at most one extra set of data.
Only a single session.run() call is needed to run all stages of the step, which makes profiling and debugging much easier.

使用此方案的好處：

所有階段都是非阻塞的，因為在熱身之後，分段區域總是具有一組資料。
每個階段都可以並行執行，因為它們都可以立即啟動。
分級緩衝區具有固定的記憶體開銷。他們將至多有一組額外的資料。
只需要一個單獨的session.run（）呼叫來執行步驟的所有階段，這使得分析和除錯更容易。

Best Practices in Building High-Performance Models 建立高效能模型的最佳實踐

Collected below are a couple of additional best practices that can improve performance and increase the flexiblity of models.

以下收集的是一些額外的最佳實踐，可以提高效能並提高模型的靈活性。

Build the model with both NHWC and NCHW 用NHWC和NCHW建立模型

Most TensorFlow operations used by a CNN support both NHWC and NCHW data format. On GPU, NCHW is faster. But on CPU, NHWC is sometimes faster.

Building a model to support both date formats keeps the model flexible and capable of operating optimally regardless of platform. Most TensorFlow operations used by a CNN support both NHWC and NCHW data format. The benchmark script was written to support both NCHW and NHWC. NCHW should always be used when training with GPUs. NHWC is sometimes faster on CPU. A flexible model can be trained on GPUs using NCHW with inference done on CPU using NHWC with the weights obtained from training.

CNN使用的大多數TensorFlow操作都支援NHWC和NCHW資料格式。在GPU上，NCHW更快。但是在CPU上，NHWC有時更快。

建立一個支援日期格式的模型可以保持模型的靈活性，無論平臺如何，都能夠最佳地執行。 CNN使用的大多數TensorFlow操作都支援NHWC和NCHW資料格式。基準指令碼是為了支援NCHW和NHWC而編寫的。在使用GPU進行培訓時，應始終使用NCHW。 NHWC有時在CPU上更快。可以使用NCHW在GPU上使用NHWC進行推理並從訓練獲得的權重對GPU進行靈活的模型訓練。

Use Fused Batch-Normalization 使用融合批處理標準化

The default batch-normalization in TensorFlow is implemented as composite operations. This is very general, but often leads to suboptimal performance. An alternative is to use fused batch-normalization which often has much better performance on GPU. Below is an example of usingtf.contrib.layers.batch_norm to implement fused batch-normalization.

TensorFlow中的預設批處理規範化被實現為複合操作。這是很通用的，但往往導致次優的表現。另一種方法是使用融合批量標準化，這在GPU上經常具有更好的效能。以下是使用tf.contrib.layers.batch_norm來實現融合批處理的一個例子。

bn = tf.contrib.layers.batch_norm(
          input_layer, fused=True, data_format='NCHW'
          scope=scope)

Variable Distribution and Gradient Aggregation 可變分佈和梯度聚合

During training, training variable values are updated using aggregated gradients and deltas. In the benchmark script, we demonstrate that with the flexible and general-purpose TensorFlow primitives, a diverse range of high-performance distribution and aggregation schemes can be built.

Three examples of variable distribution and aggregation were included in the script:

在培訓期間，培訓變數值使用聚合漸變和三角形更新。在基準指令碼中，我們展示了使用靈活和通用的TensorFlow原語，可以構建各種各樣的高效能分佈和聚合方案。

指令碼中包含三個可變分佈和聚合的示例：

parameter_server where each replica of the training model reads the variables from a parameter server and updates the variable independently. When each model needs the variables, they are copied over through the standard implicit copies added by the TensorFlow runtime. The example script illustrates using this method for local training, distributed synchronous training, and distributed asynchronous training.
replicated places an identical copy of each training variable on each GPU. The forward and backward computation can start immediately as the variable data is immediately available. Gradients are accumulated across all GPUs, and the aggregated total is applied to each GPU's copy of the variables to keep them in sync.
distributed_replicated places an identical copy of the training parameters on each GPU along with a master copy on the parameter servers. The forward and backward computation can start immediately as the variable data is immediately available. Gradients are accumulated across all GPUs on each server and then the per-server aggregated gradients are applied to the master copy. After all workers do this, each worker updates its copy of the variable from the master copy.
parameter_server其中訓練模型的每個副本從引數伺服器讀取變數並獨立地更新變數。當每個模型需要變數時，它們將通過TensorFlow執行時新增的標準隱式副本進行復制。示例指令碼說明了使用此方法進行本地訓練，分散式同步訓練和分散式非同步訓練。
複製在每個GPU上放置每個訓練變數的相同副本。隨著可變資料立即可用，正向和反向計算可以立即開始。所有GPU中都會累積漸變，並將累計總數應用於每個GPU的變數副本，以使其保持同步。
distributed_replicated將每個GPU上的訓練引數的相同副本與引數伺服器上的主副本一起放置。隨著可變資料立即可用，正向和反向計算可以立即開始。梯度在每個伺服器上的所有GPU中累積，然後將每個伺服器的聚合漸變應用於主副本。所有工作人員都執行此操作後，每個工作人員將從主副本更新其變數的副本。

Below are additional details about each approach.

以下是有關每種方法的其他詳細資訊。

Parameter Server Variables 引數伺服器變數

The most common way trainable variables are managed in TensorFlow models is parameter server mode.

In a distributed system, each worker process runs the same model, and parameter server processes own the master copies of the variables. When a worker needs a variable from a parameter server, it refers to it directly. The TensorFlow runtime adds implicit copies to the graph to make the variable value available on the computation device that needs it. When a gradient is computed on a worker, it is sent to the parameter server that owns the particular variable, and the corresponding optimizer is used to update the variable.

There are some techniques to improve throughput:

TensorFlow模型中可管理變數的最常見方式是引數伺服器模式。

在分散式系統中，每個工作程序執行相同的模型，引數伺服器程序擁有變數的主副本。當一個工作者需要一個引數伺服器的變數時，它直接引用它。 TensorFlow執行時會將隱式副本新增到圖形中，使變數值在需要它的計算裝置上可用。當在工作者上計算梯度時，將其傳送到擁有特定變數的引數伺服器，並使用相應的優化程式更新變數。

有一些提高吞吐量的技術：

The variables are spread among parameter servers based on their size, for load balancing.
When each worker has multiple GPUs, gradients are accumulated across the GPUs and a single aggregated gradient is sent to the parameter server. This reduces the network bandwidth and the amount of work done by the parameter servers.
這些變數根據其大小在引數伺服器之間進行擴充套件，用於負載平衡。
當每個工作人員有多個GPU時，每個GPU都會累積梯度，並將一個聚合梯度傳送到引數伺服器。這減少了網路頻寬和引數伺服器完成的工作量。

For coordinating between workers, a very common mode is async updates, where each worker updates the master copy of the variables without synchronizing with other workers. In our model, we demonstrate that it is fairly easy to introduce synchronization across workers so updates for all workers are finished in one step before the next step can start.

The parameter server method can also be used for local training, In this case, instead of spreading the master copies of variables across parameters servers, they are either on the CPU or spread across the available GPUs.

Due to the simple nature of this setup, this architecture has gained a lot of popularity within the community.

This mode can be used in the script by passing --variable_update=parameter_server .

對於workers之間的協調，一個非常常見的模式是非同步更新，每個worker更新變數的主副本，而不與其他workers同步。在我們的模型中，我們證明在workers之間引入同步是相當容易的，所以在下一步開始之前，所有workers的更新將一步完成。

引數伺服器方法也可以用於本地訓練，在這種情況下，不是在引數伺服器之間傳播變數的主副本，而是在CPU上或分佈在可用的GPU上。

由於這種設定的簡單性，這種架構在社群內獲得了很多的普及。

通過傳遞--variable_update=parameter_server可以在指令碼中使用此模式

Replicated Variables 複製變數

In this design, each GPU on the server has its own copy of each variable. The values are kept in sync across GPUs by applying the fully aggregated gradient to each GPU's copy of the variable.

The variables and data are available at the start of training, so the forward pass of training can start immediately. Gradients are aggregated across the devices and the fully aggregated gradient is then applied to each local copy.

Gradient aggregation across the server can be done in different ways:

Using standard TensorFlow operations to accumulate the total on a single device (CPU or GPU) and then copy it back to all GPUs.
Using NVIDIA® NCCL, described below in the NCCL section.

This mode can be used in the script by passing --variable_update=replicated .

在這個設計中，伺服器上的每個GPU都有自己的每個變數的副本。通過將完全聚合的漸變應用於每個GPU的變數副本，這些值在GPU之間保持同步。

變數和資料在培訓開始時可用，所以訓練的前進通過即可開始。梯度在裝置之間進行聚合，然後將完全聚合的漸變應用於每個本地副本。

伺服器上的漸變聚合可以通過不同的方式完成：

使用標準TensorFlow操作在單個裝置（CPU或GPU）上累積總數，然後將其複製回所有GPU。
使用NVIDIA®NCCL，如下面NCCL部分所述。

通過傳遞--variable_update = replicated可以在指令碼中使用此模式。

Replicated Variables in Distributed Training 分散式訓練中的複製變數

The replicated method for variables can be extended to distributed training. One way to do this like the replicated mode: aggregate the gradients fully across the cluster and apply them to each local copy of the variable. This may be shown in a future version of this scripts; the scripts do present a different variation, described here.

In this mode, in addition to each GPU's copy of the variables, a master copy is stored on the parameter servers. As with the replicated mode, training can start immediately using the local copies of the variables.

變數的複製方法可以擴充套件到分散式培訓。一種像複製模式一樣的方法：將叢集中的漸變聚合並將其應用於變數的每個本地副本。這可能會在此指令碼的未來版本中顯示; 指令碼確實呈現出不同的變化，這裡描述。

在這種模式下，除了每個GPU的變數副本之外，主副本也儲存在引數伺服器上。與複製模式一樣，訓練可以立即使用變數的本地副本開始。

As the gradients of the weights become available, they are sent back to the parameter servers and all local copies are updated:

All the gradients from the GPU on the same worker are aggregated together.
Aggregated gradients from each worker are sent to the parameter server that owns the variable, where the specified optimizer is used to update the master copy of the variable.
Each worker updates its local copy of the variable from the master. In the example model, this is done with a cross-replica barrier that waits for all the workers to finish updating the variables, and fetches the new variable only after the barrier has been released by all replicas. Once the copy finishes for all variables, this marks the end of a training step, and a new step can start.

隨著權重的梯度可用，它們將被髮送回引數伺服器，並且所有本地副本都被更新：

來自同一工作人員的GPU的所有漸變都聚合在一起。
來自每個工作者的聚合漸變被髮送到擁有變數的引數伺服器，其中使用指定的優化器來更新變數的主副本。
每個工作人員從主機更新變數的本地副本。在示例模型中，這是通過等待所有工作人員完成更新變數的交叉副本屏障完成的，並且只有在所有副本釋出屏障之後才獲取新變數。一旦複製完成所有變數，這標誌著培訓步驟的結束，一個新的步驟可以開始。

Although this sounds similar to the standard use of parameter servers, the performance is often better in many cases. This is largely due to the fact the computation can happen without any delay, and much of the copy latency of early gradients can be hidden by later computation layers.

This mode can be used in the script by passing --variable_update=distributed_replicated .

雖然這聽起來類似於引數伺服器的標準使用，但在許多情況下，效能往往更好。這主要是由於計算可以沒有任何延遲而發生的事實，早期梯度的大部分複製延遲可以被稍後的計算層隱藏。

High performance models in TensorFlow

Input Pipeline

Parallelize I/O Reads 並行I/O讀取

Parallelize Image Processing 並行影象處理

Parallelize CPU-to-GPU Data Transfer 並行從CPU到GPU的資料傳輸

Software Pipelining 軟體管道

Best Practices in Building High-Performance Models 建立高效能模型的最佳實踐

Build the model with both NHWC and NCHW 用NHWC和NCHW建立模型

Use Fused Batch-Normalization 使用融合批處理標準化

Variable Distribution and Gradient Aggregation 可變分佈和梯度聚合

Parameter Server Variables 引數伺服器變數

Replicated Variables 複製變數

Replicated Variables in Distributed Training 分散式訓練中的複製變數

High performance models in TensorFlow

csrf] in high performance way

High Performance Computing (HPC) in the Cloud

TarsGo: A high performance microservice framework in golang which is 5 times higher than the…

視頻筆記 CppCon 2016 Chandler Carruth High Performance Code 201 Hybrid Data Structures

Transfer learning & The art of using Pre-trained Models in Deep Learning

Visual Question Answering in Tensorflow實戰

Sentiment Analysis with Recurrent Neural Networks in TensorFlow 利用TensorFlow迴歸神經網路進行情感分析 Pluralsigh

Building Classification Models with TensorFlow 用TensorFlow構建分類模型 Pluralsight課程中文字幕

Object Detection In Tensorflow - Part 3

Object Detection In Tensorflow - Part 2

Object Detection In Tensorflow - Part 1

High Performance Visual Tracking with Siamese Region Proposal Network論文筆記

MorphCore-An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP

【影象語義分割】Semantic Segmentation Suite in TensorFlow---GitHub_Link

Sentiment Analysis with Recurrent Neural Networks in TensorFlow 利用TensorFlow迴歸神經網路進行情感分析 Pluralsigh

論文筆記《The application of two-level attention models in deep convolutional neural network for FGVC》

RNN and Language modeling in TensorFlow

Convolutions in TensorFlow

Linear and Logistic Regression in TensorFlow

High performance models in TensorFlow

Input Pipeline

Parallelize I/O Reads 並行I/O讀取

Parallelize Image Processing 並行影象處理

Parallelize CPU-to-GPU Data Transfer 並行從CPU到GPU的資料傳輸

Software Pipelining 軟體管道

Best Practices in Building High-Performance Models 建立高效能模型的最佳實踐

Build the model with both NHWC and NCHW 用NHWC和NCHW建立模型

Use Fused Batch-Normalization 使用融合批處理標準化

Variable Distribution and Gradient Aggregation 可變分佈和梯度聚合

Parameter Server Variables 引數伺服器變數

Replicated Variables 複製變數

Replicated Variables in Distributed Training 分散式訓練中的複製變數

相關推薦