tensorflow performance——Model optimization 模型優化

阿新 • • 發佈：2018-12-17

Performance

Performance is an important consideration when training machine learning models.

Performance speeds up and scales research while also providing end users with near instant predictions.

This section provides details on the high level APIs to use along with best practices to build and train high performance models, and quantize models for the least latency and highest throughput for inference.

在訓練機器學習模型時，效能是一個重要的考慮因素。效能提升和細粒度研究，同時也給終端使用者提供實時預測。本節提供關於使用高階APIs的細節，以及構建和訓練高效能模型、針對最少的延遲來量化模型和對推理的最高吞吐量的最佳實戰經驗。

Performance Guide contains a collection of best practices for optimizing your TensorFlow code.
包括一些對於優化你的tensorflow程式碼的最優實驗
Data input pipeline guide describes the tf.data API for building efficient data input pipelines for TensorFlow.
描述了針對基於tensorflow構建有效的資料輸入演算法的tf.dada API
Benchmarks contains a collection of benchmark results for a variety of hardware configurations.
包括一些針對多種硬體配置的基準結果

Tensorflow Model Optimization Toolkit is a set of techniques for optimizing models for inference:

Tensorflow 模型優化工具包是針對優化推理推斷模型的一些技術：

Overview, which introduces the model optimization toolkit.
介紹了模型優化工具包

描述了訓練後的量化

XLA (Accelerated Linear Algebra) is an experimental compiler for linear algebra that optimizes TensorFlow computations. The following guides explore XLA:

XLA Overview, which introduces XLA.
Broadcasting Semantics, which describes XLA's broadcasting semantics.
Developing a new back end for XLA, which explains how to re-target TensorFlow in order to optimize the performance of the computational graph for particular hardware.
Using JIT Compilation, which describes the XLA JIT compiler that compiles and runs parts of TensorFlow graphs via XLA in order to optimize performance.
Operation Semantics, which is a reference manual describing the semantics of operations in the ComputationBuilder interface.
Shape protocol buffer.
tfcompile, a standalone tool that compiles TensorFlow graphs into executable code in order to optimize performance.

Model optimization 模型優化

Inference efficiency is a critical issue when deploying machine learning models to mobile devices.

Where the computational demand for training grows with the number of models trained on different architectures, the computational demand for inference grows in proportion to the number of users.

The Tensorflow Model Optimization Toolkit minimizes the complexity of inference—the model size, the latency and power consumption.

當部署機器學習模型到移動裝置時，推理的效率時至關重要的因素。

訓練的計算需求隨著在不同架構上訓練的模型的數量而增長時，推斷的計算量隨著使用者數量成比例增長。

Tensorflow 模型優化工具包最小化推斷的複雜度，模型大小、延遲和功耗。

Use cases 使用場景

Model optimization is useful for:

Deploying models to edge devices with restrictions on processing, memory, or power-consumption. For example, mobile and Internet of Things (IoT) devices.
部署模型到對處理、記憶體或功耗有限制的edge裝置。例如，移動和IOT物聯網裝置
Reduce the payload size for over-the-air model updates.
減小無效模型的有效負載大小
Execution on hardware constrained by fixed-point operations.
在受定點操作限制的裝置上執行
Optimize models for special purpose hardware accelerators.
對於特殊目的的硬體加速器優化模型

Optimization methods 優化方法

Model optimization uses multiple techniques:

Reduced parameter count, for example, pruning and structured pruning.
減少引數兩，例如，剪枝和結構化剪枝。
Reduced representational precision, for example, quantization.
降低表示精度，例如，量化
Update the original model topology to a more efficient one, with reduced parameters or faster execution, for example, tensor decomposition methods and distillation.
將原始模型的拓撲結構更新為一個更有效的，減少引數或執行更快，例如，張量分解方法和蒸餾

Model quantization 模型量化

Quantizing deep neural networks uses techniques that allow for reduced precision representations of weights and, optionally, activations for both storage and computation.

量化深度神經網路使用的技術可以降低權重的表達精度，可視情況而定減少儲存和計算的執行。

Quantization provides several benefits:

量化提供了幾個好處：

Support on existing CPU platforms.
支援現有的CPU平臺
Quantizing activations reduces memory access costs for reading and storing intermediate activations.
量化啟用減少了讀取和儲存中間啟用的記憶體訪問成本
Many CPU and hardware accelerator implementations provide SIMD instruction capabilities, which are especially beneficial for quantization.

許多CPU和硬體加速器實現提供了SIMD指令功能，這對量化尤其有利。

TensorFlow Lite provides several levels of support for quantization.

Tensorflow lite為量化提供了多個級別的支援

Post-training quantization quantizes weights and activations post training and is very easy to use. Quantization-aware training allows for training networks that can be quantized with minimal accuracy drop and is only available for a subset of convolutional neural network architectures.

Quantization-aware training 考慮到訓練網路可以以最小的準確率下降來量化，而且只是對卷積神經網路結構的一個子集有用。

Latency and accuracy results 延遲和準確率結果

Below are the results of the latency and accuracy of post-training quantization and quantization-aware training on a few models.

下邊是在幾個模型上進行了post-training quantization 和 quantization-aware training的延遲和準確率的結結果

All latency numbers are measured on Pixel 2 devices using a single big core.

所有的延遲數量是在兩個裝置上測試的用單一大的那個。

As the toolkit improves, so will the numbers here:

隨著工具包的改進，所以這裡的資料也會變化：

Model	Top-1 Accuracy (Original)	Top-1 Accuracy (Post Training Quantized)	Top-1 Accuracy (Quantization Aware Training)	Latency (Original) (ms)	Latency (Post Training Quantized) (ms)	Latency (Quantization Aware Training) (ms)	Size (Original) (MB)	Size (Optimized) (MB)
Mobilenet-v1-1-224	0.709	0.657	0.70	180	145	80.2	16.9	4.3
Mobilenet-v2-1-224	0.719	0.637	0.709	117	121	80.3	14	3.6
Inception_v3	0.78	0.772	0.775	1585	1187	637	95.7	23.9
Resnet_v2_101	0.770	0.768	N/A	3973	2868	N/A	178.3	44.9

Table 1 Benefits of model quantization for select CNN models

Choice of quantization tool

As a starting point, check if the models in the TensorFlow Lite model repository can work for your application. If not, we recommend that users start with the post-training quantization tool since this is broadly applicable and does not require training data.

For cases where the accuracy and latency targets are not met, or hardware accelerator support is important, quantization-aware training is the better option.

作為起點，檢查tensorflow lite模型儲存庫中國的模型是否適用於你的應用。如果不適用，我們建議使用者從post-training quantization工具開始，因為這個是廣泛適用的且不需要訓練資料。

對於一些情況，不滿足準確率和延遲目標的，或硬體加速器支援很重要的，quantization-aware training是一個更好的選擇。

Post-training quantization

Post-training quantization is a general technique to reduce the model size while also providing up to 3x lower latency with little degradation in model accuracy. Post-training quantization quantizes weights to 8-bits of precision from floating-point.

Post-training quantization是一種通用技術，來減小模型尺寸同時提升延遲降低3倍並且模型準確率只有一點點的降低。Post-training quantization將權重從浮點型量化為8位的精度

import tensorflow as tf

converter = tf.contrib.lite.TocoConverter.from_saved_model(saved_model_dir)

converter.post_training_quantize = True

tflite_quantized_model = converter.convert()

open("quantized_model.tflite", "wb").write(tflite_quantized_model)

At inference, weights are converted from 8-bits of precision to floating-point and computed using floating point kernels. This conversion is done once and cached to reduce latency.

在推理時，權值從8-bits精度轉化到float型，並且用float型計算。這種轉化只進行一次，並進行快取以減少延遲。

To further improve latency, hybrid operators dynamically quantize activations to 8-bits and perform computations with 8-bit weights and activations. This optimization provides latencies close to fully fixed-point inference. However, the outputs are still stored using floating-point, so the speedup with hybrid ops is less than a full fixed-point computation. 為了進一步提高延遲，混合操作符將啟用動態量化為8位，並使用8位權值和啟用執行計算。這種優化提供了接近完全定點推理的延遲。但是，輸出仍然使用浮點數儲存，因此混合ops的加速比完全的定點計算要小。

Hybrid ops are available for the most compute-intensive operators in a network:

混合操作適用於網路中計算最密集的操作：

Since weights are quantized post-training, there could be an accuracy loss, particularly for smaller networks. Pre-trained fully quantized models are provided for specific networks in the TensorFlow Lite model repository. It is important to check the accuracy of the quantized model to verify that any degradation in accuracy is within acceptable limits. There is a tool to evaluate TensorFlow Lite model accuracy.

因為權重是在訓練後進行量化的，所以應該有一個精度的損失，尤其對於小網路。在tensorflow lite 模型庫中對於特殊的網路提供了預先訓練好的完全量化的模型。它對於檢查量化模型的精度是重要的來驗證任何精度上的下降是否在可接受的範圍內。這有個工具來評估tensorflow lite模型精度

If the accuracy drop is too high, consider using quantization aware training.

Representation for quantized tensors 量化張量的表示

TensorFlow approaches the conversion of floating-point arrays of numbers into 8-bit representations as a compression problem.

Since the weights and activation tensors in trained neural network models tend to have values that are distributed across comparatively small ranges (for example, -15 to +15 for weights or -500 to 1000 for image model activations).

And since neural nets tend to be robust handling noise, the error introduced by quantizing to a small set of values maintains the precision of the overall results within an acceptable threshold.

A chosen representation must perform fast calculations, especially the large matrix multiplications that comprise the bulk of the computations while running a model.

Tensorflow將float型的資料陣列到8-bit表示的轉化問題看作壓縮問題來處理。由於在訓練神經網路模型中的權值和啟用張量的值往往分佈在比較小的範圍內（例如，權值在-15到15，影象模型幾多在-500到1000）。所選的表示必須執行更快的計算，尤其在執行一個模型時包含了大量計算的大型矩陣乘法

This is represented with two floats that store the overall minimum and maximum values corresponding to the lowest and highest quantized value.

這是用2個浮點表示的，儲存著總體最小值和最大值對應於最低和最高量化值

Each entry in the quantized array represents a float value in that range, distributed linearly between the minimum and maximum.

在量化陣列中的每個條目表示在那個範圍的浮點值，線性分佈在最小和最大值之間

For example, with a minimum of -10.0 and maximum of 30.0f, and an 8-bit array, the quantized values represent the following:

例如，最小值為-10.0，最大值為30.0，一個8-bit陣列，量化後的值表示如下：

Quantized	Float
0	-10.0
128	10.0
255	30.0

Table 2: Example quantized value range

The advantages of this representation format are:

這種表達方式的好處是：

It efficiently represents an arbitrary magnitude of ranges.
可以有效的表示任何大小範圍
The values don't have to be symmetrical.
值沒必要是對稱的
The format represents both signed and unsigned values.
這種方式可以表示有符號和無符號的值
The linear spread makes multiplications straightforward.
這個線性擴充套件使乘法簡單明瞭。

tensorflow performance——Model optimization 模型優化

Model optimization 模型優化

Use cases 使用場景

Optimization methods 優化方法

Model quantization 模型量化

Latency and accuracy results 延遲和準確率結果

Choice of quantization tool

Post-training quantization

Representation for quantized tensors 量化張量的表示

tensorflow performance——Model optimization 模型優化

【tensorflow】】模型優化（一）指數衰減學習率

TensorFlow 模型優化工具包正式推出

TensorFlow技術內幕（十一）：模型優化之量化（Quantize）

tensorflow模型優化技巧

利用Tensorflow實現神經網絡模型

CMU Convex Optimization(凸優化)筆記1--凸集和凸函數

Convex optimization 凸優化

tensorflow實現LeNet-5模型

關於模型優化的思考

CSS Box Model 盒子模型

TensorFlow筆記-06-神經網絡優化-損失函數,自定義損失函數,交叉熵

MySQL8.0 · 優化器新特性 · Cost Model, 直方圖及優化器開銷優化

tensorflow利用預訓練模型進行目標檢測（一）：預訓練模型的使用

tensorflow的滑動平均模型

tensorflow-2. 神經網路的優化

[tensorflow] 如何從pb模型檔案中獲得引數資訊 How to obtain parameters information from a tensorflow .pb file?

TensorFlow Wide And Deep 模型詳解與應用

FMDB(資料庫)與Model(資料模型)的結合使用(CRUD)

tensorflow objectdetecton API 檢測模型不出結果

tensorflow performance——Model optimization 模型優化

Model optimization 模型優化

Use cases 使用場景

Optimization methods 優化方法

Model quantization 模型量化

Latency and accuracy results 延遲和準確率結果

Choice of quantization tool

Post-training quantization

Representation for quantized tensors 量化張量的表示

相關推薦