1. 程式人生 > >【TensorFlow】quantization量化

【TensorFlow】quantization量化

一、 Question 1:How does Tensorflow do quantization and dequantization?
Details 
According to the blog post “https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/“*(重點!)*, Tensorflow quantizes values before they go into a layer. After being processed by the layer, the values are dequantized. Tensorflow quantizes values by rescaling the values between 0 and 255, so it needs to keep “min” and “max” to dequantize the values. 
I would like to ask: 1. how the “min” and “max” in the outputs of a “quantization” op are determined? I mean, if we simply find the minimum and maximum value and set them to 0 and 255, we will get data overflow or underflow when doing convolution. 2. how the “min” and “max” in the outputs of a “convolution” op are determined? Both weights and activations are quantized, so there are two sets of “min” and “max”. How does a convolution op combine them to form a single set of “min” and “max”?

根據部落格文章“ https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/”,Tensorflow在進入一個層之前量化值。經過圖層處理後,值被去量化。Tensorflow通過重新調整0到255之間的值來量化值,所以它需要保留“min”和“max”來對這些值進行去量化。 
我想問一下:1.如何確定“量化”操作的輸出中的“最小”和“最大”?我的意思是,如果我們簡單地找到最小值和最大值並將它們設定為0和255,那麼在進行卷積時我們將會發生資料溢位或下溢。2.如何確定“卷積”op輸出中的“min”和“max”?量和啟用都是量化的,所以有兩組“最小”和“最大”。卷積運算如何將它們組合成一組“最小”和“最大”?

Answer 
TensorFlow uses i.a. gemmlowp for low-precision matrix multiplications. Although 8-bit values are used as inputs, intermediate results are 32-bit values. These 32-bit values are converted back to 8-bit before returning the results. 
TensorFlow使用i.a. gemmlowp用於低精度矩陣乘法。 儘管8位值用作輸入,但中間結果是32位值。 在返回結果之前,這些32位值被轉換回8位。

From https://github.com/google/gemmlowp/blob/master/doc/low-precision.md :

To avoid overflow, we internally accumulate results on more than 8 bits, and at the end we keep only some significant 8 bits. 
為了避免溢位,我們在內部累積了超過8位的結果,最後我們只保留了一些重要的8位。

二、How to Quantize Neural Networks with TensorFlow (Blog),官網指南
2.1 量化已有模型並做測試
程式碼地址:https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/quantize/python/quantize_graph.py

curl http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz -o /tmp/inceptionv3.tgz
tar xzf /tmp/inceptionv3.tgz -C /tmp/
bazel build tensorflow/contrib/quantization/tools:quantize_graph
bazel-bin/tensorflow/contrib/quantization/tools/quantize_graph \
--input=/tmp/classify_image_graph_def.pb \
--output_node_names="softmax" --output=/tmp/quantized_graph.pb \
--mode=eightbit


This will produce a new model that runs the same operations as the original, but with eight bit calculations internally, and all weights quantized as well. If you look at the file size, you’ll see it’s about a quarter of the original (23MB versus 91MB). You can still run this model using exactly the same inputs and outputs though, and you should get equivalent results. Here’s an example:

bazel build tensorflow/examples/label_image:label_image
bazel-bin/tensorflow/examples/label_image/label_image \
--input_graph=/tmp/quantized_graph.pb \
--input_width=299 \
--input_height=299 \
--mean_value=128 \
--std_value=128 \
--input_layer_name="Mul:0" \
--output_layer_name="softmax:0"


2.2 量化張量使用什麼表示?
我們將浮點數字陣列轉換為8位表示形式作為壓縮問題。我們知道經過訓練的神經網路模型中的權重和啟用張量傾向於具有分佈在相對較小範圍內的值(例如,對於權重可能有-15到+15,對於影象模型上的啟用可能有-500到1000)確切的數字會有所不同)。我們從實驗中也知道,神經網路在噪聲的情況下往往是非常穩健的,所以量化到一小組值的噪聲類錯誤不會嚴重影響整體結果的精度。我們也希望選擇一個易於執行計算的表示,特別是構成執行模型所需的大部分工作的大型矩陣乘法。 


2.3 8bit運算:Eight-bit arithmetic
using gemmlowp

2.3.1 量化實現程式碼(重要!!)
 


2.3.2 原理講解 The low-precision paradigm in gemmlowp, and how it’s implemented (gemmlowp)
“Low-precision” means that the input and output matrix entries are integers on at most 8 bits. The scalar type is uint8_t. 
gemmlowp is flexible enough to support multiple low-precision paradigms, i.e. multiple ways that a meaning is attached to 8bit values so that a computation can rely on a 8bit GEMM provided by gemmlowp.

Building a quantization paradigm from first principles
Quantization as an affine map.
Domain-specific constraint: the real value 0 must be exactly representable.
The final form of the quantization equation
Quantizing a matrix multiplication
Implementation of quantized matrix multiplication
How this is implemented in gemmlowp
How this differs from the older legacy gemmlowp quantization paradigm
Example code illustrating the new quantization paradigm
三、GOOGLE MobileNet quantizaition實現
展示的是在training過程中插入偽量化操作 
3.1 mobilenet_v1_train.py

def build_model():
  """Builds graph for model to train with rewrites for quantization.
  Returns:
    g: Graph with fake quantization ops and batch norm folding suitable for
    training quantized weights.
    train_tensor: Train op for execution during training.
  """
  g = tf.Graph()
  with g.as_default(), tf.device(
      tf.train.replica_device_setter(FLAGS.ps_tasks)):
    inputs, labels = imagenet_input(is_training=True)
    with slim.arg_scope(mobilenet_v1.mobilenet_v1_arg_scope(is_training=True)):
      logits, _ = mobilenet_v1.mobilenet_v1(
          inputs,
          is_training=True,
          depth_multiplier=FLAGS.depth_multiplier,
          num_classes=FLAGS.num_classes)

    tf.losses.softmax_cross_entropy(labels, logits)

    # Call rewriter to produce graph with fake quant ops and folded batch norms
    # quant_delay delays start of quantization till quant_delay steps, allowing
    # for better model accuracy.
    if FLAGS.quantize:
      **tf.contrib.quantize.create_training_graph(quant_delay=get_quant_delay())**


3.2 quantize_graph.py

# 建立偽量化graph
def create_training_graph(input_graph=None, quant_delay=0):
  """Rewrites a training input_graph in place for simulated quantization.
  Variables added by the rewrite get added to the global variables collection.
  The graph has fake quantization ops inserted to simulate the error
  introduced by quantization. Since the graph is transformed in place,
  the expected behavior of previously held references to nodes and tensors may
  change.
  The default value of quant_delay is suitable for finetuning an already trained
  floating point model (recommended).
  If one wants to train a quantized model from scratch, quant_delay should be
  set to the number of steps it take the floating point model to converge.
  Quantization will be activated at this point and effectively finetune the
  model. If quant_delay is not provided when training from scratch, training can
  often fail.
  Args:
    input_graph: The tf.Graph to be transformed.
    quant_delay: Number of steps after which weights and activations are
      quantized during training.
  Raises:
    ValueError: If elements contains an element that isn't a tf.Tensor or
      tf.Operation.
  """
  # TODO(raghuramank) Need to have freeze_bn_delay be a function of batch size
  # Currently the values below are hardcoded for mobilenetV1 on imagenet
  # Please use the experimental API if you need to tune these values.
  freeze_bn_delay = None

  **_create_graph(**
      input_graph=input_graph,
      is_training=True,
      quant_delay=quant_delay,
      freeze_bn_delay=freeze_bn_delay)

# 轉到_create_graph
def _create_graph(input_graph=None,
                  is_training=True,
                  weight_bits=8,
                  activation_bits=8,
                  quant_delay=None,
                  freeze_bn_delay=None,
                  scope=None):
  """Rewrites an input_graph in place for simulated quantization.
  The graph has fake quantization ops inserted to simulate the error
  introduced by quantization. Since the graph is transformed in place,
  the expected behavior of previously held references to nodes and tensors may
  change.
  Args:
    input_graph: The tf.Graph to be transformed, if None then defaults to the
      default graph.
    is_training: Whether quantizing training or eval graph.
    weight_bits: Number of bits to use for quantizing weights.
    activation_bits: Number of bits to use for quantizing activations.
    quant_delay: Number of steps after which weights and activations are
      quantized during training.
    freeze_bn_delay: Number of steps after which moving mean and variance are
      frozen and used instead of batch statistics during training.
      freeze_bn_delay should be greater than quant_delay and should correspond
      to the number of steps when training has almost converged
    scope: The scope to be transformed. If it's not None, only the ops which
      are in this scope will be transformed.
  Raises:
    ValueError: If elements contains an element that isn't a tf.Tensor or
      tf.Operation.
  """

  if input_graph is None:
    input_graph = ops.get_default_graph()
  with input_graph.as_default():
    fold_batch_norms.FoldBatchNorms(
        input_graph,
        freeze_batch_norm_delay=freeze_bn_delay,
        is_training=is_training)
    **quantize.Quantize(**
        input_graph,
        is_training,
        quant_delay=quant_delay,
        weight_bits=weight_bits,
        activation_bits=activation_bits,
        scope=scope)


3.3 quantize.py

def Quantize(graph,
             is_training,
             weight_bits=8,
             activation_bits=8,
             ema_decay=0.999,
             quant_delay=None,
             vars_collection=ops.GraphKeys.GLOBAL_VARIABLES,
             scope=None):
  """Updates graph with quantization operations.
  Currently we quantize the following tensors:
  * Conv/MatMul: Quantize the weights if it matches.
  * Activation: Quantize the output if it matches.
  * Bypass/Post-activation Bypass: Quantize both input and output
    if it matches.
  Args:
    graph: Graph to modify.
    is_training: Whether quantizing training graph or eval graph.
    weight_bits: Number of bits to use for quantizing weights.
    activation_bits: Number of bits to use for quantizing activations.
    ema_decay: (Optional) Float, EMA decay parameter.  EMA is used to update
      quantization intervals for quantizing activations (see here about EMA:
      https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average).
    quant_delay: (Optional, default None) Int, count of global steps for which
      to delay quantization.  This helps weights stabilize at the start of
      training.
    vars_collection: (Optional) Collection where to store the variables for
      quantization interval ends.
    scope: The scope to be transformed. If it's not None, only the ops which
      are in this scope will be transformed.
  Raises:
    ValueError: When quantization fails.
  """
   ……
for layer_match in **_FindLayersToQuantize(graph):**
……
 **_InsertQuantOp(**
        add_context,
        'act_quant',
        layer_match.activation_op,
        consumer_ops,
        is_training,
        moving_avg=True,
        ema_decay=ema_decay,
        quant_delay=quant_delay,
        vars_collection=vars_collection,
        bits=activation_bits,
        init_min=0.0,
        producer_scope=scope)

# 轉到
**def _FindLayersToQuantize(graph):**
  """Matches layers in graph to quantize.
  The following patterns get matched. Nodes surrounded by [] will be
  optionally matched:
          weight|folded_weight
                /
         conv|fc
            |
    [post_conv_correction]
            |
     biasadd|folded_bias
            |
         [bypass]
            |
        activation
            |
   [post_activation_bypass]
  Match replacements:
    If weight|folded_weight is found, FakeQuant is added afterwards.
    If bypass is found, FakeQuant is added before and after.
    If activation is found, FakeQuant is added afterwards.
    If post_activation_bypass is found, FakeQuant is added afterwards.
  Args:
    graph: Graph to perform match on.
  Returns:
    list of _LayerMatches.
  """
#接下來
**def _InsertQuantOp(context,**
                   name,
                   producer,
                   consumers,
                   is_training,
                   moving_avg=True,
                   init_min=-6.0,
                   init_max=6.0,
                   bits=8,
                   ema_decay=0.999,
                   quant_delay=None,
                   vars_collection=ops.GraphKeys.GLOBAL_VARIABLES,
                   narrow_range=False,
                   producer_scope=None,
                   consumer_scope=None):
  """Inserts a quant op between a producer op and (multiple) consumer ops.
  Args:
    context: Context where producer and consumer operations are nested.
    name: Name for the new quantization op within the context.
    producer: Producer operation of the pairs where quantization will be
      inserted.
    consumers: Consumer operations of the pairs.
    is_training: Whether quantizing training graph or eval graph.
    moving_avg: Specifies whether to use exponential moving average or just
      the last value seen.
    init_min: Starting minimum value for the new quantization op.
    init_max: Starting maximum value for the new quantization op.
    bits: Number of bits to use for quantization, must be between 2 and 8.
    ema_decay: (Optional) Float, EMA decay parameter.  EMA is used to update
      quantization intervals for quantizing activations (see here about EMA:
      https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average).
    quant_delay: (Optional, default None) Int, count of global steps for which
      to delay quantization.  This helps weights stabilize at the start of
      training.
    vars_collection: (Optional) Collection where to store the variables for
      quantization interval ends.
    narrow_range: Whether to use the narrow quantization range
      [1; 2^bits - 1] or wide range [0; 2^bits - 1].
    producer_scope: The restriction of producer scope. If not None, the new op
      will be inserted only when the producer is in this scope.
    consumer_scope: The restriction of producer scope. If not None, the new op
      will be inserted only when all the consumers are in this scope.
  Raises:
    ValueError: When producer operation is not directly connected to the
      consumer operation.
  """
  #接下來
  ### 對於 變數值
    if moving_avg:
    quant = (
        quant_ops.MovingAvgQuantize(
            inputs,
            init_min=init_min,
            init_max=init_max,
            ema_decay=ema_decay,
            is_training=is_training,
            num_bits=bits,
            narrow_range=narrow_range,
            vars_collection=vars_collection,
            name_prefix=name_prefix))
  else:
    quant = (
        quant_ops.LastValueQuantize(
            inputs,
            init_min=init_min,
            init_max=init_max,
            is_training=is_training,
            num_bits=bits,
            narrow_range=narrow_range,
            vars_collection=vars_collection,
            name_prefix=name_prefix))
### 對於 啟用值
  if quant_delay and quant_delay > 0:
    activate_quant = math_ops.greater_equal(
        common.CreateOrGetQuantizationStep(),
        quant_delay,
        name=name_prefix + '/activate_quant')
    quant = control_flow_ops.cond(
        activate_quant,
        lambda: quant,
        lambda: inputs,
        name=name_prefix + '/delayed_quant')
### 對於 消費操作
  if consumers:
    tensors_modified_count = graph_editor.reroute_ts(
        [quant], [inputs], can_modify=consumers)
    # Some operations can have multiple output tensors going to the same
    # consumer. Since consumers is a set, we need to ensure that
    # tensors_modified_count is greater than or equal to the length of the set
    # of consumers.
    if tensors_modified_count < len(consumers):
      raise ValueError('No inputs quantized for ops: [%s]' % ', '.join(
          [consumer.name for consumer in consumers]))


四、Other Sources
1. tensorflow/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc
2. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/quantization_utils.h#L32
3. tensorflow/contrib/quantization/tools:quantize_graph【這個將已有的模型進行量化轉換,裝換成量化圖】
4. tf.quantize
Defined in tensorflow/python/ops/array_ops.py.

tf.quantize(
    input,
    min_range,
    max_range,
    T,
    mode='MIN_COMBINED',
    round_mode='HALF_AWAY_FROM_ZERO',
    name=None
)


5. Fixed Point Quantization
5.1 Quantization training with TensorFlow 
TensorFlow can train models with quantization in the loop. Because training requires small gradient adjustments, floating point values are still used. To keep models as floating point while adding the quantization error in the training loop, fake quantization nodes simulate the effect of quantization in the forward and backward passes.

Since it’s difficult to add these fake quantization operations to all the required locations in the model, there’s a function available that rewrites the training graph. To create a fake quantized training graph:

# Build forward pass of model.
loss = tf.losses.get_total_loss()

# Call the training rewrite which rewrites the graph in-place with
# FakeQuantization nodes and folds batchnorm for training. It is
# often needed to fine tune a floating point model for quantization
# with this training tool. When training from scratch, quant_delay
# can be used to activate quantization after training to converge
# with the float graph, effectively fine-tuning the model.
tf.contrib.quantize.create_training_graph(quant_delay=2000000)

# Call backward pass optimizer as usual.
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
optimizer.minimize(loss)


The rewritten eval graph is non-trivially different from the training graph since the quantization ops affect the batch normalization step. Because of this, we’ve added a separate rewrite for the eval graph:

# Build eval model
logits = tf.nn.softmax_cross_entropy_with_logits(...)

# Call the eval rewrite which rewrites the graph in-place with
# FakeQuantization nodes and fold batchnorm for eval.
tf.contrib.quantize.create_eval_graph()

# Save the checkpoint and eval graph proto to disk for freezing
# and providing to TFLite.
with open(eval_graph_file, ‘w’) as f:
  f.write(str(g.as_graph_def()))
saver = tf.train.Saver()
saver.save(sess, checkpoint_name)


Methods to rewrite the training and eval graphs are an active area of research and experimentation. Although rewrites and quantized training might not work or improve performance for all models, we are working to generalize these techniques.

5.2 Generating fully quantized models 
The previously demonstrated after-rewrite eval graph only simulates quantization. To generate real fixed point computations from a trained quantization model, convert it to a fixed point kernel. Tensorflow Lite supports this conversion from the graph resulting from create_eval_graph.

First, create a frozen graph that will be the input for the TensorFlow Lite toolchain:

bazel build tensorflow/python/tools:freeze_graph && \
  bazel-bin/tensorflow/python/tools/freeze_graph \
  --input_graph=eval_graph_def.pb \
  --input_checkpoint=checkpoint \
  --output_graph=frozen_eval_graph.pb --output_node_names=outputs


Provide this to the TensorFlow Lite Optimizing Converter (TOCO) to get a fully quantized TensorFLow Lite model:

bazel build tensorflow/contrib/lite/toco:toco && \
  ./bazel-bin/third_party/tensorflow/contrib/lite/toco/toco \
  --input_file=frozen_eval_graph.pb \
  --output_file=tflite_model.tflite \
  --input_format=TENSORFLOW_GRAPHDEF --output_format=TFLITE \
  --inference_type=QUANTIZED_UINT8 \
  --input_shape="1,224, 224,3" \
  --input_array=input \
  --output_array=outputs \
  --std_value=127.5 --mean_value=127.5


See the documentation for tf.contrib.quantize and TensorFlow Lite.

6. quantize.cc
const MinMax& GetOrComputeMinMax(Model* model, const string& array_name) {
  auto& array = model->GetArray(array_name);
  // Normally we should have a MinMax recorded on this Array,
  // so we just use it.
  if (array.minmax != nullptr) {
    return *array.minmax;
  }

  // We don't have a MinMax. That's bad news: we need
  // the graph to provide MinMax info for all arrays in order
  // for inference to reproduce faithfully the same quantization
  // error as the training process had.
  //
  // But we still want to support a fallback for constant arrays,
  // just using the plain min and max computed from array elements.
  // We should hopefully never rely on that in production, as that
  // will not give very good accuracy as that typically won't be
  // exactly what the training process used. But it will be useful
  // to allow easily trying out quantization even if the graph
  // lacks some minmax information.
  if (array.buffer != nullptr) {
    LOG(WARNING)
        << "Constant array " << array_name
        << " lacks MinMax information. To make up for that, we will now compute"
        << " the MinMax from actual array elements. That will result in"
        << " quantization parameters that probably do not match whichever "
           "arithmetic"
        << " was used during training, and thus will probably be a cause of "
           "poor"
        << " inference accuracy.";
    CHECK(array.buffer->type == ArrayDataType::kFloat);
    const auto& data = array.GetBuffer<ArrayDataType::kFloat>().data;
    // We always want [min, max] to contain 0.
    float min = 0.f;
    float max = 0.f;
    for (auto val : data) {
      min = std::min(min, val);
      max = std::max(max, val);
    }
    if (min == 0.f && max == 0.f) {
      // Prevent downstream anger from quantized math that expects min and max
      // to not be equal.
      max = 1.f;
    }
    auto& minmax = array.GetOrCreateMinMax();
    minmax.min = min;
    minmax.max = max;
    return minmax;
  }

  LOG(FATAL) << "Array " << array_name
             << " does not have MinMax information, "
                "and is not a constant array. Cannot "
                "proceed with quantization.";
}
--------------------- 
作者:kuanzi0001 
來源:CSDN 
原文:https://blog.csdn.net/yifen4234/article/details/80382956 
版權宣告:本文為博主原創文章,轉載請附上博文連結!