Tensorflow object detection API 原始碼閱讀筆記:RPN
Update:
建議先看從程式設計實現角度學習Faster R-CNN,比較直觀。這裡由於原始碼抽象程度較高,顯得比較混亂。
faster_rcnn_meta_arch.py中這兩個對應知乎文章中RPN包含的3*3和1*1卷積:
rpn_box_predictor_features = slim.conv2d(rpn_features_to_crop
self._first_stage_box_predictor=box_predictor.ConvolutionalBoxPredictor知乎文章中的AnchorTargetCreator按照IoU將20000多個候選的anchor選出256個anchor進行分類和迴歸位置(計算RPN loss),對應:
target_assigner.batch_assign_targets;
self._first_stage_sampler=sampler.BalancedPositiveNegativeSampler,作用在first_stage_minibatch_size;
其中20000是RPN輸入的feature map大小和anchor的種類決定的,256對應first_stage_minibatch_size(見protos/faster_rcnn.proto);
總之就是在def _loss_rpn。(proposal=2000)知乎文章中的ProposalCreator: 在RPN中,從上萬個anchor中,按照概率選擇一定數目(如12000/6000),並調整大小和位置,經過NMS,選出概率最大的2000/300個,生成RoIs,對應:
def _postprocess_rpn
first_stage_max_proposals=300知乎文章中ProposalTargetCreator從2000/300候選中選擇一部分(比如128個)pooling出來用以訓練Fast R-CNN,對應:
不使用hard_example_miner
_unpad_proposals_and_sample_box_classifier_batch
second_stage_batch_size
Old:
'''RPN概況。注意,分析時很多術語直接採用了原始論文中的表述,和程式碼中不一樣。
'''
FasterRCNNFeatureExtractor.extract_proposal_features實際呼叫的是
FasterRCNNResnetV1FeatureExtractor._extract_proposal_features
生成first stage RPN features作為RPN的輸入。
class FasterRCNNMetaArch(model.DetectionModel)的_extract_rpn_feature_map:
呼叫上述特徵提取器的_extract_proposal_features,並且返回
rpn_box_predictor_features: A 4 -D float32 tensor with shape
[batch, height, width, depth] to be used for predicting proposal boxes
and corresponding objectness scores.'''sliding window得到的intermediate layer'''
rpn_features_to_crop: A 4-D float32 tensor with shape
[batch, height, width, depth] representing image features to crop using
the proposals boxes. '''其實就是前面特徵提取器得到的feature map。'''
anchors: A BoxList representing anchors (for the RPN) in
absolute coordinates.
'''這裡使用了grid_anchor_generator.GridAnchorGenerator,生成9個anchor boxes(3 different scales and 3 aspect ratios)。具體見下文分析。
'''
anchors = self._first_stage_anchor_generator.generate(
[(feature_map_shape[1], feature_map_shape[2])]
'''sliding window 作用於 conv feature map,得到intermediate layer. first_stage_box_predictor_kernel_size: Kernel size to use for the convolution op just prior to RPN box predictions.
'''
with slim.arg_scope(self._first_stage_box_predictor_arg_scope):
kernel_size = self._first_stage_box_predictor_kernel_size
rpn_box_predictor_features = slim.conv2d(
rpn_features_to_crop,
self._first_stage_box_predictor_depth,
kernel_size=[kernel_size, kernel_size],
rate=self._first_stage_atrous_rate,
activation_fn=tf.nn.relu6)
'''按照論文下面應該是intermediate layer進入cls和reg layer
'''
def _predict_rpn_proposals(self, rpn_box_predictor_features):
進入
self._first_stage_box_predictor.predict
self._first_stage_box_predictor = box_predictor.ConvolutionalBoxPredictor
'''Box predictors are classes that take a high level image feature map as input and produce two predictions, (1) a tensor encoding box locations, and (2) a tensor encoding classes for each box. 下文具體看class ConvolutionalBoxPredictor(BoxPredictor)
'''
'''再進一步是進loss了。這個predict函式可以同時返回兩個階段的一個prediction_dict,然後進loss。
'''
def predict(self, preprocessed_inputs)
def loss(self, prediction_dict, scope=None):
'''loss中呼叫了第一階段loss的計算。下文細看。
'''
def _loss_rpn
'''anchor生成
object_detection/anchor_generators/grid_anchor_generator.py
'''
def _generate #通過父類core/anchor_generator.py的generate函式呼叫
grid_height, grid_width = feature_map_shape_list[0]
# Multidimensional analog of numpy.meshgrid
scales_grid, aspect_ratios_grid = ops.meshgrid(self._scales,
self._aspect_ratios)
scales_grid = tf.reshape(scales_grid, [-1])
aspect_ratios_grid = tf.reshape(aspect_ratios_grid, [-1])
return tile_anchors(grid_height,
grid_width,
scales_grid,
aspect_ratios_grid,
self._base_anchor_size,
self._anchor_stride,
self._anchor_offset)
'''去test指令碼手算驗證一下。
base_anchor_size = [10, 10]#default=[256, 256]
anchor_stride = [19, 19]#default=[16, 16]
anchor_offset = [0, 0]
scales = [0.5, 1.0, 2.0]
aspect_ratios = [1.0]
exp_anchor_corners = [[-2.5, -2.5, 2.5, 2.5], [-5., -5., 5., 5.],
[-10., -10., 10., 10.], [-2.5, 16.5, 2.5, 21.5],
[-5., 14., 5, 24], [-10., 9., 10, 29],
[16.5, -2.5, 21.5, 2.5], [14., -5., 24, 5],
[9., -10., 29, 10], [16.5, 16.5, 21.5, 21.5],
[14., 14., 24, 24], [9., 9., 29, 29]]
feature_map_shape_list=[(2, 2)] #asks for anchors that correspond
to an 2x2 layer
grid_height, grid_width = 2,2
scales_grid, aspect_ratios_grid 略,三種組合,導致整個feature map一共獲得2*2*3=12個anchor
anchor的高和寬由scales, aspect_ratio和base_anchor_size決定,簡單。
anchor的中心由range(grid),anchor_stride和anchor_offset決定,grid就是grid_height和grid_width。比如顯然第一個是0,第二個是19。理解grid就是指在feature map上產生anchor的格點,然後就很簡單。思考:base_anchor_size和anchor_stride等引數是怎麼配置的?
anchor的中心怎麼和sliding window的中心一致?答案是在輸入的feature map上每一個格子都生成anchor:
feature_map_shape = tf.shape(rpn_features_to_crop)
anchors = self._first_stage_anchor_generator.generate(
[(feature_map_shape[1], feature_map_shape[2])])
由此可以得到anchor的stride應該是1×16=16(因為feature map的每個格點對應原圖的感受野是16*16)。符合預期。基本的想法就是feature map還原回去感受野很大,形狀單一,所以在每個格點引入了k種不同大小和形狀的anchor box,以便在原圖上更好的框住物體。這種思想在yolo2, ssd等paper中有進一步改進和擴充套件。
'''
def tile_anchors
"""生成locations and classes
object_detection/core/box_predictor.py.
作用於sliding window得到的intermediate layer,輸出是每個anchor的tensor encoding box locations和tensor encoding classes for each box.
可能會引入額外的卷積層。
另外位置學習也沒有什麼特別的,就是卷積:
box_encodings = slim.conv2d(
net, num_predictions_per_location * self._box_code_size,
[self._kernel_size, self._kernel_size],
scope='BoxEncodingPredictor')
num_predictions_per_location是anchor數。類別學習也類似:
class_predictions_with_background = slim.conv2d(
net, num_predictions_per_location * num_class_slots,
[self._kernel_size, self._kernel_size], scope='ClassPredictor',
biases_initializer=tf.constant_initializer(
self._class_prediction_bias_init))
思考:這裡學習到的box location是啥東西?其實就是predict函式呼叫_predict_rpn_proposals函式得到的'rpn_box_encodings'.
self._first_stage_box_predictor = box_predictor.ConvolutionalBoxPredictor
後面被用來計算Loss。和anchor本身的location是啥關係?它實際是論文中的predicted box與anchor box之間座標的差值。
"""
class ConvolutionalBoxPredictor(BoxPredictor)
"""cls, reg loss
object_detection/meta_architectures/faster_rcnn_meta_arch.py
這裡直接拿rpn_box_encodings計算loss了。推測batch_reg_targets是anchor box與ground truth的差值。查target_assigner.batch_assign_targets程式碼:
def assign(self, anchors, groundtruth_boxes, groundtruth_labels=None,
**params):
reg_targets = self._create_regression_targets(anchors,
groundtruth_boxes,
match)
_create_regression_targets
matched_reg_targets = self._box_coder.encode(matched_gt_boxes,
matched_anchors)
box_coders/faster_rcnn_box_coder.py
tx = (xcenter - xcenter_a) / wa
ty = (ycenter - ycenter_a) / ha
tw = tf.log(w / wa)
th = tf.log(h / ha)
順藤摸瓜找到了,和論文中一致。
"""
def _loss_rpn
(batch_cls_targets, batch_cls_weights, batch_reg_targets,
batch_reg_weights, _) = target_assigner.batch_assign_targets(
self._proposal_target_assigner, box_list.BoxList(anchors),
groundtruth_boxlists, len(groundtruth_boxlists)*[None])
batch_cls_targets = tf.squeeze(batch_cls_targets, axis=2)
localization_losses = self._first_stage_localization_loss(
rpn_box_encodings, batch_reg_targets, weights=sampled_reg_indices)
下面看loss計算公式具體是怎麼實現的。
"""The ground-truth label is 1 if the anchor is positive, and is 0 if the anchor is negative.
An anchor is labeled as positive if:
(a) the anchor is the one with highest IoU overlap with a ground-truth box
(b) the anchor has an IoU overlap with a ground-truth box higher than 0.7
Negative labels are assigned to anchors with IoU lower than 0.3 for all ground-truth
boxes.
50%/50% ratio of positive/negative anchors in a minibatch.
"""
經過之前的分析,相應的程式碼應該是
(batch_cls_targets, batch_cls_weights, batch_reg_targets,
batch_reg_weights, _) = target_assigner.batch_assign_targets(
self._proposal_target_assigner, box_list.BoxList(anchors),
groundtruth_boxlists, len(groundtruth_boxlists)*[None])
這裡呼叫的target_assigner物件是這樣構建的:
self._proposal_target_assigner = target_assigner.create_target_assigner(
'FasterRCNN', 'proposal')
進入core/target_assigner.py中的create_target_assigner函式:
elif reference == 'FasterRCNN' and stage == 'proposal':
similarity_calc = sim_calc.IouSimilarity()
matcher = argmax_matcher.ArgMaxMatcher(matched_threshold=0.7,
unmatched_threshold=0.3,
force_match_for_each_row=True)
box_coder = faster_rcnn_box_coder.FasterRcnnBoxCoder(
scale_factors=[10.0, 10.0, 5.0, 5.0])
具體實現在:
from object_detection.matchers import argmax_matcher