1. 程式人生 > >SSD網路解析之MultiBoxLoss層

SSD網路解析之MultiBoxLoss層

SSD網路中的MultiBoxLoss層是根據論文2.2節所提出的損失函式而寫的相應caffe實現,也是整個SSD網路很重要的部分。

首先,我們還是先來看一下論文原文對此部分的描述:

①英文部分

Training objective The SSD training objective is derived from the MultiBox objective [7,8] but is extended to handle multiple object categories. Let x_{ij}^{p}=\left \{ 1, \right 0\} be an indicator for matching the i-th default box to j

-th ground truth box of category p. In the matching strategy above, we can have \sum_{i}x_{ij}^{p}\geqslant 1. The overall objective loss function is a weighted sum of the localization loss (loc) and the confidence loss (conf):

                                                  L(x,c,l,g)=\frac{1}{N}(L_{conf}(x,c)+\alpha L_{loc}(x,l,g))

          (1)

where N is the number of matched default boxes. If N = 0, wet set the loss to 0. The localization loss is a Smooth L1 loss [6] between the predicted box (l) and the ground truth box (g) parameters. Similar to Faster R-CNN [2], we regress to offsets for the center (cx,xy)

of the default bounding box (d) and for its width (w) and height (h).    

                                               L_{loc}(x,l,g)=\sum_{i\in Pos}^{N}\sum_{m\in \left \{ cx,cy,w,h \right \}}x_{ij}^{k}smooth_{L1}(l_{i}^m-\hat{g}_{j}^{m})

                                                    \hat{g}_{j}^{cx}=(g_{j}^{cx}-d_{i}^{cx})/d_{i}^{w}      \hat{g}_{j}^{cy}=(g_{j}^{cy}-d_{i}^{cy})/d_{i}^{h}             (2)

                                                                \hat{g}_{j}^{w}=log(\frac{g_{j}^{w}}{d_{i}^{w}})   \hat{g}_{j}^{h}=log(\frac{g_{j}^{h}}{d_{i}^{h}})

The confidence loss is the softmax loss over multiple classes confidences (c).

                                       L_{conf}(x,c)=-\sum_{i\in Pos}^{N}x_{ij}^{p}log(\hat{c}_{i}^{p})-\sum_{i\in Neg} log(\hat{c}_{i}^{0})     \hat{c}_{i}^{p}=\frac{exp(c_{i}^{p})}{\sum_{p}exp(c_{i}^{p})}           (3)

and the weight term \alpha is set to 1 by cross validation.

Hard negative mining After the matching step, most of the default boxes are nega-tives, especially when the number of possible default boxes is large. This introduces a significant imbalance between the positive and negative training examples. Instead of using all the negative examples, we sort them using the highest confidence loss for each
default box and pick the top ones so that the ratio between the negatives and positives is at most 3:1. We found that this leads to faster optimization and a more stable training.

②我翻譯的中文部分

訓練目標:SSD訓練目標源自MultiBox的目標[7,8],但擴充套件為處理多個目標類別。使x_{ij}^{p}=\left \{ 1, \right 0\}表示第i個預設框與第j個地面實況框是否匹配且是否存在類別p的指標。根據上述匹配策略,我們得到\sum_{i}x_{ij}^{p}\geqslant 1。總體目標損失函式是定位損失(loc)和置信度損失(conf)的加權和:

                                                L(x,c,l,g)=\frac{1}{N}(L_{conf}(x,c)+\alpha L_{loc}(x,l,g))          (1)

其中N為匹配上的預設框數目。如果N=0,我們設定loss為0。定位損失是一個由預測框(l)和地面實況框(g)引數產生smooth L1  loss。與Faster R-CNN相似,我們迴歸預設框(d)的中心(cx,cy)和它的寬度(w)、高度(h)。

                                            L_{loc}(x,l,g)=\sum_{i\in Pos}^{N}\sum_{m\in \left \{ cx,cy,w,h \right \}}x_{ij}^{k}smooth_{L1}(l_{i}^m-\hat{g}_{j}^{m})

                                                    \hat{g}_{j}^{cx}=(g_{j}^{cx}-d_{i}^{cx})/d_{i}^{w}      \hat{g}_{j}^{cy}=(g_{j}^{cy}-d_{i}^{cy})/d_{i}^{h}             (2)

                                                                \hat{g}_{j}^{w}=log(\frac{g_{j}^{w}}{d_{i}^{w}})   \hat{g}_{j}^{h}=log(\frac{g_{j}^{h}}{d_{i}^{h}})

置信度損失是在多類別置信度(c)上的softmax損失。

                                      L_{conf}(x,c)=-\sum_{i\in Pos}^{N}x_{ij}^{p}log(\hat{c}_{i}^{p})-\sum_{i\in Neg} log(\hat{c}_{i}^{0})     \hat{c}_{i}^{p}=\frac{exp(c_{i}^{p})}{\sum_{p}exp(c_{i}^{p})}           (3)

以及權重項\alpha由於交叉驗證設定為1。

硬負例挖掘(Hard negative mining):在匹配步驟之後,大多數預設框都是負樣本,特別是當可能的預設框的數量很大時。這引入了正樣本和負樣本之間的顯著不平衡。我們不是使用所有的負樣本,而是使用最高置信度損失對每個預設框進行排序,然後選擇頂部的那些,使得負例和正例之間的比值最多為3:1。我們發現這可以帶來更快的優化和更穩定的訓練。

在瞭解上面的基礎上,我們還需要檢視一下MultiBoxLoss層的設定引數,即定義在caffe.proto檔案中的MultiBoxLossParameter類,如下(引數較多,大家自行看英文註釋即可):

// Message that store parameters used by MultiBoxLossLayer
message MultiBoxLossParameter {
  // Localization loss type.
  enum LocLossType {
    L2 = 0;
    SMOOTH_L1 = 1;
  }
  optional LocLossType loc_loss_type = 1 [default = SMOOTH_L1];
  // Confidence loss type.
  enum ConfLossType {
    SOFTMAX = 0;
    LOGISTIC = 1;
  }
  optional ConfLossType conf_loss_type = 2 [default = SOFTMAX];
  // Weight for localization loss.
  optional float loc_weight = 3 [default = 1.0];
  // Number of classes to be predicted. Required!
  optional uint32 num_classes = 4;
  // If true, bounding box are shared among different classes.
  optional bool share_location = 5 [default = true];
  // Matching method during training.
  enum MatchType {
    BIPARTITE = 0;
    PER_PREDICTION = 1;
  }
  optional MatchType match_type = 6 [default = PER_PREDICTION];
  // If match_type is PER_PREDICTION, use overlap_threshold to
  // determine the extra matching bboxes.
  optional float overlap_threshold = 7 [default = 0.5];
  // Use prior for matching.
  optional bool use_prior_for_matching = 8 [default = true];
  // Background label id.
  optional uint32 background_label_id = 9 [default = 0];
  // If true, also consider difficult ground truth.
  optional bool use_difficult_gt = 10 [default = true];
  // If true, perform negative mining.
  // DEPRECATED: use mining_type instead.
  optional bool do_neg_mining = 11;
  // The negative/positive ratio.
  optional float neg_pos_ratio = 12 [default = 3.0];
  // The negative overlap upperbound for the unmatched predictions.
  optional float neg_overlap = 13 [default = 0.5];
  // Type of coding method for bbox.
  optional PriorBoxParameter.CodeType code_type = 14 [default = CORNER];
  // If true, encode the variance of prior box in the loc loss target instead of
  // in bbox.
  optional bool encode_variance_in_target = 16 [default = false];
  // If true, map all object classes to agnostic class. It is useful for learning
  // objectness detector.
  optional bool map_object_to_agnostic = 17 [default = false];
  // If true, ignore cross boundary bbox during matching.
  // Cross boundary bbox is a bbox who is outside of the image region.
  optional bool ignore_cross_boundary_bbox = 18 [default = false];
  // If true, only backpropagate on corners which are inside of the image
  // region when encode_type is CORNER or CORNER_SIZE.
  optional bool bp_inside = 19 [default = false];
  // Mining type during training.
  //   NONE : use all negatives.
  //   MAX_NEGATIVE : select negatives based on the score.
  //   HARD_EXAMPLE : select hard examples based on "Training Region-based Object Detectors with Online Hard Example Mining", Shrivastava et.al.
  enum MiningType {
    NONE = 0;
    MAX_NEGATIVE = 1;
    HARD_EXAMPLE = 2;
  }
  optional MiningType mining_type = 20 [default = MAX_NEGATIVE];
  // Parameters used for non maximum suppression durig hard example mining.
  optional NonMaximumSuppressionParameter nms_param = 21;
  optional int32 sample_size = 22 [default = 64];
  optional bool use_prior_for_nms = 23 [default = false];
}

在此就解釋一下幾個引數:

loc_weight:對應於上面公式中的權重項\alpha

num_classes:訓練集中目標類別數目

share_location:在定位預測中,預測框是否在不同目標類別之間共享(不共享的話會產生num_priors*num_classes個預測框,共享的話產生num_priors個預測框,SSD採用共享的形式,其實一個預測框只能屬於一個地面實況框,而一個地面實況框只包含一個目標,採用共享形式就足夠了,且能加快執行速度)

use_prior_for_matching:是否使用預設框進行匹配(具體看後面的函式中分析)

do_neg_mining:是否進行負樣本挖掘(對應於上面的硬負例挖掘)

看完上面的caffe.proto中的相關設定,我們直接來瀏覽一下SSD中對該層的呼叫(摘自SSD300中的train.prototxt):

layer {
  name: "mbox_loss"
  type: "MultiBoxLoss"
  bottom: "mbox_loc"
  bottom: "mbox_conf"
  bottom: "mbox_priorbox"
  bottom: "label"
  top: "mbox_loss"
  include {
    phase: TRAIN
  }
  propagate_down: true
  propagate_down: true
  propagate_down: false
  propagate_down: false
  loss_param {
    normalization: VALID
  }
  multibox_loss_param {
    loc_loss_type: SMOOTH_L1
    conf_loss_type: SOFTMAX
    loc_weight: 1.0
    num_classes: 21
    share_location: true
    match_type: PER_PREDICTION
    overlap_threshold: 0.5
    use_prior_for_matching: true
    background_label_id: 0
    use_difficult_gt: true
    neg_pos_ratio: 3.0
    neg_overlap: 0.5
    code_type: CENTER_SIZE
    ignore_cross_boundary_bbox: false
    mining_type: MAX_NEGATIVE
  }
}

可知輸入blob有四個,輸出blob有一個,而各輸入blob的含義如下:

bottom[0]:儲存著每個預測框的定位資訊,大小為num×(num_priors_*loc_classes_*4),即bottom[0]->num()=num;bottom[0]->channels()=num_priors_*loc_classes_*4(其中num為batch size,即輸入影象數目;num_priors_為預設框數目;loc_classes_與share_location_有關,SSD預設loc_classes_為1,即在不同類別間共享預測框;4表示每個預測框左上角和右下角座標資訊)。該blob資料來自於train.prototxt中的mbox_loc層的輸出(該層是一個Concat層,將各層的預測框定位資料聯結成一個blob)。

bottom[1]:儲存著每個預測框中的類別置信度資訊,大小為num×(num_priors_*21),即bottom[1]->num()=num;bottom[1]->channels()=num_priors_*21(其中21是針對VOC資料集而言的,共21個類別)。該blob資料來自於train.prototxt中的mbox_conf層的輸出(該層同樣是一個Concat層)。

bottom[2]:儲存著每一個預設框的左上角和右下角座標及其variance,大小為1×2×(num_priors_*4),即bottom[2]->num()=1,表示所有輸入影象共享一組預設框(原因是輸入影象大小都是一樣的);bottom[2]->channels()=2,第一個通道儲存著左上角和右下角座標資訊,第二個通道儲存著這些座標的variance資料。該blob資料來自於train.prototxt中的mbox_priorbox層的輸出(該層同樣是一個Concat層)

bottom[3]:儲存著地面實況資訊,具體大小還未知,需要檢視SSD的資料層載入方式。

一. multibox_loss_layer.hpp標頭檔案

不做過多解釋,完全是按著caffe中定義損失層的形式寫的,大家簡單瀏覽即可,具體的還是要看cpp檔案中的實現:

#ifndef CAFFE_MULTIBOX_LOSS_LAYER_HPP_
#define CAFFE_MULTIBOX_LOSS_LAYER_HPP_

#include <map>
#include <utility>
#include <vector>

#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"
#include "caffe/util/bbox_util.hpp"  //大部分函式都在這標頭檔案裡

#include "caffe/layers/loss_layer.hpp" 

namespace caffe {

/**
 * @brief Perform MultiBox operations. Including the following:
 *
 *  - decode the predictions.
 *  - perform matching between priors/predictions and ground truth.
 *  - use matched boxes and confidences to compute loss.
 *
 */
template <typename Dtype>
class MultiBoxLossLayer : public LossLayer<Dtype> {  //繼承自LossLayer
 public:
  explicit MultiBoxLossLayer(const LayerParameter& param)
      : LossLayer<Dtype>(param) {}
  virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);

  virtual inline const char* type() const { return "MultiBoxLoss"; }
  // bottom[0] stores the location predictions. bottom[0]為邊界框的定位預測(即決定每個邊界框位置和大小的四個引數)
  // bottom[1] stores the confidence predictions. bottom[1]為置信度預測(即每一類目標在每一個邊框中的置信度)
  // bottom[2] stores the prior bounding boxes. bottom[2]為設定的預設框(對應論文中的default box)引數
  // bottom[3] stores the ground truth bounding boxes. bottom[3]為地面實況框
  virtual inline int ExactNumBottomBlobs() const { return 4; }  //輸入blob為四個
  virtual inline int ExactNumTopBlobs() const { return 1; } //輸出blob為一個

 protected:
  virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);

  // The internal localization loss layer.
  shared_ptr<Layer<Dtype> > loc_loss_layer_; //定位損失層
  LocLossType loc_loss_type_;  //定位損失型別
  float loc_weight_; //定位損失權重
  // bottom vector holder used in Forward function.
  vector<Blob<Dtype>*> loc_bottom_vec_; //loc_loss中輸入blob的所有者,用在前向傳播中
  // top vector holder used in Forward function.
  vector<Blob<Dtype>*> loc_top_vec_; //loc_loss中輸出blob的所有者,也用於前向傳播中
  // blob which stores the matched location prediction.
  Blob<Dtype> loc_pred_; //儲存匹配的定位預測
  // blob which stores the corresponding matched ground truth.
  Blob<Dtype> loc_gt_; //儲存對應匹配的地面實況
  // localization loss.
  Blob<Dtype> loc_loss_; //儲存定位損失

  // The internal confidence loss layer.
  shared_ptr<Layer<Dtype> > conf_loss_layer_; //置信度損失層
  ConfLossType conf_loss_type_; //置信度損失型別
  // bottom vector holder used in Forward function.
  vector<Blob<Dtype>*> conf_bottom_vec_; //conf_loss中輸入blob的所有者,用於前向傳播
  // top vector holder used in Forward function.
  vector<Blob<Dtype>*> conf_top_vec_; //conf_loss中輸出blob的所有者,也用於前向傳播
  // blob which stores the confidence prediction.
  Blob<Dtype> conf_pred_; //儲存置信度預測
  // blob which stores the corresponding ground truth label.
  Blob<Dtype> conf_gt_; //儲存對應的地面實況標籤
  // confidence loss.
  Blob<Dtype> conf_loss_; //置信度損失

  MultiBoxLossParameter multibox_loss_param_; //用於儲存該層的引數
  int num_classes_; //所訓練的資料集中的目標類數
  bool share_location_; //是否在不同目標類別中共享邊界框
  MatchType match_type_; //匹配型別
  float overlap_threshold_; //IOU閾值(用於匹配)
  bool use_prior_for_matching_; //是否利用預設框進行匹配
  int background_label_id_; //背景類的ID
  bool use_difficult_gt_; //是否考慮困難的地面實況
  bool do_neg_mining_; //用於判斷是否需要做負樣本挖掘
  float neg_pos_ratio_; //負樣本:正樣本的比值
  float neg_overlap_; //篩選負樣本是的IOU閾值
  CodeType code_type_; //編碼/解碼型別
  bool encode_variance_in_target_; //是否在目標函式中編碼variance
  bool map_object_to_agnostic_; //是否將所有的目標類都對映到未知類
  bool ignore_cross_boundary_bbox_; //是否忽略超出影象區域的邊界框(預測框/預設框)
  bool bp_inside_; //如果為true,則當code_type_為CORNER或CORNER_SIZE時,僅在影象區域內的corner上反向傳播(預設為false) 
  MiningType mining_type_; //負樣本挖掘方式

  int loc_classes_; //定位類別,由share_location_決定
  int num_gt_; //地面實況框數目
  int num_; //輸入的批量數(batch size)
  int num_priors_; //輸入的所有預設框數目

  int num_matches_; //整個batch size上所有預測框與地面實況框匹配對數
  int num_conf_; //用於計算置信度損失的所有正樣本和負樣本數目
  vector<map<int, vector<int> > > all_match_indices_; //存放所有的匹配對索引
  vector<vector<int> > all_neg_indices_; //存放所有選出的負樣本索引

  // How to normalize the loss.
  LossParameter_NormalizationMode normalization_;
};

}  // namespace caffe

#endif  // CAFFE_MULTIBOX_LOSS_LAYER_HPP_

二. multibox_loss_layer.cpp檔案

以下所需要閱讀的程式碼量很大,大家閱讀前請做好心理準備(裡面所用到的大部分函式均在bbox_util.cpp檔案中實現)。

(1)LayerSetUp()函式

template <typename Dtype>
void MultiBoxLossLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  LossLayer<Dtype>::LayerSetUp(bottom, top); //由於繼承於LossLayer,所以需要呼叫LossLayer中的LayerSetUp函式先進行部分初始化(所有loss類都是這樣的操作)
  if (this->layer_param_.propagate_down_size() == 0) { //如果沒有指定各輸入blob是否進行後向傳播,則採用預設形式
    //即bottom[0]和bottom[1]需要後向傳播;bottom[2]和bottom[3]不需要後向傳播
    this->layer_param_.add_propagate_down(true);
    this->layer_param_.add_propagate_down(true);
    this->layer_param_.add_propagate_down(false);
    this->layer_param_.add_propagate_down(false);
  }
  //獲取設定的該層的引數
  const MultiBoxLossParameter& multibox_loss_param =
      this->layer_param_.multibox_loss_param();
  multibox_loss_param_ = this->layer_param_.multibox_loss_param();

  num_ = bottom[0]->num(); //批量數
  num_priors_ = bottom[2]->height() / 4;  //預設框數目(4表示預設框左上角和右下角左邊或這些座標的variance,詳見PriorBox層)
  // Get other parameters.
  CHECK(multibox_loss_param.has_num_classes()) << "Must provide num_classes."; //必須設定所訓練的資料集中的目標有幾類
  num_classes_ = multibox_loss_param.num_classes();
  CHECK_GE(num_classes_, 1) << "num_classes should not be less than 1.";
  share_location_ = multibox_loss_param.share_location(); //是否在不同目標類別中共享邊界框
  loc_classes_ = share_location_ ? 1 : num_classes_; //根據是否共享,設定loc_classes_
  background_label_id_ = multibox_loss_param.background_label_id(); //獲取背景類的ID
  use_difficult_gt_ = multibox_loss_param.use_difficult_gt(); //是否考慮困難的地面實況(difficult ground truth)
  mining_type_ = multibox_loss_param.mining_type(); //負樣本挖掘方式
  if (multibox_loss_param.has_do_neg_mining()) {
    LOG(WARNING) << "do_neg_mining is deprecated, use mining_type instead."; //do_neg_mining已棄用,使用mining_type代替
    do_neg_mining_ = multibox_loss_param.do_neg_mining();
    CHECK_EQ(do_neg_mining_,
             mining_type_ != MultiBoxLossParameter_MiningType_NONE);
  }
  do_neg_mining_ = mining_type_ != MultiBoxLossParameter_MiningType_NONE; //用於判斷是否需要做負樣本挖掘

  if (!this->layer_param_.loss_param().has_normalization() &&
      this->layer_param_.loss_param().has_normalize()) {
    normalization_ = this->layer_param_.loss_param().normalize() ? //若normalize為真則採用valid模式,否則用batch_size模式
                     LossParameter_NormalizationMode_VALID :  //valid模式表示除以總輸出數目(總樣本數)但不考慮ignore_label,來進行歸一化
                     LossParameter_NormalizationMode_BATCH_SIZE; //batch_size模式表示除以batch size來進行歸一化
  } else {
    normalization_ = this->layer_param_.loss_param().normalization(); //其餘情況下由normalization引數決定(full/valid/batch_size中一種),預設採用valid
  }

  if (do_neg_mining_) {
    CHECK(share_location_)
        << "Currently only support negative mining if share_location is true.";
  }

  vector<int> loss_shape(1, 1);
  // Set up localization loss layer. 建立定位損失層
  loc_weight_ = multibox_loss_param.loc_weight(); //定位損失權重,對應於論文2.2節中總體目標損失函式中的引數阿爾法
  loc_loss_type_ = multibox_loss_param.loc_loss_type(); //定位損失型別
  // fake shape. 
  vector<int> loc_shape(1, 1); //定義1個int型的向量,且初始化為1
  loc_shape.push_back(4);
  loc_pred_.Reshape(loc_shape); //儲存匹配的定位預測
  loc_gt_.Reshape(loc_shape); //儲存對應匹配的地面實況
  loc_bottom_vec_.push_back(&loc_pred_); //即loc_bottom_vec_包含loc_pred和loc_gt,是輸入blob的所有者
  loc_bottom_vec_.push_back(&loc_gt_);
  loc_loss_.Reshape(loss_shape); //儲存定位損失
  loc_top_vec_.push_back(&loc_loss_); //loc_top_vec_包含loc_loss,是輸出blob的所有者
  if (loc_loss_type_ == MultiBoxLossParameter_LocLossType_L2) {//歐式距離損失(L2損失)
    LayerParameter layer_param;EncodeLocPrediction
    layer_param.set_name(this->layer_param_.name() + "_l2_loc"); 
    layer_param.set_type("EuclideanLoss");
    layer_param.add_loss_weight(loc_weight_);
    loc_loss_layer_ = LayerRegistry<Dtype>::CreateLayer(layer_param); //根據layer_param建立一個loss層(即註冊一個EuclideanLoss層)
    loc_loss_layer_->SetUp(loc_bottom_vec_, loc_top_vec_); //呼叫Layer類的SetUp函式實現EuclideanLoss層的初始化
  } else if (loc_loss_type_ == MultiBoxLossParameter_LocLossType_SMOOTH_L1) {//SMOOTH_L1損失(詳細參見Fast R-CNN)
    LayerParameter layer_param;
    layer_param.set_name(this->layer_param_.name() + "_smooth_L1_loc");
    layer_param.set_type("SmoothL1Loss");
    layer_param.add_loss_weight(loc_weight_);
    loc_loss_layer_ = LayerRegistry<Dtype>::CreateLayer(layer_param);
    loc_loss_layer_->SetUp(loc_bottom_vec_, loc_top_vec_); //呼叫Layer類的SetUp函式實現SmoothL1Loss層的初始化
  } else {
    LOG(FATAL) << "Unknown localization loss type.";
  }
  // Set up confidence loss layer.建立置信度損失層
  conf_loss_type_ = multibox_loss_param.conf_loss_type(); //置信度損失型別
  conf_bottom_vec_.push_back(&conf_pred_); //先不對conf_pred_和conf_gt_這兩個blob設定大小(之後需要根據不同的損失函式設定大小)
  conf_bottom_vec_.push_back(&conf_gt_);
  conf_loss_.Reshape(loss_shape);
  conf_top_vec_.push_back(&conf_loss_);
  if (conf_loss_type_ == MultiBoxLossParameter_ConfLossType_SOFTMAX) {//softmax損失
    //先檢查背景類是否在所設定的類別中
    CHECK_GE(background_label_id_, 0)
        << "background_label_id should be within [0, num_classes) for Softmax.";
    CHECK_LT(background_label_id_, num_classes_)
        << "background_label_id should be within [0, num_classes) for Softmax.";
    LayerParameter layer_param;
    layer_param.set_name(this->layer_param_.name() + "_softmax_conf");
    layer_param.set_type("SoftmaxWithLoss");
    layer_param.add_loss_weight(Dtype(1.));  //損失權重始終為1,對應論文2.2節中總體損失函式中的前半部分
    layer_param.mutable_loss_param()->set_normalization(
        LossParameter_NormalizationMode_NONE);
    SoftmaxParameter* softmax_param = layer_param.mutable_softmax_param();
    softmax_param->set_axis(1);
    // Fake reshape.
    vector<int> conf_shape(1, 1);
    conf_gt_.Reshape(conf_shape);
    conf_shape.push_back(num_classes_);
    conf_pred_.Reshape(conf_shape);
    conf_loss_layer_ = LayerRegistry<Dtype>::CreateLayer(layer_param); //註冊一個SoftmaxWithLoss層
    conf_loss_layer_->SetUp(conf_bottom_vec_, conf_top_vec_); //呼叫Layer類的SetUp函式實現SoftmaxWithLoss層的初始化
  } else if (conf_loss_type_ == MultiBoxLossParameter_ConfLossType_LOGISTIC) { //logistic損失(採用sigmoid交叉熵損失函式)
    LayerParameter layer_param;
    layer_param.set_name(this->layer_param_.name() + "_logistic_conf");
    layer_param.set_type("SigmoidCrossEntropyLoss");
    layer_param.add_loss_weight(Dtype(1.));
    // Fake reshape.
    vector<int> conf_shape(1, 1);
    conf_shape.push_back(num_classes_);
    conf_gt_.Reshape(conf_shape);
    conf_pred_.Reshape(conf_shape);
    conf_loss_layer_ = LayerRegistry<Dtype>::CreateLayer(layer_param);
    conf_loss_layer_->SetUp(conf_bottom_vec_, conf_top_vec_); //呼叫Layer類的SetUp函式實現SigmoidCrossEntropyLoss層的初始化
  } else {
    LOG(FATAL) << "Unknown confidence loss type.";
  }
}

LayerSetUp()函式的目的有兩個:

1.初始化該層,載入caffe.proto中定義的一系列引數設定值;

2.建立一個定位損失層和置信度損失層(均採用fake shape形式建立,即採用假定的輸入輸出bolb大小進行初始化網路),用於後續的前向計算和後向傳播。

這裡可以分開計算上述公式(1)中的定位損失和置信度損失(兩種損失是相加形式,且用到的變數並不相同,求導部分可以分開進行,即使具有相同的變數,也可以分開求導,最後相加),既然可以分開計算,那我們就可以分開利用caffe中的不同損失函式實現相關的前向損失計算和後向梯度傳播。而這裡程式碼中的層中層更是讓人耳目一新,在MultiBoxLoss這一損失層中加入定位損失層和置信度損失層,我頗有受益。

注:SmoothL1LossLayer參見連結

(2)Reshape()函式

template <typename Dtype>
void MultiBoxLossLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  LossLayer<Dtype>::Reshape(bottom, top); //呼叫LossLayer中的Reshape函式進行部分Reshape
  num_ = bottom[0]->num(); //批量數
  num_priors_ = bottom[2]->height() / 4; //預設框數目 
  num_gt_ = bottom[3]->height();  //地面實況數目
  CHECK_EQ(bottom[0]->num(), bottom[1]->num());
  CHECK_EQ(num_priors_ * loc_classes_ * 4, bottom[0]->channels())
      << "Number of priors must match number of location predictions."; //預設框引數數目必須與定位預測數目匹配
  CHECK_EQ(num_priors_ * num_classes_, bottom[1]->channels())
      << "Number of priors must match number of confidence predictions."; //預設框數目必須與置信度預測數目匹配
}

此部分函式比較簡單,只是獲取了一些引數,並進行了相應的檢查。

(3)Forward_cpu()函式

//前向傳播
template <typename Dtype>
void MultiBoxLossLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
    const vector<Blob<Dtype>*>& top) {
  const Dtype* loc_data = bottom[0]->cpu_data(); //定位預測資料
  const Dtype* conf_data = bottom[1]->cpu_data(); //置信度預測資料
  const Dtype* prior_data = bottom[2]->cpu_data(); //預設框資料
  const Dtype* gt_data = bottom[3]->cpu_data(); //地面實況資料

  // Retrieve all ground truth.恢復地面實況
  //注:NormalizedBBox類定義在caffe.proto中
  map<int, vector<NormalizedBBox> > all_gt_bboxes;
  GetGroundTruth(gt_data, num_gt_, background_label_id_, use_difficult_gt_,
                 &all_gt_bboxes);

  // Retrieve all prior bboxes. It is same within a batch since we assume all
  // images in a batch are of same dimension.
  //恢復預設框
  //注:每一層的所有特徵圖共享一組預設框
  vector<NormalizedBBox> prior_bboxes;
  vector<vector<float> > prior_variances;
  GetPriorBBoxes(prior_data, num_priors_, &prior_bboxes, &prior_variances);

  // Retrieve all predictions. 恢復所有定位預測EncodeLocPrediction
  vector<LabelBBox> all_loc_preds;
  GetLocPredictions(loc_data, num_, num_priors_, loc_classes_, share_location_,
                    &all_loc_preds);

  // Find matches between source bboxes and ground truth bboxes.
  //尋找與地面實況框匹配的預測框
  vector<map<int, vector<float> > > all_match_overlaps;
  FindMatches(all_loc_preds, all_gt_bboxes, prior_bboxes, prior_variances,
              multibox_loss_param_, &all_match_overlaps, &all_match_indices_);

  num_matches_ = 0;
  int num_negs = 0;
  // Sample hard negative (and positive) examples based on mining type.
  //負樣本挖掘(防止副樣本數量太多,導致訓練不平衡,詳見論文2.2節)
  MineHardExamples(*bottom[1], all_loc_preds, all_gt_bboxes, prior_bboxes,
                   prior_variances, all_match_overlaps, multibox_loss_param_,
                   &num_matches_, &num_negs, &all_match_indices_,
                   &all_neg_indices_); //all_match_indices_儲存所有預測框與地面實況框之間的匹配對;all_neg_indices_儲存所有負樣本索引
  //以下進行定位損失的前向傳播
  if (num_matches_ >= 1) { //定位損失只對匹配上的預測框進行,其餘無匹配物件的不需要計算定位損失,直接為0
    // Form data to pass on to loc_loss_layer_.
    vector<int> loc_shape(2);
    loc_shape[0] = 1;
    loc_shape[1] = num_matches_ * 4;
    loc_pred_.Reshape(loc_shape);
    loc_gt_.Reshape(loc_shape);
    Dtype* loc_pred_data = loc_pred_.mutable_cpu_data();
    Dtype* loc_gt_data = loc_gt_.mutable_cpu_data();
    //得到編碼後的預測框和地面實況框匹配對(詳見論文2.2節)
    EncodeLocPrediction(all_loc_preds, all_gt_bboxes, all_match_indices_,
                        prior_bboxes, prior_variances, multibox_loss_param_,
                        loc_pred_data, loc_gt_data);
    loc_loss_layer_->Reshape(loc_bottom_vec_, loc_top_vec_); //呼叫loc_loss_layer_層的Reshape函式重新規劃大小
    loc_loss_layer_->Forward(loc_bottom_vec_, loc_top_vec_); //loc_loss_layer_層前向傳播
  } else {
    loc_loss_.mutable_cpu_data()[0] = 0;
  }

  // Form data to pass on to conf_loss_layer_.
  //以下進行置信度損失的前向傳播(置信度損失對每個預測框均需要計算)
  if (do_neg_mining_) { 
    num_conf_ = num_matches_ + num_negs;
  } else {
    num_conf_ = num_ * num_priors_;
  }
  if (num_conf_ >= 1) {
    // Reshape the confidence data.
    vector<int> conf_shape;
    if (conf_loss_type_ == MultiBoxLossParameter_ConfLossType_SOFTMAX) {
      conf_shape.push_back(num_conf_);
      conf_gt_.Reshape(conf_shape);
      conf_shape.push_back(num_classes_);
      conf_pred_.Reshape(conf_shape);
    } else if (conf_loss_type_ == MultiBoxLossParameter_ConfLossType_LOGISTIC) {
      conf_shape.push_back(1);
      conf_shape.push_back(num_conf_);
      conf_shape.push_back(num_classes_);
      conf_gt_.Reshape(conf_shape);
      conf_pred_.Reshape(conf_shape);
    } else {
      LOG(FATAL) << "Unknown confidence loss type.";
    }
    if (!do_neg_mining_) { //如果不做負樣本挖掘
      // Consider all scores.
      // Share data and diff with bottom[1].
      CHECK_EQ(conf_pred_.count(), bottom[1]->count());
      conf_pred_.ShareData(*(bottom[1])); //conf_pred_直接使用所有的置信度預測資料
    }
    Dtype* conf_pred_data = conf_pred_.mutable_cpu_data();
    Dtype* conf_gt_data = conf_gt_.mutable_cpu_data();
    caffe_set(conf_gt_.count(), Dtype(background_label_id_), conf_gt_data); //呼叫caffe_set函式完成初始化(用背景類的標籤初始化)
    //解碼所需要的置信度資料和地面實況
    EncodeConfPrediction(conf_data, num_, num_priors_, multibox_loss_param_,
                         all_match_indices_, all_neg_indices_, all_gt_bboxes,
                         conf_pred_data, conf_gt_data);
    conf_loss_layer_->Reshape(conf_bottom_vec_, conf_top_vec_); //呼叫conf_loss_layer_層的Reshape函式重新規劃大小
    conf_loss_layer_->Forward(conf_bottom_vec_, conf_top_vec_); //conf_loss_layer_前向傳播
  } else {
    conf_loss_.mutable_cpu_data()[0] = 0;
  }

  top[0]->mutable_cpu_data()[0] = 0;
  if (this->layer_param_.propagate_down(0)) {
    //當歸一化型別為VALID時,normalizer = num_matches_(SSD採用VALID,對應於論文中的2.2節的引數N)
    Dtype normalizer = LossLayer<Dtype>::GetNormalizer(
        normalization_, num_, num_priors_, num_matches_);
    top[0]->mutable_cpu_data()[0] +=
        loc_weight_ * loc_loss_.cpu_data()[0] / normalizer;
  }
  if (this->layer_param_.propagate_down(1)) {
    Dtype normalizer = LossLayer<Dtype>::GetNormalizer(
        normalization_, num_, num_priors_, num_matches_);
    top[0]->mutable_cpu_data()[0] += conf_loss_.cpu_data()[0] / normalizer;
  }
}

Forward_cpu()函式中包含的各種在bbox_util.cpp檔案中實現的函式請參見此連結

(4)Backward_cpu()函式

//反向傳播
template <typename Dtype>
void MultiBoxLossLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down,
    const vector<Blob<Dtype>*>& bottom) {

  if (propagate_down[2]) {
    LOG(FATAL) << this->type()
        << " Layer cannot backpropagate to prior inputs."; //該層不能對預設框進行反向傳播
  }
  if (propagate_down[3]) {
    LOG(FATAL) << this->type()
        << " Layer cannot backpropagate to label inputs."; //該層不能對地面實況(label)進行反向傳播
  } //由此也可以看出需要在prototxt文件中定義該層的四個propagate_down引數(前兩個為true,後兩個為false)

  // Back propagate on location prediction.
  //以下進行定位預測的反向傳播
  if (propagate_down[0]) {
    Dtype* loc_bottom_diff = bottom[0]->mutable_cpu_diff();
    caffe_set(bottom[0]->count(), Dtype(0), loc_bottom_diff); //先用0初始化loc_bottom_diff
    if (num_matches_ >= 1) {
      vector<bool> loc_propagate_down;
      // Only back propagate on prediction, not ground truth. 
      //呼叫loc_loss_layer_層的Backward()函式時需要設定反向傳播標誌
      loc_propagate_down.push_back(true); //定位預測需要反向傳播
      loc_propagate_down.push_back(false); //地面實況不需要反向傳播
      loc_loss_layer_->Backward(loc_top_vec_, loc_propagate_down,
                                loc_bottom_vec_);
      // Scale gradient. 
      //縮放梯度發原因在於,損失函式中也進行此縮放,即損失函式中乘了一個縮放因子sacle,相應求出的梯度中也要乘上此因子
      Dtype normalizer = LossLayer<Dtype>::GetNormalizer(
          normalization_, num_, num_priors_, num_matches_);
      //計算縮放因子,本質就是損失權重/歸一化數目(top[0]->cpu_diff()[0]中存的即是損失權重,
      //詳細可參見:https://blog.csdn.net/qq_21368481/article/details/81950538)
      Dtype loss_weight = top[0]->cpu_diff()[0] / normalizer; 
      caffe_scal(loc_pred_.count(), loss_weight, loc_pred_.mutable_cpu_diff()); //呼叫caffe_scal進行縮放
      // Copy gradient back to bottom[0].
      //以下通過匹配對應關係(存放在all_match_indices_中)將梯度拷貝到對應位置
      const Dtype* loc_pred_diff = loc_pred_.cpu_diff();
      int count = 0;
      for (int i = 0; i < num_; ++i) { //迴圈遍歷所有輸入影象
        for (map<int, vector<int> >::iterator it =
             all_match_indices_[i].begin();
             it != all_match_indices_[i].end(); ++it) { //迴圈遍歷所有目標類別
          const int label = share_location_ ? 0 : it->first;
          const vector<int>& match_index = it->second;
          for (int j = 0; j < match_index.size(); ++j) { //迴圈遍歷所有預測框
            if (match_index[j] <= -1) { //如果無匹配物件,直接跳過
              continue;
            }
            // Copy the diff to the right place. 拷貝梯度到正確位置
            int start_idx = loc_classes_ * 4 * j + label * 4; //計算索引號
            caffe_copy<Dtype>(4, loc_pred_diff + count * 4,
                              loc_bottom_diff + start_idx);//呼叫caffe_copy進行拷貝
            ++count;
          }
        }
        loc_bottom_diff += bottom[0]->offset(1); //呼叫offset更新指標位置
      }
    }
  }

  // Back propagate on confidence prediction.
  //以下進行置信度預測的反向傳播
  if (propagate_down[1]) {
    Dtype* conf_bottom_diff = bottom[1]->mutable_cpu_diff();
    caffe_set(bottom[1]->count(), Dtype(0), conf_bottom_diff); //先用0初始化conf_bottom_diff
    if (num_conf_ >= 1) {
      vector<bool> conf_propagate_down;
      // Only back propagate on prediction, not ground truth.
      conf_propagate_down.push_back(true);
      conf_propagate_down.push_back(false);
      conf_loss_layer_->Backward(conf_top_vec_, conf_propagate_down,
                                 conf_bottom_vec_);//呼叫conf_loss_layer_的反向傳播函式
      // Scale gradient. 原因參見定位預測反向傳播部分
      Dtype normalizer = LossLayer<Dtype>::GetNormalizer(
          normalization_, num_, num_priors_, num_matches_);
      Dtype loss_weight = top[0]->cpu_diff()[0] / normalizer;
      caffe_scal(conf_pred_.count(), loss_weight,
                 conf_pred_.mutable_cpu_diff());
      // Copy gradient back to bottom[1].
      //以下通過正負樣本索引(正樣本存放在all_match_indices_;負樣本存放在all_neg_indices_中)將梯度拷貝到對應位置
      const Dtype* conf_pred_diff = conf_pred_.cpu_diff();
      if (do_neg_mining_) {
        int count = 0;
        for (int i = 0; i < num_; ++i) { //迴圈遍歷所有輸入影象
          // Copy matched (positive) bboxes scores' diff.
          //以下拷貝正樣本部分的梯度
          const map<int, vector<int> >& match_indices = all_match_indices_[i];
          for (map<int, vector<int> >::const_iterator it =
               match_indices.begin(); it != match_indices.end(); ++it) {//迴圈遍歷所有目標類別
            const vector<int>& match_index = it->second;
            CHECK_EQ(match_index.size(), num_priors_);
            for (int j = 0; j < num_priors_; ++j) { //迴圈遍歷所有預測框
              if (match_index[j] <= -1) { //如果無匹配物件,直接跳過
                continue;
              }
              // Copy the diff to the right place.
              caffe_copy<Dtype>(num_classes_,
                                conf_pred_diff + count * num_classes_,
                                conf_bottom_diff + j * num_classes_); //呼叫caffe_copy拷貝到對應位置
              ++count;
            }
          }
          // Copy negative bboxes scores' diff.
          //以下拷貝負樣本部分的梯度
          for (int n = 0; n < all_neg_indices_[i].size(); ++n) {
            int j = all_neg_indices_[i][n];
            CHECK_LT(j, num_priors_);
            caffe_copy<Dtype>(num_classes_,
                              conf_pred_diff + count * num_classes_,
                              conf_bottom_diff + j * num_classes_);//呼叫caffe_copy拷貝到對應位置
            ++count;
          }
          conf_bottom_diff += bottom[1]->offset(1); //呼叫offset更新指標位置
        }
      } else {
        // The diff is already computed and stored.
        bottom[1]->ShareDiff(conf_pred_); //如果沒有進行負樣本挖掘,則直接共享conf_pred_的diff
      }
    }
  }

  // After backward, remove match statistics.
  //反向傳播完畢後,清空本次迭代所用到的一系列索引變數
  all_match_indices_.clear(); 
  all_neg_indices_.clear();
}

反向傳播實質上是通過呼叫之前建立的定位損失層和置信度損失層的Backward_cpu()函式實現,也是分別進行的,通過Forward_cpu()中獲得的匹配對索引和負樣本索引來實現反向傳播(即定位損失中只更新匹配上地面實況框的那些預測框的引數資訊;置信度損失只更新挑選出的正負樣本所對應的置信度引數資訊;其餘引數均不進行更新,因為這些引數壓根沒參與損失函式的計算,也就無所謂反向傳播(這些引數的反向傳播梯度均為0),故每一張輸入影象所對應的更新哪些引數都是不一樣的,有點類似於dropout)。