1. 程式人生 > >翻譯《R-FCN: Object Detection via Region-based Fully Convolutional Network》

翻譯《R-FCN: Object Detection via Region-based Fully Convolutional Network》

Abstract 摘要

  We present region-based, fully convolutional networks for accurate and efficient object detection. In contrast to previous region-based detectors such as Fast/Faster R-CNN [6, 18] that apply a costly perregion subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image. To achieve this goal, we propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection. Our method can thus naturally adopt fully convolutional image classifier backbones, such as the latest Residual Networks (ResNets) [9], for object detection. We show competitive results on the PASCAL VOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet. Meanwhile, our result is achieved at a test-time speed of 170ms per image, 2.5-20× faster than the Faster R-CNN counterpart. Code is made publicly available at:

r-fcn-github

  我們提出了基於區域的、完全卷積的網路,用於精確和高效的目標檢測。與之前的基於區域的檢測器(如Fast/Faster R-CNN [6, 18]相比,我們的基於區域的檢測器完全卷積,幾乎所有計算都在整個影象上共享。為了實現這一目標,我們提出了position-sensitive score maps位置敏感評分對映,以解決影象分類中的平移不變性和目標檢測中的平移變化性之間的矛盾。因此,我們的方法可以自然地採用完全卷積的影象分類器骨架,如最新的殘差網路(ResNets) [9],用於物件檢測。我們使用101層ResNet在PASCAL VOC資料集上顯示競爭結果(例如,2007的83.6% mAP)。與此同時,我們的結果達到了每影象測試時間170毫秒的速度,比Faster R-CNN快快 2.5 - 20×。程式碼可以在

r-fcn-github上獲得。

1 Introduction 引言

  A prevalent family [8, 6, 18] of deep networks for object detection can be divided into two subnetworks by the Region-of-Interest (RoI) pooling layer [6]: (i) a shared, “fully convolutional” subnetwork independent of RoIs, and (ii) an RoI-wise subnetwork that does not share computation. This decomposition [8] was historically resulted from the pioneering classification architectures, such as AlexNet [10] and VGG Nets [23], that consist of two subnetworks by design — a convolutional subnetwork ending with a spatial pooling layer, followed by several fully-connected (fc) layers. Thus the (last) spatial pooling layer in image classification networks is naturally turned into the RoI pooling layer in object detection networks [8, 6, 18].

  一個用於物件檢測的深度網路家族[8,6,18]可以由Region-of-Interest (RoI) pooling layer感興趣區域池化層[6]劃分為兩個子網路:(i)共享的、“完全卷積的”子網路,獨立於RoI, (ii)一個不共享計算的RoI-wise子網路。這種分開方法[8]在歷史上是開創性的分類體系結構的作品,例如AlexNet[10]和VGG Nets[23],它們通過設計由兩個子網路組成 —— 一個卷積子網路以spatial pooling layer 空間池化層結束,然後是幾個完全連線(fc)層。因此,影象分類網路中的(最後一個)空間池化層自然轉化成了目標檢測網路中的RoI池層[8,6,18]。

  But recent state-of-the-art image classification networks such as Residual Nets (ResNets) [9] and GoogLeNets [24, 26] are by design fully convolutional(Only the last layer is fully-connected, which is removed and replaced when fine-tuning for object detection.). By analogy, it appears natural to use all convolutional layers to construct the shared, convolutional subnetwork in the object detection architecture, leaving the RoI-wise subnetwork no hidden layer. However, as empirically investigated in this work, this naïve solution turns out to have considerably inferior detection accuracy that does not match the network’s superior classification accuracy. To remedy this issue, in the ResNet paper [9] the RoI pooling layer of the Faster R-CNN detector [18] is unnaturally inserted between two sets of convolutional layers — this creates a deeper RoI-wise subnetwork that improves accuracy, at the cost of lower speed due to the unshared per-RoI computation.

  但是最近最先進的影象分類網路,如殘差網(ResNets)[9]和GoogLeNets[24,26],在設計上是完全卷積的(只有最後一層是完全連線的,在對物件檢測進行微調時將其刪除並替換)。通過類比,在物件檢測體系結構中使用所有卷積層來構造共享的卷積子網路似乎是很自然的,這使得 RoI-wise 子網路沒有隱藏層。然而,根據本研究的實證研究,這種樸素的解決方案的檢測準確率相當低,與網路優越的分類準確率不匹配。為了解決這個問題,在ResNet的論文[9]中,Faster R-CNN檢測器[18]的RoI pooling 層被不自然地插入到兩組卷積層之間 —— 這建立了一個更深層的RoI-wise子網路,以提高準確性,但由於每個RoI都有未共享的計算,因此有著更低速度的代價。

  We argue that the aforementioned unnatural design is caused by a dilemma of increasing translation invariance for image classification vs. respecting translation variance for object detection. On one hand, the image-level classification task favors translation invariance — shift of an object inside an image should be indiscriminative. Thus, deep (fully) convolutional architectures that are as translation-invariant as possible are preferable as evidenced by the leading results on ImageNet classification[9, 24, 26]. On the other hand, the object detection task needs localization representations that are translation-variant to an extent. For example, translation of an object inside a candidate box should produce meaningful responses for describing how good the candidate box overlaps the object. We hypothesize that deeper convolutional layers in an image classification network are less sensitive to translation. To address this dilemma, the ResNet paper’s detection pipeline [9] inserts the RoI pooling layer into convolutions — this region-specific operation breaks down translation invariance, and the post-RoI convolutional layers are no longer translation-invariant when evaluated across different regions. However, this design sacrifices training and testing efficiency since it introduces a considerable number of region-wise layers (Table 1).

  我們認為,上述的、不自然的設計是由一個兩難問題造成的——增加影象分類的平移不變性與尊重目標檢測的平移變化。一方面,影象級分類任務偏重影象內物體的平移不變性——影象內物體的平移應該是不加區別的。因此,儘可能平移不變的深度(完全)卷積體系結構更可取,這一點從ImageNet分類的主要結果可以看出[9,24,26]。另一方面,目標檢測任務在一定程度上需要平移變化性的定位表示。例如,在候選框中平移一個物件應該產生有意義的響應——描述候選框與物件的重疊程度。我們假設影象分類網路中較深的卷積層對平移的敏感性較低。為了解決這個難題,ResNet論文的檢測管道[9]將RoI pooling層插入到卷積中——這個 region-specific 區域特定的操作分解了平移不變性,並且在跨不同區域評估時,post-RoI 卷積層不再具有平移不變性。但是,這種設計犧牲了訓練和測試效率,因為它引入了大量 region-wise 層(Table 1)。

Table 1

  In this paper, we develop a framework called Region-based Fully Convolutional Network (R-FCN) for object detection. Our network consists of shared, fully convolutional architectures as is the case of FCN [15]. To incorporate translation variance into FCN, we construct a set of position-sensitive score maps by using a bank of specialized convolutional layers as the FCN output. Each of these score maps encodes the position information with respect to a relative spatial position (e.g., “to the left of an object”). On top of this FCN, we append a position-sensitive RoI pooling layer that shepherds information from these score maps, with no weight (convolutional/fc) layers following. The entire architecture is learned end-to-end. All learnable layers are convolutional and shared on the entire image, yet encode spatial information required for object detection. Figure 1 illustrates the key idea and Table 1 compares the methodologies among region-based detectors.

  本文提出了一種基於區域的全卷積網路(R-FCN)的目標檢測框架。我們的網路由共享的、完全卷積的體系結構組成,正如FCN[15]一樣。為了在FCN中加入平移變化性,我們使用一組專門的卷積層作為FCN輸出,position-sensitive score map構建了一組位置敏感的分數對映。每個score maps都對相對空間位置的位置資訊進行編碼(例如,“物件的左側”)。在這個FCN之上面,我們附加了一個position-sensitive RoI pooling layer位置敏感的RoI池層,它可以從這些分數對映中獲取資訊,而不是跟隨權重(卷積/fc)層在後面。整個體系結構是端到端學習的。所有可學習的層都是卷積的,並在整個影象上共享,但需要對空間資訊進行編碼在目標檢測上。Figure 1說明了關鍵思想,Table 1比較了基於區域的檢測器之間的方法。

Figure 1

  Using the 101-layer Residual Net (ResNet-101) [9] as the backbone, our R-FCN yields competitive results of 83.6% mAP on the PASCAL VOC 2007 set and 82.0% the 2012 set. Meanwhile, our results are achieved at a test-time speed of 170ms per image using ResNet-101, which is 2.5× to 20× faster than the Faster R-CNN + ResNet-101 counterpart in [9]. These experiments demonstrate that our method manages to address the dilemma between invariance/variance on translation, and fully convolutional image-level classifiers such as ResNets can be effectively converted to fully convolutional object detectors. Code is made publicly available at: r-fcn-github.

  使用101-layer Residual Net(ResNet-101)[9]為骨幹,我們R-FCN的結果——83.6% mAP 在 PASCAL VOC 2007和82.0%在2012版。與此同時,我們的結果在測試時間達到了170ms每張圖的速度(使用resnet - 101),這比在[9]中的Faster R-CNN + ResNet-101 快2.5-20×的速度。這些實驗表明,我們的方法能夠有效地解決平移上的不變性/變化性之間的矛盾,而像ResNets這樣的完全卷積的影象級別的分類器可以有效地轉換成完全卷積物件檢測器。程式碼可以在r-fcn-github上獲得。

2 Our approach 我們的方法

  Overview. Following R-CNN [7], we adopt the popular two-stage object detection strategy [7, 8, 6, 18, 1, 22] that consists of: (i) region proposal, and (ii) region classification. Although methods that do not rely on region proposal do exist (e.g., [17, 14]), region-based systems still possess leadingaccuracy on several benchmarks [5, 13, 20]. We extract candidate regions by the Region Proposal Network (RPN) [18], which is a fully convolutional architecture in itself. Following [18], we share the features between RPN and R-FCN. Figure 2 shows an overview of the system.

2.1 Abstract 概述

  根據R-CNN[7],我們採用了流行的兩階段目標檢測策略[7,8,6,18,1,22],包括:(i)區域提議,(ii)區域分類。雖然不依賴於區域建議的方法確實存在(例如,[17,14]),但是基於區域的系統仍然在幾個benchmark上具有領先的準確性[5,13,20]。我們通過 Region Proposal Network區域提議網路(RPN)[18]提取候選區域,它本身就是一個完全卷積的體系結構。在[18]之後,我們共享了特徵,在RPN和R-FCN之間。Figure 2顯示了系統的概述。

Figure 2

  Given the proposal regions (RoIs), the R-FCN architecture is designed to classify the RoIs into object categories and background. In R-FCN, all learnable weight layers are convolutional and are computed on the entire image. The last convolutional layer produces a bank of k2 position-sensitive score maps for each category, and thus has a k2 (C + 1)-channel output layer with C object categories (+1 for background). The bank of k2 score maps correspond to a k × k spatial grid describing relative positions. For example, with k × k = 3 × 3, the 9 score maps encode the cases of {top-left, top-center, top-right, …, bottom-right} of an object category. R-FCN ends with a position-sensitive RoI pooling layer. This layer aggregates the outputs of the last convolutional layer and generates scores for each RoI. Unlike [8, 6], our position-sensitive RoI layer conducts selective pooling, and each of the k × k bin aggregates responses from only one score map out of the bank of k × k score maps. With end-to-end training, this RoI layer shepherds the last convolutional layer to learn specialized position-sensitive score maps. Figure 1 illustrates this idea. Figure 3 and 4 visualize an example. The details are introduced as follows.

  考慮到提議區域(RoIs), R-FCN體系結構被設計為將RoIs劃分為物件類別和背景。在R-FCN中,所有可學習的權重層都是卷積層,並在整個影象上進行計算。最後一個卷積層為每個類別生成一堆k2 position-sensitive score maps ,因此有一個k2 (C + 1)-channel output layer with C object categories( +1 為背景 )。The bank of k2 score maps 對應一個k×k空間網格描述相對位置。例如,with k × k = 3 × 3 kernel,9 score maps 對一個類別的某個部位{左上,中上,右上,…,右下}進行編碼。R-FCN以 position-sensitive RoI layer結束。這個層彙總了最後一個卷積層的輸出,併為每個RoI生成scores。與[6,8]不同,position-sensitive RoI layer進行選擇性池化,每個k × k bin從一個 k × k score maps輸出的score map彙總響應。通過端到端訓練,這個RoI層引導最後一個卷積層來學習專門的位置敏感得分對映。Figure 1說明了這個想法。Figure 3和4顯示了一個示例。具體內容如下:

Figure 3,4

2.2 Backbone architecture 骨幹架構

   The incarnation of R-FCN in this paper is based on ResNet-101 [9], though other networks [10, 23] are applicable. ResNet-101 has 100 convolutional layers followed by global average pooling and a 1000-class fc layer. We remove the average pooling layer and the fc layer and only use the convolutional layers to compute feature maps. We use the ResNet-101 released by the authors of [9], pre-trained on ImageNet [20]. The last convolutional block in ResNet-101 is 2048-d, and we attach a randomly initialized 1024-d 1×1 convolutional layer for reducing dimension (to be precise, this increases the depth in Table 1 by 1). Then we apply the k2 (C + 1)-channel convolutional layer to generate score maps, as introduced next.

  本文中的R-FCN的實現是基於ResNet-101[9],但其他網路[10,23]也是適用的。ResNet-101有100個卷積層,然後是 global average pooling 全域性平均池化和1000-class的fc層。我們去掉了平均池化層和fc層,只使用卷積層來計算feature map。我們使用了由[9]的作者釋出的ResNet-101,在ImageNet[20]上預先訓練。在ResNet-101中,最後的convolutional block卷積塊是2048-d,我們附上一個隨機初始化的 1024-d 1×1 卷積層,去減少維度(準確地說,這就增加了深度 in Table 1 by 1)。然後我們應用k2k^2 (C + 1)-channel卷積層生成score maps,介紹下。

2.3 Position-sensitive score maps & Position-sensitive RoI pooling. 位置敏感得分對映和位置敏感RoI池化

   To explicitly encode position information into each RoI, we divide each RoI rectangle into k × k bins by a regular grid. For an RoI rectangle of a size w × h, a bin is of a size ≈wk\frac{w}{k}×hk\frac{h}{k}[8, 6]. In our method, the last convolutional layer is constructed to produce k2k^2 score maps for each category. Inside the (i, j)-th bin (0 ≤ i, j ≤ k − 1), we define a position-sensitive RoI pooling operation that pools only over the (i, j)-th score map:         rc(i,jΘ)=(x,y)bin(i,j)zi,j,c(x+x0,y+y0Θ)/nr_c(i, j | Θ) = \sum_{(x,y)∈bin(i,j)}z_{i,j,c}(x + x_0, y + y_0 | Θ)/n         (1)

   將位置資訊顯示編碼到每個RoI中,我們把每一個RoI矩形分為 k × k bins的常規格。RoI 的矩形的大小 w × h , a bin 的 大小≈wk\frac{w}{k}×hk\frac{h}{k}[8,6]。在我們的方法中,構建最後一個卷積層,生成每個類別的k2k^2評分對映。在(i,j)-th bin(0≤i,j≤k−1),我們在(i,j)-th score map 上定義一個position-sensitive RoI pooling operation:         rc(i,jΘ)=(x,y)bin(i,j)zi,j,c(x+x0,y+y0Θ)/nr_c(i, j | Θ) = \sum_{(x,y)∈bin(i,j)}z_{i,j,c}(x + x_0, y + y_0 | Θ)/n         (1)

  Here rc(i,j)r_c(i, j) is the pooled response in the (i, j)-th bin for the c-th category, zi,j,cz_{i,j,c} is one score map out of the k2(C+1)k^2 (C + 1) score maps, (x0,y0)(x_0, y_0) denotes the top-left corner of an RoI, n is the number of pixels in the bin, and Θ denotes all learnable parameters of the network. The (i,j)th(i, j)-th bin spans iwkx<(i+1)wk\lfloor i\frac{w}{k}\rfloor ≤ x < \lceil (i + 1) \frac{w}{k}\rceil and jhky<(j+1)hk\lfloor j \frac{h}{k} \rfloor≤ y < \lceil(j + 1) \frac{h}{k}\rceil. The operation of Eqn.(1) is illustrated in Figure 1, where a color represents a pair of (i, j). Eqn.(1) performs average pooling (as we use throughout this paper), but max pooling can be conducted as well.

  在這,rc(i,j)r_c(i, j)是(i, j)-th bin上關於第c個類別的池化響應,zi,j,cz_{i,j,c}k2(C+1)k^2 (C + 1)score maps輸出的一個score map,(x0,y0)(x_0, y_0) 代表的是一個RoI的左上角,nn是bin中畫素的個數,而ΘΘ指的是網路的所有可學習的引數。第(i,j)(i, j)個bin跨越(spans)iwkx<(i+1)wk\lfloor i\frac{w}{k}\rfloor ≤ x < \lceil (i + 1) \frac{w}{k}\rceil and jhky<(j+1)hk\lfloor j \frac{h}{k} \rfloor≤ y < \lceil(j + 1) \frac{h}{k}\rceil。算式(1)的操作如Figure 1 所示,其中一個顏色表示一對(i, j)。Eqn.(1)執行平均池化(正如我們在本文中所使用的),但也可以進行最大池化。

  The k2 position-sensitive scores then vote on the RoI. In this paper we simply vote by averaging the scores, producing a (C + 1)-dimensional vector for each RoI: rc(Θ)=(i,j)rc(i,jΘ)r_c(Θ) = \sum_{(i,j)} r_c(i, j | Θ). Then we compute the softmax responses across categories: sc(Θ)=erc(Θ)/c=0CPCerc(Θ)s_c(Θ) = e ^{r_{c(Θ)}} / \sum_{c'=0}^{C} P_Ce^{r_{c'}(Θ)} . They are used for evaluating the cross-entropy loss during training and for ranking the RoIs during inference.

  接著,k2位置敏感分值對RoI進行投票。在本文中,我們簡單地通過平均分值進行投票,為每個RoI產生(C + 1)維向量:rc(Θ)=(i,j)rc(i,jΘ)r_c(Θ) = \sum_{(i,j)} r_c(i, j | Θ)