【Faster RCNN】《Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks》
NIPS,全稱神經資訊處理系統大會(Conference and Workshop on Neural Information Processing Systems),是一個關於機器學習和計算神經科學的國際會議。該會議固定在每年的12月舉行,由NIPS基金會主辦。NIPS是機器學習領域的頂級會議 。在中國計算機學會的國際學術會議排名中,NIPS為人工智慧領域的A類會議。
1 Motivation
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck
作者提出 Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals(10ms per image).
2 Innovation
RPN,end to end
不是用 image pyramid 圖1(a),也不是用 filter pyramid,圖1(b),而是用 anchor,圖一(c),可以叫做,pyramid of regression references
3 Advantages
- 5fps (including all steps) on a GPU——VGG
- state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image
- ILSVRC and COCO 2015 competitions,the foundations of the 1st-place winning entries(eg:ResNet)
4 Methods
SS慢,EdgeBoxes 雖然能達到 0.2 second per image(和檢測的時間差不多了),一個很直接的想法就是在 GPU上實現這些演算法,但是 re-implementation ignores the down-stream detection network and therefore misses important opportunities for sharing computation.
相關工作先介紹了 object proposal的情況,然後是 Deep Net works for object detection(主要是 RCNN, fast RCNN 和 OverFeat),個人感覺對RCNN 和 OverFeat 的總結很精闢
R-CNN mainly plays as a classifier, and it does not predict object bounds (except for refining by bounding box regression).
In the OverFeat method, a fully-connected layer is trained to predict the box coordinates for the localization task that assumes a single object.
4.1 RPN
Note: RPN is class-agnostic 【R-FCN】《R-FCN: Object Detection via Region-based Fully Convolutional Networks》
4.1.1 Anchors
共享卷積的最後一層,ZF有 5 layers(256 dimension),VGG 有13 layers(512 dimension),
2k中 2 是 object or not object,k是每個3*3的 sliding window 中 anchor數量, 4k 中的 4 是 bbox
ratios 和 scales 的威力如下:
- Translation-Invariant anchors
相比與 MultiBox的方法,Faster RCNN的 anchor 基於卷積,有 translation-invariant 的性質,而且 引數量更少,(4+2)* k * dimension(eg,k=9,VGG dimension為512) parameters 為 ,更少的引數量的好處是,less risk of overfitting on small datasets,like PASCAL VOC
- Multi-Scale Anchors as Regression References
區別於 image pyramid 和 filter pyramid,作者用 anchor pyramid(不同的 scales 和 ratios),more cost-efficient,因為 only relies on images and feature maps of a single scales and uses filters(sliding windows on feature map)of a single size.
4.1.2 Loss Function
每個anchor進行2分類,object or not,positive 為 IoU>0.5或者max IoU,negative 為 IoU<0.3,其它的anchor對訓練來說沒有用
- :minibatch 中 anchor
- :predicted probability of anchor being an object.
- :is 1 if the anchor is positive, 0 if the anchor is negative
- :4 parameterized coordinates of the predicted bounding box
- :ground-truth box associated with a positive anchor
- :log loss
- :Smooth L1 loss,前面乘以了 表示 regression loss is activated only for positive anchors
Normalized by 和 (normalization is not required and could be simplified), 用來 balance parameters
- 設定為 mini-batch的大小,eg:256
- 設定為 numbers of anchor locations(~2400)
- 設定為 10,正好兩種損失55開
x,y 是 predict box 的中心,w 和 h 分別是 寬和高
分別表示 predict-box,anchor box 和 ground-truth box,y,h,w 的表示方法也一樣
This can be thought of as bounding-box regression from an anchor box to a nearby ground-truth box.說白了,就是計算 (predict box 與 anchor 的 偏差) 和 (ground-truth 與 anchor的偏差)的損失
Note:這裡的 bbox regression 不同於 Fast RCNN 和 SPPnet的,
Fast RCNN 和 SPPnet 的bbox regression: is performed on features pooled from arbitrarily sized RoIs, and the regression weights are shared by all region sizes.
Faster RCNN 此處的 bbox regression 是爭對 per scales 和 per ratios的,To account for varying sizes, a set of k bounding-box regressors are learned. Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share weights.
4.1.3 Training RPNs
randomly sampls 256 anchors,這樣會出現以下問題:but this will bias towards negative samples as they are dominate,所以我們按照1:1 的抽正負anchors,如果positive anchors不夠128,pad negative anchors
We randomly initialize all new layers by drawing weights from a zero-mean Gaussian distribution with standard deviation 0.01.
4.2 Sharing Features for RPN and Fast R-CNN
Both RPN and Fast R-CNN, trained independently, will modify their convolutional layers in different ways. We therefore need to develop a technique that allows for sharing convolutional layers between the two networks, rather than learning two separate networks.
- Alternating training(論文中採用的方法)
- Approximate joint training(效果會比交替訓練好一些)
- Non-approximate joint training
作者用的是 交替訓練,4-step Alternating Training
- RPN(ImageNet 初始化,RPN and Fast RCNN not share prameters)
- Fast RCNN(ImageNet 初始化,用RPN產生的proposal——替換掉SS產生的,訓練Fast RNN,not share)
- 用上一步的訓練好的引數,fine tuning RPN(share)
- 用重新訓練的RPN提出的proposal, fine tuning the unique layers of Fast RCNN 也就是 head 部分(share)
A similar alternating training can be run for more iterations, but we have observed negligible improvements.
4.3 Implementation Detais
- Train and test 都是 single scales,reshape shorter side s = 600 pixels
- Image pyramid : trade off accuracy and speed(沒采用)
- Anchors:scales, 、 、 ,ratios: , , ,見表一,表中紅色的字型是預設的 anchors(2:1),表中列出來的是 bbox regression 之後的結果
- 訓練的時候,剔除 cross image boundaries (跨圖邊界)的anchors,測試的時候,clip(裁剪) to the image
- RPN proposal 有很多overlap,我們用了非極大值抑制(NMS),iou設定為0.7,NMS does not harm the ultimate detection accuracy,但是減少了 proposal 的數量。論文中 用 top-2000的proposal 進行 train。為什麼NMS overlap thresold 設定為0.7呢?
看上面這個圖,就是 , , 三種情況,假如 ground truth 和 1:1一樣大,那麼與 , 的 IOU都為 ,這樣的話會導致同一目標產生兩種特徵圖,不利於網路的學習,所以把 IOU設定為0.7,儘量緩解這種情況(只是一種解釋喲)
5 Experiments
5.1 Ablation Experiments
- 1,2,3對比,3 更好,the fewer proposals also reduce the region-wise fully-connected layers’ cost(table 5可以看到)
- 3,4 對比,share 好
- 3,6 對比,RPN+fast RCNN 比 SS+ Fast RCNN 好,train test 的 proposal 不一樣
- 4,8 對比, NMS 影響不大
- 7,11差距不算大,9,11差距明顯,cls 排序很重要
- 6,12對比,reg 很重要
5.2 VOC 07/12 實驗結果
5.3 速度(ms)
5.4 recall-to-IoU
RPN 的 proposal 從 2000 drops 到 300 效果差不多
5.5 PK (one-stage overfeat)
5.6 COCO 上的結果
VGG 換成 ResNet, ensemble一下, COCO 2015 object detection 冠軍
Note: reshape 是為了softmax操作,softmax操作中,第一維必須是類別數,類別如果是2,object or not,則是 class-agnostic ,如果類別是,比如 VOC 資料集,20+1類, 則是 class-specific
