讀論文：Object Detection

阿新 • • 發佈：2018-12-10

Object Detection

Rich feature hierarchies for accurate object detection and semantic segmentation

CVPR’14

問題

之前最好的辦法很複雜
滑動視窗來定位是個挑戰

方法

測試：
- 從輸入圖片中 region proposals 2000個候選區域
- 在每個區域跑一次CNN，提取出固定長度的特徵向量
- 對每個向量用SVM
訓練：
- Supervised pre-training 用ILSVRC 2012的資料集進行預訓練
- Domain-specific fine-tuning 對不同場合的識別需要進行fine-tuning
- Object category classifiers 對每個類都訓練一個線性的SVM分類器，standard hard negative mining method

參考

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

ECCV‘14

問題

之前的方法需要輸入的圖片是固定尺寸的，或者進行cropping、warping
這樣會導致不必要的失真

方法

提出 spatial pyramid pooling (SPP)
在卷積層的最後一層加上SPP層
從卷積層獲得的feature map，將其分成1x1，2x2，4x4…的小塊，再在每個小塊上做max pooling

收穫

Bag-of-Words (BoW) approach 的思想
a global average pooling is used to reduce the model size and also reduce overfitting;
a global average pooling is used on the testing stage after all fc layers to improve accuracy
a global max pooling is used for weakly supervised object recognition
The global pooling operation corresponds to the traditional Bag-of-Words method

參考

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

NIPS’15

問題

之前的方法：SSPnet, Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck
Region proposal methods typically rely on inexpensive features and economical inference schemes 比如（Selective Search）

方法

提出RPN網路（anchor boxes的機制）
On top of these conv features, we construct RPNs by adding two additional conv layers:
- one that encodes each conv map position into a short (e.g., 256-d) feature vector
- and a second that, at each conv map position, outputs an objectness score and regressed bounds for k region proposals relative to various scales and aspect ratios at that location (k = 9 is a typical value).
訓練：we propose a simple training scheme that alternates between fine-tuning for the region proposal task and then fine-tuning for object detection, while keeping the proposals fixed

參考

SSD: Single Shot MultiBox Detector

ECCV’16

問題

these approaches have been too computationally intensive for embedded systems and, even with high-end hardware, too slow for real-time applications.

方法

…that does not resample pixels or features for bounding box hypotheses
The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps
To achieve high detection accuracy we produce predictions of different scales from feature maps of different scales, and explicitly separate predictions by aspect ratio

參考

Feature Pyramid Networks for Object Detection

CVPR2017

問題

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive
即特徵金字塔的方法在傳統的計算機視覺演算法中經常被用到，而在深度學習中，都儘量避免使用多尺度相關的演算法，因為一旦涉及多尺度，計算量將成倍增加
金字塔結構的優勢是其產生的特徵每一層都是語義資訊加強的，包括高解析度的低層
In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost

方法

作者認為在卷積網路中的每一層，就對應一個尺度的特徵
FPN(Feature Pyramid Networks)包含一個自底向上的pathway，一個自頂向下的pathway，以及lateral connections
- Bottom-up pathway
- 作者定義，不同層但是尺度相同的feature map處於同一個stage，作者只利用了每個stage 最後一層的feature map去做後續的操作。這裡也很好理解，每個stage的feature map 尺寸相同，最深層的feature map肯定有更強的特徵表達
- 具體來說，對於ResNets，作者使用了每個階段的最後一個殘差結構的特徵啟用輸出
- Top-down pathway and lateral connections
- 把高層特徵圖進行上取樣（最鄰近上取樣法），然後把該特徵橫向連線（lateral connections ）至前一層特徵，具體看原論文3

收穫

作者利用卷積網路自身的結構設計出特徵金字塔
演算法同時利用低層特徵高解析度和高層特徵的高語義資訊，通過融合這些不同層的特徵達到預測的效果
並且預測是在每個融合後的特徵層上單獨進行的，這和常規的特徵融合方式不同

參考

R-FCN: Object Detection via Region-based Fully Convolutional Networks

CVPR’16

問題

一方面，影象級別的分類任務側重於平移不變性（在一幅圖片中平移一個物體而不改變它的判別結果）
另一方面，物體檢測任務需要定義物體的具體位置，因此需要平移變換特性

方法

為了把平移變換特性融合進FCN中，我們建立了一個位敏得分地圖（position-sensitive score maps）來編碼位置資訊，從而表徵相關的空間位置。
在FCN的頂層，我們附加了一個position-sensitive ROI pooling layer 來統領這些得分地圖（score maps）的資訊，這些得分地圖不帶任何權重層

收穫

可以通過 score maps 投票來確定空間位置

參考

PVANET: Deep but Lightweight Neural Networks for Real-time Object Detection

NIPS’16

問題

準確率很高的檢測演算法有 heavy computational cost
本文提出用於目標檢測的一個輕量級的特徵提取的網路結構——PVANET

方法

The key design principle is “less channels with more layers”
our networks adopted some recent building blocks
Concatenated rectified linear unit (C.ReLU) is applied to the early stage of our CNNs (i.e., first several layers from the network input) to reduce the number of computations by half without losing accuracy.
Inception [3] is applied to the remaining of our feature generation sub-network
We adopted the idea of multi-scale representation like HyperNet [4] that combines several intermediate outputs

收穫

C.ReLU 可以在不損失精度的情況下提升2倍速度
Inception 用於捕獲輸入影象中小目標和大目標的最具有 cost-effective
採用multi-scale representation的思想可以同時考慮多個層次的細節和非線性

參考

DSSD : Deconvolutional Single Shot Detector

CVPR’17

問題

進一步提高精度

方法

把SSD的基網路由VGG換成ResNet-101
然後新增 Deconvolution Module ，prediction module
跳躍連結

收穫

更好的特徵提取網路和增加上下文資訊有助於提高精度

參考

DSOD: Learning Deeply Supervised Object Detectors from Scratch

ICCV’17

問題

為什麼要從0開始訓練一個檢測模型，而不是fine-tune一個預訓練的模型呢？

預訓練的模型一般是在分類影象資料集比如Imagenet上訓練的，不一定可以遷移到檢測模型的資料上（比如醫療影象）
預訓練的模型，其結構都是固定的，所以如果你要再修改的話比較麻煩
預訓練的分類網路的訓練目標一般和檢測目標不一致，因此預訓練的模型對於檢測演算法而言不一定是最優的選擇

方法

左邊的plain connection表示SSD演算法中的特徵融合操作，這裡對於300*300的輸入影象而言，一共融合了6種不同scale的特徵
在每個虛線矩形框內都有一個1*1的卷積和一個3*3的卷積操作，這其實就是一個bottleneck，也就是1*1的卷積主要起到降低channel個數從而降低3*3卷積計算量的作用
右邊的dense connection表示本文引入densenet思想的特徵融合操作
dense connection部分左邊的虛線矩形框部分和plain connection的右邊虛線矩形框部分很像，差別在於channel個數（dense connection中3*3的channel個數是對應plain connection中3*3的channel個數的一半），主要是因為在plain connection中，每個bottleneck的輸入直接是前一個bottleneck的輸出，但是在dense connection中，每個bottleneck的輸入是前面所有bottleneck的輸出的concate
dense connection部分右邊的矩形框是down sampling block，包含2x2的max pooling（降取樣作用）和一個1x1的卷積（降低channel個數的作用），作者也提到先進行降取樣再進行1x1卷積主要可以減少計算量

參考

Training Region-based Object Detectors with Online Hard Example Mining

CVPR’16

問題

The field of object detection has made significant advances riding on the wave of region-based ConvNets, but their training procedure still includes many heuristics and hyperparameters that are costly to tune.

方法

文章提出了一種通過online hard example mining（OHEM）演算法訓練基於區域的卷積檢測運算元的高效目標檢測演算法，能夠對簡單樣本和一些小數量樣本進行抑制，使得訓練過程更加高效
該方法利用顯著的bootstrapping技術（SVM中被普遍利用），對SGD演算法進行一定的修改，使得原有的region-based ConvNets的啟發式學習和多引數可以被移除，並得到較準確穩定的檢測結果
- 文章提出的OHEM演算法裡，對於給定影象，經過selective search RoIs，同樣計算出卷積特徵圖。但是在綠色部分的（a）中，一個只讀的RoI網路對特徵圖和所有RoI進行前向傳播，然後Hard RoI module利用這些RoI的loss選擇B個樣本。在紅色部分（b）中，這些選擇出的樣本（hard examples）進入RoI網路，進一步進行前向和後向傳播

參考

Focal Loss for Dense Object Detection

ICCV‘17

問題

one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far
即 one-stage detectors 很好，但是沒 two-stage detectors 精準
作者提出 focal loss 的出發點也是希望 one-stage detector 可以達到 two-stage detector 的準確率，同時不影響原有的速度

方法

提出了 focal loss
- $F L (p_{t}) = - a_{t} (1 - p_{t})^{γ} l o g (p_{t})$
- 這個損失函式是在標準交叉熵損失基礎上修改得到的
- 這個函式可以通過減少易分類樣本的權重，使得模型在訓練時更專注於難分類的樣本
- 當一個樣例被誤分類，那麼 $P_{t}$ 很小，那麼調製因子 $（ 1 - P_{t} ）$ 接近1，損失不被影響；當 $P_{t} \to 1$ ，因子 $（ 1 - P_{t} ）$ 接近0，那麼分的比較好的（well-classified）樣本的權值就被調低了
- 專注引數 $γ$ 平滑地調節了易分樣本調低權值的比例。 $γ$ 增大能增強調製因子的影響，實驗發現 $γ$ 取2最好

參考

Soft-NMS – Improving Object Detection With One Line of Code

ICCV’17

問題

紅色框和綠色框是當前的檢測結果，二者的得分分別是0.95和0.80。如果按照傳統的NMS進行處理，首先選中得分最高的紅色框，然後綠色框就會因為與之重疊面積過大而被刪掉
另一方面，NMS的閾值也不太容易確定，設小了會出現下圖的情況（綠色框因為和紅色框重疊面積較大而被刪掉），設定過高又容易增大誤檢

方法

思路：不要粗魯地刪除所有IOU大於閾值的框，而是降低其置信度

指定一個置信度閾值，然後最後得分大於該閾值的檢測框得以保留

參考

Light-Head R-CNN: In Defense of Two-Stage Object Detector

arXiv:1711

問題

two-stage detector 檢測精度高，但是慢
主要在基於proposal的recognition過程，作者成為 ’head‘

方法

對head進行瘦身
- Large separable convolution + Thin feature map 提升演算法速度。
- 用FC來代替global average pooliing來減少空間資訊的丟失，提高精度。
加入其它trick，例如: PSRoI with RoIAlign、multi-scale training、OHEM 等來進一步提升精度

收穫

解決問題前要分析問題，知道從哪下手才有效
此篇論文更偏工程，實用

參考

讀論文：Object Detection

Object Detection Rich feature hierarchies for accurate object detection and semantic segmentation CVPR’14 問題之前最好的辦法很複雜滑動

讀論文系列：Object Detection SPP-net

多層彌補 match rop lte 圖片 detection orien img 本文為您解讀SPP-net: Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

讀論文系列：Object Detection ECCV2016 SSD

mat 輸入 gem 分類就是 box HR com 參考轉載請註明作者：夢裏茶 Single Shot MultiBox Detector Introduction 一句話概括：SSD就是關於類別的多尺度RPN網絡基本思路：基礎網絡後接多層feature m

各種物件檢測論文總結(Object Detection )

Original url: http://blog.csdn.net/u010167269/article/details/52563573 https://handong1587.github.io/deep_learning/2015/10/09/object

讀論文：deep Learning 深度學習合集

Deep Learning Very Deep Convolutional Networks for Large-Scale Image Recognition ICLR 2015 問題網路模型不夠深方法** 用3個 3x3的核替換

讀論文：Detecting Vehicle Illegal Parking Events using Sharing Bikes’ Trajectories（未完成）

KDD 2018 鄭宇組的論文首先說了這個工作很有意義，並且有資料可以做這個事情，然後定義問題分析難點最後講述自己的解決方案，並且已部署上線。其中預處理部分非常值得學習，違章檢測是基於分佈的思想也值得借鑑。 OUTLINE 軌跡資料預處理違章停車

無聊讀論文：視覺注意力模型RARE2012

Riche, N., Mancas, M., Duvinage, M., Mibulumukini, M., Gosselin, B., & Dutoit, T. (2013). RARE2012: A multi-scale rarity-based saliency detection with

論文筆記：Learning Region Features for Object Detection

中心思想繼Relation Network實現可學習的nms之後，MSRA的大佬們覺得目標檢測器依然不夠fully learnable，這篇文章類似之前的Deformable ROI Pooling，主要在ROI特徵的組織上做文章，文章總結了現有的各種ROI Pooling變體，提出了一個統一的數學表示式

深度學習論文翻譯解析（四）：Faster R-CNN: Down the rabbit hole of modern object detection

論文標題：Faster R-CNN: Down the rabbit hole of modern object detection 論文作者：Zhi Tian , Weilin Huang, Tong He , Pan He , and Yu Qiao 論文地址：https://tryolab

論文閱讀筆記二十三：Rich feature hierarchies for accurate object detection and semantic segmentation Tech report(R-CNN CVPR2014)

論文源址：http://www.cs.berkeley.edu/~rbg/#girshick2014rcnn 摘要在PASCAL VOC資料集上，最好的方法的思路是將低階資訊與較高層次的上下文資訊進行結合。該文的兩個亮點：（1）將CNN應用到re

論文閱讀筆記二十六：Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks（CVPR 2016）

論文源址：https://arxiv.org/abs/1506.01497 tensorflow程式碼：https://github.com/endernewton/tf-faster-rcnn 摘要目標檢測依賴於區域proposals演算法對目標的位置進

論文筆記：Feature Pyramid Networks for Object Detection

初衷 Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep

論文閱讀：You Only Look Once: Unified, Real-Time Object Detection

Preface 注：這篇今年 CVPR 2016 年的檢測文章 YOLO，我之前寫過這篇文章的解讀。但因為不小心在 Markdown 編輯器中編輯時刪除了。幸好同組的夥伴轉載了我的，我就直

【論文筆記】視訊物體檢測(VID)系列 FGFA：Flow-Guided Feature Aggregation for Video Object Detection

1.Abstract 視訊中的物體檢測會受到諸如運動模糊、視訊散焦、奇特姿態等的影響。現有的工作嘗試從box-level使用temporal資訊，但是這種方法不能端到端地進行訓練。我們提出了FGFA，在frame-level使用temporal資訊。它將相

目標檢測論文閱讀：Relation Networks for Object Detection

Relation Networks for Object Detection 論文連結：https://arxiv.org/abs/1711.11575 程式碼連結：暫無，尚不清楚是否會公開這個是CVPR 2018的文章，雖然並沒有什麼巧妙的設

論文閱讀筆記（二十二）：Feature Pyramid Networks for Object Detection（FPN）

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning o

論文閱讀筆記（三十八）：Dynamic Zoom-in Network for Fast Object Detection in Large Images

We introduce a generic framework that reduces the computational cost of object detection while retaining accuracy for scenarios whe

目標檢測論文閱讀：Cascade R-CNN: Delving into High Quality Object Detection

Cascade R-CNN: Delving into High Quality Object Detection 樣本減少引發的過擬合在train和inference使用不一樣的閾值很容易導致mismatch(這一點在下面會有解釋) 作者為

RefineDet論文閱讀：Single-Shot Refinement Neural Network for Object Detection

裁剪部分損失函數過程 bject sin 關聯增加問題摘要 RefineDet是CVPR 2018的一篇論文，文中提出了一個新的single-shot檢測器RefineDet，實現了比二階段方法更高的準確率而且具有與一階段方法相當的效率。RefineDet包括兩

【譯】Cascade R-CNN：Delving into High Quality Object Detection論文翻譯

CVPR 2018年論文：Cascade R-CNN----------------------------------------------------------------------------------------------------博主也是正在看這篇論文，

讀論文：Object Detection

Object Detection

Rich feature hierarchies for accurate object detection and semantic segmentation

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

SSD: Single Shot MultiBox Detector

Feature Pyramid Networks for Object Detection

R-FCN: Object Detection via Region-based Fully Convolutional Networks

PVANET: Deep but Lightweight Neural Networks for Real-time Object Detection

DSSD : Deconvolutional Single Shot Detector

DSOD: Learning Deeply Supervised Object Detectors from Scratch

Training Region-based Object Detectors with Online Hard Example Mining

Focal Loss for Dense Object Detection

Soft-NMS – Improving Object Detection With One Line of Code

Light-Head R-CNN: In Defense of Two-Stage Object Detector

相關推薦