An Analysis of Scale Invariance in Object Detection – SNIP 論文解讀

阿新 • • 發佈：2018-10-18

記錄測試的 one zhang 不可策略 correct 抽象 alt

前言

本來想按照慣例來一個overview的，結果看到一篇十分不錯而且詳細的介紹，因此copy過來，自己在前面大體總結一下論文，細節不做贅述，引用文章講得很詳細。

論文概述

引用文章

以下內容來自：http://lowrank.science/SNIP/

這篇日誌記錄一些對下面這篇 CVPR 2018 Oral 文章的筆記。

Singh B, Davis L S. An Analysis of Scale Invariance in Object Detection–SNIP[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 3578-3587.
論文鏈接：https://arxiv.org/abs/1711.08189
代碼鏈接：https://github.com/bharatsingh430/snip

論點

論文一開始作者擺了個事實，對於 Image Classification 已經能夠做到 super-human level performance 了，但是 Object Detection 還差得很遠很遠，所以作者問了一個問題：Why is object detection so much harder than image classification?

作者給出的解釋是 Large scale variation across object instances，這個 scale variation 不僅僅存在於要 apply 的 target dataset 自身內部，還存在與 pre-trained dataset 和要 apply 的 target dataset 之間。

對於要 apply 的 target dataset 自身內部的 extreme scale variation，作者給了下面這張圖來說明。縱坐標的 CDF 是 Cumulative Distribution Function，累積分布函數；這個 Relative Scale 應該就是 Object 應該是長或寬，占據圖像的長或寬的比例。從這張圖上能看出，COCO 的大部分目標集中在 relative scale 0.1 之下，面積小於 1 %。這裏其實有兩個問題：
- 一個是目標本身很小，怎麽樣才能比較好的特征表示小目標，也就是讓 CNN 本身能檢測出小目標。
- 另一個問題是，因為 COCO 裏面大部分都是小目標，而小目標因為很小，所以彼此之間尺度的倍數其實很大，scale 為 0.0001 的和 scale 為 0.1 的之間差了 1000 倍，但因為一半的目標都集中在 0.1 一下，所以特別小的尺度的其實也有大量目標，並不能被忽略。也就是說是大量、數目不可忽略的 very small objects 的存在，使得 object detection 數據集上的 scale variation 很大。所以對於 COCO，就要求 CNN 必須要有在極小的 scale 和很大的 scale 上（這兩者之間的比例值很大，比如 0.0001 vs 0.9）之間的目標都有很好的分類能力才會有很好的性能，也就要要有對 extreme scale variation 的魯棒性，即 scale-invariant。

技術分享圖片

對於 pre-trained dataset 和要 apply 的 target dataset 之間的 scale variation，作者給了 domain-shift 這個名詞來形容。ImageNet 是用來圖像分類的，目標一般 scale 比較大，而 object detection 數據集中的目標的 scale 差異就很大了。在 ImageNet 這種大目標數據集上 pre-trained 的 features，直接用在檢測 object detection 中的那些小目標，可以預期到效果並不會很好，這就是 domain-shift 造成的。

歸根到底，object detection 目前做不好，還是因為有大量 very small objects 存在的本身，而 small objects 檢測很難，因為：

small objects 因為 small，內部 scale 就差異很大（倍數，因為分母很小，一除就會很大），檢測器需要很強的 scale-invariance 能力，而 CNN 就其設計本身其實是沒有 scale-invariance 的；

small objects 本身 small，在 ImageNet 這樣 Median Scale Objects 為主的 Datasets 上 Pre-trained 的 Features 直接用來 Detect Small Objects 效果不好，因為 domain-shift。

CNN 抽取 semantic feature 導致的 coarse representation 和 detect small objects 需要 fine resolution 之間的矛盾，small objects 因為 small，很難在 coarse representation 還有很好的表示，很可能就被忽略了
- The deeper layers of modern CNNs have large strides (32 pixels) that lead to a very coarse representation of the input image, which makes small object detection very challenging.
- 所以本質上是因為 strides 太大導致的原圖像的表示是非常 coarse 的表示，在這種 coarse 的表示中，小目標本身很容易就會被忽視掉了
- 實際上這個問題，在 Semantic Segmentation 中也是存在的，我們希望能夠有既是 Fine Resolution 又是 Semantic 的表示，這也是為什麽後面的一些改進方法和 Semantic Segmentation 方法做法相同的原因。

為了 alleviate 由 scale variation 和 small object instances 本身導致的問題，目前大概有下面這些思路：

features from the layers near to the input, referred to as shallow(er) layers, are combined with deeper layers for detecting small object instances [23, 34, 1, 13, 27]，
- 典型代表是 FPN、SSD
- 這個路線其實用來處理上面的難點 3，coarse representation vs fine resolution 的
- 但作者點出了 the high level semantic features (at conv5) generated even by feature pyramid networks will not be useful for classifying small objects，高層特征沒用如果目標小，這個很合理，因為這個途徑並沒有來處理難點 1 和難點 2 所以當然束手無措

dilated/deformable convolution is used to increase receptive fields for detecting large objects [32, 7, 37, 8]
- 這個路線也是用來處理上面的難點 3，為了最後有一個 fine resolution 的 representation

independent predictions at layers of different resolutions are used to capture object instances of different scales [36, 3, 22]
- 這個路線還是用來處理上面的難點 3，做 Prediction 的時候能夠在合適的 Feature 抽象程度（Resolution）上做

context is employed for disambiguation [38, 39, 10]
- 這個不了解，需要去看論文

training is performed over a range of scales [7, 8, 15] or, inference is performed on multiple scales of an image pyramid
- 這條路線對於小目標來說其實也就是上采樣，很暴力但也很有效，同時來處理上面的難點 1 scale variation 和難點 3 目標太小在 coarse representation 中殘留不多的問題，當然這種方式也有問題，這個會在後面討論

predictions are combined using nonmaximum suppression [7, 8, 2, 33]

總之，檢測小目標，要麽解決問題，也就是對小目標做很好的特征表示，要麽消滅問題本身，把小目標消滅掉，統統 upsampling 成大目標，在對小目標進行 scale-invariant 的特征表示束手無策的情況下，upsampling 似乎就成了比較可行的方案了。不過還需要有很多問題要搞清楚：

upsampling 到底有沒有用？

到底要怎麽做 upsampling？

要對誰做 upsampling？只對 training，還是只對 test，還是都做？

如果都做 upsampling，彼此又該怎麽用？都遍歷所有尺度麽？還是要固定尺度，為了和 pre-trained datasets 的尺度一致。

對應到作者原文中，作者問的是下面兩個問題：

Is it critical to upsample images for obtaining good performance for object detection? Even though the typical size of images in detection datasets is 480x640, why is it a common practice to up-sample them to 800x1200? Can we pre-train CNNs with smaller strides on low resolution images from ImageNet and then fine-tune them on detection datasets for detecting small object instances?

When fine-tuning an object detector from a pre-trained image classification model, should the resolution of the training object instances be restricted to a tight range (from 64x64 to 256x256) after appropriately re-scaling the input images, or should all object resolutions (from 16x16 to 800x1000, in the case of COCO) participate in training after up-sampling input images?

在本文中，作者依次回答了上面這些問題：

首先，up-sampling 對於 small object detection 來說非常重要，這也是為什麽對於 detection datasets, it is a common practice to up-sample 480x640 to 800x1200。

pre-train CNNs with smaller strides on low resolution images from ImageNet 然後再 fine-tune them on detection datasets for detecting small object instances 這種是方式是可以的而且是本文提倡的，只不過 fine-tuning 和 test 都要在本文提出的特殊的 Pyramid 上做

為了消除 domain-shift，在做 fine-tuning 的時候，需要將 training object instances 大小限制在 a tight range (from 64x64 to 256x256) 以保持與 pre-trained datasets 的 object scales 一致這種方式效果最好，而不是 all object resolutions (from 16x16 to 800x1000, in the case of COCO) 都參與到訓練中。

因此，綜上所述，本文的貢獻或者說 argument 就在於提倡訓練 detector 的時候要用 Pyramid，但只有固定尺度內的目標才被拿來參與訓練，作者把這種訓練方式叫作 Scale Normalization for Image Pyramids (SNIP)。本文本質上是一篇討論怎麽來使用 Image Pyramid 的論文，故而後面的論文都是比較不同的 Image Pyramid 使用方式的。

最典型的就是下面兩種使用方式：

scale-specific detectors：variation in scale is handled by training separate detectors - one for each scale range.
- 一個 detector 負責一個 scale 的 objects
- 這裏的樣本應該是沒有做過 Image Pyramid 的 Datasets，這樣的話，對於每個 scale 來說，樣本數量就減少了，訓練樣本少了，對於訓練一個 detector 來說，並沒有把全部的 samples 用上

scale invariant detector：training a single object detector with all training samples
- 這個雖然叫作 scale invariant detector，其實不過只是一個美好的期許而已，實際上 CNN 本身是沒有 scale invariance 這個性質的。即使最後表現出了一定的能夠檢測 multi-scale object 的能力，但這只是「『假象』，那不過 CNN 是用其強大的擬合能力來強行 memorize 不同 scale 的物體來達到的capacity 來強行memorize 不同 scale 的物體來達到的capacity，這其實浪費了大量的 capacity」[1]，也就是說 capacity 並沒有被用到該用的地方去

所以，這裏就有一個取舍了，scale-specific detector 沒有用上全部 samples 可能會導致性能不佳；scale invariant detector 浪費了大量的 capacity 來強行 memorize 不同 scale 的物體，而不是用來學習語義信息，也會導致性能不佳。最好的當然是，不做取舍，兩個都要，即能用上全部 samples，而且不將 capacity 浪費在強行 memorize 不同 scale 的物體上。實際上，這個是可以做到的。

本文的 SNIP，通過 Image Pyramid，使得每個 Object 都能有一個落在與 Pre-trained 的 ImageNet 數據集的 Scale 相同的表達，並且只對經過 Image Pyramid 後與 Pre-trained 的 ImageNet 數據集的 Scale 相同的 Sample 進行訓練，既保證了用上全部 samples，又將capacity 都用在了學習語義信息上。

論證

作者在「3. Image Classification at Multiple Scales」和「5. Data Variation or Correct Scale?」兩處安排了兩個論證實驗。

Fining-tuning, whether or not?

「3. Image Classification at Multiple Scales」這一節研究 domain shift 的影響，除此之外，作者其實還要回答另外一個問題，那就是既然 domain-shift 有影響，那還要不要采用 fine-tuning 這種方式，也就是拿 pre-trained weights 做初始化，直接在 object detection 的 target dataset 上 train from scratch 不好嗎？

作者安排了三個論證實驗，最後證明了即使有 domain shift，還是應該要采用 pre-trained + fine-tuning 這種方式。也就是回答了作者一開始提出來的問題：

Can we pre-train CNNs with smaller strides on low resolution images from ImageNet and then fine-tune them on detection datasets for detecting small object instances?

答案是 yes, we can.

此外，其實 domain shift 不僅僅是在 pre-trained datasets 和 target datasets 之間存在，其實我們在做 Test 的時候，為了檢測小目標通常會做 Image Pyramid，會縮小、放大圖像，這個時候，Test 的 Pyramid 裏面的 object 也會跟 Training 時候的 object scale 不一致，所以這裏就提醒我們一點，在使用 Image Pyramid 的時候還要考慮 domain shift 的影響。

因此，Pre-trained Data 與 Training Data 之間有 domain shift，Training Data 與 Test Data 之間也會有 domain shift. 但這兩個雖然都叫 Domain Shift，其實還有點不同，Pre-trained Data 與 Training Data 之間有 domain shift 是由於 Object 在原有 Resolution 下本身的 Scale 分布造成的；而 Training Data 與 Test Data 之間的 domain shift，則是由 Test 時候采用 Image Pyramid 造成的。

Naive Multi-Scale Inference

這個實驗所采用的是直接拿在 Full Resolution 的數據集上得到的 Pre-trained Weights 拿來應用於 Target Dataset，不做 Fine-tuning。

但是對於 Detection，這個實驗的啟示是 Training Data 與 Test Data 之間由於 Image Pyramid 造成的 domain shift 的影響。

這個實驗是在原尺寸的 ImageNet 上 Training，然後在經過 down-sampling 再 up-sampling 的圖像上做 Test；

對原圖像做 down-sampling 是為了獲得 low-resolution 的圖，再把 low-resolution 的圖 up-sampling 成跟 training image 一樣大小是為了模擬 Pyramid 裏面的 up-sampling 的行為，因為 Detection 最後還是對一個 Region Proposal 的區域做 Classification，因此，這個實驗雖說是在審視 Training Set 和 Test Set 在 Resolution 上的差異對 Classification 的影響，但其實也解釋了做 Detection 的時候，Training Set 和 Test Set 在 Resolution 上的差異的影響。

這裏的 Resolution 指的是圖像的清晰程度。

結論自然是 Training Set 和 Test Set 的 Resolution 差異越大，效果越差，因此要保證 Training Set 和 Test Set 的 Resolution 一致。

說明直接放大小物體去測試是不行的，是要把放大後的小物體真正參與到訓練裏。

Resolution Specific Classifiers

這個實驗所采用的是直接在 Low Resolution 的 Target Dataset 上 Training from scratch，不做 pre-training。

Naive Multi-Scale Inference 這個 Network 是應用於 Full Resolution 數據的網絡，網絡本身相對復雜，CNN-B 的 B 應該是 Base 的意思吧，也就是基準網絡，模擬的是在 Full Resolution 上訓練的基準網絡在 Low Resolution 圖像上測試的效果。

Resolution Specific Classifiers 這個 Network，是在 Low Resolution 數據上訓練並在 Low Resolution 數據上測試，但是為了能夠讓網絡應用在 Low Resolution 的圖像上，采用的是 Simplified network，所以叫 CNN-S。此時，雖然 Training Data 和 Test Data 的 Resolution 一致了，但是因為 Network 簡單了，capacity 弱了，也會造成預測效果不好。

這時候就要看，究竟是簡化網絡造成的預測效果不好影響大，還是 Training 和 Test 數據的 Resolution 不一致的 Domain Shift 對預測效果不好的影響大了，從實驗結果上看，CNN-S 遠好於 CNN-B，註意這裏的前提是數據充足。

因此可以得到的結論是，在數據充足的前提下，Domain Shift 會造成很大的性能損失，也就是說 CNN 並沒有學習 Scale Invariance 的能力，可以遇見即使在 Image Pyramid 做 Test 的時候，CNN 對於在 Training 沒見過的 Scale 的 Object 的時候，效果會很差，這其實也說明了一定要讓 Training Data 和 Test Data 在一個尺度的重要性。

Fine-tuning High-Resolution Classifiers

這個實驗所采用的是先在 Full Resolution 的 Pre-trained Dataset 上做 Pre-training，然後再在 Low Resolution 的 Target Dataset 上做 Fine-tuning。當然為了輸入網絡，Low Resolution Image 要做下 Upsampling.

因為這個是在 CNN-B 的基礎上做了 Fine-tuning，因此叫作 CNN-B-FT。

CNN-B-FT 的效果明顯好於 CNN-S，這說明為了 Low Resolution Data 去削足適履采用 low capacity 的簡單網絡，不如還是采用 Pre-trained on Full Resolution Dataset + Fine-tuning on Low Resolution Dataset 這種方式。

其實這是很合理的，相比 Learning from Scratch 的隨機初始化權重，Pre-trained weights 至少給了一個合適的權重初始化。反正最後還是要在 Target Dataset 上做。但要註意，Fine-tuning 的時候，Target Dataset 是被 up-scaling 了跟 Pre-trained Dataset 一樣的大小。這樣做應該是為了保證 pre-trained datasets 和 target datasets 之間的 object 大小一致。

Fine-tuning, how?

Training on 800 x 1400，test on 1400 x 2000

這個是模擬僅僅 inference is performed on multiple scales of an image pyramid；在 800 x 1400 的圖片上 Training，然後在 1400 x 2000 上的圖片上做 Test 是為了檢測小目標常常采用的是策略。

這是基準，後面的都要跟這個比，這個叫做 800-all。

Training on 1400 x 2000，test on 1400 x 2000

這個 upsampling 了小目標，而且 Training 和 Test 在同一尺度上，但最後的效果僅僅比 800-all 好了一點點，可以忽略的一點。

作者給的話就是 up-sampling 會 blows up the medium-to-large objects which degrades performance，median size object become too big to be correctly classified!

我自己的理解是 up-sampling，雖然減小了小目標在 target dataset 與 pre-trained dataset 之間 domain shift，但是又增加了 medium size 的 objects 在 target dataset 與 pre-trained dataset 之間 domain shift，大量 median objects 變成了超大目標， scale 和 ImageNet 這樣的 pretrained dataset 上大部分目標的 scale 不一致

Scale specific detectors

為了去除 Scale Variation 讓 CNN 把能力都用在 Memorizing 而不是 Learning Semantic 上帶來的性能下降，作者只對一定 scale 的小目標做了訓練，也就是沒有了 scale variation，但 training data 的數量減少了。

實驗結果表明，這比 800-all 的效果還要差，因為去除掉了 median-to-large 的 objects，並不有利於 CNN 學習語義，也就是說，去掉一些 scale 的樣本不利於學習語義，塞給 CNN 各種 scale 的樣本讓它去強行記憶也不利於 CNN 學習語義。

單純只用小物體效果也不好是因為數據不足，大物體其實對於語義信息是很有幫助的。你只用了部分數據還不如全用了雖然用的不特別好。

Multi-Scale Training (MST)

用 Image Pyramid 生成多個 Resolution，然後用一個 CNN 去 fit 所有這些不同 Resolution 的 object，最後的結果是跟 800-all 差不多。

說明 CNN 沒有學習 Scale Invariance 的能力，強行讓它記住不同尺寸的目標，會損害它 Learning Semantic 的能力，從而雖然 Data 經過 Image Pyramid 數量增加了會帶來一點增益，也隨著 Learn 到的 Semantic 能力的損失下降了。

這要要求我們理想的 detector，即能夠利用上所有的樣本，但餵它的這些樣本又能夠都處於合適的尺度內，從而能夠讓 CNN 把能力都放在 Learning Semantic Information 上。

所以由 DNN 學到的特征不具有：旋轉不變性，尺度不變性？都是數據堆起來的假象，或者說是通過 capacity 由不同 neuron 死記硬背

結論

對於 Scale-Variation，有兩種思路，一種是增大學到 Scale-Variation 的能力，從而能夠 handle Scale-Variation，另一種是減少面對數據中的 Scale-Variation，這樣就相當於把任務給 simplified 的了。作者采用了後面一種，可以說是簡單粗暴，也可以說是治標不治本。

如果要想賦予 CNN 尺度不變性，還是要考慮怎麽樣的結構在設計上考慮了 scale invariance，以及怎麽從 data 中抽取出或者說學習到這個 scale invariance。

除了尺度不變性，CNN 其實也學不到旋轉不變性，如果你的 target dataset 裏面旋轉不變性很重要，那可以考慮采取跟本文一樣的操作。

感想

我很喜歡這篇文章，它給了我們這些做應用的人一個清晰的怎麽做應用研究的範式。通過仔細分析現在存在的問題背後的原因，然後找出可以解決這個問題的手段，而不是堆疊一些 fancy 時髦的東西，是值得我學習的榜樣??。

Reference

下面是一些寫這篇筆記時的一些參考資料，對我嘗試理解 SNIP 提供了很大的幫助。

[1] CVPR18 Detection 文章選介（下）
[2] 目標檢測論文閱讀：An Analysis of Scale Invariance in Object Detection – SNIP
[3] 目標檢測 - SNIPER-Efficient Multi-Scale Training - 論文筆記

如果您覺得我的文章對您有所幫助，不妨小額捐助一下，您的鼓勵是我長期堅持的一大動力。

An Analysis of Scale Invariance in Object Detection – SNIP 論文解讀

記錄測試的 one zhang 不可策略 correct 抽象 alt 前言本來想按照慣例來一個overview的，結果看到一篇十分不錯而且詳細的介紹，因此copy過來，自己在前面大體總結一下論文，細節不做贅述，引用文章講得很詳細。論文概述引用文章以下內容來自：

An Analysis of Scale Invariance in Object Detection – SNIP 論文解讀

前言

論文概述

引用文章

論點

論證

Fining-tuning, whether or not?

Naive Multi-Scale Inference

Resolution Specific Classifiers

Fine-tuning High-Resolution Classifiers

Fine-tuning, how?

Training on 800 x 1400，test on 1400 x 2000

Training on 1400 x 2000，test on 1400 x 2000

Scale specific detectors

Multi-Scale Training (MST)

結論

感想

Reference

An Analysis of Scale Invariance in Object Detection – SNIP 論文解讀

[CVPR2018] An Analysis of Scale Invariance in Object Detection – SNIP

跟蹤演算法基準--Tracking the Trackers: An Analysis of the State of the Art in Multiple Object Tracking

論文翻譯 DOTA:A Large-scale Dataset for Object Detection in Aerial Images

An Overview of JavaScript Testing in 2018

《You Only Look Once: Unified, Real-Time Object Detection》論文筆記

【Network Architecture】Feature Pyramid Networks for Object Detection(FPN)論文解析（轉）

《3D Object Proposals using Stereo Imagery for Accurate Object Class Detection》論文解讀

An Empirical Analysis of Anonymity in Zcash

【YOLT】《You Only Look Twice: Rapid Multi-Scale Object Detection In Satellite Imagery》

[譯]深度神經網絡的多任務學習概覽(An Overview of Multi-task Learning in Deep Neural Networks)

ionic3打包出錯ionic cordova build android（系列一）：could not find an installed version of gradle either in android studio

for in,Object.keys()與for of的區別

[LeetCode] 323. Number of Connected Components in an Undirected Graph 無向圖中的連通區域的個數

421. Maximum XOR of Two Numbers in an Array

for in,Object.keys()與for of的用法與區別

Given an array of integers that is already sorted in ascending order, find two numbers such that the

ECCV 2018 論文閱讀筆記——Acquisition of Localization Confidence for Accurate Object Detection

深度神經網路的多工學習概覽(An Overview of Multi-task Learning in Deep Neural Networks)

[論文理解] Rapid-Object-Detection-using-a-Boosted-cascade-of-simple-features

An Analysis of Scale Invariance in Object Detection – SNIP 論文解讀

前言

論文概述

引用文章

論點

論證

Fining-tuning, whether or not?

Naive Multi-Scale Inference

Resolution Specific Classifiers

Fine-tuning High-Resolution Classifiers

Fine-tuning, how?

Training on 800 x 1400，test on 1400 x 2000

Training on 1400 x 2000，test on 1400 x 2000

Scale specific detectors

Multi-Scale Training (MST)

結論

感想

Reference

相關推薦