[Object Detection]關於“在預訓練網路中增加捲積和全連線層可以改善效能”
Yolo論文裡提到"Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [28]."
[28] S. Ren, K. He, R. B. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. CoRR, abs/1504.06066, 2015. 3, 7
目標檢測中基本使用預訓練的卷積網路來提取特徵,然後在卷積網路後面再新增其它層,構成“Networks on Convolutional feature maps” (NoCs)。NoC可以使用SVM、MLP,或者ConvNet。
SVM vs MLP as NoC:
Table 1 shows the results of using MLP as NoC. Here we randomly initialize the weights by Gaussian distributions. The accuracy of NoC with 2 to 4 fc layers increases with the depth. Compared with the SVM classifier trained on the RoI features (“SVM on RoI”, equivalent to a 1-fc structure), the 4-fc NoC as a classifier on the same features has 7.8% higher mAP. Note that in this comparison the NoC classifiers have no pre-training (randomly initialized). The gain is solely because that MLPs are better classifiers than single-layer SVMs.
Using ConvNet as NoC:
In recent detection systems [12], [13], [14], [23], [24], conv layers in the pre-trained models are thought of as region-independent feature extractors, and thus are shared on the entire image without being aware of the regions that are of interest. Although this is a computationally efficient solution, it misses the opportunities of using conv layers to learn region-aware features that are fit to the regions of interest (instead of full images).
We investigate using 1 to 3 additional conv layers (with ReLU) in a NoC. We use 256 conv filters for the ZF net and 512 for the VGG net. The conv filters have a spatial size of 33 and a padding of 1, so the m×m spatial resolution is unchanged. After the last
additional conv layer, we apply three fc layers as in the above MLP case. For example, we denote a NoC with 2 conv layers as “c256-c256-f4096-f4096-f21”.
When using VOC 07 trainval for training, the mAP is nearly unchanged when using 1 additional conv layer, but drops when using more conv layers. We observe that the degradation is a result of overfitting. The VOC 07 trainval set is too small to train deeper models. However, NoCs with conv layers show improvements when trained on the VOC 07+12 trainval set (Table 2). For this training set, the 3fc NoC baseline is lifted to 56.5% mAP. The advanced 2conv3fc NoC improves over this baseline to 58.9%. This justifies the effects of the additional conv layers. Table 2 also shows that the mAP gets saturated when using 3 additional conv layers.
預訓練網路中的卷積層可以認為是區域無關的特徵提取器(region-independent feature extractors),這個特徵提取器跟興趣區域無關,被整個影象共享。在NoC中增加捲積層,可以用於學習region-aware的特徵,專門適配於興趣區域。
[28]總結了幾個發現:
The following key observations can be concluded from the above subsections:
(i) A deeper region-wise classifier is useful and is in general orthogonal to deeper feature maps.
(ii) A convolutional region-wise classifier is more effective than an MLP-based region-wise classifier.