Large Kernel Matters—— Improve Semantic Segmentation by Global Convolutional Network
The large kernel (and effective receptive field) plays an important role when we have to perform the classification and localization tasks simultaneously.(當我們必須同時執行分類和定位任務時,大核心(和有效的感知野)起著重要作用)
we propose a Global Convolutional Network to address both the classification and localization issues for the semantic segmentation.(提出GCN來解決語義分割的分類和定位問題)
We also suggest a residual-based boundary refinement to further refine the object boundaries.(基於殘差的邊界細化以進一步細化目標邊界)
82.2% (vs 80.2%) on PASCAL VOC 2012 dataset and 76.9% (vs 71.8%) on Cityscapes dataset.
-
引言
For the classification task, the models are required to be invariant to various transformations like translation and rotation. But for the localization task, models should be transformation-sensitive -
方法
First from the localization view, the structure must be fully-convolutional without any fully-connected layer or global pooling layer that used by many classification networks, since the latter will discard localization information. Second from the classification view, motivated by the densely-connected structure of classification models, the kernel size of the convolutional structure should be as large as possible.
we models the boundary alignment as a residual structure.S = S + R(S),where S is
the coarse score map and R() is the residual branch.
3. 實驗
We evaluate our approach on the standard benchmark PASCAL VOC 2012 [11, 10] and Cityscapes [8]. PASCAL VOC 2012 has 1464 images for training, 1449 images for validation and 1456 images for testing, which belongs to 20 object classes along with one background class. We also use the Semantic Boundaries Dataset [13] as auxiliary dataset, resulting in 10,582 images for training. We choose the state-of-the-art network ResNet 152 [14] (pretrained on ImageNet [28]) as our base model for fine tuning. During the training time, we use standard SGD [20] with batch size 1, momentum 0.99 and weight decay 0.0005 . Data augmentations like mean subtraction and horizontal flip are also applied in training.
we pad each input image into 512 x 512 so that the top-most feature map is 16 x 16.
Only odd size are used just to avoid alignment error.(k值)(k值的範圍可根據最後一個特徵圖的大小來定)
雖然GCN的網路結構增加了引數量,但是通過與實驗C進行對比證明了效能的提升並不是因為引數量的增加。
GCN模型主要提高了內部區域的準確性,而邊界區域的影響較小(大型物體中心的畫素+GCN≈“純”分類問題)