如何使用 pytorch 實現 SSD 目標檢測演算法

阿新 • • 發佈：2021-11-17

前言

SSD 的全稱是 Single Shot MultiBox Detector，它和 YOLO 一樣，是 One-Stage 目標檢測演算法中的一種。由於是單階段的演算法，不需要產生所謂的候選區域，所以 SSD 可以達到很高的幀率，同時 SSD 中使用了多尺度的特徵圖來預測目標，所以 mAP 可以比肩甚至超過 Faster R-CNN。在這篇部落格中，我們會詳細地介紹 SSD 的原理，並使用 pytorch 來實現 SSD。

模型結構

VGG16

SSD 的結構如上圖所示，可以看到 SSD 使用 VGG16 為主幹網路，但是在 VGG16 的結構上做了如下修改：

去掉了全連線層 FC6 和 FC7，替換成了卷積層 Conv6 和 Conv7

將 Conv6 前面的最大值池化層從 MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) 換成 MaxPool2d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)，這樣就可以保證輸入 Conv6 的特徵圖大小不變
Conv6 使用 3×3 大小、膨脹率為 6 的膨脹卷積來增大感受野
Conv7 使用 1×1 大小的普通卷積

總結下來 SSD 中的 VGG16 實現程式碼為：

def vgg16(batch_norm=False) -> nn.ModuleList:
    """ 建立 vgg16 模型

    Parameters
    ----------
    batch_norm: bool
        是否在卷積層後面新增批歸一化層
    """
    layers = []
    in_channels = 3
    cfg = [64, 64, 'M', 128, 128, 'M', 256, 256,
           256, 'C', 512, 512, 512, 'M', 512, 512, 512]

    for v in cfg:
        if v == 'M':
            layers.append(nn.MaxPool2d(2, 2))
        elif v == 'C':
            layers.append(nn.MaxPool2d(2, 2, ceil_mode=True))
        else:
            conv = nn.Conv2d(in_channels, v, 3, padding=1)

            # 如果需要批歸一化的操作就新增一個批歸一化層
            if batch_norm:
                layers.extend([conv, nn.BatchNorm2d(v), nn.ReLU(True)])
            else:
                layers.extend([conv, nn.ReLU(True)])

            in_channels = v

    # 將原始的 fc6、fc7 全連線層替換為卷積層
    layers.extend([
        nn.MaxPool2d(3, 1, 1),
        nn.Conv2d(512, 1024, 3, padding=6, dilation=6),  # conv6 使用空洞卷積增加感受野
        nn.ReLU(True),
        nn.Conv2d(1024, 1024, 1),                        # conv7
        nn.ReLU(True)
    ])

    layers = nn.ModuleList(layers)
    return layers

下面這張圖直觀地顯示了原本的 VGG16 (torchvision.models 的實現) 和 SSD 中 VGG16 的區別，可以看到從 (30): MaxPool2d 開始二者就變得不一樣了：

Extra Feature Layers

為了更好地檢測大目標物體，SSD 在 VGG16 後面多添加了幾個卷積塊 Conv8_2、Conv9_2、Conv10_2 和 Conv11_2，他們的具體結構為：

self.extras = nn.ModuleList([
    nn.Conv2d(1024, 256, 1),                        # conv8_2
    nn.Conv2d(256, 512, 3, stride=2, padding=1),
    nn.Conv2d(512, 128, 1),                         # conv9_2
    nn.Conv2d(128, 256, 3, stride=2, padding=1),
    nn.Conv2d(256, 128, 1),                         # conv10_2
    nn.Conv2d(128, 256, 3),
    nn.Conv2d(256, 128, 1),                         # conv11_2
    nn.Conv2d(128, 256, 3),
])

先驗框

前面提及了 SSD 使用多尺度的特徵圖來檢測目標，所謂的多尺度，其實就是用了不同大小的特徵圖。假設我們向 SSD 神經網路輸入了一張 300×300×3 的影象，經過一些列的卷積和池化之後，會得到下面 6 個要用來檢測目標的特徵圖：

Conv4_3 輸出的 39×39×512 的特徵圖
Conv7 輸出的 19×19×1024 的特徵圖
Conv8_2 輸出的 10×10×512 的特徵圖
Conv9_2 輸出的 5×5×256 的特徵圖
Conv10_2 輸出的 3×3×256 的特徵圖
Conv11_2 輸出的 1×1×256 的特徵圖

隨著網路的加深，特徵圖會越來越小，小目標的特徵可能會丟失掉，所以我們用大的特徵圖用來檢測小目標，小的特徵圖用來檢測大目標。如下圖的子圖 (a) 所示，用藍色方框標出的貓貓比較小，所以應該由較大的 8×8 特徵圖（這裡的 8 × 8 只是為了和 4×4 形成大小對比，實際中用的不是 8×8）檢測出來，而紅色方框標出的狗狗較大，就應該使用 4×4 大小的特徵圖來檢測。但是具體要怎麼用這些特徵圖來進行目標檢測呢？這就引出了先驗框（default box）的概念。如子圖 (b) 所示，我們會在每張特徵圖的每個畫素點處產生一些具有不同長寬比 \(a_r\) 和尺度 \(scale\) 的方框，稱之為先驗框。只要知道了輸入影象的大小（此處為 300×300）、 \(a_r\) 和 \(scale\) ，我們就能確定出這些先驗框的大小。

\(scale\) 的計算方式

如果我們給前面列出的 6 個特徵圖從 0 到 5 編個號，第 \(k\) 個特徵圖中最小的那個正方形先驗的 \(scale\) 記為 \(s_k\) ，定義 \(s_k\) 的計算公式為：

\[s_k = s_{min} + \frac{s_{max} - s_{min}}{m-1} (k-1), \quad k \in [1, m] \]

其中 \(s_{max}\) 為 0.9，\(s_{min}\) 為 0.2。從上述公式可以看出，\(s_k\) 與 \(k\) 成正相關，也就是說：特徵圖越小，\(s_k\) 反而越大，因為小特徵圖要用來檢測大目標。在論文中，作者直接將 \(s_0\) 設定為 \(s_{min}/2\) 即 0.1，剩下的 5 個特徵圖的 \(s_k\) 由上述公式給出，此時 \(m=5\)，\(s_1=0.2\)，\(s_2=0.375\)，\(s_3=0.55\)，\(s_4=0.725\)，\(s_5=0.9\)。將 \(s_k\) 乘以輸入影象的尺寸 300 可以得到先驗框的真實大小，計算得到的結果是 \([30,\ 60,\ 112.5,\ 165,\ 217.5,\ 270]\)，這就產生了一個問題：我影象都是整數大小的，你先驗框怎麼能帶小數呢？所以又將上述計算 \(s_k\) 的公式魔改為：

\[s_k=\left(\lfloor s_{min}\times100 \rfloor + \lfloor \frac{\lfloor s_{max}\times100 \rfloor - \lfloor s_{min}\times 100 \rfloor}{m-1}\rfloor (k-1)\right)/100, \quad k \in [1, m] \]

重新計算可得 \(s_1=0.2\)，\(s_2=0.375\)，\(s_3=0.54\)，\(s_4=0.71\)，\(s_5=0.88\)，對應先驗框的大小為 \([30,\ 60,\ 111,\ 162,\ 213,\ 264]\)，如此一來我們便得到了每個特徵圖中最小的正方形先驗框的大小。從上圖可以看到除了這個小正方形以外，還會有一個大正方形，那麼這個大正方形的尺寸又該如何確定呢？論文中使用幾何平均公式來確定：\(s_k'=\sqrt{s_ks_{k+1}}\)，對於最小的特徵圖，\(s_{k+1}\) 取 1.05。小聲比比：\(s_k\) 的定義公式就沒什麼意思，我直接給出幾個能算出整數值預測框大小的 \(s_k\) 不就完事了嗎，似乎沒必要搞得這麼麻煩 _(:3」∠)_。

\(a_r\) 計算公式

論文中給出的 \(a_r\) 為 \(\left\{1,\ 1',\ 2,\ \frac{1}{2},\ 3,\ \frac{1}{3} \right\}\), 前面兩個長寬比對應了小正方形 \(s_k\) 和大正方形 \(s_k'\)，後面 4 個長寬比對應了剩餘 4 個長方形先驗框，這 4 個長方形先驗框的寬度和高度由該式子給出：\(w_k^a=s_k\sqrt{a_r},\ h_k^a=s_k/\sqrt{a_r}\) 。實際上並不是每個特徵圖中都設定 \(a_r\) 為 \(\left\{1,\ 1',\ 2,\ \frac{1}{2},\ 3,\ \frac{1}{3} \right\}\)，第一個和最後兩個特徵圖只使用 \(\left\{1,\ 1',\ 2,\ \frac{1}{2} \right\}\) 來生成先驗框。這裡先驗框的寬度和高度都是小數，因為實際使用中我們會將 ground truth 邊界框的座標 \((x_{min},\ y_{min},\ x_{max},\ y_{max})\) 即左上角和右下角座標歸一化。

生成先驗框

我們會使用每個畫素點 \((i,\ j)\) 的中心座標 \((c_x,\ c_y)\) 來生成先驗框 \((c_x,\ c_y,\ w_k^a,\ h_k^a)\)，其中 \(c_x=\frac{i+0.5}{|f_k|}\)，\(c_y=\frac{j+0.5}{|f_k|}\)，這裡的 \(|f_k|\) 是特徵圖的尺寸，除以它就可以將中心座標歸一化。注意：影象以左上角為原點，向右為 \(x\) 軸正方向，向下為 \(y\) 軸正方向。影象對應到矩陣中就是以列索引 \(j\) 為 \(x\) 座標值，行索引 \(i\) 為 \(y\) 座標值，所以程式碼中的 \(c_x=\frac{j+0.5}{|f_k|}\)，\(c_y=\frac{i+0.5}{|f_k|}\)。

# coding:utf-8
from itertools import product
from math import sqrt

import torch


class PriorBox:
    """ 用來生成先驗框的類 """

    def __init__(self, image_size=300, feature_maps: list = None, min_sizes: list = None,
                 max_sizes: list = None, aspect_ratios: list = None, steps: list = None, **kwargs):
        """
        Parameters
        ----------
        image_size: int
            影象大小

        feature_maps: list
            特徵圖大小

        min_sizes: list
            特徵圖中的最小正方形先驗框的尺寸

        max_sizes: list
            下一個特徵圖中的最小正方形先驗框的尺寸

        aspect_ratios: list
            長寬比

        steps: list
            步長，可理解為感受野大小
        """
        self.image_size = image_size
        self.feature_maps = feature_maps or [38, 19, 10, 5, 3, 1]
        self.min_sizes = min_sizes or [30, 60, 111, 162, 213, 264]
        self.max_sizes = max_sizes or [60, 111, 162, 213, 264, 315]
        self.steps = steps or [8, 16, 32, 64, 100, 300]
        self.aspect_ratios = aspect_ratios or [
            [2], [2, 3], [2, 3], [2, 3], [2], [2]]

    def __call__(self):
        """ 得到所有先驗框

        Returns
        -------
        boxes: Tensor of shape `(n_priors, 4)`
            先驗框
        """
        boxes = []

        for k, f in enumerate(self.feature_maps):
            f_k = self.image_size / self.steps[k]

            for i, j in product(range(f), repeat=2):
                # 中心座標，向右為 x 軸正方向，向下為 y 軸正方向
                cx = (j+0.5) / f_k
                cy = (i+0.5) / f_k

                # 1 和 1'
                s_k = self.min_sizes[k]/self.image_size
                s_k_prime = sqrt(s_k*self.max_sizes[k]/self.image_size)

                boxes.append([cx, cy, s_k, s_k])
                boxes.append([cx, cy, s_k_prime, s_k_prime])

                # 根據其餘的 ar 計算寬和高
                for ar in self.aspect_ratios[k]:
                    boxes.append([cx, cy, s_k*sqrt(ar), s_k/sqrt(ar)])
                    boxes.append([cx, cy, s_k/sqrt(ar), s_k*sqrt(ar)])

        boxes = torch.Tensor(boxes).clamp(min=0, max=1)
        return boxes

現在我們來計算一下每個特徵圖生成的先驗框的個數：

第 1 個特徵圖：\(38\times38\times4=5776\)
第 2 個特徵圖：\(19 \times 19 \times 6 = 2166\)
第 3 個特徵圖：\(10 \times 10 \times 6 = 600\)
第 4 個特徵圖：\(5\times 5\times 6=150\)
第 5 個特徵圖：\(3\times 3\times 4=36\)
第 6 個特徵圖：\(1\times 1\times 4=4\)

所以在一張 300×300 的圖上總共生成了 \(5776+2166+600+150+36+4=8732\) 個先驗框。

位置偏移量和置信度

現在我們只拿到了一堆密密麻麻的先驗框，還不知道先驗框中有沒有物體，如果有的話是什麼物體？同時由於先驗框的位置是固定的，如果直接拿它來定位目標，肯定會非常不準，所以我們還需要預測先驗框和真實框（ground truth box，已歸一化）之間的偏移量。我們對每個特徵圖使用了上圖中的 Detector & Classifier 模組，裡面是卷積層，用來預測先驗框和真實框之間的偏移量，以及先驗框之中是否有某類物體的置信度（0~1 之間）。

假設我們要檢測 n_classes 類物體（包括背景），特徵圖的通道數為 in_channels，每個畫素點位置生成的先驗框的個數為 k。由於每個先驗框都要預測 n_classes 個類的置信度，所以 classifier 的輸出通道數為 k*n_classes ，同理偏移位置應該包含 4 個座標值，所以 Detector 的輸出通道數為 k*4。寫成程式碼就是如下形式：

self.confs = nn.ModuleList([
    nn.Conv2d(512, n_classes*4, 3, padding=1),
    nn.Conv2d(1024, n_classes*6, 3, padding=1),
    nn.Conv2d(512, n_classes*6, 3, padding=1),
    nn.Conv2d(256, n_classes*6, 3, padding=1),
    nn.Conv2d(256, n_classes*4, 3, padding=1),
    nn.Conv2d(256, n_classes*4, 3, padding=1),
])
self.locs = nn.ModuleList([
    nn.Conv2d(512, 4*4, 3, padding=1),
    nn.Conv2d(1024, 4*6, 3, padding=1),
    nn.Conv2d(512, 4*6, 3, padding=1),
    nn.Conv2d(256, 4*6, 3, padding=1),
    nn.Conv2d(256, 4*4, 3, padding=1),
    nn.Conv2d(256, 4*4, 3, padding=1),
])

第一個特徵圖在檢測之前還有一個 \(\text{L2 Normalization}\) 的操作，根據論文中的說法：

Since, as pointed out in [12], conv4_3 has a different feature scale compared to the other layers, we use the L2 normalization technique introduced in [12] to scale the feature norm at each location in the feature map to 20 and learn the scale during back propagation.

由於 Conv4_3 輸出的特徵圖較大，在沒有使用 \(\text{Batch Normalization}\) 的情況下應該加一個 \(\text{L2 Normalization}\) 的操作來消除尺度差異性，但是作者沒有使用傳統的標準化操作，他還在每個通道維度乘上一個可學習的縮放量 \(scale\)，並將初始值設定為 20，關於更多標準化的介紹可以參見部落格《為什麼要做特徵歸一化/標準化？》。上述標準化操作對應的程式碼為：

class L2Norm(nn.Module):
    """ L2 標準化 """

    def __init__(self, n_channels: int, scale=20):
        """
        Parameters
        ----------
        n_channels: int
            通道數

        scale: float
            l2標準化的縮放比
        """
        super().__init__()
        self.gamma = scale
        self.eps = 1e-10
        self.n_channels = n_channels
        self.weight = nn.Parameter(Tensor(self.n_channels))
        self.reset_parameters()

    def reset_parameters(self):
        init.constant_(self.weight, self.gamma)

    def forward(self, x: Tensor):
        norm = x.pow(2).sum(dim=1, keepdim=True).sqrt()+self.eps
        x = torch.div(x, norm)
        # 將 weight 的維度變為 [1, n_channels, 1, 1]
        y = x*self.weight[None, ..., None, None]
        return y

截止目前，我們已經可以讓神經網路來預測先驗框的偏移量和類別置信度了，下面是 SSD 神經網路的前饋過程：

class SSD(nn.Module):
    """ SSD 神經網路模型 """

    def __init__(self, n_classes: int, variance=(0.1, 0.2), top_k=200, conf_thresh=0.01,
                 nms_thresh=0.45, image_size=300, **config):
        """
        Parameters
        ----------
        n_classes: int
            要預測的種類數，包括背景

        variance: Tuple[float, float]
            先驗框的方差

        top_k: int
            每個類的邊界框上限

        conf_thresh: float
            置信度閾值

        nms_thresh: float
            nms 中 IOU 閾值

        image_size: int
            影象尺寸

        **config:
            關於先驗框生成的配置
        """
        super().__init__()

        if len(variance) != 2:
            raise ValueError("variance 只能有 2 元素")

        self.n_classes = n_classes
        self.image_size = image_size
        config['image_size'] = image_size

        # 生成先驗框
        self.priorbox_generator = PriorBox(**config)
        self.prior = Tensor(self.priorbox_generator())

        # 各個模組
        self.vgg = vgg16()
        self.l2norm = L2Norm(512, 20)
        self.extras = nn.ModuleList([
            # 同上，省略不寫
        ])
        self.confs = nn.ModuleList([
            # 同上，省略不寫
        ])
        self.locs = nn.ModuleList([
            # 同上，省略不寫
        ])

        # 很快介紹
        self.detector = Detector(
            n_classes, variance, top_k, conf_thresh, nms_thresh)

    def forward(self, x: Tensor):
        """
        Parameters
        ----------
        x: Tensor of shape `(N, 3, H, W)`
            影象資料

        Returns
        -------
        loc: Tensor of shape `(N, n_priors, 4)`
            偏移量

        conf: Tensor of shape `(N, n_priors, n_classes)`
            類別置信度

        prior: Tensor of shape `(n_priors, 4)`
            先驗框
        """
        loc = []
        conf = []
        sources = []

        # 批大小
        N = x.size(0)

        # 計算從 conv4_3 輸出的特徵圖
        for layer in self.vgg[:23]:
            x = layer(x)

        # 儲存 conv4_3 輸出的 l2 標準化結果
        sources.append(self.l2norm(x))

        # 計算 vgg16 後面幾個卷積層的特徵圖
        for layer in self.vgg[23:]:
            x = layer(x)

        # 儲存 conv7 的輸出的特徵圖
        sources.append(x)

        # 計算後面幾個卷積層輸出的特徵圖
        for i, layer in enumerate(self.extras):
            x = F.relu(layer(x), inplace=True)
            if i % 2 == 1:
                sources.append(x)

        # 使用分類器和探測器進行預測並將通道變為最後一個維度方便堆疊
        for x, conf_layer, loc_layer in zip(sources, self.confs, self.locs):
            loc.append(loc_layer(x).permute(0, 2, 3, 1).contiguous())
            conf.append(conf_layer(x).permute(0, 2, 3, 1).contiguous())

        # 輸出維度為 (batch_size, n_priors, n_classes) 和 (batch_size, n_priors, 4)
        conf = torch.cat([i.view(N, -1) for i in conf], dim=1)
        loc = torch.cat([i.view(N, -1) for i in loc], dim=1)

        return loc.view(N, -1, 4), conf.view(N, -1, self.n_classes),  self.prior

    @torch.no_grad()
    def predict(self, x: Tensor):
        """
        Parameters
        ----------
        x: Tensor of shape `(N, 3, H, W)`
            影象資料

        Returns
        -------
        out: Tensor of shape `(N, n_classes, top_k, 5)`
            檢測結果，最後一個維度的前四個元素為邊界框的座標 `(xmin, ymin, xmax, ymax)`，最後一個元素為置信度
        """
        loc, conf, prior = self(x)
        return self.detector(loc, F.softmax(conf, dim=-1), prior.to(loc.device))

編碼和解碼

之前刻意沒有提及偏移量的使用方式，現在來展開講講。假設第 \(i\) 個先驗框的位置為 \(d_i=(d_i^{cx},\ d_i^{cy},\ d_i^w,\ d_i^h)\)，真實框的位置為 \(g_j=(g_j^{cx},\ g_j^{cy},\ g_j^w,\ g_j^h)\)，那麼偏差量真值應該為：

\[\hat{g}_j^{cx}=(g_j^{cx}-\ d_i^{cx})/d_i^w\\ \hat{g}_j^{cy}=(g_j^{cy}-\ d_i^{cy})/d_i^h\\ \hat{g}_j^{w}=\log(\frac{g_j^{w}}{d_i^w})\\ \hat{g}_j^{h}=\log(\frac{g_j^{h}}{d_i^h}) \]

實際上程式碼中還會使用一個方差 \(varaiances=(center\_variacen,\ size\_variance)\)，且 \(variances=(0.1,\ 0.2)，\)使得上述公式變化為：

\[\hat{g}_j^{cx}=(g_j^{cx}-\ d_i^{cx})/d_i^w/center\_variance\\ \hat{g}_j^{cy}=(g_j^{cy}-\ d_i^{cy})/d_i^h/center\_variance\\ \hat{g}_j^{w}=\log(\frac{g_j^{w}}{d_i^w})/size\_variance\\ \hat{g}_j^{h}=\log(\frac{g_j^{h}}{d_i^h})/size\_variance \]

作者對這個操作的解釋是：

It is used to encode the ground truth box w.r.t. the prior box. You can check this function. Note that it is used in the original MultiBox paper by Erhan etal. It is also used in Faster R-CNN as well. I think the major goal of including the variance is to scale the gradient. Of course you can also think of it as approximate a gaussian distribution with variance of 0.1 around the box coordinates.

也就是說這個 \(variances\) 可以用來縮放梯度，加快訓練過程。

使用了 \(variances\) 之後，由偏差量預測值 \(l=(l^{cx},\ l^{cy},\ l^w,\ l^h)\) 和先驗框解碼出邊界框預測值的公式為：

\[g^{cx}_{\text{predict}}=d^w*center\_variance*l^{cx}+d^{cx}\\ g^{cy}_{\text{predict}}=d^h*center\_variance*l^{cy}+d^{cy}\\ g^{w}_{\text{predict}}=d^we^{size\_variance*l^w}\\ g^{h}_{\text{predict}}=d^he^{size\_variance*l^h} \]

根據上述公式，編碼和解碼的程式碼為：

def encode(prior: Tensor, matched_bbox: Tensor, variance: tuple):
    """ 編碼先驗框和與邊界框之間的偏置量

    Parameters
    ----------
    prior: Tensor of shape `(n_priors, 4)`
        先驗框，座標形式為 `(cx, cy, w, h)`

    matched_bbox: Tensor of shape `(n_priors, 4)`
        匹配到的邊界框，座標形式為 `(xmin, ymin, xmax, ymax)`

    variance: Tuple[float, float]
        先驗框方差

    Returns
    -------
    g: Tensor of shape `(n_priors, 4)`
        編碼後的偏置量
    """
    matched_bbox = corner_to_center(matched_bbox)
    g_cxcy = (matched_bbox[:, :2]-prior[:, :2]) / (variance[0]*prior[:, 2:])
    g_wh = torch.log(matched_bbox[:, 2:]/prior[:, 2:]+1e-5) / variance[1]
    return torch.cat((g_cxcy, g_wh), dim=1)


def decode(loc: Tensor, prior: Tensor, variance: tuple):
    """ 根據偏移量和先驗框位置解碼出邊界框的位置

    Parameters
    ----------
    loc: Tensor of shape `(n_priors, 4)`
        先驗框，座標形式為 `(cx, cy, w, h)`

    prior: Tensor of shape `(n_priors, 4)`
        先驗框，座標形式為 `(cx, cy, w, h)`

    variance: Tuple[float, float]
        先驗框方差

    Returns
    -------
    g: Tensor of shape `(n_priors, 4)`
        邊界框的位置
    """
    bbox = torch.cat((
        prior[:, :2] + prior[:, 2:] * variance[0] * loc[:, :2],
        prior[:, 2:] * torch.exp(variance[1] * loc[:, 2:])), dim=1)
    bbox = center_to_corner(bbox)
    return bbox

處理多餘的先驗框

思考一下這個問題：現在我生成了這麼多的預測框，相鄰的預測框很可能檢測出來的是同一個物體，如果將這些預測全部畫出來，圖片上將會是密密麻麻的一大片。該如何解決這個問題呢？這時候我們就需要使用非極大值抑制（NMS）演算法。

交併比

在 NMS 演算法中使用到了交併比（Interection-over-unio，簡稱 IOU）的概念，用來衡量兩個預測框之間的重疊程度。假設我們有兩個預測框 \(b_0=(x_{min0},\ y_{min0},\ x_{max0},\ y_{max0})\) 和 \(b_1=(x_{min1},\ y_{min1},\ x_{max1},\ y_{max1})\) ，固定 \(b_0\) 的位置不變，移動 \(b_1\)，他們之間會有四種重疊情況，如下圖所示，此時交併比計算公式為 \(IOU=C/(A+B-C)\)，就是交集面積除以並集面積。雖然圖中有四種重疊情況，但是計算的時候可以合併為一種 \(C=w_c*h_c\)：

交集 \(C\) 的寬度 \(w_c=x_2-x_1\)，其中 \(x_2=\min\{x_{max0},\ x_{max1}\}\)，\(x_1=\max\{x_{min0},\ x_{min1} \}\)；
交集 \(C\) 的高度 \(h_c=y_2-y_1\)，其中 \(y_2=\min\{y_{max0},\ y_{max1}\}\)，\(y_1=\max\{y_{min0},\ y_{min1} \}\)；

NMS 演算法流程

假設我們要檢測的類是上圖中的跑車，輸入 NMS 演算法的就是所有的預測框boxes和這些預測框對於包含跑車的置信度 scores。如果是 300×300 的輸入影象，那麼預測框的維度就應該是 \((8732, 4)\)，每行一個預測框，共 8732 個。下面 NMS 演算法的流程：

將 boxes 按照置信度 scores進行降序排序，並且只留下前 top_k 個預測框；
初始化一個空列表 keep 用於儲存最終留下的預測框；
從 boxes中選出置信度最高的那個預測框 \(b_0\)，將該預測框新增到 keep 中；
計算該預測框與其餘預測框的交併比 iou，如果某個預測框 \(b_i\) 與預測框 \(b_0\) 的交併比大於閾值 overlap_thresh，就將預測框 \(b_i\) 從 boxes 中移除；
重複上述步驟，直到 boxes 為空，返回 keep

上述過程對應的程式碼為：

def nms(boxes: Tensor, scores: Tensor, overlap_thresh=0.5, top_k=200):
    """ 非極大值抑制，去除多餘的預測框

    Parameters
    ----------
    boxes: Tensor of shape `(n_priors, 4)`
        預測框，座標形式為 `(xmin, ymin, xmax, ymax)`

    scores: Tensor of shape `(n_priors, )`
        某個類的每個先驗框的置信度

    overlap_thresh: float
        IOU 閾值，大於閾值的部分預測框會被移除，值越小保留的框越多

    top_k: int
        保留的預測框個數上限

    Returns
    -------
    indexes: LongTensor of shape `(n, )`
        保留的邊界框的索引
    """
    keep = []
    if boxes.numel() == 0:
        return torch.LongTensor(keep)

    # 每個預測框的面積
    x1 = boxes[:, 0]
    y1 = boxes[:, 1]
    x2 = boxes[:, 2]
    y2 = boxes[:, 3]
    area = (x2-x1)*(y2-y1)

    # 對分數進行降序排序並擷取前 top_k 個索引
    _, indexes = scores.sort(dim=0, descending=True)
    indexes = indexes[:top_k]

    while indexes.numel():
        i = indexes[0]
        keep.append(i)

        # 最後一個索引時直接退出迴圈
        if indexes.numel() == 1:
            break

        # 其他的預測框和當前預測框的交集
        right = x2[indexes].clamp(max=x2[i].item())
        left = x1[indexes].clamp(min=x1[i].item())
        bottom = y2[indexes].clamp(max=y2[i].item())
        top = y1[indexes].clamp(min=y1[i].item())
        inter = ((right-left)*(bottom-top)).clamp(min=0)

        # 計算 iou
        iou = inter/(area[i]+area[indexes]-inter)

        # 保留 iou 小於閾值的邊界框，自己和自己的 iou 為 1
        indexes = indexes[iou < overlap_thresh]

    return torch.LongTensor(keep)

如果對每一個類的預測框都是用 NMS 演算法，程式碼就會如下所示：

# coding:utf-8
import torch
from torch import Tensor

from utils.box_utils import decode, nms


class Detector:
    """ 用於處理 SSD 網路輸出的探測器類，在測試時起作用 """

    def __init__(self, n_classes: int, variance: list, top_k=200, conf_thresh=0.01, nms_thresh=0.45) -> None:
        """
        Parameters
        ----------
        n_classes: int
            類別數，包括背景

        variance: Tuple[float, float]
            先驗框方差

        top_k: int
            預測框數量的上限

        conf_thresh: float
            置信度閾值

        nms_thresh: float
            nms 操作中 iou 的閾值，越小保留的預測框越多
        """
        self.n_classes = n_classes
        self.conf_thresh = conf_thresh
        self.nms_thresh = nms_thresh
        self.variance = variance
        self.top_k = top_k

    def __call__(self, loc: Tensor, conf: Tensor, prior: Tensor):
        """ 生成預測框

        Parameters
        ----------
        loc: Tensor of shape `(N, n_priors, 4)`
            預測的偏移量

        conf: Tensor of shape `(N, n_priors, n_classes)`
            類別置信度，需要被 softmax 處理過

        prior: Tensor of shape `(n_priors, 4)`
            先驗框

        Returns
        -------
        out: Tensor of shape `(N, n_classes, top_k, 5)`
            檢測結果，最後一個維度的前四個元素為邊界框的座標 `(xmin, ymin, xmax, ymax)`，最後一個元素為置信度
        """
        N = loc.size(0)

        # 一張圖中可能有多個相同型別的物體，所以多一個 n_classes 維度
        out = torch.zeros(N, self.n_classes, self.top_k, 5)

        for i in range(N):
            # 解碼出邊界框
            bbox = decode(loc[i], prior, self.variance)
            conf_scores = conf[i].clone()   # Shape: (n_priors, n_classes)

            for c in range(1, self.n_classes):
                # 將置信度小於閾值的置信度元素濾掉
                mask = conf_scores[:, c] > self.conf_thresh
                scores = conf_scores[:, c][mask]    # Shape: (n_prior', )

                # 如果所有的先驗框都沒有預測出這個類，就直接跳過
                if scores.size(0) == 0:
                    continue

                # 將置信度小於閾值的邊界框濾掉
                boxes = bbox[mask]

                # 非極大值抑制，將多餘的框濾除
                indexes = nms(boxes, scores, self.nms_thresh, self.top_k)
                out[i, c, :len(indexes)] = torch.cat(
                    (boxes[indexes], scores[indexes].unsqueeze(1)), dim=1)

        return out

訓練模型

匹配先驗框和真實框

在訓練之前，我們需要標識出哪些先驗框中包含目標，哪些沒有包含。包含物體的先驗框被稱為正樣本，沒有包含的是負樣本（對應的就是背景類）。如果包含目標的話，我們還應該確定先驗框包含的是什麼類別的目標。為了確定正負樣本，需要將先驗框和真實框進行匹配。匹配的流程為：

計算所有先驗框和一張圖片中的所有真實框的交併比，假設先驗框的個數為 n_priors，真實框的個數為 n_objects，那麼計算得到的交併比矩陣 iou 維度為 (n_priors, n_objects)，iou 的第 \(i\) 行第 \(j\) 列的元素就代表第 \(i\) 和先驗框和第 \(j\) 個真實框的交併比；
對於交併比矩陣 iou，我們沿著 dim=0 的方向進行max計算可以得到和每個真實框交併比最大的那個先驗框，沿著 dim=1 的方向進行max計算可以得到和每個先驗框交併比最大的那個真實框；
將和每個真實框匹配的最好的先驗框標記為正樣本，其餘的都是負樣本。但是這樣會帶來一個問題：先驗框很多而正樣本很少，也就是說正負樣本的數量差距會很大，會給訓練帶來困難。所以我們還需要進行下一步；
對於那些沒有被標記為正樣本的先驗框，如果和他們匹配的最好的那個真實框的交併比大於閾值 overlap_thresh，我們也將這個先驗框標記為正樣本。

總結下來就是每個真實框至少匹配一個先驗框，而每個先驗框不一定都會有一個與之匹配的真實框。匹配過程的程式碼為：

def match(overlap_thresh: float, prior: Tensor, bbox: Tensor, variance: tuple, label: Tensor):
    """ 匹配先驗框和邊界框真值

    Parameters
    ----------
    overlap_thresh: float
        IOU 閾值

    prior: Tensor of shape `(n_priors, 4)`
        先驗框，座標形式為 `(cx, cy, w, h)`

    bbox: Tensor of shape `(n_objects, 4)`
        邊界框真值，座標形式為 `(xmin, ymin, xmax, ymax)`

    variance: Tuple[float, float]
        先驗框方差

    label: Tensor of shape `(n_objects, )`
        類別標籤

    Returns
    -------
    loc: Tensor of shape `(n_priors, 4)`
        編碼後的先驗框和邊界框的位置偏移量

    conf: Tensor of shape `(n_priors, )`
        先驗框中的物體所屬的類
    """
    # 計算每個先驗框和一張圖片中的每個真實框的交併比
    iou = jaccard_overlap(center_to_corner(prior), bbox)

    # 獲取和每個邊界框匹配地最好的先驗框的 iou 和索引，返回值形狀 (n_objects, )
    best_prior_iou, best_prior_index = iou.max(dim=0)

    # 獲取和每個先驗框匹配地最好的邊界框的 iou 和索引，返回值形狀為 (n_priors, )
    best_bbox_iou, best_bbox_index = iou.max(dim=1)

    # 邊界框匹配到的先驗框不能再和別的邊界框匹配，即使 iou 小於閾值也必須匹配，所以填充一個大於1的值
    best_bbox_iou.index_fill_(0, best_prior_index, 2)
    for i in range(len(best_prior_index)):
        best_bbox_index[best_prior_index[i]] = i

    # 挑選出和先驗框匹配的邊界框，形狀為 (n_priors, 4)
    matched_bbox = bbox[best_bbox_index]

    # 標記先驗框中的物體所屬的類，形狀為 (n_priors, )，+1 是為了讓出背景類的位置
    conf = label[best_bbox_index]+1
    conf[best_bbox_iou < overlap_thresh] = 0

    # 對先驗框的位置進行編碼
    loc = encode(prior, matched_bbox, variance)

    return loc, conf


def jaccard_overlap(prior: Tensor, bbox: Tensor):
    """ 計算預測的先驗框和邊界框真值的交併比，四個座標為 `(xmin, ymin, xmax, ymax)`

    Parameters
    ----------
    prior: Tensor of shape `(A, 4)`
        先驗框

    bbox: Tensor of shape  `(B, 4)`
        邊界框真值

    Returns
    -------
    iou: Tensor of shape `(A, B)`
        交併比
    """
    A = prior.size(0)
    B = bbox.size(0)

    # 將先驗框和邊界框真值的 xmax、ymax 以及 xmin、ymin進行廣播使得維度一致，(A, B, 2)
    # 再計算 xmax 和 ymin 較小者、xmin 和 ymin 較大者，W=xmax較小-xmin較大，H=ymax較小-ymin較大
    xy_max = torch.min(prior[:, 2:].unsqueeze(1).expand(A, B, 2),
                       bbox[:, 2:].broadcast_to(A, B, 2))
    xy_min = torch.max(prior[:, :2].unsqueeze(1).expand(A, B, 2),
                       bbox[:, :2].broadcast_to(A, B, 2))

    # 計算交集面積
    inter = (xy_max-xy_min).clamp(min=0)
    inter = inter[:, :, 0]*inter[:, :, 1]

    # 計算每個矩形的面積
    area_prior = ((prior[:, 2]-prior[:, 0]) *
                  (prior[:, 3]-prior[:, 1])).unsqueeze(1).expand(A, B)
    area_bbox = ((bbox[:, 2]-bbox[:, 0]) *
                 (bbox[:, 3]-bbox[:, 1])).broadcast_to(A, B)

    return inter/(area_prior+area_bbox-inter)

經過上述匹配過程之後我們會得到維度為 (n_priors, 4) 的偏移量真實值和 (n_priors, ) 的類別標籤。實際的位置損失計算中不會使用到負樣本。

損失函式

損失函式由置信度損失和定位損失組成，定義公式為：

\[L(x,c,l,g)=\frac1N\left( L_{conf}(x,c)+\alpha L_{loc}(x,l,g) \right) \]

其中 \(x\in \{0,1 \}\) ，\(c\) 為置信度預測值，\(l\) 為位置偏差量預測值，\(g\) 為位置偏差量真值，\(N\) 為正樣本的數量，如果 \(N\) 為 0，則損失也會被置為 0。實際計算中 \(\alpha\) 取值為 1。

位置損失

位置損失的計算公式為：

\[L_{loc}(x,l,g)=\sum_{i\in Pos}^N \sum_{m\in \{ cx,cy,w,h\}}x_{ij}^k\ {\rm smooth_{L1}}\left( l_i^m-\hat{g}_j^m \right) \]

看起來好像很複雜的樣子，我們來分解一下。假設某張圖片上有 n_objects 個真實框，我們的 SSD 產生了 n_priors 個先驗框，匹配出了 n_positives 個正樣本。\(x_{ij}^k\) 代表第 \(i\) 個正樣本的類是否和第 \(j\) 個真實框的類相同，如果相同則為 1，否則為 0。\(l_i\) 代表了第 \(i\) 個正樣本的位置偏差量預測值，\(\hat{g}^m_j\) 代表第 \(i\) 個正樣本和第 \(j\) 個真實框的位置偏差量真實值。實際計算中只要將 nms() 輸出的 loc_t 和神經網路匹配出來的 loc_pred 根據正樣本的索引進行切片再用 \(\rm{smooth_{L_1}}\) 計算一下損失即可。

置信度損失

置信度損失的計算公式如下，實際上就是使用了交叉熵損失：

\[L_{conf}(x, c) = -\sum_{i\in Pos}^{N} x_{ij}^{p} \log (\hat{c}_{i}^{p}) - \sum_{i\in N_{eg}} \log (\hat{c}_{i}^{0}),\quad \text{where}\ \hat{c}_{i}^{p} = \frac {\exp(c_{i}^{p})}{\sum_{ p} \exp(c_{i}^{p})} \]

為了讓正負樣本數量均衡，論文中使用了困難樣本挖掘的方法對負樣本進行抽樣，使正負樣本的數量比為 \(1:3\)，具體計算過程如下：

給定正樣本輸入為 \(N_p\) 和正負樣本比值 \(1:3\)，則可以得到負樣本的數量 \(N_{eg}=3N_p\);
計算負樣本的置信度損失
根據置信度損失對負樣本進行排序，選取損失最高的前 \(N_{eg}\) 個負樣本來計算 \(L_{conf}\)

具體的負樣本置信度損失計算有一點小技巧，需要做一些數學變換：

\[\begin{align} -\log (\hat{c}_{i}^{0}) &= -\log (\frac {\exp(c_{i}^{0})}{\sum \exp(c_{i}^{0})})\\ &=-\log \left(\exp(c_i^0) \right) + \log\left( \sum \exp(c_{i}^{0}) \right)\\ &=-c_i^0+ \log\left( \sum_{i=1}^N \exp(c_{i}^{0}) \right)\\ &=-c_i^0+ \log\left( \sum_{i=1}^N \exp(c_{i}^{0}-c) \right)+c\\ &\rm{where}\quad c=\max\{c_1^0,\ c_2^0,\ ...,\ c_i^0\} \end{align} \]

程式碼

# coding:utf-8
from typing import Tuple, List

import torch
from torch import nn
from torch import Tensor
from torch.nn import functional as F

from utils.box_utils import match, log_sum_exp


class SSDLoss(nn.Module):
    """ 損失函式 """

    def __init__(self, n_classes: int, variance=(0.1, 0.2), overlap_thresh=0.5, neg_pos_ratio=3, use_gpu=True, **kwargs):
        """
        Parameters
        ----------
        n_classes: int
            類別數，包括背景

        variance: Tuple[float, float]
            先驗框方差

        overlap_thresh: float
            IOU 閾值，預設為 0.5

        neg_pos_ratio: int
            負樣本和正樣本的比例，預設 3:1

        use_gpu: bool
            是否使用 gpu
        """
        super().__init__()

        if len(variance) != 2:
            raise ValueError("variance 只能有 2 元素")

        self.use_gpu = use_gpu
        self.variance = variance
        self.n_classes = n_classes
        self.neg_pos_ratio = neg_pos_ratio
        self.overlap_thresh = overlap_thresh

    def forward(self, pred: Tuple[Tensor, Tensor, Tensor], target: List[Tensor]):
        """ 計算損失

        Parameters
        ----------
        pred: Tuple[Tensor]
            SSD 網路的預測結果

        target: list of shape `(N, )`
            標籤列表，每個標籤的形狀為 `(n_objects, 5)`，包含邊界框位置和類別，每張圖中可能不止有一個目標
        """
        loc_pred, conf_pred, prior = pred
        N = conf_pred.size(0)
        n_priors = prior.size(0)

        # 將先驗框和邊界框 ground truth 匹配，loc_t 儲存編碼後的偏移量
        loc_t = torch.Tensor(N, n_priors, 4)
        conf_t = torch.Tensor(N, n_priors)
        prior = prior.detach()
        for i in range(N):
            bbox = target[i][:, :-1].detach()
            label = target[i][:, -1].detach()
            loc_t[i], conf_t[i] = match(
                self.overlap_thresh, prior, bbox, self.variance, label)

        if self.use_gpu:
            loc_t = loc_t.cuda()
            conf_t = conf_t.cuda()

        # 正樣本標記，索引的 shape: (N, n_priors, 4)，會將所有正樣本選出來合成為一維向量
        positive = conf_t > 0   # Shape: (N, n_priors)
        positive_index = positive.unsqueeze(-1).expand_as(loc_pred)

        # 方框位置損失
        loc_positive = loc_pred[positive_index].view(-1, 4)
        loc_t = loc_t[positive_index].view(-1, 4)
        loc_loss = F.smooth_l1_loss(loc_positive, loc_t, reduction='sum')

        # 困難樣本挖掘，conf_logP 的 shape: (N*n_priors, 1)
        batch_conf_pred = conf_pred.view(-1, self.n_classes)
        conf_logP = log_sum_exp(
            batch_conf_pred)-batch_conf_pred.gather(1, conf_t.type(torch.int64).view(-1, 1))

        # 去除正樣本，根據損失獲取負樣本的損失排名，conf_logP 的 shape: (N, n_priors)
        conf_logP = conf_logP.view(N, -1)
        conf_logP[positive] = 0
        _, loss_index = conf_logP.sort(dim=1, descending=True)
        _, loss_rank = loss_index.sort(dim=1)

        # 根據 負樣本:正樣本 選取出 N_eg 個負樣本
        n_positive = positive.long().sum(dim=1, keepdim=True)
        n_negative = torch.clamp(
            self.neg_pos_ratio*n_positive, max=positive.size(1)-1)
        # 負樣本標記，shape: (N, n_priors)
        negative = loss_rank < n_negative.expand_as(loss_rank)

        # 置信度損失
        index = (positive+negative).unsqueeze(2).expand_as(conf_pred).gt(0)
        conf_pred = conf_pred[index].view(-1, self.n_classes)
        conf_t = conf_t[(positive+negative) > 0]
        conf_loss = F.cross_entropy(conf_pred, conf_t.type(torch.int64), reduction='sum')

        # 將損失除以正樣本個數
        n_positive = n_negative.detach().sum()
        loc_loss /= n_positive
        conf_loss /= n_positive

        return loc_loss, conf_loss

資料增強

為了讓模型更加魯棒，我們可以對原始資料做一些增強操作，比如調整影象大小、色調、對比度、顏色通道順序等等，論文中作者還進行了隨機裁剪的操作，這種操作有可能返回以下三種影象：

啥都不做，直接返回原始影象
隨機裁剪一塊區域，並且裁剪出來的區域和原始影象中的邊界框的交併比的最小值應該大於某個閾值，文中取 0.1、0.3、0.5、0.7 和 0.9
隨機裁剪一塊區域，對裁剪出來的區域沒有交併比要求

對於隨機裁剪的區域，他們的長寬比應該在 0.5~2 之間。由於裁剪過後，裁剪區域包含的邊界框可能只佔了原始邊界框的一小部分，所以論文中只保留了中心點落在裁剪區域的那些邊界框（實際上保留的是邊界框和裁剪區域的重疊部分）。對於上述的隨機裁剪操作實現程式碼如下：

class Transformer:
    """ 影象增強抽象類 """

    def transform(self, image: ndarray, bbox: ndarray, label: ndarray):
        """ 對輸入的影象進行增強

        Parameters
        ----------
        image: `~np.ndarray` of shape `(H, W, 3)`
            影象，影象模式是 RGB 或者 HUV，沒有特殊說明預設 RGB 模式

        bbox: `~np.ndarray` of shape `(n_objects, 4)`
            邊界框

        label: `~np.ndarray` of shape `(n_objects, )`
            類別標籤

        Returns
        -------
        image, bbox, label:
            增強後的資料
        """
        raise NotImplementedError("影象增強方法必須被重寫")

class RandomSampleCrop(Transformer):
    """ 隨機裁剪 """

    def __init__(self):
        super().__init__()
        self.sample_options = [
            # 直接返回原圖
            None,
            # 隨機裁剪，裁剪區域和邊界框的交併比有閾值要求
            (0.1, None),
            (0.3, None),
            (0.7, None),
            (0.9, None),
            # 隨機裁剪
            (None, None),
        ]

    def transform(self, image: ndarray, bbox: ndarray, label: ndarray):
        h, w, _ = image.shape

        while True:
            mode = randchoice(self.sample_options)

            # 直接返回原圖
            if mode is None:
                return image, bbox, label

            min_iou, max_iou = mode
            if min_iou is None:
                min_iou = float('-inf')
            if max_iou is None:
                max_iou = float('inf')

            # 最多嘗試 50 次，避免死迴圈
            for _ in range(50):

                # 隨機選取取樣區域的寬高
                ww = random.uniform(0.3*w, w)
                hh = random.uniform(0.3*h, h)

                # 要求寬高比在 0.5 ~ 2 之間
                if not 0.5 <= hh/ww <= 2:
                    continue

                # patch 的四個座標
                left = random.uniform(high=w-ww)
                top = random.uniform(high=h-hh)
                rect = np.array([left, top, left+ww, top+hh], dtype=np.int)

                # 交併比不滿足閾值條件就捨棄這個 patch
                iou = jaccard_overlap_numpy(rect, bbox)
                if iou.min() > max_iou or iou.max() < min_iou:
                    continue

                # 裁剪下 patch
                patch = image[rect[1]:rect[3], rect[0]:rect[2]]

                # 判斷邊界框的中心有沒有落在 patch 裡面
                centers = (bbox[:, :2]+bbox[:, 2:])/2
                m1 = (centers[:, 0] > rect[0]) & (centers[:, 1] > rect[1])
                m2 = (centers[:, 0] < rect[2]) & (centers[:, 1] < rect[3])
                mask = m1 & m2

                # 如果沒有任何一個邊界框的中心在 patch 裡面就捨棄這個 patch
                if not mask.any():
                    continue

                # 中心落在 patch 裡面的邊界框及其標籤
                bbox_ = bbox[mask].copy()
                label_ = label[mask]

                # 對 patch 裡面的邊界框進行座標平移，使其以 patch 的左上角為原點
                bbox_[:, :2] = np.clip(bbox_[:, :2]-rect[:2], 0, np.inf)
                bbox_[:, 2:] = np.clip(
                    bbox_[:, 2:]-rect[:2], 0, rect[2:]-rect[:2]-1)

                return patch, bbox_, label_

關於更多影象增強的操作可以參見【SSD演算法】史上最全程式碼解析-資料篇，實在太多了，這裡有點寫不下 orz。

後記

至此 SSD 的原理也介紹的差不多了，至於如何評估模型，大家可以參考別的部落格，這裡就不贅述了（~~才不是因為寫不動了~~），程式碼已經放在了 GitHub 上面，以上~