1. 程式人生 > 其它 >多目標跟蹤綜述 2021

多目標跟蹤綜述 2021

Background

MOT Categories

Track-by-detection

  • Detection

  • ReID(Data Association)

    • ReID應該是充分的
  • Bounding Boxes

    • Features of ROI
  • Siamese/Re-ID/IOU(Appearance Affinity)

  • Motion

    • the projection of non-linear 3D motion into the 2D image domain still poses a challenging problem for many models.?

這裡有一個重要的假設:跟蹤結果一定是檢測結果的子集

Two-Stage Milestones: SORT and DeepSORT

Detector = +18.9%

SORT

  • Faster-RCNN
  • Kalman Filter
    • \(\mathbf{o}_i = [\texttt{r},x,y,s,\dot{x},\dot{y},\dot{s}]\)
      • 將位置和麵積視作勻速變化的量(變化率為常數)
    • linear constant velocity model / independent of other objects and camera motion.
  • Tracklet Init and Deletion
    • Accept all detection(>minHeight&minConfidence)
    • Immediate Deleting

DeepSORTref

  • Faster-RCNN
    • \[c = \lambda D_1(\text{track}_i,\text{detection}_j) + (1-\lambda)D_2(\text{track}_i,\text{detection}_j) \]
    • Motion
      • \(D_1\):馬氏距離
    • Appearance
      • \(D_2\):深度特徵的餘弦相似度
      • ReID Pretrained Model
  • Tracklet Init and Deletion
    • Tentative = 3 frames (>minHeight&minConfidence)
    • Delete after 3 frames

Joint Detection and Tracking (Detection-by-tracking)

『IDEA [檢測(檢測本身就是多目標的)是多目標跟蹤的一個上界]

核心是Joint Learning

TODO 『2021-12-21 [CenterTrack]

Eval

Specifically, the mapping between ground truthand hypotheses is established as follows: if the ground truth object oi and the hypothesis hj are matched in frame t - 1,and in frame t the IoU(oi; hj) \(\geq\) 0:5, then oi and hj are matched in that frame, even if there exists another hypothesis hk such that IoU(oi; hj) < IoU(oi; hk), considering the continuity constraint. After the matching from previousframes has been performed, the remaining objects are tried to be matched with the remaining hypotheses, still using a0.5 IoU threshold. The ground truth bounding boxes that cannot be associated with a hypothesis are counted as falsenegatives (FN), and the hypotheses that cannot be associated with a real bounding box are marked as false positives(FP). Also, every time a ground truth object tracking is interrupted and later resumed is counted as a fragmentation,while every time a tracked ground truth object ID is incorrectly changed during the tracking duration is counted as anID switch. Then, the simple metrics computed are the following:

Detection

  • [ ] DETR

Transformer

https://jalammar.github.io/illustrated-transformer/

  • RNN
    • Maintain the hidden state
  • Transformer
    • Self-Attention Global
    • Contextualized Embedding

SiamMOT = DeepSORT + Motion Model

Motivation & Background

  • 傳統MOT
    • 二分圖匹配
      • 節點維護外形+運動特徵
      • 全域性優化問題
      • 沒有顯式的幀間關聯
      • improving local linking over consecutive frames rather than building an offline graph to re-identify instances across large temporal gaps.

SORT & DeepSORT

Siam網路缺點:無法區分不同的物體

TODO 『2021-12-21 [Faster RCNN]

TODO 『2021-12-21 [center Track] SKIPPED

Contribution

  • SiamMOT: Faster RCNN + Motion Model
    • consecutive frames & local linking
    • Based on SORT

Pipeline

Arch

Given\(\mathbf{F}^{t},\mathbf{F}^{t+1};\mathbf{R}_{i}^{t}(\text{Region, Bounding Box})\)

Faster-RCNN

  • \(R_{i}^{t}\)是t時刻的跟蹤結果
  • \(\mathbf{D}^{t+1} = \text{Detector}(\mathbf{F}^{t+1})\)

Region-based Siamese Tracker

  • \(\mathbf{f}_{R_i}^{t} = \operatorname{ROIAlign}(R_{i}^{t},\text{dim})\)

    • feature of Region(Object) i in frame t
  • \(\mathbf{f}_{S_i}^{t} = \operatorname{ROIAlign}(S_{i}^{t},\text{dim})\)

    • feature of Object Search Area i in frame t
    • \(S_{i}^{t+1} = \operatorname{Expand}(R_{i}^{t},\text{factor})\)
    • (考慮物體不會在兩幀之間產生巨大位移)
  • \(\mathbf{R}^{t+1} = \text{SpatialMatching}(\mathbf{Mot}^{t+1},\mathbf{Det}^{t+1} )\)

    • \(\mathbf{Mot}^{t+1},\mathbf{Det}^{t+1}\)分別代表運動模型的預測結果和目標檢測器的結果
    • 他們都是是在先驗擴充套件的區域內得到的
    • \(\mathbf{Mot}^{t+1}\)對應互相關回歸
    • \(\mathbf{Det}^{t+1}\)對應檢測結果
\[\mathbf{Mot}^{t+1} = \operatorname{SiameseTracker}(\mathbf{f}_{R_i}^{t},\mathbf{f}_{S}^{t+1}) \]

核心:Siamese Tracker

  • Implicit MM
\[\text{[Confidence:Offset]} = \operatorname{MLP}(\mathbf{f}_{S_i}^{t},\mathbf{f}_{S_i}^{t+1}) \]\[\mathbf{Mot}_{i}^{t+1} = \operatorname{Modify}(R_{i}^{t},\text{Offset}) \]

『IDEA [為什麼採用Search]

  • [ ] 這裡沒有搞清楚

❓感覺好像是寫錯了

因為不一定是軌跡,沒有運動關聯性

  • Explicit MM
    • channel-wise cross-correlation
    • response map

通過Correlation互相關——卷積操作:實際上利用了卷積操作的可擴充套件性(靈活性)

包括置信度

\[\mathbf{v}_i(x,y) = \text{Confidence}_i(x,y) \]

以及迴歸框

\[\mathbf{p}_i(x,y) = [l; t; r; b] \]

v,p都是在16*16大小的特徵圖上,每個畫素點的

最後尋找最優位置

\[(x,y) = \mathop{\arg\max}\limits_{x,y}() \]

The penalty map is introduced to discourage dramatic movements

『IDEA [我一直感覺基於ROI的迴歸是不充分的,因為卷積操作對物體變化的感知在inference階段是不存在的,引數都已經固化了]

『IDEA [BoundingBoxReg學習到的是一個通用的,根據特徵迴歸偏移量(調整量)的引數。按理說,但是效果卻很好]

❓其實這裡說是Motion Model,我覺得並不是。這裡並沒有建立運動模型。在本質上和SiamFC是一樣的。

First it uses the channel independent correlation operation to allow the network to explicitly learn a matching function between the same instance in sequential frames. Second, it enables a mechanism for finer-grained pixel-level supervision which is important to reduce the cases of falsely matching to distractors.

那為什麼會有很好的效果呢?

Training

\[Loss = l_{rpn} + l_{detect} + l_{motion} \]

\(l_{motion}\)就是之前提到過的損失函式

Short Occlusion

Inference

Spacial Matching

Solver

一些匹配規則


CorrTrack/TLD = FairMOT+Corr.

這篇文章總體來講寫的不是很清楚

沒有程式碼,也沒有明確的架構圖

Motivation & Background

  • 卷積神經網路結構上的區域性感知特性
    • 相同語義個體相似度高,干擾強烈
  • 不能有效獲得空間和時間上的長程依賴
  • MOT的任務特點:處理多尺度特徵/物體大小不確定(和檢測任務相同)檢測任務處理多尺度特徵的思路

相似度

\[\mathbf{A}_{i j}=\operatorname{dist}\left(\mathbf{f}\left(\mathbf{d}_{t}^{i}\right), \hat{\mathbf{f}}\left(\mathbf{T}_{t-1}^{j}\right)\right)+\alpha \operatorname{IoU}\left(\mathbf{d}_{t}^{i}, \hat{\mathbf{d}}_{t}^{j}\right) \]
  • 基於previous Frame/previous ROI計算語義相似度
    • 空間相關性受到了檢測器的限制
    • 大量外圍相似個體(尤其在行人場景下)
    • 導致ID Switch

the correlation information between the cropped image patches is lost directly, and the adjacency spatial relationship is only retained in coordinates

需要區分DISTRACOTRS

❓但是這個有必要嗎?相鄰物體並不會出現很劇烈的移動啊。

也可能作者的意思是出現了Touching Switch(接觸漂移,我自己起的名字)

  • 基於FairMOT

FairMOT

Pipeline

Arch

  • 通用特徵提取
  • 從時空依賴中同時學習相關性並進行檢測預測
  • 將檢測結果關聯到最接近的軌跡上

Spatial Local Correlation Layers

目的是得到context correlation features融合上下文的空間特徵

Spacial Correlation

參考

其實這個改進很Intuitive

僅僅是人為設定了一個區域性感受野

實際上是做了一個Self-Attention

  • 注意力權重\(\alpha(x,y) = \text{NeighborsCorrAt}(x,y)\)
  • \[\text{Correlated Feature} = \mathbf{F}_{C}^{l} = \sum_{x,y} \alpha(x,y) \mathbf{F}_{t}^{l}(x,y) \]
    • 上標\(l\)代表特徵金字塔的位置
    • 在這裡指的是,還沒有用到\(l\)

Spacial Corr. on FPN

特徵金字塔

空洞卷積(Dilated Conv.)

\[[0,R\times D \times 2^l] \]

這是什麼意思?

Temporal Correlation

不是很懂

這是之前的一個公式,一模一樣啊

colorization as a proxy task

這裡甚至沒有提到損失函式

Actually, our method intensively perform siamese tracking operations \(M\times N\) to increases the discrimination.

Self-Supervised Feature Learning

TransTrack

Motivation

Query-key Promising

For the same object, its feature in different frames is highly similar, which enables the query-key mechanism to output ordered object sets. This inspiration should also be beneficial to the MOT task.

How to transfer q-k from SOT to MOT

作者認為,最嚴重的問題在於new-coming物體,沒有相應的Query。而在SOT當中,目標是保證在畫面當中的

A desirable solution should be able to well capture new coming objects and propagate previously detected objects to the following frames at the same time

  • New-coming Objects

Traditinoal Way

  • [ ] 傳統目標初始化方法

Pipeline

Arch

這裡的Q-K不是指的Attention裡面的

  • one Encoder -> key
    • Input: extracted features(2-frames)
  • 2 Parallel Decoder -> query?
    • Object Detection(DETR)
    • Track
      • Appearance & Location Information
  • Box Association
  • [ ] 迭代過程是什麼

  • set prediction

    • 2 sets
    • (DETR) object query for detection
      • (NO NMS)
    • features(of objects on track) as track query
      • provide consistent object information to maintain tracklets.

  • simple IoU matching to generate the final ordered object set from them.

  • previous frames -> data association

Input and Output

Training

Inference

目標初始化

TransTrack first detects objects on the first frame, where the feature maps are from two copies of the first frame.

occlusions and shortterm
disappearing

Specifically, if a tracking box is unmatched, it keeps as an “inactive” tracking box until it remains unmatched for K consecutive frames. Inactive tracking boxes can be matched to detection boxes and regain their ID.
we choose K = 32.


Why using Transformer/ dominant reason

  • Decent Frame Work

  • No Prior

  • [ ] Transformer在視覺任務當中的優勢


Summary

JDT paradigm

What's Different from TD

  • TD = Detection + Association

  • Joint Learning/Task-Driven


Questions

  • 矩形框估計可能並不準確
    • SiamMask
    • Segmentation is the upper bound of Detection
    • 可能影響準確率
  • ID Switch


TODO

  • [ ] 怎樣閱讀一篇論文,可以算是理解了
    • [ ] 能寫出虛擬碼
    • [ ] 說出作者的創作動機

Appendix

Kalman Filter

SORT

DeepSORT

  • \(D_1\)
    • 在卡爾曼濾波分佈上計算的馬氏距離
    • \(M(\text{track}_i,\text{dec}_j;S_{\text{Kalman Filter}})\)
  • \(D_2\)

    • 深度ReID特徵餘弦相似度
    • 僅僅依靠外觀特徵進行匹配也是可以進行追蹤的。
  • 級聯匹配

如果一條軌跡被遮擋了一段較長的時間,那麼在卡爾曼濾波器的不斷預測中就會導致概率彌散。那麼假設現在有兩條軌跡競爭同一個檢測目標,那麼那條遮擋時間長的往往得到馬氏距離更小(更加接近)
馬氏距離的協方差矩陣
使檢測目標傾向於分配給丟失時間更長的軌跡,但是直觀上,該檢測目標應該分配給時間上最近的軌跡。所以deepsort引入了級聯匹配的策略讓更經常出現的目標被分配的優先順序更高。

  • 應對遮擋

IOU匹配

unconfirmed and unmatched tracks of age n = 1.This helps to to account for sudden appearance changes, e.g., due to partial occlusion with static scene geometry, and to increase robustness against erroneous initialization

Faster-RCNN

FPN + Fast RCNN

  • RPN
    • Predefined Anchors
      • ROI Pooling
  • 2 MLP Heads
    • Classification
    • Coord. Regression:修正候選框

Feature Pyramid

需要解決的問題

  • 多尺度
    • 細粒度資訊丟失
    • 小物體無法重建
  • 影象金字塔
  • 單個高層特徵
  • 直接抽取不同特徵層
  • 特徵金字塔
    • 降取樣
    • 上取樣
      • 最近鄰插值
    • 1*1卷積,側向連線+merge

Dilated Conv.

  • 不做pooling損失資訊的情況下,加大了感受野

Non-local Module

卷積核的設計初衷

  • 捕捉區域性精細結構/模式匹配
    • 可以看做區域性濾波

目的:大範圍依賴

增大感受野的方式

  • 堆疊
    • Sampling 過程中丟失大量資訊
  • 全連線/Attention

Multi Scale Perception

  • ROI Pooling/ROI Align

    • 本質上屬於Pooling方法,有損取樣
  • FPN

    • 特徵金字塔
    • 可以較好地融合不同尺度的特徵

FairMOT

  • 目標檢測:視為高解析度特徵圖上基於中心的包圍盒迴歸任務
    • = Faster RCNN = clsHeatMap + BBReg
  • 網路架構適應ReID任務
    • 類似於特徵金字塔
    • 多尺度融合

核心在於損失函式的設計

  • 資料關聯上
    • 基於DeepSORT
    • 其實只是替代了ReID特徵
    • 變成了Task-Driven