多目標跟蹤綜述 2021
- TransTrack(Init One Stage)/(TrackFormer)
- TransCenter(Trans + Regression)
- SiamMOT
- CorrTrack(Detection)
- CenterTrack 檢測和跟蹤聯合
MPNTracker
Background
MOT Categories
Track-by-detection
-
Detection
-
ReID(Data Association)
ReID
應該是充分的
-
Bounding Boxes
- Features of ROI
-
Siamese/Re-ID/IOU(Appearance Affinity)
-
Motion
-
the projection of non-linear 3D motion into the 2D image domain still poses a challenging problem for many models.?
-
這裡有一個重要的假設:跟蹤結果一定是檢測結果的子集
Two-Stage
Milestones: SORT
and DeepSORT
Detector = +18.9%
- Faster-RCNN
- Kalman Filter
- \(\mathbf{o}_i = [\texttt{r},x,y,s,\dot{x},\dot{y},\dot{s}]\)
- 將位置和麵積視作勻速變化的量(變化率為常數)
- linear constant velocity model / independent of other objects and camera motion.
- \(\mathbf{o}_i = [\texttt{r},x,y,s,\dot{x},\dot{y},\dot{s}]\)
- Tracklet Init and Deletion
- Accept all detection(>minHeight&minConfidence)
- Immediate Deleting
DeepSORT
ref
- Faster-RCNN
- \[c = \lambda D_1(\text{track}_i,\text{detection}_j) + (1-\lambda)D_2(\text{track}_i,\text{detection}_j) \]
- Motion
- \(D_1\):馬氏距離
- Appearance
- \(D_2\):深度特徵的餘弦相似度
- ReID Pretrained Model
- Tracklet Init and Deletion
Tentative
= 3 frames (>minHeight&minConfidence)- Delete after 3 frames
Joint Detection and Tracking (Detection-by-tracking)
『IDEA [檢測(檢測本身就是多目標的)是多目標跟蹤的一個上界]』
核心是Joint Learning
TODO 『2021-12-21 [CenterTrack]』
Eval
Specifically, the mapping between ground truthand hypotheses is established as follows: if the ground truth object oi and the hypothesis hj are matched in frame t - 1,and in frame t the IoU(oi; hj) \(\geq\) 0:5, then oi and hj are matched in that frame, even if there exists another hypothesis hk such that IoU(oi; hj) < IoU(oi; hk), considering the continuity constraint. After the matching from previousframes has been performed, the remaining objects are tried to be matched with the remaining hypotheses, still using a0.5 IoU threshold. The ground truth bounding boxes that cannot be associated with a hypothesis are counted as falsenegatives (FN), and the hypotheses that cannot be associated with a real bounding box are marked as false positives(FP). Also, every time a ground truth object tracking is interrupted and later resumed is counted as a fragmentation,while every time a tracked ground truth object ID is incorrectly changed during the tracking duration is counted as anID switch. Then, the simple metrics computed are the following:
Detection
- [ ] DETR
Transformer
https://jalammar.github.io/illustrated-transformer/
- RNN
- Maintain the hidden state
- Transformer
- Self-Attention Global
Contextualized Embedding
SiamMOT = DeepSORT + Motion Model
Motivation & Background
- 傳統MOT
- 二分圖匹配
- 節點維護
外形
+運動
特徵 - 全域性優化問題
- 沒有顯式的幀間關聯
-
improving local linking over consecutive frames rather than building an offline graph to re-identify instances across large temporal gaps.
- 節點維護
- 二分圖匹配
SORT & DeepSORT
- 孿生網路
- 響應圖
Siam網路缺點:無法區分不同的物體
TODO 『2021-12-21 [Faster RCNN]』
TODO 『2021-12-21 [center Track] SKIPPED』
Contribution
- SiamMOT: Faster RCNN + Motion Model
- consecutive frames & local linking
- Based on SORT
Pipeline
Arch
Given
\(\mathbf{F}^{t},\mathbf{F}^{t+1};\mathbf{R}_{i}^{t}(\text{Region, Bounding Box})\)
Faster-RCNN
- \(R_{i}^{t}\)是t時刻的跟蹤結果
- \(\mathbf{D}^{t+1} = \text{Detector}(\mathbf{F}^{t+1})\)
Region-based Siamese Tracker
-
\(\mathbf{f}_{R_i}^{t} = \operatorname{ROIAlign}(R_{i}^{t},\text{dim})\)
- feature of Region(Object) i in frame t
-
\(\mathbf{f}_{S_i}^{t} = \operatorname{ROIAlign}(S_{i}^{t},\text{dim})\)
- feature of Object Search Area i in frame t
- \(S_{i}^{t+1} = \operatorname{Expand}(R_{i}^{t},\text{factor})\)
- (考慮物體不會在兩幀之間產生巨大位移)
-
\(\mathbf{R}^{t+1} = \text{SpatialMatching}(\mathbf{Mot}^{t+1},\mathbf{Det}^{t+1} )\)
- \(\mathbf{Mot}^{t+1},\mathbf{Det}^{t+1}\)分別代表運動模型的預測結果和目標檢測器的結果
- 他們都是是在先驗擴充套件的區域內得到的
- \(\mathbf{Mot}^{t+1}\)對應
互相關回歸
- \(\mathbf{Det}^{t+1}\)對應
檢測結果
核心:Siamese Tracker
Implicit MM
『IDEA [為什麼採用Search]』
- [ ] 這裡沒有搞清楚
❓感覺好像是寫錯了
因為不一定是軌跡,沒有運動關聯性
Explicit MM
- channel-wise cross-correlation
- response map
通過Correlation
互相關——卷積操作:實際上利用了卷積操作的可擴充套件性(靈活性)
包括置信度
\[\mathbf{v}_i(x,y) = \text{Confidence}_i(x,y) \]以及迴歸框
\[\mathbf{p}_i(x,y) = [l; t; r; b] \]v,p都是在16*16大小的特徵圖上,每個畫素點的
最後尋找最優位置
\[(x,y) = \mathop{\arg\max}\limits_{x,y}() \]The penalty map is introduced to discourage dramatic movements
『IDEA [我一直感覺基於ROI的迴歸是不充分的,因為卷積操作對物體變化的感知在inference
階段是不存在的,引數都已經固化了]』
『IDEA [BoundingBoxReg學習到的是一個通用的,根據特徵迴歸偏移量(調整量)的引數。按理說,但是效果卻很好]』
❓其實這裡說是Motion Model
,我覺得並不是。這裡並沒有建立運動模型。在本質上和SiamFC
是一樣的。
First it uses the channel independent correlation operation to allow the network to explicitly learn a matching function between the same instance in sequential frames. Second, it enables a mechanism for finer-grained pixel-level supervision which is important to reduce the cases of falsely matching to distractors.
那為什麼會有很好的效果呢?
Training
\[Loss = l_{rpn} + l_{detect} + l_{motion} \]\(l_{motion}\)就是之前提到過的損失函式
Short Occlusion
Inference
Spacial Matching
Solver
一些匹配規則
CorrTrack/TLD = FairMOT+Corr.
這篇文章總體來講寫的不是很清楚
沒有程式碼,也沒有明確的架構圖
Motivation & Background
- 卷積神經網路結構上的區域性感知特性
- 相同語義個體相似度高,干擾強烈
- 不能有效獲得空間和時間上的長程依賴
- MOT的任務特點:處理多尺度特徵/物體大小不確定(和檢測任務相同)檢測任務處理多尺度特徵的思路
相似度
\[\mathbf{A}_{i j}=\operatorname{dist}\left(\mathbf{f}\left(\mathbf{d}_{t}^{i}\right), \hat{\mathbf{f}}\left(\mathbf{T}_{t-1}^{j}\right)\right)+\alpha \operatorname{IoU}\left(\mathbf{d}_{t}^{i}, \hat{\mathbf{d}}_{t}^{j}\right) \]- 基於
previous Frame
/previous ROI
計算語義相似度- 空間相關性受到了檢測器的限制
- 大量外圍相似個體(尤其在行人場景下)
- 導致
ID Switch
the correlation information between the cropped image patches is lost directly, and the adjacency spatial relationship is only retained in coordinates
需要區分DISTRACOTRS
❓但是這個有必要嗎?相鄰物體並不會出現很劇烈的移動啊。
也可能作者的意思是出現了Touching Switch(接觸漂移,我自己起的名字)
- 基於
FairMOT
Pipeline
Arch
- 通用特徵提取
- 從時空依賴中同時學習相關性並進行檢測預測
- 將檢測結果關聯到最接近的軌跡上
Spatial Local Correlation Layers
目的是得到context correlation features(融合上下文的空間特徵)
Spacial Correlation
- 作者以Non-local Module為參照,進行了基於鄰域的改進
其實這個改進很Intuitive
僅僅是人為設定了一個區域性感受野
實際上是做了一個Self-Attention
- 注意力權重\(\alpha(x,y) = \text{NeighborsCorrAt}(x,y)\)
-
\[\text{Correlated Feature} = \mathbf{F}_{C}^{l} = \sum_{x,y} \alpha(x,y) \mathbf{F}_{t}^{l}(x,y)
\]
- 上標\(l\)代表特徵金字塔的位置
- 在這裡指的是,還沒有用到\(l\)
Spacial Corr. on FPN
\[[0,R\times D \times 2^l] \]這是什麼意思?
Temporal Correlation
不是很懂
這是之前的一個公式,一模一樣啊
colorization as a proxy task
這裡甚至沒有提到損失函式
Actually, our method intensively perform siamese tracking operations \(M\times N\) to increases the discrimination.
Self-Supervised Feature Learning
TransTrack
Motivation
Query-key Promising
For the same object, its feature in different frames is highly similar, which enables the query-key mechanism to output ordered object sets. This inspiration should also be beneficial to the MOT task.
How to transfer q-k from SOT to MOT
作者認為,最嚴重的問題在於new-coming物體,沒有相應的Query。而在SOT當中,目標是保證在畫面當中的
A desirable solution should be able to well capture new coming objects and propagate previously detected objects to the following frames at the same time
- New-coming Objects
Traditinoal Way
- [ ] 傳統目標初始化方法
Pipeline
Arch
這裡的Q-K不是指的Attention裡面的
- one Encoder ->
key
- Input: extracted features(2-frames)
- 2 Parallel Decoder ->
query
?- Object Detection(DETR)
- Track
- Appearance & Location Information
- Box Association
- KM演算法
- 帶權二分圖最大匹配
-
[ ] 迭代過程是什麼
-
set prediction
- 2 sets
- (DETR)
object query
fordetection
- (NO NMS)
- features(of objects on track) as
track query
-
provide consistent object information to maintain tracklets.
-
-
simple IoU matching to generate the final ordered object set from them.
-
previous frames -> data association
Input and Output
Training
Inference
目標初始化
TransTrack first detects objects on the first frame, where the feature maps are from two copies of the first frame.
occlusions and shortterm
disappearing
Specifically, if a tracking box is unmatched, it keeps as an “inactive” tracking box until it remains unmatched for K consecutive frames. Inactive tracking boxes can be matched to detection boxes and regain their ID.
we choose K = 32.
Why using Transformer/ dominant reason
-
Decent Frame Work
-
No Prior
-
[ ] Transformer在視覺任務當中的優勢
Summary
JDT paradigm
What's Different from TD
-
TD = Detection + Association
-
Joint Learning/Task-Driven
Questions
- 矩形框估計可能並不準確
- SiamMask
- Segmentation is the upper bound of Detection
- 可能影響準確率
-
ID Switch
TODO
- [ ] 怎樣閱讀一篇論文,可以算是理解了
- [ ] 能寫出虛擬碼
- [ ] 說出作者的創作動機
Appendix
Kalman Filter
SORT
DeepSORT
- \(D_1\)
- 在卡爾曼濾波分佈上計算的馬氏距離
- \(M(\text{track}_i,\text{dec}_j;S_{\text{Kalman Filter}})\)
-
\(D_2\)
- 深度ReID特徵餘弦相似度
- 僅僅依靠外觀特徵進行匹配也是可以進行追蹤的。
-
級聯匹配
如果一條軌跡被遮擋了一段較長的時間,那麼在卡爾曼濾波器的不斷預測中就會導致概率彌散。那麼假設現在有兩條軌跡競爭同一個檢測目標,那麼那條遮擋時間長的往往得到馬氏距離更小(更加接近)
馬氏距離的協方差矩陣
使檢測目標傾向於分配給丟失時間更長的軌跡,但是直觀上,該檢測目標應該分配給時間上最近的軌跡。所以deepsort引入了級聯匹配的策略讓更經常出現的目標被分配的優先順序更高。
- 應對遮擋
IOU匹配
unconfirmed and unmatched tracks of age n = 1.This helps to to account for sudden appearance changes, e.g., due to partial occlusion with static scene geometry, and to increase robustness against erroneous initialization
Faster-RCNN
FPN + Fast RCNN
- RPN
Predefined Anchors
ROI Pooling
- 2 MLP Heads
- Classification
Coord. Regression
:修正候選框
Feature Pyramid
需要解決的問題
- 多尺度
- 細粒度資訊丟失
- 小物體無法重建
影象金字塔
單個高層特徵
直接抽取不同特徵層
特徵金字塔
- 降取樣
- 上取樣
- 最近鄰插值
- 1*1卷積,側向連線+merge
Dilated Conv.
- 不做pooling損失資訊的情況下,加大了感受野
Non-local Module
- 捕捉區域性精細結構/模式匹配
- 可以看做
區域性濾波
- 可以看做
目的:大範圍依賴
增大感受野的方式
- 堆疊
- Sampling 過程中丟失大量資訊
- 全連線/Attention
Multi Scale Perception
-
ROI Pooling
/ROI Align
- 本質上屬於
Pooling
方法,有損取樣
- 本質上屬於
-
FPN
- 特徵金字塔
- 可以較好地融合不同尺度的特徵
FairMOT
- 目標檢測:視為高解析度特徵圖上基於中心的包圍盒迴歸任務
- =
Faster RCNN
=clsHeatMap
+BBReg
- =
- 網路架構適應ReID任務
- 類似於
特徵金字塔
- 多尺度融合
- 類似於
核心在於損失函式的設計
- 資料關聯上
- 基於DeepSORT
- 其實只是替代了ReID特徵
- 變成了
Task-Driven