1. 程式人生 > >MARS: 一個針對大尺度人的reid視訊基準

MARS: 一個針對大尺度人的reid視訊基準

MARS: A Video Benchmark for Large-Scale Person Re-identification

下面是曠世的演算法介紹http://www.sohu.com/a/207091906_418390

一些部落格對MARS的總結:https://blog.csdn.net/baidu_39622935/article/details/82867177

reid開原始碼;https://blog.csdn.net/qq_21997625/article/details/80937939

不論是基於圖片還是視訊的reid,都是先提取特徵,然後再進行度量學習(metric learning)


視訊行人重識別:
1)提取特徵:

傳統方法:LOMO特徵、HOG3D特徵、GEI特徵等等

深度學習方法:IDE (ID-discriminative embedding):就是訓練一個分類的網路(基於ImageNet進行fine tune),前5層卷積層,6、7是1024個神經元的全連線層,第8層是ID數目的分類層。然後當作分類任務訓練網路。訓練完成後,用第7層的輸出作為提取的特徵

例項解析:

對於視訊行人重識別,大多是把視訊序列按幀來處理,MARS這篇文章提出了先把視訊序列圖片分成若干小塊8*8*6分塊提取HOG3D特徵(96維),對於一個視訊序列因長度不同,勢必結果是ni * 96。然後作者通過詞袋模型將其變成了2000維的特徵,對不同長度的視訊均可。

在MARS這篇文章提出的第二種IDE特徵方法中,作者把一個視訊序列的每一幀都送入CNN提取一個特徵,然後將所有幀的特徵進行pooling獲得一個代表視訊序列的特徵(MARS和PRID用max pool, iLIDS用avg pool)。
2)度量學習:

KISSME、XQDA等,效率和準確率都很高
 

摘要

This paper considers person re-identification (re-id) in videos.
We introduce a new video re-id dataset, named Motion Analysis and Re-identification Set (MARS)

, a video extension of the Market-1501 dataset.
To our knowledge, MARS is the largest video re-id dataset to date.
Containing 1,261 IDs and around 20,000 tracklets, it provides rich visual information compared to image-based datasets.
Meanwhile, MARS reaches a step closer to practice.
The tracklets are automatically generated by the Deformable Part Model (DPM) as pedestrian detector and the GMMCP tracker.
A number of false detection/tracking results are also included as distractors which would exist predominantly in practical video databases.
 Extensive evaluation of the state-of-the-art methods including the space-time descriptors and CNN is presented.
We show that CNN in classification mode can be trained from scratch using the consecutive bounding boxes of each identity.
The learned CNN embedding outperforms other competing methods significantly and has good generalization ability on other video re-id datasets upon fine-tuning.

本文考慮了視訊中的人員再識別(reid)問題。

我們介紹了一個新的視訊reid資料集,名為運動分析和重新識別集(MARS),是Market-1501 datase資料集的視訊擴充套件。

據我們所知,MARS是迄今為止最大的視訊reid資料集。

它包含1,261個id和大約20,000個tracklet,與基於影象的資料集相比,它提供了豐富的視覺資訊。

與此同時,火星離實踐又近了一步。

軌跡由變形部件模型(DPM)作為行人檢測器和GMMCP跟蹤器自動生成。

一些錯誤的檢測/跟蹤結果也包括作為干擾,這將主要存在於實際的視訊資料庫。

廣泛評價了包括時空描述符和CNN在內的最先進的方法。
我們證明,在分類模式下,CNN可以使用每個標識的連續包圍框零開始進行訓練。

學習到的CNN嵌入方法在效能上明顯於其他競爭方法,並且經過微調後對其他視訊reid資料集具有較好的泛化能力。

INTRODUCTION

With respect to the “probe-to-gallery” pattern, there are four re-id strategies:
image-to-image, image-to-video, video-to-image, and video-to-video.
 Among them,the first mode is mostly studied in literature, and previous methods in image-
based re-id [5,24,35] are developed in adaptation to the poor amount of training data.

對於“probe-to-gallery”模式,有四種reid策略:

影象到影象,影象到視訊,視訊到影象,視訊到視訊。

其中,第一種模式在文獻中研究較多,以往的方法多用於影象-

基於reid[5,24,35]的開發是為了適應少量的訓練資料。

The second mode can be viewed as a special case of “multi-shot”,
and the third one involves multiple queries.
 Intuitively, the video-to-video pattern, which is our focus in this paper, is more favorable because both probe and gallery units contain much richer visual information than single images.
Empirical evidences confirm that the video-to-video strategy is superior to the others (Fig. 3).

第二種模式可以看作是“多鏡頭”的特殊情況,第三種模式涉及多個查詢。

直觀地說,我們在本文中關注的視訊對視訊模式更有利,因為探針和相簿單元都包含比單個影象豐富得多的視覺資訊。

實證證明,視訊對視訊策略優於其他策略(圖3)。

Currently, a few video re-id datasets exist [4, 15, 28, 36].
They are limited in scale: typically several hundred identities are contained, and the number of image sequences doubles (Table 1).
Without large-scale data, the scalability of algorithms is less-studied and methods that fully utilize data richness are less likely to be exploited.
In fact, the evaluation in [43] indicates that re-id performance drops considerably in large-scale databases.

目前,存在一些視訊reid資料集[4,15,28,36]。

它們在規模上是有限的:通常包含幾百個標識,影象序列的數量是原來的兩倍(表1)。

沒有大規模資料,演算法的可擴充套件性研究較少,充分利用資料豐富性的方法被利用的可能性較小。

事實上,[43]中的評估表明,在大型資料庫中,reid效能會大幅下降。

Moreover, image sequences in these video re-id datasets are generated by hand-drawn bboxes. This process is extremely expensive, requiring intensive human labor.
And yet, in terms of bounding box quality, hand-drawn bboxes are biased towards ideal situation, where pedestrians are well-aligned.
 But in reality, pedestrian detectors will lead to part occlusion or misalignment which may have a non-ignorable effect on re-id accuracy [43].
Another side-effect of hand-drawn box sequences is that each identity has one box sequence under a camera.
 This happens because there are no natural break points inside each sequence.
 But in automatically generated data, a number of tracklets are available for each identity due to miss detection or tracking.
As a result, in practice one identity will have multiple probes and multiple sequences as ground truths. It remains unsolved how to make use of these visual cues.

此外,這些視訊reid資料集中的影象序列是由手繪bbox生成的。這個過程非常昂貴,需要大量的人力。

然而,就邊界盒的質量而言,手工繪製的bbox偏向於理想的情況,即行人排列良好。

但在現實中,行人檢測器會導致部分遮擋或不對準,這可能會對reid準確率[43]產生不可忽視的影響。

手繪框序列的另一個副作用是,每個標識在攝像機下都有一個框序列。

這是因為在每個序列中沒有自然斷點。

但是在自動生成的資料中,由於丟失檢測或跟蹤,每個標識都有許多tracklet可用。

因此,在實踐中,一個身份將有多個探針和多個序列作為基本事實。如何利用這些視覺線索仍未解決。

In light of the above discussions, it is of importance to 1) introduce large-scale and real-life video re-id datasets and

2) design effective methods which fully utilizes the rich visual data.
 To this end, this paper contributes in collecting and annotating a new person re-identification dataset, named “Motion Analysis and Re-identification Set” (MARS) (Fig. 1).
Overall, MARS is featured in several aspects.
 First, MARS has 1,261 identities and around 20,000 video sequences, making it the largest video re-id dataset to date.
Second, instead of hand-drawn bboxes, we use the DPM detector [11] and GMMCP tracker [7] for pedestrian detection and tracking, respectively.
Third, MARS includes a number of distractor tracklets produced by false detection or tracking result. Finally, the multiple-query and multiple-ground truth mode will enable future research in fields such as query re-formulation and search re-ranking [45].

綜上所述,1)引入大規模的真實視訊reid資料集,

2)設計充分利用豐富的視覺資料的有效方法是非常重要的。

為此,本文致力於收集並標註一個新的人再識別資料集,名為“運動分析與再識別集”(MARS)(圖1)。

總的來說,火星有幾個特點。

首先,MARS擁有1,261個身份和大約20,000個視訊序列,是迄今為止最大的視訊reid資料集。

其次,我們使用DPM檢測器[11]和GMMCP tracker[7]分別進行行人檢測和跟蹤,而不是手繪bbox。

第三,MARS包含了一些由錯誤檢測或跟蹤結果產生的干擾軌跡。最後,多查詢、多地面真值模式將使查詢重構、搜尋重排名[45]等領域的研究成為可能。

Apart from the extensive tests of the state-of-the-art re-id methods, this paper evaluates two important features:
1) motion features including HOG3D [18] and the gait [13] feature, and
2) the ID-disciminative Embeddings (IDE) [46], which learns a CNN descriptor in classification mode.
Our results show that although motion features achieve impressive results on small datasets, they are less effective on MARS due to intensive changes in pedestrian activity.
In contrast, the IDE descriptor learned on the MARS training set significantly outperforms the other competing features, and demonstrates good generalization ability on the other two video datasets after fine-tuning.

除了對最先進的reid方法進行了廣泛的測試外,本文還評估了兩個重要特徵:

1)運動特徵包括HOG3D[18]和步態[13]特徵,和

2) id - minative embedded (IDE)[46],在分類模式下學習CNN描述符。

我們的結果顯示,儘管運動特徵在小資料集上取得了令人印象深刻的結果,但由於行人活動的劇烈變化,它們在火星上的效果不太好。

相比之下,MARS訓練集學習到的IDE描述符在效能上明顯優於其他競爭特徵,並且經過微調後,在另外兩個視訊資料集上表現出了良好的泛化能力。