Pseudo-3D Residual Networks 演算法筆記
ICCV2017的文章。在視訊分類或理解領域,容易從影象領域的2D卷積聯想到用3D卷積來做,雖然用3D卷積進行特徵提取可以同時考慮到spatial和temporal維度的特徵,但是計算成本和模型儲存都太大,因此這篇文章針對視訊領域中採用的3D卷積進行改造,提出Pseudo-3D Residual Net (P3D ResNet),思想有點像當年的Inception v3中用1*3和3*1的卷積疊加代替原來的3*3卷積,這篇文章是用1*3*3卷積和3*1*1卷積代替3*3*3卷積(前者用來獲取spatial維度的特徵,實際上和2D的卷積沒什麼差別;後者用來獲取temporal維度的特徵,因為倒數第三維是幀的數量)
Figure1是幾個模型在層數、模型大小和在Sports-1M資料集上的視訊分類效果對比,其中的P3D ResNet是在ResNet 152基礎上修改得到的,深度之所以不是152,是因為改造後的每個residual結構不是原來ResNet系列的3個卷積層,而是3或4個卷積層,詳細可以看Figure3,所以最後網路深度是199層。官方github程式碼中的網路就是199層的。ResNet 152是直接在Sports-1M資料集上fine tune得到的。可以看出199層的P3D ResNet雖然在模型大小上比ResNet-152(此處ResNet-152是在sports-1M資料集上fine tune得到的)大一些,但是準確率提升比較明顯,與C3D(此處C3D是直接在sports-1M資料集上從頭開始訓練得到的)的對比在效果和模型大小上都有較大改進,除此之外,速度的提升也是亮點,後面有詳細的速度對比。
既然想用1*3*3卷積和3*1*1卷積代替3*3*3卷積,那麼怎樣組合這兩種卷積也是一個問題,Figure2是P3D ResNet網路中residual的三種結構形式。S表示spatial 2D filters,也就是1*3*3卷積;T表示temporal 1D filters,也就是3*1*1卷積。
Figure3是對於P3D ResNet網路中residual的三種結構形式的詳細介紹以及和ResNet的residual的對比。P3D ResNet的深度增加主要是P3D-A和P3D-C帶來的。
Table1是P3D ResNet的速度和在UCF101資料集上的準確率對比。ResNet-50是在UCF101資料集上fine tune得到的,具體是這樣做的
P3D-A ResNet、P3D-B ResNet、P3D-C ResNet是這樣做的:The architectures of three P3D ResNet variants are all initialized with ResNet-50 except for the additional temporal convolutions and are further fine-tuned on UCF101. 換句話說,1*3*3卷積是可以用原來ResNet-50的3*3卷積進行初始化的,但是3*1*1卷積是不行的,因為ResNet-50中沒有這樣尺寸的卷積核,因此3*1*1卷積是隨機初始化然後直接在視訊資料集上fine tune。For each P3D ResNet variant, the dimension of input video clip is set as 16 × 160 × 160 which is randomly cropped from the resized non-overlapped 16-frame clip with the size of 16 × 182 × 242. 訓練P3D的時候每個batch包含128個clip,每個clip包含16幀(frame)影象,每幀影象的大小是160*160,因此輸入就是128*3*16*160*160這樣的維度。另外為什麼是16幀呢?主要是和網路結構相關,從程式碼可以看出涉及4個pooling正好能將16降到1。測試的時候是從一個video中抽取20個clip,每個clip由16 frame影象組成,後面會詳細介紹。原文關於模型的輸入尺寸是這麼說的:Given a video clip with the size of c×l×h×w where c, l, h and w denotes the number of channels, clip length, height and width of each frame, respectively. clip length就是這裡說的16 frame。
從Table1可以看出在模型大小增加一點的情況下,速度大大提升(9 clip/s就是144 frame/s左右),準確率提升也比較明顯。另外 By additionally pursuing structural diversity, P3D ResNet makes the absolute improvement over P3D-A ResNet, P3D-B ResNet and P3D-C ResNet by 0.5%, 1.4% and 1.2% in accuracy respectively, indicating that enhancing structural diversity with going deep could improve the power of neural networks.
最後的P3D ResNet是通過三種變形的交替連線得到,如Figure4所示。
Table2是在Sports-1M資料集上的結果對比,Sports-1M一共包含487個class,視訊數量在1.13 million左右。Clip [email protected]表示clip的top1分類準確率(clip-level accuracy),Video [email protected]表示video的top1分類準確率(video-level accuracy),Video [email protected]表示video的top5分類準確率。在Table2中 Deep Video是採用類似AlexNet的網路進行分類的,而Single Frame和Slow Fusion的差別是輸入frame的數量,後者相當於是基於10個frame來計算clip和video-level的準確率,所以會高一些。Convolutional Pooling exploits max-pooling over the final convolutional layer of GoogleNet across each clip’s frames,也就是說是對120個frame做max-pooling得到的,所以準確率較高,但顯然速度要慢很多。C3D既可以train from scratch,也可以在I380K資料集上預訓練,然後在Sports-1M資料集上fine tune得到。ResNet-152 is fine-tuned and employed on one frame from each clip to produce a clip-level score,也就是說clip-level score是由一個frame決定的。另外ResNet-152和Deep Video(Single Frame)的區別只是網路結構不一致而已;P3D ResNet(199層)的速度應該在2clip/s以上,因為文中提到處理每個clip的時間少於0.5s。
總結下Table2(模型測試的時候)是這麼得到的:We randomly sample 20 clips from each video and adopt a single center crop per clip, which is propagated through the network to obtain a clip-level prediction score. The video-level score is computed by averaging all the clip-level scores of a video. clip-level accuracy比較容易理解,就是一個clip(包含16 frame)作為訓練好的模型的輸入,然後在pool5層會得到2048維的輸出,最後接一個全連線層得到487維輸出(對應Sports-1M的487個類別)。video-level accuracy則是對同一個video的每個clip生成的2048維輸出做平均,最後基於平均後得到的2048維特徵用channel數為487的全連線層進行連線得到輸出。
實驗結果:
首先是關於實驗用到的5個數據集:
UCF101 and ActivityNet are two of the most popular video action recognition benchmarks.
UCF101 consists of 13,320 videos from 101 action categories. Three training/test splits are provided by the dataset organisers and each split in UCF101 includes about 9.5K training and 3.7K test videos.
The ActivityNet dataset is a large-scale video benchmark for human activity understanding.
The latest released version of the dataset (v1.3) is exploited, which contains 19,994 videos from 200 activity categories. The 19,994 videos are divided into 10,024, 4,926 and 5,044 videos for training, validation and test set, respectively.
ASLAN is a dataset on action similarity labeling task, which is to predict the similarity between videos. The dataset includes 3,697 videos from 432 action categories.
YUPENN and Dynamic Scene are two sets for the scenario of scene recognition. In between, YUPENN is comprised of 14 scene categories each containing 30 videos. Dynamic Scene consists of 13 scene classes with 10 videos per class.
Table3是本文演算法和其他演算法在UCF101資料集上的對比。這裡主要將演算法分成三種:End-to-end CNN architecture with fine-tuning、CNN-based representation extractor+linear SVM、Method fusion with IDT。Accuracy列中括號部分的準確率表示輸入除了視訊幀(video frame)以外,還包含光流資訊(optical flow)。這裡的P3D ResNet應該是199層的模型。IDT是人工提取的特徵。TSN是ECCV2016的演算法,算是目前效果比較好的了。只以視訊幀為輸入的P3D ResNet的效果甚至要好於一些以視訊幀和光流為輸入的網路的效果,比如引用25、29、37。P3D ResNet和C3D的效果對比可以看出前者的優勢還是比較明顯。In addition, by performing the recent state-of-the-art encoding method [22] on the activations of res5c layer in P3D ResNet, the accuracy can achieve 90.5%, making the improvement over the global representation from pool5 layer in P3D ResNet by 1.9%. 文中的這句話並沒有作為實驗結果列在表格中,不知是何原因。
Table3是本文演算法和其他演算法在ActivityNet資料集上的對比
Table5是在ASLAN資料集上的關於action similarity的結果對比,這個資料集是用來判斷:does a pair of videos present the same action。
其他更多實驗結果可以參看原文。
關於後期優化的三個方向,作者也放出來了,非常值得一試,尤其是第三點,也就是以視訊幀和光流資訊同時作為模型的輸入,畢竟這種做法在其他演算法上效果非常明顯(Table3的括號)。原文如下:Our future works are as follows. First, attention mechanism will be incorporated into our P3D ResNet for further enhancing representation learning. Second, an elaborated study will be conducted on how the performance of P3D ResNet is affected when increasing the frames in each video clip in the training. Third, we will extend P3D ResNet learning to other types of inputs, e.g., optical flow or audio.