1. 程式人生 > 實用技巧 >MyDLNote-Inpainting: 2020 ECCV 視訊補全論文速讀 Video Inpainting

MyDLNote-Inpainting: 2020 ECCV 視訊補全論文速讀 Video Inpainting

DVI: Depth Guided Video Inpainting for Autonomous Driving

[paper] [code]

本文通過引入 3D 點雲資料,對視訊進行補全。該方法解決了一個特別重要的問題:即在整個視訊中,某個區域始終被遮擋(而後面兩篇文章卻預設需要補全的區域在其他某個關鍵幀並未遮擋)。

Abstract

To get clear street-view and photo-realistic simulation in autonomous driving, we present an automatic video inpainting algorithm that can remove traffic agents from videos and synthesize missing regions with the guidance of depth/point cloud.

為了在自動駕駛過程中獲得清晰的街景和逼真的模擬效果,提出了一種基於深度/點雲引導的自動視訊補繪演算法,該演算法可以去除視訊中的交通代理,合成缺失區域。

By building a dense 3D map from stitched point clouds, frames within a video are geometrically correlated via this common 3D map. In order to fill a target inpainting area in a frame, it is straightforward to transform pixels from other frames into the current one with correct occlusion. Furthermore, we are able to fuse multiple videos through 3D point cloud registration, making it possible to inpaint a target video with multiple source videos. The motivation is to solve the long-time occlusion problem where an occluded area has never been visible in the entire video.

通過拼接點雲構建密集的 3D map,視訊內的幀通過這個普通的 3D map 進行幾何關聯。為了在一個幀中填充一個目標 inpainting區域,它是直接轉換畫素從其他幀到具有正確遮擋的當前幀。此外,可以通過 3D 點雲配準來融合多個視訊,使目標視訊與多個源視訊 inpaint 成為可能。本文的動機是為了解決長時間遮擋的問題,其中這個遮擋區域在整個視訊中都不可見

To our knowledge, we are the first to fuse multiple videos for video inpainting. To verify the effectiveness of our approach, we build a large inpainting dataset in the real urban road environment with synchronized images and Lidar data including many challenge scenes, e.g., long time occlusion.

本文還是第一篇融合多個視訊的視訊補全模型。

Depth Guided Video Inpainting

本模型的整體構架:3D map 首先是通過將所有點雲拼接在一起,然後再投射到單獨的幀上來構建的。利用密集的深度對映和已知的外部相機引數,我們可以從其他幀中抽取候選顏色來填充當前幀中的空洞。然後,採用 belief propagation based regularization 方法來保證補繪區域內畫素顏色的一致性。

Fig. 1. Frame-wise point clouds (a) are stitched into a 3D map (b) using LOAM. The 3D map is projected onto a frame (c) to generate a depth map. For each pixel in the target region (e), we use its depth (d) as guidance to sample colors from other frames (f). Final pixel values are determined by BP regularization and color harmonization to ensure photometric consistency. (g) shows the final inpainting result.

Learning Joint Spatial-Temporal Transformations for Video Inpaintin

[paper] [github]

這是一篇將 Transformer 用在視訊補全演算法中。Transformer 已經被應用在好多領域了。

Abstract

State-of-the-art approaches adopt attention models to complete a frame by searching missing contents from reference frames, and further complete whole videos frame by frame. However, these approaches can suffer from inconsistent attention results along spatial and temporal dimensions, which often leads to blurriness and temporal artifacts in videos.

目前最先進的方法是採用注意力模型,通過從參考幀中搜索缺失的內容來完成一個幀,進而一幀一幀地完成整個視訊。然而,這些方法可能會在空間和時間維度上產生不一致的注意結果,這常常會導致視訊中的模糊和時間偽影。

In this paper, we propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting. Specifically, we simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss.

在本文中,提出學習一個聯合時空變換網路(STTN)用於視訊 inpaint。具體地說,利用自注意同時填充所有輸入幀中缺失的區域,並提出利用時空對抗損失優化 STTN。

Spatial-Temporal Transformer Networks

Overall design

  • Problem formulation:

The intuition is that an occluded region in a current frame would probably be revealed in a region from a distant frame, especially when a mask is large or moving slowly. To fill missing regions in a target frame, it is more effective to borrow useful contents from the whole video by taking both neighboring frames and distant frames as conditions. To simultaneously complete all the input frames in a single feed-forward process, we formulate the video inpainting task as a “multi-to-multi” problem. Based on the Markov assumption [11], we simplify the “multi-to-multi” problem and denote it as:

where X ^{t+n}_{t-n} denotes a short clip of neighboring frames with a center moment t and a temporal radius n. X^T_{1,s} denotes distant frames that are uniformly sampled from the videos X^T_1 in a sampling rate of s. Since X^T_{1,s} can usually cover most key frames of the video, it is able to describe “the whole story” of the video. Under this formulation, video inpainting models are required to not only preserve temporal consistency in neighboring frames, but also make the completed frames to be coherent with “the whole story” of the video.

本文的一個核心 intuition

某幀中被遮擋的區域,可以通過其它在該區域沒變被遮擋的幀來補償。這個特殊的幀被認為是 the whole story 幀 X^T_1

因此,本文的模型是:當前幀的補全,不僅與 t-n 到 t+n 這些臨近的幀有關,還和那個 the whole story 的關鍵幀 X^T_1有關。

  • Network design:

The overview of the proposed Spatial-Temporal Transformer Networks (STTN) is shown in Figure 2. As indicated in Eq. (1), STTN takes both neighboring frames X ^{t+n}_{t-n} and distant frames X^T_{1,s} as conditions, and complete all the input frames simultaneously. Specifically, STTN consists of three components, including a frame-level encoder, multi-layer multi-head spatialtemporal transformers, and a frame-level decoder. The frame-level encoder is built by stacking several 2D convolution layers with strides, which aims at encoding deep features from low-level pixels for each frame. Similarly, the frame-level decoder is designed to decode features back to frames. Spatial-temporal transformers are the core component, which aims at learning joint spatial-temporal transformations for all missing regions in the deep encoding space.

本文提出的 STTN 模型包括三個部分:

frame-level 編碼器:通過跨步疊加幾個二維卷積層構建的,目的是為每一幀從低層次畫素編碼深層特徵。

multi-layer multi-head spatialtemporal transformers:目的學習深度編碼空間中所有缺失區域的聯合時空變換。

frame-level 解碼器:將特徵解碼回幀。

Short-Term and Long-Term Context Aggregation Network for Video Inpainting

[paper]

Abstract

Existing methods either suffer from inaccurate short-term context aggregation or rarely explore long-term frame information. In this work, we present a novel context aggregation network to effectively exploit both short-term and long-term frame information for video inpainting.

現有的方法要麼存在短期上下文聚合不準確的問題,要麼很少探索長期幀資訊。在這項工作中,提出了一個新的上下文聚合網路,以有效地利用短期和長期幀資訊的視訊補全。

In the encoding stage, we propose boundary-aware shortterm context aggregation, which aligns and aggregates, from neighbor frames, local regions that are closely related to the boundary context of missing regions into the target frame. Furthermore, we propose dynamic long-term context aggregation to globally refine the feature map generated in the encoding stage using long-term frame features, which are dynamically updated throughout the inpainting process.

在編碼階段,我們提出了邊界感知的短期上下文聚合,即從相鄰幀中對齊並聚合與缺失區域的邊界上下文密切相關的區域性區域到目標幀中。此外,我們提出了動態長期上下文聚合,以全域性細化編碼階段使用長期幀特徵生成的特徵圖,這些特徵在整個 inpaint 過程中動態更新。

Short-Term and Long-Term Context Aggregation Network

  • Network Overview

Fig. 3. Overview of our proposed network. In the encoding stage, we conduct Boundary-aware Short-term Context Aggregation (BSCA) (Sec. 3.2) using short-term frame information from neighbor frames, which is beneficial to context aggregation and generating temporally consistent contents. In the decoding stage, we propose the Dynamic Long-term Context Aggregation (DLCA) (Sec. 3.3), which utilizes dynamically updated long-term frame information to refine the encoding-generated feature map.

在編碼階段,利用鄰近幀的短期幀資訊進行了邊界感知的短期上下文聚合 (Boundary-aware Short-term Context Aggregation, BSCA),有利於上下文聚合,生成時間上一致的內容。在譯碼階段,提出了動態長期上下文聚合 (Dynamic Long-term Context Aggregation),它利用動態更新的長期幀資訊來細化編碼生成的 feature map。採用了 convolutional LSTM (Conv-LSTM) layer 方法。

Boundary-aware Short-term Context Aggregation

Fig. 4. Left: Boundary-aware Short-term Context Aggregation (BSCA) module. Right: The boundary-aware context alignment operation in BSCA. Here, l ∈ { 1 2 , 1 4 , 1 8 } refers to the encoding scale.

Dynamic Long-term Context Aggregation

固定取樣長期參考幀沒有考慮視訊的運動多樣性。因此,它們可能不可避免地帶來更多不相關甚至是嘈雜的資訊。因為不同的視訊有不同的運動模式 (例如緩慢移動或來回移動),它會導致幀之間不同的上下文依賴關係。因此,所選擇的長期參考資訊必須與當前目標框架的上下文相關。本文采用動態策略,有效利用長期參考資訊。這個解碼階段上下文聚合模組的結構如圖所示。通過1 動態更新長期特徵和 2) 非基於區域性的聚合對上述編碼階段生成的特徵圖進行細化。

Fig. 5. The Dynamic Long-term Context Aggregation (DLCA) module.