1. 程式人生 > 實用技巧 >MyDLNote-Inpainting: 2020 ECCV VCNet: A Robust Approach to Blind Image Inpainting 盲影象補全

MyDLNote-Inpainting: 2020 ECCV VCNet: A Robust Approach to Blind Image Inpainting 盲影象補全

2020 ECCV VCNet: A Robust Approach to Blind Image Inpainting

Keywords: Blind image inpainting · visual consistency · spatial normalization · generative adversarial networks

[paper]

Abstract

Blind inpainting is a task to automatically complete visual contents without specifying masks for missing areas in an image. Previous work assumes known missing-region-pattern, limiting the application scope.

解釋盲影象補全的含義和傳統給定 missing region 模式的問題。

We instead relax the assumption by defining a new blind inpainting setting, making training a neural system robust against various unknown missing region patterns. Specifically, we propose a two-stage visual consistency network (VCN) to estimate where to fill (via masks) and generate what to fill

. In this procedure, the unavoidable potential mask prediction errors lead to severe artifacts in the subsequent repairing. To address it, our VCN predicts semantically inconsistent regions first, making mask prediction more tractable. Then it repairs these estimated missing regions using a new spatial normalization, making VCN robust to mask prediction errors. Semantically convincing
and visually compelling content can be generated.

本文要做的事:在不給定 missing region,自動識別影象中存在需要 inpainting 的區域,將其準確標記為 missing region,並實現高質量補全。

具體地,提出了兩個階段的 視覺一致性網路 visual consistency network (VCN):

1. 估計哪裡(where)需要補全:預測語義不一致(emantically inconsistent)的區域;

2. 補全什麼(what):利用一種新的空間歸一化方法對估計的缺失區域進行修復。

Extensive experiments show that our method is effective and robust in blind image inpainting. And our VCN allows for a wide spectrum of applications.

Introduction

第一段,提出問題:

We note the requirement of having accurate masks makes it difficult to be practical in several scenarios where masks are not available, e.g., graffiti and raindrop removal (Fig. 1). Users need to carefully locate corrupted regions manually, where inaccurate masks may lead to inferior results. We in this paper analyze blind inpainting that automatically finds pixels to complete, and propose a suitable solution based on image context understanding.

問題:在實際應用中,要補全的區域並不是提前標記好的,如下圖。使用者需要自己很仔細的人工標記,而這種標記不僅繁瑣,而且不一定準確。

第二段,與已有演算法的不同:

Existing work [3, 24] on blind inpainting assumes that the missing areas are filled with constant values or Gaussian noise. Thus the corrupted areas can be identified easily and almost perfectly based on noise patterns. This oversimpli- fied assumption could be problematic when corrupted areas are with unknown content. To improve the applicability, we relax the assumption and propose the versatile blind inpainting task. We solve it by taking deeper semantics of the input image into overall consideration and detecting more semantically meaningful inconsistency based on the context in contrast to previous blind inpainting.

現有的盲補全的演算法假設缺失的區域用定值或高斯噪聲填充。因此,可以很容易並完美地識別汙染區域的噪聲模式。當損壞的區域包含未知內容時,這種過於簡單的假設可能會產生問題。為了提高其適用性,本文的模型放寬了假設,提出了通用的盲補全任務。本文的解決方法是綜合考慮輸入影象更深層次的語義,並基於上下文檢測更多語義意義上的不一致性,即盲補全的區域不再是簡單的定值或高斯噪聲形式,而是多樣的、與正常背景語義不連續的區域。

第三段,難點和策略:

Note that blind inpainting without assuming the damage patterns is highly ill-posed. This is because the unknown degraded regions need to be located based on their difference from the intact ones instead of their known characteristics, and the uncertainties in this prediction make the further inpainting challenging.

We address it in two aspects, i.e., a new data generation approach and a novel network architecture.

預測需要補全的區域的難點在於:未知的退化區域需要根據其與完好區域的差異來定位,而不是根據其已知特徵來定位,這種預測中的不確定性使進一步的補繪具有挑戰性。

本文的策略是:從兩個方面來解決這個問題,即提出一種新的樣本生成方法,用於生成多樣的退化區域;提出一種新的影象補全網路

第四段,關於樣本生成:

For training data collection, if we only take common black or noise pixels in damaged areas as input, the network may detect these patterns as features instead of utilizing the contextual semantics as we need. In this scenario, the damage for training should be diverse and complicated enough so that the contextual inconsistency instead of the pattern in damage can be extracted. Our first contribution, therefore, is the new strategy to generate diverse training data where natural images are adopted as the filling content with random strokes.

本文的第一個貢獻,就是提出了新的盲補全樣本的生成,旨在退化區域的 多樣性 複雜性 足夠的多,使得網路在預測 missing region 時能夠有足夠的能力識別出 語義不連續 區域。

第五段,關於補全模型:

For model design, our framework consists of two stages of mask prediction and robust inpainting.

A discriminative model is used to conduct binary pixelwise classification to predict inconsistent areas.

With the mask estimated, we use it to guide the inpainting process. Though this framework is intuitive, its specific designs to address the biggest issue in this framework are non-trivial: how to neutralize the generation degradation brought by inevitable mask estimation errors in the first stage. To cope with this challenge, we propose a probabilistic context normalization (PCN) to spatially transfers contextual information in different neural layers, enhancing information aggregation of the inpainting network based on the mask prediction probabilities.

We experimentally validate that it outperforms other existing approaches exploiting masks, e.g., concatenating mask with the input image and using convolution variants (like Partial Convolution [22] or Gated Convolution [44]) to employ masks, in evaluation.

模型包括兩個部分,mask 估計和補全。

mask 估計:採用判別模型對不一致區域進行二元畫素分類預測。

補全:根據 mask 的估計,使用它來引導補全過程。雖然該框架是直觀的,但其針對該框架中最大問題的具體設計並非瑣碎:如何能抵消在第一階段中由於 mask 估計存在誤差導致的生成退化(用自己的話說就是,mask 估計可能是不準確的,存在一定誤差,導致影象生成存在問題,如何抵消這個負面作用)。為了應對這一問題,本文提出了一種基於概率上下文歸一化 probabilistic context normalization (PCN) 的方法來在不同的神經層空間傳輸上下文資訊,增強基於 mask 預測概率的 inpaint網路的資訊聚合。

原文 Introduction 的最後兩段是實驗結論說明和主要貢獻。

Robust Blind Inpainting

給出本文的符號表達,和任務描述:

For this task, the input is only a degraded image {I} \in R^{h\times w\times c} (contaminated by unknown visual signals), and the output is expected to be a plausible image \widehat{O} \in R^{h\times w\times c} , approaching ground truth {O} \in R^{h\times w\times c} of I. The degraded image I in the blind inpainting setting is formulated as

I = O \odot (1-M) + N \odot M, (1)

where {M} \in R^{h\times w\times 1} is a binary region mask (with value 0 for known pixels and 1 otherwise), and {N} \in R^{h\times w\times c} is a noisy visual signal. is the Hadamard product operator. Given I, we predict \widehat{O} (an estimate of O) with latent variables M and N. Also, Eq. (1) is the means to produce training tuples < I_i , O_i ,M_i , N_i >_{|i=1,...,m}.

從公式(1)中可以看到,網路訓練時,需要資料集提供的圖片包括,退化影象 I,影象的 ground truth O,退化區域 ground truth M,退化區域中的填充訊號 NM 表示在哪裡對影象進行退化;N 表示進行什麼樣的退化。

N indicates what and M indicates where

Training Data Generation

這一節,提出如何生成退化影象,具體地,填充訊號 N 和區域 M 是怎麼生成的。首先,提出一個重要的方法論

The key for defining N is to make it indistinguishable as much as possible from I on image pattern, so that the model cannot decide if a local patch is corrupted without seeing the image context. Then a neural system trained with such data has the potential to work on unknown contamination.

定義 N 的關鍵是使它在影象模式上儘可能與 I 無法區分,這樣在沒有看到影象上下文的情況下,模型就無法判斷區域性 patch 是否損壞。然後,用這些資料訓練的神經系統就有可能在未知的汙染上發揮作用。

然後,給出具體的方法:

In this paper, we use real-world image patches to form N. This ensures that local patches between N and I are indistinguishable, enforcing the model to draw an inference based on contextual information, which eventually improves the generalization ability for real-world data.

Further, we alleviate any priors introduced by M in training via employing free-form strokes [44]. Existing blind or non-blind inpainting methods often generate the arbitrary size of a rectangle or text-shaped masks. However, this is not suitable for our task, because it may encourage the model to locate the corrupted part based on the rectangle shape. Free-form masks can largely diversify the shape of masks, making the model harder to infer corrupted regions with shape information.

Also, we note that direct blending image O and N using Eq. (1) would lead to noticeable edges, which are strong indicators to distinguish among noisy areas. This will inevitably sacrifice the semantic understanding capability of the used model. Thus, we dilate the M into \widetilde{M} by the iterative Gaussian smoothing in [37] and employ alpha blending in the contact regions between O and N.

1. N 是用真實的影象 patch 來形成。這樣,當給定區域性影象時,無法區分 NI,這樣做是為了增強模型根據上下文資訊進行推斷的能力,從而最終提高對真實資料的泛化能力。

2. 對於 M 的生成,消除任何先驗,即形狀位置大小完全隨意。自由形狀的 mask 可以極大地豐富 M 的形狀,使得模型很難通過形狀資訊推斷出被破壞的區域。

3. 直接將 M 填充 N,是會有明顯邊界的。本文用高斯平滑和 alpha blending 的方法,將 O 按照 M 的空間和 N 進行接合。

Inpainting Method: Visual Consistent Network (VCN)

第一、二段,算是 overview。

VCN has two sub-modules, i.e. Mask Prediction Network (MPN) and Robust Inpainting Network (RIN). MPN is to predict potential visually inconsistent areas of a given image, while RIN is to inpaint inconsistent parts based on the predicted mask and the image context. Note that these two submodules are correlated. MPN provides an inconsistency mask \hat{M}\in R^{h\times w\times 1} , where \hat{M}_p \in [0, 1], helping RIN locate inconsistent regions. On the other hand, by leveraging local and global semantic context, RIN largely regularizes MPN, enforcing it to focus on these regions instead of simply fitting our generated data.

VCN 包括兩個模組:

1. Mask 預測網路 MPN :預測視覺不一致區域;

2. 魯棒補全網路 RIN :補全不一致區域。

MPN 和 RIN 是相互關聯的:

MPN 估計的不一致區域 \hat{M} 幫助 RIN 定位。另一方面,由於 RIN 補全的影象需要在區域性和全域性語義準確,這種需求極大的規範了 MPN 的估計結果。

Our proposed VCN is robust to blind image inpainting in the given relativistic generalized setting. Its robustness is shown in two aspects. MPN of VCN can predict the regions to be repaired with decent performance even the contamination patterns are new to the trained model. More importantly, RIN of VCN synthesizes plausible and convincing visual content for the predicted missing regions, robust against mask prediction errors.

有個問題:訓練資料集中雖然給定了各種各樣的填充內容,但如果測試影象中的填充內容 N 並沒有在訓練集中出現過,本文的盲補全演算法還能穩定嗎?

答案是肯定的。其實,MPN 在預測 mask 區域時,是根據語義的不一致來預測的,而不是根據內容預測的。另一方面,RIN 在生成可信內容的同時,促使 mask 預測更加準確。

下面先看 MPN 網路:

  • Mask Prediction Network (MPN)

MPN 用來學習從退化影象 I 到二值 mask \hat{M} 的對映 FM 是給定的 ground truth mask 區域。

首先用到的技術是 自適應損失(self-adaptive loss)函式

這裡 \tau 的作用是平衡非退化區域 1-M 和退化區域 M 直接的比例。公式(2)第一項計算的是 M\hat{M}資訊熵,第二項同理。

上述模型需要注意兩點:

Note that \hat{M} is an estimated soft mask where 0 ≤ \hat{M}p ≤ 1 for ∀p, although we employ a binary version for M in Eq. (1). It means the damaged pixels are not totally abandoned in the following inpainting process. The softness of \hat{M} enables the differentiability of the whole network. Additionally, it lessens error accumulation caused by pixel misclassification, since pixels whose status (damaged or not) MPN are uncertain about are still utilized in the later process.

注意: 這裡預測的 \hat{M} 是一個 soft mask,即其取值範圍在 0 到1 之間,而不是傳統的那種 非 0 即 1 的 hard mask。

\hat{M} 的 soft 性質使整個網路具有可微性。此外,它減少了由於畫素分類錯誤造成的積累誤差,因為在後期處理中仍然使用 MPN 狀態的不確定 (損壞或未損壞) 的畫素。

Note that the objective of MPN is to detect all corrupted regions. Thus it tends to predict large corrupted regions for an input corrupted image, which is shown in Fig. 3(c). As a result, it makes the subsequent inpainting task too difficult to achieve. To make the task more tractable, we instead propose to detect the inconsistency region of the image, as shown in Fig. 3(d), which is much smaller. If these regions are correctly detected, other corrupted regions can be naturally blended to the image, leading to realistic results. In the following, we show that by jointly learning MPN with RIN, the MPN eventually locates inconsistency regions instead of all corrupted ones.

注意:MPN 的目標是檢測所有損壞的區域。因此,它傾向於預測出一個偏大的損壞區域,如下圖(c)。

為了解決這個問題,本文提出的方法是隻檢測區域性不一致區域,如下圖(d)。這一點,需要聯合學習 RIN 和 MPN 來實現。

下面先看 RIN 網路:

  • Robust Inpainting Network (RIN)

RIN 的目標是將退化影象 I,在藉助預測的 M 給定的區域下,生成 O 的對映。

RIN is structured in an encoder-decoder fashion with probabilistic contextual blocks (PCB). PCB is a residual block variant armed with a new normalization (Fig. 4), incorporating spatial information with the predicted mask.

RIN 模組中包含 probabilistic contextual blocks (PCB) 模組。

第一個問題:為什麼要提出並使用 概率上下文規範化 probabilistic context normalization (PCN)?

With the predicted mask \hat{M}, repairing corrupted regions requires knowledge inference from context, and being skeptical to the mask for error propagation from the previous stage. A naive solution is to concatenate the mask with the image and feed them to a network. However, this way captures context semantics only in deeper layers, and does not consider the mask prediction error explicitly. To improve contextual modeling and minimize mask error propagation, it would be better if the transfer is done in all building blocks, driven by the estimated mask confidence. Hence, we propose a probabilistic context normalization (PCN, Fig. 4) to transfer contextual information in different layers.

考慮到 mask 的預測誤差會傳播到影象內容補全任務中。解決的方法是,讓資訊能夠在 mask 預測模組和 影象補全模組之間進行轉換,驅動各項任務的可信度。如下圖,本文的模型中,預測的 mask 會送入 RIN 的每一層,RIN 在優化時,其誤差也會反向傳播到 MPN 中,從而優化 MPN 預測的準確性。

下面給出 PCN 的整個原理:

Our PCN module is composed of the context feature transfer term and feature preserving term. The former transfers mean and variance from known features to unknown areas, both indicated by the estimated soft mask \hat{M} (H defined below is its downsampled version). It is a learnable convex combination of feature statistics from the predicted known areas and unknowns ones. Feature preserving term keeps the features in the known areas (of high confidence) intact. The formulation of PCN is given as

and the operator \mathcal{ T } (·) is to conduct instance internal statistics transfer as

where X is the input feature map of PCN, and H is nearest-neighbor downsampled from \hat{M}, which shares the same height and width with X. \bar{H} = 1- H indicates the regions that MPN considers clean. X_P = X\odot H and X_Q = X\odot \bar{H} . β is a learnable channel-wise vector (\beta \in R^{1\times 1\times c} and β ∈ [0, 1]) computed from X by a squeeze-and-excitation module [12] as

(5)

where\bar{x} \in R^{1\times 1\times c} is also a channel-wise vector computed by average pooling X, and f(·) is the excitation function composed by two fully-connected layers with activation functions (ReLU and Sigmoid, respectively).

µ(·, ·) and σ(·, ·) in Eq. (4) compute the weighted average and standard deviation respectively in the following manner:

where Y is a feature map, T is a soft mask with the same size of Y, and is a small positive constant. i and j are the indexes of height and width, respectively

這麼一大段,就是給出了 PCN 的數學表示式。

公式 (5) 說明 \beta 表示的就是 SE 模組(就是一個通道注意力模組)。

下面是對 PCN 的解釋:

Prior work [8, 15] showed that feature mean and variance from an image are related to its semantics and texture. The feature statistics propagation by PCN helps regenerate inconsistent areas by leveraging contextual mean and variance. This is intrinsically different from existing methods that implicitly achieve this goal in deep layers, as we explicitly accomplish it in each building block. Thus PCN is beneficial to the learning and performance of blind inpainting. More importantly, RIN keeps robust considering potential errors in \hat{M} from MPN, although RIN is guided by \hat{M} for repairing.

有工作說明了,特徵的均值和方差與影象的語義和紋理有關。PCN 傳播的上下文均值和方差特徵統計,有助於更新不一致區域。

第二個問題:綜合優化目標函式的確立

一共用了 4 中:

L1 損失函式:即補全影象與 ground truth 影象之間的平均絕對誤差;

VGG 損失函式:不多解釋了;

ID-MRF 損失函式:

ID-MRF loss [37, 25] is employed as our texture consistency term. It computes the sum of the patch-wise difference between neural patches from the generated content and those from the corresponding ground truth using a relative similarity measure. It enhances generated image details by minimizing discrepancy with its most similar patch from the ground truth.

它通過一種相對的相似性度量來計算生成內容 patch 與對應的 ground truth patch 之間的差值之和。它通過最小化與 ground truth 最相似的 patch 的差異來增強生成的影象細節。

對抗損失函式:採用的是 WGAN。

For the adversarial term, WGAN-GP [10, 1] is adopted as

where P denotes data distribution, and D is a discriminator for the adversarial training. Its corresponding learning objective for the discriminator is given as