paper:Generating Images with Perceptual Similarity Metrics based on Deep Networks
主要內容:本文主要從傳統深度網路生成圖片比較模糊出發,分析其原因為影象的details並不能全部的存在於feature中,所以通常loss傾向於平均化所有可能存在細節的locations,導致最後生成的圖片blurry,但是準確的locations並不重要,重要的是the distribution of these details,所以希望模型能通過在合適的特徵空間中測量距離,實現對不相關變換的不變性和對區域性影象統計的敏感性,但是,特徵表達具有收縮性,即很多的圖片包括fake images都能用相同的特徵向量來表示。總解決思想是加入額外的natural image prior和其他的loss(在半監督和生成圖片的領域中很多的方法就是先提出一種很直觀的現象和問題,然後給出自己的理解和分析,即使分析的本質上和其他的它所參考文章的原因一樣,但是依然會給出作者自己不太一樣或是換一種說法的解釋,這樣就能在分析上有所新,最後的解決方法還是在改動loss和新增loss上做文章)。
重要的句子:
1. Instead of computing distances in the image space, we compute distances between image features extracted by deep neural networks. This metric reflects perceptual similarity of images much better and, thus, leads to better results.
2.We demonstrate two examples of use cases of the proposed loss: (1) networks that invert the AlexNet convolutional network; (2) a modified version of a variational autoencoder that generates realistic high-resolution random images.
3. The precise location of all details is not preserved in the features. A loss in image space leads to averaging all likely locations of details, hence the reconstruction looks blurry.(分析現象)However, exact locations of all fine details are not important for perceptual similarity of images. What is important is the distribution of these details
4.Our main insight is that invariance to irrelevant transformations and sensitivity to local image statistics can be achieved by measuring distances in a suitable feature space(解決思想)直到現在我看到的半監督相關的較新的文章都傾向於在latent space中做文章,這是不是對應著feature represent呢?可能說到底感覺在影象領域中怎麼找到有效的feature還是有著相當重要的作用。cnn為什麼能這麼有效還是因為其有效的特徵選擇。
5.In fact, convolutional networks provide a feature representation with desirable properties. They are invariant to small, smooth deformations but sensitive to perceptually important image properties, like salient edges and textures.下采樣自帶區域性和旋轉不變形。
6.Since feature representations are typically contractive, feature similarity does not automatically mean image similarity(這裡的解釋跟missing model的原因有些相似,後者是說大部分的model可能會被high value的D所引導,因為這時候G的效能還較差,這就導致大部分的images都會對應相似的model,而造成missing model的現象,最終生成的images由於缺乏多樣性會模糊). In practice this leads to high-frequency artifacts, To force the network generate realistic images, we introduce a natural image prior based on adversarial training, as proposed by Goodfellow et al. [1] 1 . We train a discriminator network to distinguish the output of the generator from real images based on local image statistics.A combination of similarity in an appropriate feature space with adversarial training yields the best results
7.These go beyond simple distances in image space and can capture complex and perceptually important properties of images.
8.Loss in feature space. Given a differentiable comparator C .C may be fixed or may be trained; for example, it can be a part of the generator or the discriminator.L f eat alone does not provide a good loss for training. It is known (Mahendran & Vedaldi, 2015) that optimizing just for similarity in the feature space typically leads to highfrequency artifacts. This is because for each natural image there are many non-natural images mapped to the same feature vector (This is unless the feature representation is specifically designed to map natural and non-natural images far apart, such as the one extracted from the discriminator of a GAN.) . Therefore, a natural image prior is necessary to constrain the generated images to the manifold of natural images.