they use only low-level image features of the target image to inform the texture transfer.



a fundamental prerequisite is to find image representations that independently model variations in the semantic image content and the style in which it is presented.



It was shown that Convolutional Neural Networks trained with sufficient labeled data on specific tasks such as object recognition learn to extract high-level image content in generic feature representations that generalise across datasets and even to other visual information processing tasks.

接下來作者簡要介紹了CNN是如何處理影象的內容和風格的,作者提出了一種新穎的方法稱之為A Neural Algorithm of Artistic Style.


In fact, our style transfer algorithm combines a parametric texture model based on Convolutional Neural Networks with a method to invert their image representations


作者對VGG-19 網路進行了稍許修改:

  • scale the networks
  • replacing the maximum pooling operation by average pooling
  • do not use any of the fully connected layers


瞭解CNN的學者都知道,一副影象 x 在CNN的每一層都被編碼,如假設一個卷積層有 N l 個卷積核,那麼就會輸出 N l 個特徵圖,每個特徵圖大小為 M l ,因此我們可以用矩陣 F l 儲存每一層結果,矩陣大小為 N l × M l 。那麼其每個矩陣元素 F i j l 的含義為:在第 l 層,第 i 個卷積核在位置 j 的結果。


  • p :原影象
  • x :初始化的白噪聲影象
  • P l :原影象的內容特徵
  • F l :白噪聲影象的內容特徵

然後對影象 x 運用梯度下降,可以生成與原圖特徵相對應的人造圖,使用如下的損失函式:

L c o n t e n t ( p , x , l ) = 1 2 i , j ( F i j l P i j l ) 2
通過後向傳導可以計算出影象 x 的梯度,跟據梯度調整白噪圖( x )直到其產生的特徵描述矩陣與原圖一致。


higher layers in the network capture the high-level content in terms of objects and their arrangement in the input image but do not constrain the exact pixel values of the reconstruction very much


In contrast, reconstructions from the lower layers simply reproduce the exact pixel values of the original image




為了獲取風格特徵(a representation of the style),作者主要使用了用於捕捉紋理資訊的特徵空間。而這個特徵空間(feature space)之所以可以用來描述風格,在於它包含了不同卷積層間的聯絡:

This feature space can be built on top of the filter responses in any layer of the network. It consists of the correlations between the different filter responses


G i j l = k F i k l F j k l
其中 G i j l 表示第 l 層中第 i 個feature map與第 j 個feature map的聯絡,即部分風格特徵。

通過獲取多個層之間的feature correlations,便可以獲得影象的總體風格特徵:

By including the feature correlations of multiple layers, we obtain a stationary, multi-scale representation of the input image.

同生成影象內容特徵的方法一樣,作者先初始化一個白噪聲影象 x ,用 a 代表風格影象,用 A l G l 表示白噪聲影象和風格影象在第 l 層的風格特徵,通過讓梯度下降的方式來讓白噪聲影象逐漸逼近風格影象,損失函式如下:

E l = 1 4 N l 2 M l 2 i , j ( G i j l A i j l ) 2 L s t y l e ( a , x ) = l = 0 L w l E l