【GAN論文-01】翻譯-Progressive growing of GANS for improved quality ，stability，and variation-論文

阿新 • • 發佈：2020-05-30

Published as a conference paper at ICLR 2018

Tero Karras、Timo Aila、Samuli Laine and Jaakko Lehtinen

NVIDIA and Aalto University

一、論文翻譯

ABSTRACT

We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution,we add new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it,allowing us to produce images of unprecedented quality,e.g.,CELEBA images at 10242. We also propose a simple way to increase the variation in generated im ages,and achieve a record inception score of 8

.80 in unsupervised CIFAR10. Additionally,we describe several implementation details that are important for discouraging unhealthy competition between the generator and discriminator. Finally,we suggest a new metric for evaluating GAN results,both in terms of image quality and variation. As an additional contribution,we construct a higher-quality version of the CELEBA dataset.

摘要：

我們為GAN描述了一個新的訓練方法。方法的關鍵點就是漸進的讓生成器和判別器都增長：從一個低解析度開始，在訓練過程中，我們不斷新增新層使模型增加更好的細節。這個方法既加速了訓練又使訓練更加穩定，生成的圖片質量史無前例的好，例如：1024*1024大小的CELEBA圖片。我們也提出了一個簡單的在生成圖片的過程中增加變數的方法，並且在無監督資料集CIFAR10上獲得了8.80分的記錄。另外，我們描述了若干實現細節，這對打壓生成器和判別器之間的非健康競爭是非常重要的。最後，我們提出了一種新的評估GAN結果的方法，包括影象質量和變化。因為增加項的貢獻，我們構建了一個更高質量的CELEBA資料集版本。

1 INTRODUCTION

Generative methods that produce novel samples from high-dimensional data distributions,such as images,are finding widespread use,for example in speech synthesis (van den Oord et al.,2016a),image-to-image translation (Zhu et al.,2017; Liu et al.,2017; Wang et al.,2017),and image inpainting (Iizuka et al.,2017). Currently the most prominent approaches are autoregressive models(van den Oord et al.,2016b;c),variational autoencoders (VAE) (Kingma & Welling,2014),and generative adversarial networks (GAN) (Goodfellow et al.,2014). Currently they all have significant

strengths and weaknesses. Autoregressive models – such as PixelCNN – produce sharp images but are slow to evaluate and do not have a latent representation as they directly model the conditional distribution over pixels,potentially limiting their applicability. VAEs are easy to train but tend to produce blurry results due to restrictions in the model,although recent work is improving this (Kingma et al.,2016). GANs produce sharp images,albeit only in fairly small resolutions and with somewhat limited variation,and the training continues to be unstable despite recent progress (Salimans et al.,2016; Gulrajani et al.,2017; Berthelot et al.,2017; Kodali et al.,2017). Hybrid methods combine various strengths of the three,but so far lag behind GANs in image quality (Makhzani & Frey,2017; Ulyanov et al.,2017; Dumoulin et al.,2016).

1 介紹

我們發現從高維度的資料分佈中（例如圖片）產生新穎樣本的生成式方法正在廣泛使用，例如語音合成（van den Oord et al.,2016a），影象到影象的轉換（Zhu et al.,2017）以及影象繪製（Iizuka et al.,2017）。目前最好的方法是自動迴歸模型（van den Oord et al.,2016b;c），可變自動編碼（VAE）（Kingma & Welling,2014）以及GAN（(Goodfellow et al.,2014）。目前他們都有顯著的優勢和劣勢。自動迴歸模型–例如PixelCNN–會產生銳化的圖片但是評估緩慢並且不具備一個潛在的代表性，因為他們是直接在畫素上模型化條件分佈，潛在的限制了他們的適用性。VAEs方法訓練簡單但是由於模型的限制傾向於產生模糊的結果，雖然最近的工作正在改善這個缺點（Kingma et al.,2016）。GANs方法雖然只能在相當小的解析度並且帶有一些限制的可變性解析度上產生銳化影象，儘管最近有新的進展 (Salimans et al.,2016; Gulrajaniet al.,2017)但是在訓練上仍然是不穩定的。混合的方法結合了這三個方法的不同優點，但是目前在圖片質量上仍然不如GANs(Makhzani & Frey,2016)。

Typically,a GAN consists of two networks: generator and discriminator (aka critic). The generator produces a sample,an image,from a latent code,and the distribution of these images should ideally be indistinguishable from the training distribution. Since it is generally infeasible to engineer a function that tells whether that is the case,a discriminator network is trained to do the assessment,and since networks are differentiable,we also get a gradient we can use to steer both networks to the right direction. Typically,the generator is of main interest – the discriminator is an adaptive loss function that gets discarded once the generator has been trained.

典型的，一個GAN模型包括兩個網路：生成式網路和判別式網路（aka critic）。生成式網路生成一個樣本，例如：從一個潛在的程式碼中生成一副圖片，這些生成的圖片分佈和訓練的圖片分佈是不可分辨的。因為通過建立一個函式來辨別是生成樣本還是訓練樣本一般是不可能的，所以一個判別器網路被訓練去做這樣一個評估，因為網路是可區分的，所以我們也可以得到一個梯度用來引導網路走到正確的方向。典型的，生成器是主要興趣方–判別器就是一個適應性的損失函式，即一旦生成器被訓練後，這個函式就要被丟棄。

There are multiple potential problems with this formulation. When we measure the distance between the training distribution and the generated distribution,the gradients can point to more or less random directions if the distributions do not have substantial overlap,i.e.,are too easy to tell apart (Arjovsky & Bottou,2017). Originally,Jensen-Shannon divergence was used as a distance metric (Goodfellow et al.,and recently that formulation has been improved (Hjelm et al.,2017) and a number of more stable alternatives have been proposed,including least squares (Mao et al.,2016b),absolute deviation with margin (Zhao et al.,and Wasserstein distance (Arjovsky et al.,2017; Gulrajani et al.,2017). Our contributions are largely orthogonal to this ongoing discussion,and we primarily use the improved Wasserstein loss,but also experiment with least-squares loss.

這個表述存在多種潛在的問題。例如：當我們測量訓練分佈和生成分佈之間的距離時，如果分佈之間沒有大量的很容易分辨的重疊那麼梯度可能指出或多或少的隨機方向 (Arjovsky& Bottou,2017)。原來， Jensen-Shannon散度被用作距離度量（Goodfellow et al.,2014），最近這個公式已經被改善（Hjelm et al.,2017）並且大量更多的可選方案被提出，包括least squares (Mao et al.,2016b)，絕對邊緣誤差（absolute deviation with margin (Zhao et al.,2017)），以及Wasserstein 距離(Arjovsky et al.,2017)。我們的貢獻和目前正在進行的討論大部分是正交的，並且我們基本使用改善的Wasserstein 損失，但是也有基於least-squares損失的實驗。

The generation of high-resolution images is difficult because higher resolution makes it easier to tell the generated images apart from training images (Odena et al.,thus drastically amplifying the gradient problem. Large resolutions also necessitate using smaller minibatches due to memory constraints,further compromising training stability. Our key insight is that we can grow both the generator and discriminator progressively,starting from easier low-resolution images,and add new layers that introduce higher-resolution details as the training progresses. This greatly speeds up training and improves stability in high resolutions,as we will discuss in Section 2.

高解析度圖片的生成是困難的因為更高的解析度使得判別器更容易分辨是生成的圖片還是訓練圖片（Odena et al.,2017），因此徹底放大了這個梯度問題。由於記憶體的限制，大解析度使用更小的minibatches也是需要的，所以要和訓練穩定性進行折中。我們的關鍵亮點在於我們可以同時漸進促進生產器和判別器增長，從比較簡單的低解析度開始，隨著訓練的發展，不斷新增新的層引進更高解析度細節。這個很大程度上加速了訓練並且改善了在高解析度圖片上的穩定性，正如我們在Section 2中討論的。

The GAN formulation does not explicitly require the entire training data distribution to be represented by the resulting generative model. The conventional wisdom has been that there is a tradeoff between image quality and variation,but that view has been recently challenged (Odena et al.,2017). The degree of preserved variation is currently receiving attention and various methods have been suggested for measuring it,including inception score (Salimans et al.,2016),multi-scale structural similarity (MS-SSIM) (Odena et al.,2003),birthday paradox (Arora & Zhang,and explicit tests for the number of discrete modes discovered (Metz et al.,2016). We will describe our method for encouraging variation in Section 3,and propose a new metric for evaluating the quality and variation in Section 5.

GAN公式沒有明確要求所有的訓練資料分佈都由生成的生成式模型來表述。傳統方法會在圖片質量和可變性之間有一個折中，但是這個觀點最近已經改變 (Odena et al.,2017)。保留的可變性的程度目前受到關注並且提出了多種方法去測量可變性，包括初始分數 (Salimans et al.,2016)，多尺度結構相似性 (MS-SSIM) (Odena et al.,2003)，生日悖論(Arora & Zhang,2017)，以及發現的離散模式的顯示測試 (Metz et al.,2016)。我們將在Section 3中描述我們鼓勵可變性的方法，並在 Section 5中提出一個評估質量和可變性的新的度量。

Section 4.1 discusses a subtle modification to the initialization of networks,leading to a more balanced learning speed for different layers. Furthermore,we observe that mode collapses traditionally plaguing GANs tend to happen very quickly,over the course of a dozen minibatches. Commonly they start when the discriminator overshoots,leading to exaggerated gradients,and an unhealthy competition follows where the signal magnitudes escalate in both networks. We propose a mechanism to stop the generator from participating in such escalation,overcoming the issue (Section 4.2).

Section 4.1中對網路的初始化討論了一個細小的修改，使得不同層的學習速度更加平衡。更進一步，我們觀察到在十幾個minibatches的過程中，GAN會更快速的發生令人討厭的傳統的模式崩塌現象，通常當判別器處理過度時模式崩塌開始，導致梯度過大，並且會在兩個網路訊號幅度增大的地方伴隨著一個不健康的競爭。我們提出了一個機制去阻止生成器參與這樣的升級，以克服這個問題 (Section 4.2)。

We evaluate our contributions using the CELEBA,LSUN,CIFAR10 datasets. We improve the best published inception score for CIFAR10. Since the datasets commonly used in benchmarking generative methods are limited to a fairly low resolution,we have also created a higher quality version of the CELEBA dataset that allows experimentation with output resolutions up to 1024 × 1024 pixels. This dataset and our full implementation are available at

https://github.com/tkarras/progressive_growing_of_gans,trained networks can be found at https://drive.google.com/open?id=0B4qLcYyJmiz0NHFULTdYc05lX0U along

我們使用CELEBA,CIFAR10資料集去評估我們的貢獻。對於 CIFAR10我們改善了已經公佈的最好的初始分數。因為通常被用於評量標準的生成方法的資料集對於相當低的解析度來說是受限制的，所以我們已經建立了一個更高質量版本的CELEBA資料集，允許輸出解析度高達 1024 × 1024畫素的實驗。我們正準備釋出這個資料集。我們成果的全部實現在網址https://github.com/tkarras/progressive_growing_of_gans可以獲得，帶有結果圖片的訓練網路在 https://drive.google.com/open?id=0B4qLcYyJmiz0NHFULTdYc05lX0U 獲得，補充的vidio說明資料集，額外的結果，隱藏的空間插值都在https://youtu.be/XOxxPcy5Gr4。

2 PROGRESSIVE GROWING OF GANS

Our primary contribution is a training methodology for GANs where we start with low-resolution images,and then progressively increase the resolution by adding layers to the networks as visualized in Figure 1. This incremental nature allows the training to first discover large-scale structure of the image distribution and then shift attention to increasingly finer scale detail,instead of having to learn all scales simultaneously.

2 GANs的漸進增長

我們的主要貢獻就是GANs的訓練方法：從低解析度圖片開始，然後通過向網路中新增層逐漸的增加解析度，正如Figure 1所示。這個增加的本質使得訓練首先發現大尺度結構的圖片分佈，然後將關注點逐漸的轉移到更好尺度細節上，而不是必須同時學習所有的尺度。

技術分享圖片

Figure1:我們的訓練開始於有著一個4*4畫素的低空間解析度的生成器和判別器。隨著訓練的改善，我們逐漸的向生成器和判別器網路中新增層，因此增加生成圖片的空間解析度。所有現存的層通過程序保持可訓練性。這裡N×N是指卷積層在N×N的空間解析度上進行操作。這個方法使得在高解析度上穩定合成並且加快了訓練速度。右圖我們展示了六張通過使用在1024 × 1024空間解析度上漸進增長的方法生成的樣例圖片。

We use generator and discriminator networks that are mirror images of each other and always grow in synchrony. All existing layers in both networks remain trainable throughout the training process. When new layers are added to the networks,we fade them in smoothly,as illustrated in Figure 2. This avoids sudden shocks to the already well-trained,smaller-resolution layers. Appendix A describes structure of the generator and discriminator in detail,along with other training parameters.

我們使用生成器網路和判別器網路作為相互的鏡子並且同步促進兩者的增長。同時在兩個網路中的所有現存的層通過訓練程序保持可訓練性。當新的層被新增到網路中時，我們平滑的減弱它們，正如Fig2中所解釋的。這樣就避免了給已經訓練好的更小解析度的層帶來突然的打擊。附錄A從細節上描述生成器網路和判別器網路的結構，並附有其他的訓練引數。

技術分享圖片

Figure 2:當把生成器和判別器的解析度加倍時，我們會平滑的增強新的層。這個樣例解釋瞭如何從16 × 16畫素的圖片轉換到32 × 32畫素的圖片。在轉換（b）過程中，我們把在更高解析度上操作的層視為一個殘缺塊，權重α從0到1線性增長。這裡的2× 和 0.5× 指利用最近鄰濾波和平均池化分別對圖片解析度加倍和折半。toRGB表示將一個層中的特徵向量投射到RGB顏色空間中，fromRGB正好是相反的過程；這兩個過程都是利用1 × 1卷積。當訓練判別器時，我們插入下采樣後的真實圖片去匹配網路中的當前解析度。在解析度轉換過程中，我們在兩張真實圖片的解析度之間插值，類似於如何將兩個解析度結合到一起用生產器輸出。

We observe that the progressive training has several benefits. Early on,the generation of smaller images is substantially more stable because there is less class information and fewer modes (Odena et al.,2017). By increasing the resolution little by little we are continuously asking a much simpler question compared to the end goal of discovering a mapping from latent vectors to e.g. 10242 images. This approach has conceptual similarity to recent work by Chen & Koltun (2017). In practice it stabilizes the training sufficiently for us to reliably synthesize megapixel-scale images using WGAN-GP loss (Gulrajani et al.,2017) and even LSGAN loss (Mao et al.,2016b).

Another benefit is the reduced training time. With progressively growing GANs most of the iterations are done at lower resolutions,and comparable result quality is often obtained up to 2–6 times faster,depending on the final output resolution.

我們觀察到漸進訓練有若干好處。早期，更小影象的生成非常穩定因為分類資訊較少而且模式也少（Odena et al.,2017）。通過一點一點的增加解析度，我們正不斷的尋找一個更簡單的問題，即：和最終目標進行比較，最終目標：從潛在向量中（例如1024*1024的圖片）發現一個匹配。這個方法在概念上類似於最近Chen&Koltun（2017）的工作。在實踐上，對於我們來說，它使訓練充分穩點，因此在利用WGANGP損失(Gulrajani et al.,2017 )甚至LSGAN損失( Mao et al.,2016b)去合成megapixel-scale圖片變得可靠。

另外一個好處是減少了訓練時間。隨著GANs網路的漸進增長，大部分的迭代都在較低解析度下完成，對比結果質量加快了2-6倍的速度，這都依賴最後的輸出解析度。

The idea of growing GANs progressively is related to the work of Wang et al. (2017),who use multiple discriminators that operate on different spatial resolutions. That work in turn is motivated by Durugkar et al. (2016) who use one generator and multiple discriminators concurrently,and Ghosh et al. (2017) who do the opposite with multiple generators and one discriminator. Hierarchical GANs (Denton et al.,2015; Huang et al.,2016; Zhang et al.,2017) define a generator and discrimi nator for each level of an image pyramid. These methods build on the same observation as our work – that the complex mapping from latents to high-resolution images is easier to learn in steps – but the crucial difference is that we have only a single GAN instead of a hierarchy of them. In contrast to early work on adaptively growing networks,growing neural gas (Fritzke,1995) and neuro evolution of augmenting topologies (Stanley & Miikkulainen,2002) that grow networks greedily,we simply defer the introduction of pre-configured layers. In that sense our approach resembles layer-wise training of autoencoders (Bengio et al.,2007).

這個漸進增長的GANs想法是和課程GANs（他們使用多種不同空間解析度的鑑別器）相關的，這個想法就是：把多個在不同空間解析度上操作的判別器和一個單一的生成器連線，進一步的把調整兩個解析度之間的平衡作為訓練時間的一個函式。這個想法按照兩個方法輪流工作，即Durugkar et al. (2016)提出的同時使用一個生成器和多個判別器的方法以及Ghosh et al. (2017)提出的相反的使用多個生成器和一個判別器的方法。和早期的自適應增長型網路相比，例如：使網路貪婪增長的增長型神經氣(Fritzke,1995)以及增強型拓撲結構的神經進化(Stanley & Miikkulainen,2002)，我們簡單的推遲了預配置層的介入。這種情況下，我們的方法和自動編碼的智慧層訓練(Bengio et al.,2007)相像。

3 INCREASING VARIATION USING MINIBATCH STANDARD DEVIATION

GANs have a tendency to capture only a subset of the variation found in training data,and Salimans et al. (2016) suggest "minibatch discrimination" as a solution. They compute feature statistics not only from individual images but also across the minibatch,thus encouraging the minibatches of generated and training images to show similar statistics. This is implemented by adding a minibatch layer towards the end of the discriminator,where the layer learns a large tensor that projects the input activation to an array of statistics. A separate set of statistics is produced for each example in a minibatch and it is concatenated to the layer‘s output,so that the discriminator can use the statistics internally. We simplify this approach drastically while also improving the variation.

3 使用小批量標準偏差增加可變性

抓取在訓練資料中發現的變數的僅一個子集是GANs的一個趨勢，Salimans et al. (2016)提出了"minibatch discrimination"作為解決方案。他們不僅從單個圖片中而且還從小批量圖片中計算特徵統計，因此促進了生成的小批量圖片和訓練圖片展示出了相似的統計。這是通過向判別器末端增加一個小批量層來實施，這個層學習一個大的張量將輸入啟用投射到一個統計陣列中。在一個小批量中的每個樣例會產生一個獨立的統計集並且和輸出層連線，以至於判別器可以從本質上使用這個統計。我們大大簡化了這個方法同時提高了可變性。

Our simplified solution has neither learnable parameters nor new hyperparameters. We first compute the standard deviation for each feature in each spatial location over the minibatch. We then average these estimates over all features and spatial locations to arrive at a single value. We replicate the value and concatenate it to all spatial locations and over the minibatch,yielding one additional (constant) feature map. This layer could be inserted anywhere in the discriminator,but we have found it best to insert it towards the end (see Appendix A.1 for details). We experimented with a richer set of statistics,but were not able to improve the variation further. In parallel work,Lin et al. (2017) provide theoretical insights about the benefits of showing multiple images to the discriminator.

我們簡化的解決方案既沒有可學習的引數也沒有新的超引數。我們首先計算基於小批量的每個空間位置的每個特徵的標準偏差。然後對所有特徵和空間位置的評估平均化到一個單一的值。我們複製這個值並且將它連線到所有空間位置以及小批量上，服從一個額外的（不變的）特徵對映。這個層可以在網路中的任何地方插入，但是我們發現最好是插入到末端(see Appendix A.1 for details)。我們用一個豐富的統計集做實驗，但是不能進一步提高可變性。

Alternative solutions to the variation problem include unrolling the discriminator (Metz et al.,2016) to regularize its updates,and a "repelling regularizer" (Zhao et al.,2017) that adds a new loss term to the generator,trying to encourage it to orthogonalize the feature vectors in a minibatch. The multiple generators of Ghosh et al. (2017) also serve a similar goal. We acknowledge that these solutions may increase the variation even more than our solution – or possibly be orthogonal to it – but leave a detailed comparison to a later time.

針對可變性這個問題另一個解決方案包括：展開判別器(Metz et al.,2016)去正則化它的更新，以及一個 "repelling regularizer" (Zhao et al.,2017)方法,即向生成器中新增一個新的損失項，嘗試促進它與一個小批量中的特徵向量正交化。Ghosh et al. (2017）提出的多個生成器也滿足這樣一個相似的目標。我們承認這些解決方案可能會增加可變性甚至比我們的解決方案更多–或者可能與它正交–但是後面留有一個細節性的比較。

4 NORMALIZATION IN GENERATOR AND DISCRIMINATOR

GANs are prone to the escalation of signal magnitudes as a result of unhealthy competition between the two networks. Most if not all earlier solutions discourage this by using a variant of batch normalization (Ioffe & Szegedy,2015; Salimans & Kingma,2016; Ba et al.,2016) in the generator,and often also in the discriminator. These normalization methods were originally introduced to eliminate covariate shift. However,we have not observed that to be an issue in GANs,and thus believe that the actual need in GANs is constraining signal magnitudes and competition. We use a different approach that consists of two ingredients,neither of which include learnable parameters.

4 在生成器和判別器中規範化

由於兩個網路之間的不健康的一個競爭結果，GANs往往會有訊號幅度升級情況。大多數早期的解決方案並不鼓勵這種在生成器以及在判別器中使用批處理正則化的一個變數 (Ioffe & Szegedy,2016)的方式。這些正則化方法原來是消除協變數偏移的。然而，我們沒有觀察到在GANs中存在這個問題，因此相信在GANs中需要的是制約訊號幅度以及競爭問題。我們使用兩個因素且都不包含可學習引數的不同方法。

4.1 EQUALIZED LEARNING RATE

We deviate from the current trend of careful weight initialization,and instead use a trivial N (0; 1) initialization and then explicitly scale the weights at runtime. To be precise,we set w^i = wi=c,where wi are the weights and c is the per-layer normalization constant from He‘s initializer (He et al.,2015). The benefit of doing this dynamically instead of during initialization is somewhat subtle,and relates to the scale-invariance in commonly used adaptive stochastic gradient descent methods such as RMSProp (Tieleman & Hinton,2012) and Adam (Kingma & Ba,2015). These

methods normalize a gradient update by its estimated standard deviation,thus making the update independent of the scale of the parameter. As a result,if some parameters have a larger dynamic range than others,they will take longer to adjust. This is a scenario modern initializers cause,and thus it is possible that a learning rate is both too large and too small at the same time. Our approach ensures that the dynamic range,and thus the learning speed,is the same for all weights. A similar reasoning was independently used by van Laarhoven (2017)

4.1 調節學習速率

我們脫離了當前謹慎的權重初始化趨勢，使用了一個數學上最簡單的正太分佈N (0; 1)初始化，然後在執行階段顯示縮放權重。為了更精確，我們設定這裡寫圖片描述，wi是權重，c是來自於He等的初始化方法 (He et al.,2015)的前一層正則化常量。在初始化過程中動態做這種操作的好處是有一些微妙的，它關係到常規的使用自適應隨機梯度下降法例如RMSProp (Tieleman & Hinton,2012) 和 Adam (Kingma & Ba,2015)方法保持的尺度不變性。這些方法通過評估標準差正則化一個梯度更新，因此使更新不依賴於引數的變化。結果，如果一些引數相比較其他引數而言有一個更大範圍的動態變化，他們將花費更長的時間去調整。這是一個現在初始化問題面臨的場景，因此有可能出現在同一時間學習速率既是最大值也是最小值的情況。我們的方法保證了動態範圍，因此對於所有權重，學習速度都是一樣的。

4.2 PIXELWISE FEATURE VECTOR NORMALIZATION IN GENERATOR

技術分享圖片

4.2 生成器中的pixelwise特徵向量歸一化

由於競爭的結果，為了防止出現在生成器和判別器中的量級逐漸脫離控制的場景，我們對每個畫素中的特徵向量進行歸一化使每個卷積層之後的生成器中的長度可以單位化。我們只用一個"區域性相應正則化" (Krizhevsky et al.,2012)變數，按照公式

技術分享圖片

這裡寫圖片描述配置，其中這裡寫圖片描述 N表示特徵匹配的數量，ax,y和bx,y分別表示畫素（x,y）中的原始和歸一化特徵向量。我們驚喜的發現這個粗率的限制在任何方式下看起來都不會危害到這個生成器並且對於大多數資料集，它也不會改變太多結果，但是它卻在有需要的時候有效的防止了訊號幅度的增大。

5 MULTI-SCALE STATISTICAL SIMILARITY FOR ASSESSING GAN RESULTS

In order to compare the results of one GAN to another,one needs to investigate a large number of images,which can be tedious,difficult,and subjective. Thus it is desirable to rely on automated methods that compute some indicative metric from large image collections. We noticed that existing methods such as MS-SSIM (Odena et al.,2017) find large-scale mode collapses reliably but fail to react to smaller effects such as loss of variation in colors or textures,and they also do not directly

assess image quality in terms of similarity to the training set.

5 評估GAN結果的多尺度統計相似性

為了把一個GAN的結果和另一個做比較，需要調查大量的圖片，這可能是乏味的，困難的並且主觀性的。因此依賴自動化方法–從大量的收集圖片中計算一些指示性指標是可取的。我們注意到現存的方法例如MS-SSIM (Odena et al.,2017)在發現大尺度模式的崩塌很可靠，但是對比較小的影響沒有反應例如在顏色或者紋理上的損失變化，而且它們也不能直接對訓練集相似的圖片質量進行評估。

We build on the intuition that a successful generator will produce samples whose local image structure is similar to the training set over all scales. We propose to study this by considering the multiscale statistical similarity between distributions of local image patches drawn from Laplacian pyramid (Burt & Adelson,1987) representations of generated and target images,starting at a low-pass resolution of 16 × 16 pixels. As per standard practice,the pyramid progressively doubles until the full resolution is reached,each successive level encoding the difference to an up-sampled version of the previous level.

我們的直覺是一個成功的生成器會基於所有尺度，產生區域性影象結構和訓練集是相似的樣例。我們建議通過考慮兩個分別來自於生成樣例和目標圖片的 Laplacian金字塔表示的區域性圖片匹配分佈的多尺度統計相似性，並從 16 × 16畫素的低通過解析度開始，進行學習。隨著每一個標準的訓練，這個金字塔雙倍的漸增直到獲得全部解析度，每個連續的水平的編碼都不同於它先前的上取樣版本。

技術分享圖片

一個單一的拉普拉斯金字塔等級對應著一個特定空間頻率帶。我們隨機取樣16384 張圖片並從拉普拉斯金字塔中的每一級中提取出128個描述符，每一級給我們2.1M描述符。每一個描述符都是帶有3個顏色通道的 7 × 7相鄰畫素，通過

技術分享圖片

來指定。我們把訓練集和生成集中的l級的匹配分別指定為

技術分享圖片

我們首先標準

技術分享圖片

述 w.r.t.每個顏色通道的均值和標準差，然後通過計算他們的

技術分享圖片

（sliced Wasserstein distance）值評估統計相似性，這是一種有效的使用512個對映 (Rabin et al.,2011)計算隨機近似的EMD值（earthmovers distance）的方法。

Intuitively a small Wasserstein distance indicates that the distribution of the patches is similar,meaning that the training images and generator samples appear similar in both appearance and variation

at this spatial resolution. In particular,the distance between the patch sets extracted from the lowestresolution 16 × 16 images indicate similarity in large-scale image structures,while the finest-level

patches encode information about pixel-level attributes such as sharpness of edges and noise.

直觀上，一個小的Wasserstein距離表示了塊兒間的分佈是相似的，意味著訓練樣例和生成樣例在外貌以及空間解析度的變化上都是相似的。特別是，從最低的解析度 16 × 16的圖片上提取出的塊兒集之間的距離表明在大尺度影象結構方面是相似的，然而finest-level的塊兒編碼了關於畫素級屬性的資訊例如邊界的尖銳性和噪聲。

6 EXPERIMENTS

In this section we discuss a set of experiments that we conducted to evaluate the quality of

our results. Please refer to Appendix A for detailed description of our network structures

and training configurations. We also invite the reader to consult the accompanying video

(https://youtu.be/G06dEcZ-QTg) for additional result images and latent space interpolations.

In this section we will distinguish between the network structure (e.g.,convolutional layers,resizing),training configuration (various normalization layers,minibatch-related operations),and training loss (WGAN-GP,LSGAN).

6 實驗

這部分我們討論了一系列的實驗來評估我們結果的質量。我們的網路結構以及訓練編譯的細節描述請參考附件A。我們也邀請讀著去參閱另外的結果圖片的附帶視訊(https://youtu.be/XOxxPcy5Gr4) 以及隱藏的空間插值。這部分我們將區分網路結構 (e.g.,resizing)，訓練編譯（不同的正則化層，相關的小批處理操作），以及訓練損失 (WGAN-GP,LSGAN)。

6.1 IMPORTANCE OF INDIVIDUAL CONTRIBUTIONS IN TERMS OF STATISTICAL SIMILARITY

We will first use the sliced Wasserstein distance (SWD) and multi-scale structural similarity (MSSSIM) (Odena et al.,2017) to evaluate the importance our individual contributions,and also perceptually validate the metrics themselves. We will do this by building on top of a previous state-of-theart loss function (WGAN-GP) and training configuration (Gulrajani et al.,2017) in an unsupervised setting using CELEBA (Liu et al.,2015) and LSUN BEDROOM (Yu et al.,2015) datasets in 128^2 resolution. CELEBA is particularly well suited for such comparison because the training images contain noticeable artifacts (aliasing,compression,blur) that are difficult for the generator to repro duce faithfully. In this test we amplify the differences between training configurations by choosing a relatively low-capacity network structure (Appendix A.2) and terminating the training once the dis

criminator has been shown a total of 10M real images. As such the results are not fully converged.

6.1 就統計相似性而言本人貢獻的重要性

我們首先將使用SWD值和多尺度結構相似性（MSSSIM） (Odena et al.,2017) 去評估我們自己貢獻的重要性，也從感知上驗證度量本身。我們會在一個先前的最新損失函式 (WGAN-GP)的頂層進行編譯並在一個128*128解析度的 CELEBA (Liu et al.,2015)和LSUN BEDROOM (Yu et al.,2015)的非監督資料集上訓練配置 (Gulrajani et al.,2017)。CELEBA 資料集特別適合這樣的比較因為這些圖片包含了顯著的偽跡（混疊，壓縮，模糊），這些偽跡對於生成器來說重新準確的生成式很困難的。在這個測試中，我們通過選擇一個相關的低容量網路結構（附件A.2）並且一旦判別器已經展示了總共10M的真實圖片時就終止訓練的方式來訓練配置並放大訓練配置間的差異。這樣結果就不會全部相同（相似）。

技術分享圖片

Table 1:生成樣例和訓練樣例之間的SWD值（ Sliced Wasserstein distance） (Section 5)和針對設定為 128 × 128解析度的若干訓練集的生成樣例之間的多尺度結構相似性 (MS-SSIM)。對於SWD，每一列展示了拉普拉斯金字塔的一個層級，最後一列給出了蘇哥距離的平均值。

技術分享圖片

Figure 3: (a) – (g) CELEBA樣例對應Table 1中的行。這些是有意不收斂的。（h）我們的收斂結果。注意有些圖片是混疊的並且有些圖片是非尖銳的–這是一個數據集的缺陷，這種模型會學習如實的複製。

Table 1 lists the numerical values for SWD and MS-SSIM in several training configurations,where our individual contributions are cumulatively enabled one by one on top of the baseline (Gulrajani et al.,2017). The MS-SSIM numbers were averaged from 10000 pairs of generated images,and SWD was calculated as described in Section 5. Generated CELEBA images from these configurations are shown in Figure 3. Due to space constraints,the figure shows only a small number of examples for each row of the table,but a significantly broader set is available in Appendix H. Intuitively,a good evaluation metric should reward plausible images that exhibit plenty of variation in

colors,textures,and viewpoints. However,this is not captured by MS-SSIM: we can immediately see that configuration (h) generates significantly better images than configuration (a),but MS-SSIM remains approximately unchanged because it measures only the variation between outputs,not similarity to the training set. SWD,on the other hand,does indicate a clear imp

Table 1列出了在若干訓練配置中的SWD和MS-SSIM的數值，表明了我們的個人貢獻逐漸的使基線的頂部(Gulrajani et al.,2017)一個接一個的成為可能。MS-SSIM個數是平均來自於10000對生成圖片，SWD值計算在第5部分描述。Figure 3展示了來自於這些配置的生成的CELEBA圖片。由於空間限制，這個圖片僅僅展示了每行桌子的一小部分樣例，但是在附近H中可以獲得一個更廣的集合。從直覺上說，一個好的評估標準應該獎勵展示出的在顏色，紋理以及角度的大量變數中很相似的圖片。然而，這並沒有被MS-SSIM捕捉到：我們可以立刻看到配置（h）生成了比配置（a）更好的圖片，但是MS-SSIM值保持近似不變因為它僅僅測量輸出的變化而不測量輸出與訓練集的相似性。另一方面，SWD就有一個明顯的改善。

The first training configuration (a) corresponds to Gulrajani et al. (2017),featuring batch normalization in the generator,layer normalization in the discriminator,and minibatch size of 64. (b) enables progressive growing of the networks,which results in sharper and more believable output images. SWD correctly finds the distribution of generated images to be more similar to the training set.

第一個訓練配置(a)對應方法Gulrajani et al. (2017)，特徵化生成器中的批處理正則化，判別器中的層正則化，並且小批量大小為64。(b)能夠使網路漸進增長，導致輸出圖片更加尖銳更加可信。SWD正確的發現了生成圖片的分佈於訓練集更加相似。

Our primary goal is to enable high output resolutions,and this requires reducing the size of minibatches in order to stay within the available memory budget. We illustrate the ensuing challenges in (c) where we decrease the minibatch size from 64 to 16. The generated images are unnatural,which is clearly visible in both metrics. In (d),we stabilize the training process by adjusting the hyperparameters as well as by removing batch normalization and layer normalization (Appendix A.2). 我們的主要目標是輸出高解析度，這就要求減少小批量大小來保證執行在可獲得的儲存空間預算之內。在（c）中我們說明了將批處理有64降到16時遇到的挑戰。在兩個度量中可以清楚的看到生成的圖片是不自然的。在(d)中，我們通過調整超引數以及移動批處理正則化和層正則化使訓練程序穩定。

As an intermediate test (e∗),we enable minibatch discrimination (Salimans et al.,which somewhat surprisingly fails to improve any of the metrics,including MS-SSIM that measures output variation. In contrast,our minibatch standard deviation (e) improves the average SWD scores and images. We then enable our remaining contributions in (f) and (g),leading to an overall improvement in SWD

and subjective visual quality. Finally,in (h) we use a non-crippled network and longer training – we feel the quality of the generated images is at least comparable to the best published results so far.

作為中間的一個測試(e∗)，我們能夠小批量的判別 (Salimans et al.,2016)，有時也不能改善任何度量，包括測量輸出變數的MS-SSIM值。相反，我們的小批量標準差 (e) 改善了SWD的平均得分還有圖片。然後我們將我們的貢獻用於 (f) 和(g)中，導致了在SWD以及主管視覺質量方面的總體改進。最後，在(h)中，我們使用一個非殘疾網路以及更長時間的訓練–我們認為生成圖片的質量可以和目前最好的結果想媲美。

技術分享圖片

Figure 4: 在訓練速度和收斂性方面漸進增長的影響。使用了一個NVIDIA Tesla P100 GPU測量時間。 (a) 關於Gulrajani et al. (2017)方法提到的掛鐘，使用128 × 128解析度的CELEBA資料集統計相似性。每個曲線都展示了拉普拉斯金字塔每一級的SWD值，垂直的線指示我們在Table 1中停止訓練的點。（b）能夠漸進增長的相同曲線。短的垂直線指示我們在G和D中雙倍增加分別率的點。（c）在1024 × 1024解析度以原訓練速度漸進增長的影響。

6.2 CONVERGENCE AND TRAINING SPEED

Figure 4 illustrates the effect of progressive growing in terms of the SWD metric and raw image throughput. The first two plots correspond to the training configuration of Gulrajani et al. (2017) without and with progressive growing. We observe that the progressive variant offers two main benefits: it converges to a considerably better optimum and also reduces the total training time by about a factor of two. The improved convergence is explained by an implicit form of curriculum learning that is imposed by the gradually increasing network capacity. Without progressive growing,all layers of the generator and discriminator are tasked with simultaneously finding succinct intermediate representations for both the large-scale variation and the small-scale detail. With progressive growing,however,the existing low-resolution layers are likely to have already converged early on,so the networks are only tasked with refining the representations by increasingly smaller-scale effects as new layers are introduced. Indeed,we see in Figure 4(b) that the largest-scale statistical similarity curve (16) reaches its optimal value very quickly and remains consistent throughout the rest of the training. The smaller-scale curves (32,64,128) level off one by one as the resolution is increased,but the convergence of each curve is equally consistent. With non-progressive training in

Figure 4(a),each scale of the SWD metric converges roughly in unison,as could be expected.

Figure 4 說明了SWD度量的漸進增長的影響以及原始影象的吞吐率。前兩個圖對應Gulrajani et al. (2017)的帶有和不帶有漸進增長的訓練配置。我們觀察到漸進變數提供了兩個主要優點：它收斂到一個非常好的最佳值並且總共的訓練時間大概減少了一倍。改進的收斂值由課程學習的一個隱形格式來解釋，這個課程學習有逐漸增長的網路容量決定。沒有漸進增長情況下，生成器和判別器的所有層都要求同時找到簡潔的大尺度變化和小尺度細節的中間展示。然而，漸進增長下，現存的低解析度層可能在早期就已經收斂了，所以網路僅僅要求隨著新層的加入，通過增加更小尺度影響得到更精煉的展示。確實，我們在Figure 4(b)中可以看到最大尺度的統計相似性曲線(16)很快的到達了它的優化值並且穿過訓練的間斷時間保持連續。更小尺度的曲線(32,128)隨著解析度的增加逐個的趨於平穩，但是每條曲線的收斂性是非常一致的。正如所料，非漸進訓練的

The speedup from progressive growing increases as the output resolution grows. Figure 4(c) shows training progress,measured in number of real images shown to the discriminator,as a function of training time when the training progresses all the way to 10242 resolution. We see that progressive growing gains a significant head start because the networks are shallow and quick to evaluate at the beginning. Once the full resolution is reached,the image throughput is equal between the two methods. The plot shows that the progressive variant reaches approximately 6.4 million images in 96 hours,whereas it can be extrapolated that the non-progressive variant would take about 520 hours to reach the same point. In this case,the progressive growing offers roughly a 5:4× speedup.

Figure 4(a)中，每個SWD度量的收斂值尺度都是不平穩的。

6.3 HIGH-RESOLUTION IMAGE GENERATION USING CELEBA-HQ DATASET

To meaningfully demonstrate our results at high output resolutions,we need a sufficiently varied high-quality dataset. However,virtually all publicly available datasets previously used in GAN literature are limited to relatively low resolutions ranging from 322 to 4802. To this end,we created a high-quality version of the CELEBA dataset consisting of 30000 of the images at 1024 × 1024 resolution. We refer to Appendix C for further details about the generation of this dataset.

6.3 使用CELEBA-HQ資料集生成高解析度影象

為了證明我們的結果是高輸出解析度，我們需要一個變化充分的高質量資料集。然而，以前在GaN文獻中使用的幾乎所有公開可用的資料集都侷限於相對較低的從32*32 到480*480的解析度範圍。文中末尾，我們建立了一個高質量版本的CELEBA資料集，包含30000張1024 × 1024解析度的圖片。關於資料集生成的進一步細節參考附件C。

Our contributions allow us to deal with high output resolutions in a robust and efficient fashion. Figure 5 shows selected 1024 × 1024 images produced by our network. While megapixel GAN results have been shown before in another dataset (Marchesi,our results are vastly more varied and of higher perceptual quality. Please refer to Appendix F for a larger set of result images as well as the nearest neighbors found from the training data. The accompanying video shows latent space interpolations and visualizes the progressive training. The interpolation works so that we first randomize a latent code for each frame (512 components sampled individually from N (0; 1)),then blur the latents across time with a Gaussian (σ = 45 frames @ 60Hz),and finally normalize each vector to lie on a hypersphere.

我們的貢獻允許我們以一個穩健高效的方式處理高解析度的輸出。Figure 5選擇了我們的網路生成的1024 × 1024分辨的圖片。然而在另一個數據集上 (Marchesi,2017)，兆畫素的GAN結果已經在這之前展示出來了，但我們的結果更加多樣化，感知質量也更高。一個更大的結果影象集以及從訓練資料中找到的最近鄰影象集請參考附件F。附帶的視訊顯示了潛在的空間插值和視覺化的循序漸進的訓練。插值使我們首先隨機化一個每一幀的潛在編碼（來自於正太分佈N (0; 1)的512個獨立的樣例元件），然後我們用一個高斯函式 (σ = 45 frames @ 60Hz)跨越時間模糊化潛在特徵，最後歸一化每個向量到一個單位超球面上。

We trained the network on 8 Tesla V100 GPUs for 4 days,after which we no longer observed qualitative differences between the results of consecutive training iterations. Our implementation used an adaptive minibatch size depending on the current output resolution so that the available memory budget was optimally utilized.

我們在一塊NVIDIA Tesla P100 GPU上訓練了20天的網路，直到我們觀察不到連續的訓練迭代結果之間的質量差異。我們的實施方法被用在一個依賴於當前輸出解析度的自適應小批量大小的網路上使可獲得的記憶體預算被最佳利用。

In order to demonstrate that our contributions are largely orthogonal to the choice of a loss function,we also trained the same network using LSGAN loss instead of WGAN-GP loss. Figure 1 shows six examples of 10242 images produced using our method using LSGAN. Further details of this setup are given in Appendix B.

為了證明我們的貢獻在很大程度上和損失函式的選擇是正交的，我們也使用 LSGAN 損失來替代WGAN-GP損失訓練了相同的網路。Figure 1展示了使用我們方法和使用 LSGAN方法產生的 1024*1024解析度的圖片中的六個樣例，設定的詳細細節在附件B中給出。

技術分享圖片

Figure 5:使用CELEBA-HQ 資料集生成的1024 × 1024解析度的圖片。附件F有更大的結果集，以及潛在空間插值的附帶視訊。右邊，是由Marchesi (2017) 提出的一個更早期的兆畫素GAN生成的兩幅圖片，展示限制的細節以及變化。

技術分享圖片

Figure 6:在 LSUN BEDROOM資料集上的視覺化質量比較；圖片複製於引用的文章。

技術分享圖片

6.4 LSUN RESULTS

Figure 6 shows a purely visual comparison between our solution and earlier results in LSUN BEDROOM. Figure 7 gives selected examples from seven very different LSUN categories at 2562. A larger,non-curated set of results from all 30 LSUN categories is available in Appendix G,and the video demonstrates interpolations. We are not aware of earlier results in most of these categories,and while some categories work better than others,we feel that the overall quality is high.

6.4 LSUN結果

Figure 6展示了一個純粹的我們的解決方案和在 LSUN BEDROOM資料集上的早期結果的視覺比較。Figure 7給了被選擇的7個不同的LSUN種類的256*256解析度的樣例。附件G中可以獲得一個更大的，沒有組織的所有30個LSUN種類的結果集，視訊證明插值。我們不知道這些種類的早期結果，雖然有些種類比其它的要好，但是我們感覺整體質量是高的。

6.5 CIFAR10 INCEPTION SCORES

The best inception scores for CIFAR10 (10 categories of 32 × 32 RGB images) we are aware of are 7.90 for unsupervised and 8.87 for label conditioned setups (Grinblat et al.,2017). The large difference between the two numbers is primarily caused by "ghosts" that necessarily appear between classes in the unsupervised setting,while label conditioning can remove many such transitions.

6.5 CIFAR10 資料集的初始得分

我們知道的CIFAR10(10 categories of 32 × 32 RGB images) 的最好初始得分是：非監督資料集7.9分，帶標籤條件設定的資料集8.87分(Grinblat et al.,2017)。這兩個數字之間的最大差異主要是由 "ghosts"導致的，在非監督環境中它必然出現在類之間，然而在有標籤條件時可以移除很多這樣的轉換。

技術分享圖片

當我們的所有貢獻都成功的時候，我們在非監督環境中會有8.8的得分。附錄D顯示了一組有代表性的結果影象，以及從早期方法得到的更全面的結果列表。網路和訓練設定對CELEBA資料集要求是一樣的，當然程序限制為32*32解析度。僅有的使用者化就是WGAN-GP的正則化項

技術分享圖片

Gulrajani et al. (2017)使用γ = 1.0，對應1-Lipschitz，但是我們注意到事實上最小化ghosts會使轉換 (γ = 750)更好更快。我們還沒有用其他資料集嘗試這個方法。

7 DISCUSSION

While the quality of our results is generally high compared to earlier work on GANs,and the training is stable in large resolutions,there is a long way to true photorealism. Semantic sensibility and understanding dataset-dependent constraints,such as certain objects being straight rather than curved,leaves a lot to be desired. There is also room for improvement in the micro-structure of the images. That said,we feel that convincing realism may now be within reach,especially in CELEBA-HQ.

7 討論

我們的結果質量普遍高於先前的在GANs上的質量，並且在大解析度上的訓練穩定，對於真正的真實照片來說還有好長的路。對依賴資料的限制的語義敏感性和理解還有很多需要提高，例如確定的目標是直接可得的而不是間接得到的。影象的微觀結構還有改進的餘地。就是說，我們覺得 convincing realism現在是可能實現的，特別是CELEBA-HQ。

附件：

技術分享圖片

A NETWORK STRUCTURE AND TRAINING CONFIGURATION

A.1 1024 × 1024 NETWORKS USED FOR CELEBA-HQ

Table 2 shows network architectures of the full-resolution generator and discriminator that we use with the CELEBA-HQ dataset. Both networks consist mainly of replicated 3-layer blocks that we introduce one by one during the course of the training. The last Conv 1 × 1 layer of the generator corresponds to the toRGB block in Figure 2,and the first Conv 1 × 1 layer of the discriminator similarly corresponds to fromRGB. We start with 4 × 4 resolution and train the networks until we have shown the discriminator 800k real images in total. We then alternate between two phases: fade in the first 3-layer block during the next 800k images,stabilize the networks for 800k images,fade in the next 3-layer block during 800k images,etc.

表 2 顯示了我們用於 CELEBA-HQ 資料集的全解析度生成器和鑑別器的網路體系結構。兩個網路主要由複製的 3 層模組組成，在培訓過程中我們逐個介紹的這些塊。生成器的最後一個 Conv 1 × 1 層對應於圖 2 中的 toRGB 塊，而第一個 Conv 1 ×1 層的鑑別器同樣對應於 fromRGB。我們從 4 × 4 解析度開始，並訓練網路，直到我們總共顯示了 800k 真實影象的鑑別器。然後，我們在兩個階段之間交替：在接下來的 800k 影象中在第一個 3 層塊中淡入淡出，穩定 800k 影象的網路，在 800k 影象期間在下一個 3 層塊中淡入淡出，等等。

Our latent vectors correspond to random points on a 512-dimensional hypersphere,and we represent training and generated images in [-1,1]. We use leaky ReLU with leakiness 0.2 in all layers of both networks,except for the last layer that uses linear activation. We do not employ batch normalization,layer normalization,or weight normalization in either network,but we perform pixelwise normalization of the feature vectors after each Conv 3×3 layer in the generator as described in Section 4.2. We initialize all bias parameters to zero and all weights according to the normal distribution with unit variance. However,we scale the weights with a layer-specific constant at runtime as described in Section 4.1. We inject the across-minibatch standard deviation as an additional feature map at 4 × 4 resolution toward the end of the discriminator as described in Section 3. The upsampling and downsampling operations in Table 2 correspond to 2 × 2 element replication and average pooling,respectively.

我們的潛在向量對應於 512 維超球上的隨機點，我們在 [-1，1] 中表示訓練和生成的影象。我們在兩個網路的所有層中使用leakiness 0.2 的leaky ReLU，但最後一個使用線性啟用的層除外。我們未在任一網路中採用批處理規範化、層規範化或權重規範化，但我們在生成器中的每個 Conv 3×3 層後按畫素對要素向量執行畫素規範化，如第 4.2 節所述。我們根據單位方差的正態分佈將所有偏置引數初始化為零和所有權重。但是，我們使用執行時的特定於圖層的常量縮放權重，如第 4.1 節所述。我們將跨小批量標準差注入到附加要素圖中，解析度為 4 × 4，接近第 3 節所述的鑑別器末端。表 2 中的上取樣和向下取樣操作分別對應於 2 × 2 元素複製和平均池。

技術分享圖片

14,512^2 -->6,and 1024^2--->3 to avoid exceeding the available memory budget. We use the WGAN-GP loss,but unlike Gulrajani et al. (2017),we alternate between optimizing the generator and discriminator on a per-minibatch basis,we set ncritic = 1. Additionally,we introduce a fourth term into the discriminator loss with an extremely small weight to keep the discriminator output from drifting too far away from zero. To be precise,

技術分享圖片

我們使用 Adam (Kingma & Ba,2015) 訓練網路，α = 0:001,β1 = 0,β2 = 0:99，? = 10+8。我們不使用任何學習速率衰減或降速，但為了在訓練期間在任意給定點視覺化發生成輸出，我們使用衰減 0.999 的生成器權重的指數執行平均值。我們使用迷你批量大小 16 解析度 4^2–128^2，然後根據256^2 --> 14,512^2 --> 6,and 1024^2---> 3

以避免超出可用記憶體預算。我們使用 WGAN-GP 損耗，但與 Gulrajani 等人（2017）不同，我們根據每個小批量優化生成器和判別器（即，我們設定 ncritic = 1）之間交替。此外，我們引入第四個術語到鑑別器損失與極小的權重，以防止鑑別器輸出漂移太遠從零。確切地說

技術分享圖片

A.2 OTHER NETWORKS

Whenever we need to operate on a spatial resolution lower than 1024 × 1024,we do that by leaving out an appropriate number copies of the replicated 3-layer block in both networks. Furthermore,Section 6.1 uses a slightly lower-capacity version,where we halve the number of feature maps in Conv 3 × 3 layers at the 16 × 16 resolution,and divide by 4 in the subsequent resolutions. This leaves 32 feature maps to the last Conv 3 × 3 layers. In Table 1 and Figure 4 we train each resolution for a total 600k images instead of 800k,and also fade in new layers for the duration of 600k images.

A.2 其他網路

每當我們需要對低於 1024 × 1024 的空間解析度進行操作時，我們會通過在兩個網路中保留複製的 3 層塊的適當數量副本來執行此操作。此外，第 6.1 節使用稍低的容量版本，其中我們將 Conv 3 + 3 層中的要素對映數在 16 × 16 解析度下減半，並在後續解析度中除以 4。這將留下 32 個要素對映到最後一個 Conv 3 + 3 圖層。在表 1 和圖 4 中，我們針對總共 600k 影象（而不是 800k）訓練每個解析度，並在 600k 影象的持續期間在新圖層中淡入淡出。

For the "Gulrajani et al. (2017)" case in Table 1,we follow their training configuration as closely as possible. In particular,we set α = 0:0001,β2 = 0:9,ncritic = 5,drift = 0,and minibatch size 64. We disable progressive resolution,minibatch stddev,as well as weight scaling at runtime,and initialize all weights using He‘s initializer (He et al.,2015). Furthermore,we modify the generator by replacing LReLU with ReLU,linear activation with tanh in the last layer,and pixelwise normalization with batch normalization. In the discriminator,we add layer normalization to all Conv 3 × 3 and Conv 4 × 4 layers. For the latent vectors,we use 128 components sampled independently from the normal distribution.

對於表 1 中的"Gulrajani 等人（2017）"案例，我們儘可能密切地關注他們的培訓配置。特別是，我們設定 #= 0：0001，^2 = 0：9，ncritic = 5，漂移 = 0，和微型批處理大小 64。我們禁用漸進式解析度、微型批處理 stddev 以及在執行時縮放權重，並使用 He 的初始化器初始化所有權重（He 等人，2015 年）。此外，我們修改了生成器，將 LReLU 替換為 ReLU，在最後一層用 tanh 進行線性啟用，並用批處理規範化進行畫素規範化。在鑑別器中，我們將層規範化新增到所有 Conv 3 + 3 和 Conv 4 = 4 層。對於潛在向量，我們使用獨立於正態分佈取樣的 128 個分量。

B LEAST-SQUARES GAN (LSGAN) AT 1024 × 1024

We find that LSGAN is generally a less stable loss function than WGAN-GP,and it also has a tendency to lose some of the variation towards the end of long runs. Thus we prefer WGAN-GP,but have also produced high-resolution images by building on top of LSGAN. For example,the 1024^2 images in Figure 1 are LSGAN-based.

1024×1024的B最小二乘GAN（LSGAN）

我們發現LSGAN通常是一個比WGAN-GP更不穩定的損失函式，並且在長期執行結束時也有失去一些變化的趨勢。因此，我們更喜歡WGAN-GP，但也通過在LSGAN上構建來生成解決方案影象。例如，圖片1的1024^2影象基於Lsgan

On top of the techniques described in Sections 2–4,we need one additional hack with LSGAN that prevents the training from spiraling out of control when the dataset is too easy for the discriminator,and the discriminator gradients are at risk of becoming meaningless as a result. We adaptively increase the magnitude of multiplicative Gaussian noise in discriminator as a function of the discriminator‘s output. The noise is applied to the input of each Conv 3 × 3 and Conv 4 × 4 layer. There is a long history of adding noise to the discriminator,and it is generally detrimental for the image quality (Arjovsky et al.,2017) and ideally one would never have to do that,which according to our tests is the case for WGAN-GP (Gulrajani et al.,2017). The magnitude of noise is determined as

技術分享圖片

is an exponential moving average of the discriminator output d. The motivation behind this hack is that LSGAN is seriously unstable when d approaches (or exceeds) 1:0.

除了第2-4節中描述的技術外，我們還需要使用LSGAN進行一次額外的破解，以防止當資料集對鑑別器太容易時訓練失控，並且鑑別器半徑有可能因此變得毫無意義。我們適當地增大判別器中增加高斯噪聲的用於判別器的輸出。噪聲應用於輸入oeach Conv 3 x 3和Conv 4 x 4層。在鑑別器中新增噪聲的歷史很長，它通常有損影象質量（Arjovsky等人，2017年），理想情況下，人們不會這樣做，根據我們的測試，WGAN-GP就是這樣（Gullajani等人，2017年）。噪聲的大小被確定為

技術分享圖片

它是判別器輸出d的指數移動平均值，此攻擊的動機是當d接近（或超過）1:0時LSGAN是不穩定的

中文翻譯：

https://blog.csdn.net/liujunru2013/article/details/78545882

【GAN論文-01】翻譯-Progressive growing of GANS for improved quality ，stability，and variation-論文

一、論文翻譯

ABSTRACT

摘要：

1 INTRODUCTION

1 介紹

2 PROGRESSIVE GROWING OF GANS

2 GANs的漸進增長

3 INCREASING VARIATION USING MINIBATCH STANDARD DEVIATION

3 使用小批量標準偏差增加可變性

4 NORMALIZATION IN GENERATOR AND DISCRIMINATOR

4 在生成器和判別器中規範化

4.1 EQUALIZED LEARNING RATE

4.1 調節學習速率

4.2 PIXELWISE FEATURE VECTOR NORMALIZATION IN GENERATOR

4.2 生成器中的pixelwise特徵向量歸一化

5 MULTI-SCALE STATISTICAL SIMILARITY FOR ASSESSING GAN RESULTS

5 評估GAN結果的多尺度統計相似性

6 EXPERIMENTS

6 實驗

6.1 IMPORTANCE OF INDIVIDUAL CONTRIBUTIONS IN TERMS OF STATISTICAL SIMILARITY

6.1 就統計相似性而言本人貢獻的重要性

6.2 CONVERGENCE AND TRAINING SPEED

6.3 HIGH-RESOLUTION IMAGE GENERATION USING CELEBA-HQ DATASET

6.3 使用CELEBA-HQ資料集生成高解析度影象

6.4 LSUN RESULTS

6.4 LSUN結果

6.5 CIFAR10 INCEPTION SCORES

6.5 CIFAR10 資料集的初始得分

7 DISCUSSION

7 討論

A NETWORK STRUCTURE AND TRAINING CONFIGURATION

A.1 1024 × 1024 NETWORKS USED FOR CELEBA-HQ

A.2 OTHER NETWORKS

A.2 其他網路

B LEAST-SQUARES GAN (LSGAN) AT 1024 × 1024

相關推薦