當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

生成高分辨率pdf_用于高分辨率图像合成的生成变分自编码器

發布時間：2023/12/15 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了生成高分辨率pdf_用于高分辨率图像合成的生成变分自编码器小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

生成高分辨率pdf

This article presents our research on high resolution image generation using Generative Variational Autoencoder.

本文介紹了我們使用生成變分自動編碼器進行高分辨率圖像生成的研究。

重要事項 (Important Points)

Our work addresses the mode collapse issue of GANs and blurred images generated using VAEs in a single model architecture.

我們的工作解決了單一模型架構中GAN的模式崩潰問題以及使用VAE生成的模糊圖像。

We use the encoder of VAE as it is while replacing the decoder with a discriminator.

我們將VAE編碼器原樣使用，同時用鑒別符替換解碼器。

The encoder is fed data from a normal distribution while the generator is fed that from a gaussian distribution.

編碼器從正態分布中饋入數據，而生成器從高斯分布中饋入數據。

The combination from both is then fed to a discriminator which tells whether the generated images are correct or not.

然后將兩者的組合饋送到鑒別器，該鑒別器告訴所生成的圖像是否正確。

We evaluate our network on 3 different datasets: MNIST, CelebA-HQ and LSUN dataset.

我們在3個不同的數據集上評估我們的網絡：MNIST，CelebA-HQ和LSUN數據集。

We outperform previous state-of-the-art methods in terms of MMD, SSIM, log likelihood, reconstruction error, ELBO and KL divergence as the evaluation metrics.

在MMD，SSIM，對數似然，重構誤差，ELBO和KL散度作為評估指標方面，我們的表現優于以前的最新方法。

介紹 (Introduction)

The training of deep neural networks requires hundreds or even thousands of images. Lack of labelled datasets especially for medical images often hinders the progress. Hence it becomes imperative to create additional training data. Another area which is actively researched is using generative adversarial networks for image generation. Using this technique, new images can be generated by training on the existing images present in the dataset. The new images are realistic but different from the original data. There are two main approaches of using data augmentation using GANs: image to image translation and sampling from random distribution. The main challenge with GANs is the mode collapse problem i.e. the generated images are quite similar to each other and there is not enough variety in the images generated.

深度神經網絡的訓練需要數百甚至數千張圖像。缺少特別是醫學圖像的標記數據集通常會阻礙這一進展。因此，必須創建其他訓練數據。積極研究的另一個領域是使用生成對抗網絡進行圖像生成。使用這種技術，可以通過對數據集中存在的現有圖像進行訓練來生成新圖像。新圖像逼真但與原始數據不同。使用GAN進行數據增強的主要方法有兩種：圖像到圖像的轉換和隨機分布的采樣。 GAN的主要挑戰是模式崩潰問題，即生成的圖像彼此非常相似，并且生成的圖像種類不足。

Another approach for image generation uses Variational Autoencoders. This architecture contains an encoder which is also known as generative network which takes a latent encoding as input and outputs the parameters for a conditional distribution of the observation. The decoder is also known as an inference network which takes as input an observation and outputs a set of parameters for the conditional distribution of the latent representation. During training VAEs use a concept known as reparameterization trick, in which sampling is done from a gaussian distribution. The main challenge with VAEs is that they are not able to generate sharp images.

圖像生成的另一種方法是使用變分自動編碼器。該體系結構包含一個編碼器，也稱為生成網絡，它以潛在編碼為輸入并輸出用于條件分布觀測的參數。解碼器也稱為推理網絡，其將觀察值作為輸入并輸出用于潛在表示的條件分布的一組參數。在訓練過程中，VAE使用一種稱為“重新參數化技巧”的概念，其中從高斯分布中進行采樣。 VAE的主要挑戰是它們無法生成清晰的圖像。

數據集 (Dataset)

The following datasets are used for training and evaluation:

以下數據集用于訓練和評估：

MNIST — This is a large dataset of handwritten digits which has been used successfully for training image classification and image processing algorithms. It contains 60,000 training images and 10,000 test images.

MNIST —這是一個龐大的手寫數字數據集，已成功地用于訓練圖像分類和圖像處理算法。它包含60,000個訓練圖像和10,000個測試圖像。

LSUN dataset — This dataset contains millions of color images with 10 scene categories and 20 object categories. This is one of the most common datasets for training and testing GAN based neural networks.

LSUN數據集—該數據集包含數百萬個具有10個場景類別和20個對象類別的彩色圖像。這是用于訓練和測試基于GAN的神經網絡的最常見數據集之一。

CelebA-HQ dataset -This is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. This is also one of the most common datasets for training and testing GAN based neural networks.

CelebA-HQ數據集-這是一個大規模的面部屬性數據集，其中包含200,000多張名人圖像，每張圖像都有40個屬性注釋。這也是用于訓練和測試基于GAN的神經網絡的最常見數據集之一。

VAE與我們的網絡 (VAE vs Ours Network)

We show how instead of inference made in the way shown in original VAE architecture, we can add the error vector to the original data and multiply by standard distribution. The new term goes to the encoder and gets converted to the latent space. In the decoder, similarly the error vector gets added to the latent vector and multiplied by standard deviation. In this manner, we use the encoder of VAE in a manner similar to that in the original VAE. While we replace the decoder with a discriminator and hence change the loss function accordingly. The comparison between model architectures of VAE and our architecture is shown in Fig 1.

我們展示了如何代替原始VAE體系結構中所示的方式進行推理，而是可以將誤差矢量添加到原始數據并乘以標準分布。新術語進入編碼器并轉換為潛在空間。在解碼器中，類似地，將誤差矢量添加到潛矢量，并乘以標準偏差。以這種方式，我們以類似于原始VAE的方式使用VAE的編碼器。雖然我們用鑒別器代替了解碼器，因此相應地改變了損失函數。 VAE的模型架構與我們的架構之間的比較如圖1所示。

Figure 1: Comparison between standard VAE and our network where e1 and e2 denote samples from some noise distribution, x denotes image vector, z denotes latent space vector, f and g denotes encoder and decoder functions respectively and +, ? denotes addition and concat operators.圖1：標準VAE與我們的網絡之間的比較，其中e1和e2表示來自某些噪聲分布的樣本，x表示圖像矢量，z表示潛在空間矢量，f和g分別表示編碼器和解碼器函數，+，*表示加法和concat運算符。

Our architecture can be seen both as an extension of VAE as well as that of GAN. Reasoning it as the former is easy as this requires a change in loss function for decoder, while the latter can be made by recalling the fact that GAN essentially works on the concept of zero sum game maintaining Nash Equilibrium between the generator and discriminator. In our case, both the encoder from VAE and discriminator from GAN are playing zero sum game and are competing with each other. As the training proceeds, the loss decreases in both the cases until it stabilizes.

我們的架構既可以看作是VAE的擴展，也可以看作是GAN的擴展。將其推理為前者很容易，因為這需要更改解碼器的損失函數，而后者可以通過回顧GAN實質上是在零和博弈的概念上起作用，以保持生成器與鑒別器之間的納什均衡這一事實來實現。在我們的案例中，VAE的編碼器和GAN的鑒別器都在玩零和游戲，并且彼此競爭。隨著訓練的進行，兩種情況下的損失都會減少，直到穩定為止。

網絡架構 (Network Architecture)

The network architecture used in this work is explained in the below points:

以下幾點解釋了此工作中使用的網絡體系結構：

The discriminator and encoder networks have four convolution layers, each of which uses 3×3 filters.

鑒別器和編碼器網絡具有四個卷積層，每個卷積層都使用3×3濾波器。

We use Batch Normalization and Leaky Rectified Linear Unit (LeakyReLU) layers after each layer.

我們在每層之后使用批歸一化和泄漏校正線性單位(LeakyReLU)層。

In training, we found that our architecture suffers from instability during training. This was solved using WGAN loss function which measures Wasserstein distance between two distributions.

在訓練中，我們發現我們的體系結構在訓練過程中遭受不穩定的困擾。這是使用WGAN損失函數解決的，該函數測量兩個分布之間的Wasserstein距離。

We used the gradient penalty term to stabilize the training.

我們使用梯度懲罰項來穩定訓練。

Our loss function has a total for 3 terms. While training, the encoder and the generator are considered as one network. Thus, we sum up the loss functions of the two networks in the order encoder-generator, discriminator as one and train the networks.

我們的損失函數總共有3個條件。訓練時，編碼器和生成器被視為一個網絡。因此，我們將兩個網絡的損失函數以編碼器-生成器，鑒別器的階數作為一個總和進行訓練。

Two latent vectors are sampled one from normal distribution and the other from gaussian distribution. The one from normal distribution is fed to the encoder while the one from gaussian distribution is fed to the generator.

采樣兩個潛在向量，一個從正態分布中采樣，另一個從高斯分布中采樣。來自正態分布的一個饋給編碼器，而來自高斯分布的一個饋給發電機。

The outputs from both the vectors are in turn fed to the discriminator to tell whether the generated image is real or not.

來自兩個向量的輸出又被饋送到鑒別器以判斷所生成的圖像是否真實。

Our network architecture is shown in Fig 2.

我們的網絡架構如圖2所示。

Figure 2: Our network architecture圖2：我們的網絡架構

建筑細節 (Architecture Details)

The generator and discriminator layerwise architecture details is shown in Table 1 and Table 2 respectively. We denoted ResNet block as consisting of the following layers — convolutional, max pooling layer, 30 percent dropouts in between the layers and batch normalization layer.

生成器和鑒別器分層體系結構的詳細信息分別顯示在表1和表2中。我們將ResNet塊表示為由以下幾層組成-卷積，最大池化層，各層與批處理規范化層之間的30％的失落。

算法 (Algorithm)

The algorithm used in this work is trained using Stochastic Gradient Descent (SGD) as shown below:

這項工作中使用的算法是使用隨機梯度下降(SGD)進行訓練的，如下所示：

實驗 (Experiments)

All the generated samples are generator outputs from random latent vectors. We normalize all data into the range [-1, 1] and use two evaluation metrics to measure the performance of our network. First of them measures the distribution distance between the real and generated samples with maximum mean discrepancy (MMD) scores. The second metric evaluates the generation diversity with multi-scale structural similarity metric (MS-SSIM). Table 4. compares MMD and MS-SSIM scores with previous state of the art architectures.

所有生成的樣本都是隨機潛矢量的生成器輸出。我們將所有數據歸一化為[-1，1]范圍，并使用兩個評估指標來衡量我們網絡的性能。首先，它們以最大平均差異(MMD)分數測量實際樣本與生成的樣本之間的分布距離。第二個指標使用多尺度結構相似性指標(MS-SSIM)評估世代多樣性。表4.將MMD和MS-SSIM得分與先前的最新體系結構進行了比較。

We noticed the model with a small latent vector size of 100 suffers from severe mode collapse. The best results can be obtained using a moderately large latent vector size. Table 5 compares the effect of different latent variable sizes on the MMD and MS-SSIM scores respectively.

我們注意到，較小的潛在矢量大小為100的模型會遭受嚴重的模式崩潰。使用適度大的潛在向量大小可以獲得最佳結果。表5比較了不同潛在變量大小分別對MMD和MS-SSIM分數的影響。

As can be seen, latent variable size with value 1000 produces the best results of those being compared. Both at low and high latent variable size mode collapse is seen which is one of the main challenges faced while training GANs.

可以看出，值1000的潛在變量大小產生了被比較的最佳結果。在低潛變量和高潛變量模式下都可以看到崩潰，這是訓練GAN時面臨的主要挑戰之一。

Four common evaluation metrics have been used in the literature for testing the performance of generative models. These are log-likelihood, reconstruction error, ELBO and KL divergence.

文獻中已使用四種常見的評估指標來測試生成模型的性能。這些是對數似然，重構誤差，ELBO和KL差異。

The log-likelihood is calculated by finding the parameter that maximizes the log-likelihood of the observed sample. The reconstruction error is the distance between the original data point and its projection onto a lower-dimensional subspace. The optimization problem used in our model uses KL divergence error which is intractable hence we maximize ELBO instead of minimizing the KL divergence. KL divergence is a measure of how similar the generated probability distribution is to the true probability distribution. The comparison using these evaluation metrics for our model on MNIST dataset with the original VAE architecture is shown in Table 6.

通過找到使所觀察樣品的對數似然性最大的參數來計算對數似然性。重建誤差是原始數據點與其在低維子空間上的投影之間的距離。我們模型中使用的優化問題使用了KL散度誤差，這是很難解決的，因此我們將ELBO最大化而不是將KL散度最小化。 KL散度是衡量所生成的概率分布與真實概率分布的相似程度的度量。表6顯示了使用這些評估指標對我們的模型在MNIST數據集與原始VAE體系結構上進行的比較。

We compare our log probability distribution value with those obtained by previous state of the art methods which is shown in Table 7. The log probability distribution is an important evaluation metric in the sense that it shows the diversity of the samples generated.

我們將對數概率分布值與通過表7所示的現有技術方法獲得的對數概率分布值進行比較。就對數概率分布而言，它顯示了所生成樣本的多樣性，這是一項重要的評估指標。

結果 (Results)

We present the generated images on all the 3 datasets used for testing. The images were trained for 1000 iterations. The images generated using the CELEBA-HQ dataset is shown in Fig 3.

我們在用于測試的所有3個數據集上展示生成的圖像。對圖像進行了1000次迭代訓練。使用CELEBA-HQ數據集生成的圖像如圖3所示。

Figure 3: 1024 × 1024 images generated using the CELEBA-HQ dataset.圖3：使用CELEBA-HQ數據集生成的1024×1024圖像。

The images generated using the LSUN BEDROOM dataset is shown in Fig 4.

使用LSUN BEDROOM數據集生成的圖像如圖4所示。

Figure 4: 256 × 256 images generated using LSUN BEDROOM dataset圖4：使用LSUN BEDROOM數據集生成的256×256圖像

The images generated from different LSUN categories is shown in Fig 5.

從不同的LSUN類別生成的圖像如圖5所示。

Figure 5: Sample 256 × 256 images generated from different LSUN categories圖5：從不同的LSUN類別生成的示例256×256圖像

We compare our results with previous state of the art networks on MNIST dataset in Fig 6.

我們將結果與圖6中MNIST數據集上的現有技術網絡進行了比較。

Figure 6: Generated MNIST images a) GAN b) WGAN c) VAE d) GVAE圖6：生成的MNIST圖像a)GAN b)WGAN c)VAE d)GVAE

結論 (Conclusions)

In this blog, we presented a new training procedure for Variational Autoencoders based on generative models. This allows us to make the inference model much more flexible, allowing it to represent almost any posterior distributions over the latent variables. Our network was trained and tested on 3 publicly available datasets. On evaluating using MMD, SSIM, log likelihood, reconstruction error, ELBO and KL divergence as the evaluation metrics, our network beats the previous state of the art algorithms. Using generative model approaches to generate additional training data especially in fields like medical imaging could be revolutionary as there is a shortage of medical data for training deep convolutional neural network architectures.

在此博客中，我們介紹了基于生成模型的變分自動編碼器的新訓練程序。這使我們可以使推理模型更加靈活，從而可以表示潛在變量上的幾乎任何后驗分布。我們的網絡在3個公開可用的數據集上進行了培訓和測試。在使用MMD，SSIM，對數似然，重構誤差，ELBO和KL散度作為評估指標進行評估時，我們的網絡擊敗了現有算法。使用生成模型方法生成額外的訓練數據，尤其是在醫學成像等領域，可能是革命性的，因為缺乏用于訓練深度卷積神經網絡架構的醫學數據。

翻譯自: https://towardsdatascience.com/generative-variational-autoencoder-for-high-resolution-image-synthesis-48dd98d4dcc2