Introduction
Single-image super-resolution is the process of generating a high-resolution image that is consistent with an input low-resolution image. It falls under the broad family of image-to-image translation tasks, including colorization, in-painting, and de-blurring. Like many such inverse problems, image super-resolution is challenging because multiple output images may be consistent with a single input image, and the conditional distribution of output images given the input typically does not conform well to simple parametric distributions, e.g., a multivariate Gaussian. Accordingly, while simple regression-based methods with feedforward convolutional nets may work for super-resolution at low magnification ratios, they often lack the high-fidelity details needed for high magnification ratios.
Deep generative models have seen success in learning complex empirical distributions of images (e.g., [3], [4]). Autoregressive models [5], [6], variational autoencoders (VAEs) [7], [8], Normalizing Flows (NFs) [9], [10], and GANs [11], [12], [13] have shown convincing image generation results and benefited conditional tasks such as image super-resolution [14], [15], [16], [17], [18]. However, existing techniques often suffer from various limitations; autoregressive models are prohibitively expensive for high-resolution image generation, NFs and VAEs often yield sub-optimal sample quality, and GANs require carefully designed regularization and optimization tricks to tame optimization instability and mode collapse [19], [20], [21], [22].
We propose SR3 (Super-Resolution via Repeated Refinement), a new approach to conditional image generation, inspired by recent work on Denoising Diffusion Probabilistic Models (DDPM) [1], [23], and denoising score matching [1], [24]. SR3 works by learning to transform a standard normal distribution into an empirical data distribution through a sequence of refinement steps, resembling Langevin dynamics. The key is a U-Net architecture [25] that is trained with a denoising objective to iteratively remove various levels of noise from an image. We adapt DDPMs to image-to-image translation by proposing a simple effective modification to the U-Net architecture. In contrast to GANs, which require inner-loop maximization, we minimize a well-defined loss function. Unlike autoregressive models, SR3 uses a constant number of inference steps regardless of output resolution.
SR3 models work well across a range of magnification factors and input resolutions (e.g., see Fig. 1), and they can be cascaded, e.g., going from
Automated image quality scores like PSNR and SSIM do not reflect human preference well when the input resolution is low and the magnification factor is large (e.g., [14], [15], [17], [26], [27], [28]). These quality scores often penalize synthetic high-frequency details, such as hair texture, because synthetic details do not perfectly align with the original details. We therefore resort to human evaluation to compare the quality of super-resolution methods. We adopt a 2-alternative forced-choice (2AFC) paradigm in which human subjects are shown a low-resolution input and are required to select between a model output and a ground truth image (cf. [29]). With that data we calculate fool rate scores that capture both image quality and the consistency of model outputs with low-resolution inputs.
On a standard 8× face super-resolution task, SR3 achieves a human fool rate close to 50%, outperforming FSRGAN [14] and PULSE [17] with fool rates of at most 34%. On a 4× task on natural images, SR3 outperforms a ESRGAN [30] EnhanceNet [31] and SRFlow [32] on human evaluation, and a wide range of methods on classification accuracy of a ResNet-50 classifier trained on high-resolution images. To demonstrate unconditional and class-conditional generation we combine a
Conditional Denoising Diffusion Model
We are given a dataset of input-output image pairs, denoted
The conditional DDPM model generates a target image
Two representative SR3 outputs: (top) 8× face super-resolution at 16×16
The forward diffusion process
The distribution of intermediate images in the iterative refinement chain is defined in terms of a forward diffusion process that gradually adds Gaussian noise to the output via a fixed Markov chain, denoted
2.1 Gaussian Diffusion Process
Following [1], [23], we define a forward Markovian diffusion process
\begin{align*}
q({\boldsymbol{y}}_{1:T} \mid {\boldsymbol{y}}_{0}) &=\prod \nolimits _{t=1}^{T} q({\boldsymbol{y}}_{t} \mid {\boldsymbol{y}}_{t-1}), \tag{1}\\
q({\boldsymbol{y}}_{t} \mid {\boldsymbol{y}}_{t-1}) &=\mathcal {N}({\boldsymbol{y}}_{t} \mid \sqrt{\alpha _{t}}\; {\boldsymbol{y}}_{t-1}, (1 - \alpha _{t}) \boldsymbol{I}), \tag{2}
\end{align*}
Importantly, one can characterize the distribution of
\begin{equation*}
q({\boldsymbol{y}}_{t} \mid {\boldsymbol{y}}_{0}) ~=~ \mathcal {N}({\boldsymbol{y}}_{t} \mid \sqrt{\gamma _{t}}\; {\boldsymbol{y}}_{0}, (1-\gamma _{t}) \boldsymbol{I}), \tag{3}
\end{equation*}
\begin{align*}
q&({\boldsymbol{y}}_{t-1} \mid {\boldsymbol{y}}_{0}, {\boldsymbol{y}}_{t}) = \mathcal {N}({\boldsymbol{y}}_{t-1} \mid \boldsymbol{\mu }, \sigma ^{2} \boldsymbol{I}) \\
&\boldsymbol{\mu } = \frac{\sqrt{\gamma _{t-1}}\;(1-\alpha _{t})}{1-\gamma _{t}}\; {\boldsymbol{y}}_{0} + \frac{\sqrt{\alpha _{t}}\;(1-\gamma _{t-1})}{1-\gamma _{t}}{\boldsymbol{y}}_{t} \\
&\sigma ^{2} = \frac{(1-\gamma _{t-1})(1-\alpha _{t})}{1-\gamma _{t}}. \tag{4}
\end{align*}
2.2 Optimizing the Denoising Model
The key to inference with diffusion models (Section 2.3) is the denoising network. In our case it is conditioned on side information in the form of a source image
\begin{equation*}
\widetilde{\boldsymbol{y}}= \sqrt{\gamma }\; {\boldsymbol{y}}_{0} + \sqrt{1-\gamma } \;{\boldsymbol{\epsilon }}, ~~~~~~ {\boldsymbol{\epsilon }}\sim \mathcal {N}(\boldsymbol{0},\boldsymbol{I}), \tag{5}
\end{equation*}
Algorithm 1. Training a Denoising Model $f_\theta$
repeat
Take a gradient descent step on
until converged
Algorithm 2. Inference in $T$ Iterative Refinement Steps
for
end for
return
In addition to a source image
\begin{equation*}
\mathbb {E}_{({\boldsymbol{x}}, {\boldsymbol{y}})} \mathbb {E}_{{\boldsymbol{\epsilon }}, \gamma } \bigg \Vert f_\theta ({\boldsymbol{x}}, \underbrace{\sqrt{\gamma } \;{\boldsymbol{y}}_{0} + \sqrt{1-\gamma }\; {\boldsymbol{\epsilon }}}_{\widetilde{\boldsymbol{y}}}, \gamma) - {\boldsymbol{\epsilon }}\; \bigg \Vert ^{p}_{p}, \tag{6}
\end{equation*}
Instead of regressing the output of
2.3 Inference via Iterative Refinement
Inference under our model is defined as a reverse Markovian process, which goes in the reverse direction of the forward diffusion process, starting from Gaussian noise
\begin{align*}
p_\theta ({\boldsymbol{y}}_{0:T} | {\boldsymbol{x}}) &=p({\boldsymbol{y}}_{T}) \prod \nolimits _{t=1}^{T} p_\theta ({\boldsymbol{y}}_{t-1} | {\boldsymbol{y}}_{t}, {\boldsymbol{x}}) \tag{7}\\
p({\boldsymbol{y}}_{T}) &=\mathcal {N}({\boldsymbol{y}}_{T} \mid \boldsymbol{0}, \boldsymbol{I}) \tag{8}\\
p_\theta ({\boldsymbol{y}}_{t-1} | {\boldsymbol{y}}_{t}, {\boldsymbol{x}}) &= \mathcal {N}({\boldsymbol{y}}_{t-1} \mid \mu _{\theta }({\boldsymbol{x}}, {{\boldsymbol{y}}}_{t}, \gamma _{t}), \sigma _{t}^{2}\boldsymbol{I}). \tag{9}
\end{align*}
Recall that the denoising model
\begin{equation*}
\hat{{\boldsymbol{y}}}_{0} = \frac{1}{\sqrt{\gamma _{t}}} \left({\boldsymbol{y}}_{t} - \sqrt{1 - \gamma _{t}}\; f_{\theta }({\boldsymbol{x}}, {\boldsymbol{y}}_{t}, \gamma _{t}) \right). \tag{10}
\end{equation*}
\begin{equation*}
\mu _{\theta }({\boldsymbol{x}}, {{\boldsymbol{y}}}_{t}, \gamma _{t}) = \frac{1}{\sqrt{\alpha _{t}}} \left({\boldsymbol{y}}_{t} - \frac{1-\alpha _{t}}{ \sqrt{1 - \gamma _{t}}} f_{\theta }({\boldsymbol{x}}, {\boldsymbol{y}}_{t}, \gamma _{t}) \right), \tag{11}
\end{equation*}
Following this parameterization, each iteration of iterative refinement under our model takes the form,
\begin{equation*}
{\boldsymbol{y}}_{t-1} \leftarrow \frac{1}{\sqrt{\alpha _{t}}} \left({\boldsymbol{y}}_{t} - \frac{1-\alpha _{t}}{ \sqrt{1 - \gamma _{t}}} f_{\theta }({\boldsymbol{x}}, {\boldsymbol{y}}_{t}, \gamma _{t}) \right) + \sqrt{1 - \alpha _{t}}{\boldsymbol{\epsilon }}_{t},
\end{equation*}
2.4 Justification of the Training Objective
Following Ho et al. [1], we justify the choice of the training objective in (6) for the probabilistic model outlined in (9) from a variational lower bound perspective. If the forward diffusion process is viewed as a fixed approximate posterior to the inference process, one can derive the following variational lower bound on the marginal log-likelihood
\begin{align*}
\mathbb {E}_{({\boldsymbol{x}},{\boldsymbol{y}}_{0})}\log p_\theta ({\boldsymbol{y}}_{0} | {\boldsymbol{x}}) &\geq \mathbb {E}_{{\boldsymbol{x}},{\boldsymbol{y}}_{0}}\mathbb {E}_{q({\boldsymbol{y}}_{1:T}|{\boldsymbol{y}}_{0})}\bigg [ \log p({\boldsymbol{y}}_{T}) \\
& + \sum _{t \geq 1} \log \frac{p_{\theta } ({\boldsymbol{y}}_{t-1} | {\boldsymbol{y}}_{t}, {\boldsymbol{x}})}{q({\boldsymbol{y}}_{t}|{\boldsymbol{y}}_{t-1})} \bigg ]. \tag{12}
\end{align*}
Given the particular parameterization of the inference process outlined above, one can show [1] that the negative variational lower bound can be expressed as the following simplified loss, up to a constant weighting of each term for each time step
\begin{align*}
\mathbb {E}_{{\boldsymbol{x}},{\boldsymbol{y}}_{0},{\boldsymbol{\epsilon }}} \!\sum _{t=1}^{T} \frac{1}{T} \bigg \Vert \boldsymbol{\epsilon } - \boldsymbol{\epsilon }_{\theta }({\boldsymbol{x}}, \sqrt{\gamma _{t}} {\boldsymbol{y}}_{0} + \sqrt{1 - \gamma _{t}}\boldsymbol{\epsilon }, \gamma _{t}) \bigg \Vert ^{2}_{2}, \tag{13}
\end{align*}
Our approach is also linked to denoising score matching [35], [36], [37], [38] for training unnormalized energy functions for density estimation. These methods learn a parametric score function to approximate the gradient of the empirical data log-density. To make sure the gradient of the data log-density is well-defined, one often replaces each data point with a Gaussian distribution with a small variance. Song and Ermon [39] advocate the use of a Multi-scale Guassian mixture as the target density, where each data point is perturbed with different amounts of Gaussian noise, so that Langevin dynamics starting from pure noise can still yield reasonable samples.
One can view our approach as a variant of denoising score matching in which the target density is given by a mixture of
\begin{equation*}
\frac{\mathrm{d} \log q(\widetilde{\boldsymbol{y}}\mid {\boldsymbol{y}}_{0}, \gamma)}{\mathrm{d} \widetilde{\boldsymbol{y}}} ~=~ -\frac{\widetilde{\boldsymbol{y}}- \sqrt{\gamma }{\boldsymbol{y}}_{0}}{\sqrt{1-\gamma }} ~=~ -{\boldsymbol{\epsilon }}, \tag{14}
\end{equation*}
2.5 SR3 Model Architecture and Noise Schedules
The SR3 architecture is similar to the U-Net in DDPM [1], with self-attention and modifications adapted from [40]; i.e., we replace the original DDPM residual blocks with residual blocks from BigGAN [41], and we re-scale skip connections by
To condition the model on the input
For our training noise schedule, we follow [33], and use a piece-wise distribution for
For sample generation (or inference), early diffusion models [1], [40] required 1-2 K diffusion steps, making generation slow, especially for high resolution images. For more efficient generation we instead adapt recent techniques [33]. In particular, by conditioning on
Fig. 3 depicts the architecture used for both SR3 and our regression baselines. This denoising U-Net takes as input a noisy high resolution image and a low-resolution conditioning image that has been interpolatd and up-sampled to the target resolution. Task dependent parameters are summarized in Table 1. These architectures have more parameters than many existing networks for image super-resolution, motivated in part by other domains where performance scales with model capacity and dataset size. As shown in Section 4.5, even a simple Regression model with a large architecture can perform surprisingly well.
Depiction of U-Net architecture of SR3. The low resolution input image
Related Work
SR3 is inspired by recent work on deep generative models and recent learning-based approaches to super-resolution.
Generative Models. Autoregressive models (ARs) [43], [44] can model exact data log likelihood, capturing rich distributions. However, their sequential generation of pixels is expensive, limiting application to low-resolution images. Normalizing flows [9], [10], [45] improve on sampling speed while modelling the exact data likelihood, but the need for invertible parameterized transformations with a tractable Jacobian determinant limits their expressiveness. VAEs [7], [46] offer fast sampling, but tend to underperform GANs and ARs in image quality [8]. Generative Adversarial Networks (GANs) [11] are popular for class conditional image generation and super-resolution. Nevertheless, the inner-outer loop optimization often requires tricks to stabilize training [19], [20], and conditional tasks like super-resolution usually require an auxiliary consistency-based loss to avoid mode collapse [16]. Cascades of GAN models have been used to generate higher resolution images [47].
Score matching [35] models the gradient of the data log-density with respect to the image. Score matching on noisy data, called denoising score matching [36], is equivalent to training a denoising autoencoder, and to DDPMs [1]. Denoising score matching over multiple noise scales with Langevin dynamics sampling from the learned score functions has recently been shown to be effective for high quality unconditional image generation [1], [24]. These models have also been generalized to continuous time [40]. Denoising score matching and diffusion models have also found success in shape generation [48], and speech synthesis [33]. We extend this method to super-resolution, with a simple learning objective, a constant number of inference generation steps, and high quality generation.
Super-Resolution. Numerous super-resolution methods have been proposed [16], [30], [31], [32], [49], [50], [51], [52], [53]. Much of the early work on super-resolution is regression based and trained with an MSE loss [49], [52], [54], [55], [56]. As such, they effectively estimate the posterior mean, yielding blurry images when the posterior is multi-modal [16], [17], [31]. Our regression baseline defined below is also a one-step regression model trained with MSE (cf. [52], [56]), but with a large U-Net architecture. SR3, by comparison, relies on a series of iterative refinement steps, each of which is trained with a regression loss. This difference permits our iterative approach to capture richer distributions. Further, rather than estimating the posterior mean, SR3 generates samples from the target posterior.
Autoregressive models have been used successfully for super-resolution and cascaded up-sampling [15], [18], [57], [58]. Nevertheless, the expensive of inference limits their applicability to low-resolution images. SR3 can generate high-resolution images, e.g., 1024×1024, but with a constant number of refinement steps (often no more than 100).
GAN-based super-resolution methods have also found considerable success [12], [16], [17], [30], [31], [32], [59]. FSRGAN [14] and PULSE [17] in particular have demonstrated high quality face super-resolution results. However, many such GAN based methods are generally difficult to optimize, and often require auxiliary objective functions to ensure consistency with the low resolution inputs.
Normalizing flows have been used for super-resolution with a multi-scale approach [32], [60]. They are competitive with GAN models, and are capable of generating 1024×1024 images due in part to their efficient inference process. SR3 uses a series of reverse diffusion steps to transform a Gaussian distribution to an image distribution while flows require a deep and invertible network.
Experiments
We assess the effectiveness of SR3 in super-resolution on faces, natural images, and synthetic images obtained from a low-resolution generative model. The latter enables high-resolution image synthesis using cascaded model.
4.1 Datasets
We follow previous work [17], training face super-resolution models on Flickr-Faces-HQ (FFHQ) [61] and evaluating on CelebA-HQ [12]. For natural image super-resolution, we train on ImageNet 1K [62] and use the dev split for evaluation. We train unconditional face and class-conditional ImageNet generative models using DDPM on the same datasets discussed above. For training and testing, we use low-resolution images that are down-sampled using bicubic interpolation with anti-aliasing enabled. For ImageNet, we discard images where the shorter side is less than the target resolution. We use the largest central crop like [41], which is then resized to the target resolution using area resampling as our high resolution image.
4.2 Training Details
We train SR3 and regression models for 1 M training steps, with a batch size of 256; this typically takes about four days on 64 TPUv3 chips. Given the large model capacity and large datsets, the models often continue to improve well beyond 1 M steps. We choose a checkpoint for the regression baseline based on peak-PSNR on the held out set. We do not perform any checkpoint selection on SR3 models and simply select the latest checkpoint. Consistent with [1], we use the Adam optimizer with a linear warmup schedule over 10 K training steps, followed by a fixed learning rate of 1e-4 for SR3 models and 1e-5 for regression models. We use a dropout rate of 0.2 for 16×16
4.3 Evaluation
We evaluate SR3 models on face and natural images:
Face super-resolution at
and$16\!\times \!16 \!\rightarrow \! 128\!\times \!128$ trained on FFHQ and evaluated on CelebA-HQ.$64\!\times \!64 \!\rightarrow \! 512\!\times \!512$ Natural image super-resolution at
and$64\!\times \!64 \rightarrow 256\!\times \!256$ pixels on ImageNet [62].$56\!\times \!56 \rightarrow 224\!\times \!224$ Unconditional
face generation by a cascade of 3 models, and class-conditional$1024\!\times \!1024$ ImageNet image generation by a cascade of 2 models.$256\!\times \!256$
We compare SR3 with EnhanceNet [31], ESRGAN [30], SRFlow [32], FSRGAN [14] and PULSE [17]. We also compare to a Regression baseline that shares the same architecture and model capacity as SR3. Importantly, this enables one to directly assess the advantages of iterative refinement over a single step regression model, ablating the effects of model size, architecture, and training data. Performance is assessed qualitatively and quantitatively, using human evaluation, FID scores and the classification accuracy of a pre-trained model on super-resolution outputs.
4.4 Qualitative Results
Fig. 4 compares SR3 and our Regression baseline for a 64×64
Super-resolution results (64×64
Three samples from SR3 applied to ImageNet test images (16×16
Fig. 6 shows outputs of our face super-resolution models (
Results of a SR3 model (64×64
Further qualitative comparisons are shown in Fig. 7, where SR3 is compared to SoTA GAN models [30], [31] and a Normalizing Flow model [32]. While the GAN- and Flow-based methods produce sharp details, they also tend to generate artifacts in regions with fine-grained texture (e.g., see the face of the jaguar, and the structure of the dockyard). By comparison, SR3 produces sharp images with plausible details and minimal artifacts. As discussed above, while the high resolution details are realistic, they are not expected to perfectly match the original reference image.
SR3 and state-of-the-art methods on
4.5 Quantitative Evaluation
Table 2 shows the PSNR, SSIM [63] and Consistency scores for 16×16
4.5.1 Consistency With Low-Resolution Inputs
It is important for super-resolution outputs to be consistent with low-resolution inputs. To measure this consistency, we compute MSE between the downsampled outputs and the low-resolution inputs. Table 2 shows that SR3 achieves the best consistency error beating PULSE and FSRGAN by a significant margin, even slightly outperforming the regression baseline. This result demonstrates the key advantage of SR3 over the state-of-the-art GAN based methods. In fact, SR3 does not require any auxiliary objective function in order to ensure consistency with the low-resolution inputs.
4.5.2 Classification Accuracy on Super-Resolution Outputs
Table 4 compares the outputs of 4× natural image super-resolution models on object classification accuracy. Following [31], [64] we apply 4× super-resolution models to 56×56 center crops from the validation set of ImageNet. Then, we report classification accuracy of a pre-trained ResNet-50 [65] model. Since SR3 models are trained on the task of 64×64
SR3 outperforms existing methods by a significant margin on both top-1 and top-5 classification errors, suggesting higher perceptual quality. The strong performance of the Regression model can be attributed to the model capacity and architecture, and in part because it was trained on ImageNet data. The improvement of SR3 over Regression can be viewed as a direct indication of the power of the diffusion framework and iterative refinement, as both models use the same architecture. These results also reaffirm the limits of conventional reference-based metrics in super-resolution, like PSNR and SSIM, for which the baseline Regression model exhibits higher performance.
4.5.3 Human Evaluation (2AFC)
Direct human evaluation is one of the most desirable metrics for evaluating super-resolution models. While mean opinion score (MOS) is commonly used to measure image quality in this context, forced choice pairwise comparison has been found to be a more reliable method for such subjective quality assessments [69]. Furthermore, standard MOS studies do not capture consistency between low-resolution inputs and high-resolution outputs.
We use a 2-alternative forced-choice (2AFC) paradigm to measure how well humans can discriminate true images from those generated from a model. In Task-1 subjects were shown a low resolution input in between two high-resolution images, one being the real image (ground truth), and the other generated from the model. Subjects were asked “Which of the two images is a better high quality version of the low resolution image in the middle?” This task takes into account both image quality and consistency with the low resolution input. Task-2 is similar to Task-1, except that the low-resolution image was not shown, so subjects only had to select the image that was more photo-realistic. They were asked “Which image would you guess is from a camera?” Subjects viewed images for 3 seconds before responding. The source code for human evaluation can be found here.1
The subject fool rate is the fraction of trials on which a subject selects the model output over ground truth. Our fool rates for each model are based on 50 subjects, each of whom were shown 50 of the 100 images in the test set. Fig. 9 shows the fool rates for Task-1 (top), and for Task-2 (bottom). In both experiments, the fool rate of SR3 is close to 50%, indicating that SR3 produces images that are both photo-realistic and faithful to the low-resolution inputs. We find similar fool rates over a wide range of viewing durations up to 12 seconds.
Comparison on 4× face super-resolution (16×16
Face super-resolution human fool rates (higher is better, for photo-realistic samples one would expect a fool rate close to 50%). Outputs of four models are compared to ground truth. (top) Task-1, subjects are shown low-resolution inputs. (bottom) Task-2, inputs are not shown.
The fool rates for FSRGAN and PULSE in Task-1 are lower than the Regression baseline and SR3. The strength of SR3 over the Regression model reflects the benefits of iterative refinement in the diffusion model, since both models share the same architecture. We speculate that the PULSE optimization has failed to converge to high resolution images sufficiently close to the inputs. Indeed, when asked solely about image quality in Task-2 (Fig. 9 (bottom)), the PULSE fool rate increases significantly.
The fool rate for the Regression baseline is lower in Task-2 (Fig. 9 (bottom)) than Task-1. The regression model tends to generate images that are blurry, but nevertheless faithful to the low resolution input. We speculate that in Task-1, given the inputs, subjects are influenced by consistency, while in Task-2, ignoring consistency, they instead focus on image sharpness. SR3 and Regression samples used for human evaluation are provided here2
The results of a similar study with natural images, comparing SR3 with Regression, GAN-based models [30], [31] and a Flow-based model [32] on a subset of the ImageNet validation set are shown in Fig. 10. In this study images were displayed for 6 seconds and the input images were not displayed (i.e., Task-2). We used somewhat longer display times because natural images are more complex and cluttered than the face images. We did not show the input image because inconsistency between inputs and model outputs did not appear to be problematic with the baselines used. From Fig. 10 one can see that SR3 outperforms baselines by a substantial margin, suggesting higher perceptual quality. The regression model is significantly weaker in this case, which we attribute to the longer viewing time which makes it easier to discern the image blur.
ImageNet super-resolution fool rates. Model outputs are compared to ground truth with pair of images shown for 6 seconds.
To further appreciate the experimental results, it is useful to visually compare outputs of different models on the same inputs, as in Fig. 8. FSRGAN exhibits distortion in face region and struggles with generating glasses properly (e.g., top row). It also fails to recover texture details in the hair region (see bottom row). PULSE often produces images that differ significantly from the input image, both in the shape of the face and the background, and sometimes in gender too (see bottom row) presumably due to failure of the optimization to find a sufficiently good minima. As noted above, our Regression baseline produces results consistent to the input, however they are typically quite blurry. By comparison, the SR3 results are consistent with the input and contain more detailed image structure.
In addition to the aggregate fool rate results in Fig. 9, it is also interesting to inspect images that attain highest and lowest fool rates for a given technique. This provides insight into the nature of the problems that models exhibit, as well as cases in which the model outputs are good enough to regularly confuse people.
Fig. 11 displays the outputs of PULSE [17] and SR3 with the lowest and highest fool rates for Task-1 (the conditional task). Notice that images from PULSE for which the fool rate is low have obvious distortions, and the fool rates are lower than 10%. For SR3, by comparison, the images with the lowest fool rates are still reasonably good, with much higher fool rates of 14% and 19%. It is interesting to see that the best fool rates for SR3 on Task-1 are 84% and 88%. The corresponding original images for these examples are somewhat noisy, and as a consequence, interestingly, many subjects prefer the SR3 outputs.
Test cases with lowest and highest Fool rates for PULSE and SR3 in Task-1 (which compares models outputs to reference images, in the presence of low-resolution inputs). For privacy reasons, reference images are not shown.
4.6 Generation Speed
As discussed in Section 2.5, diffusion models typically require a large number of refinement steps during sample generation, and are therefore expensive compared to GANs. For more efficient inference, given a generation budget, SR3 determines the noise schedule using hyper-parameter search. Fig. 12 shows the resulting trade-off between image quality (FID) and efficiency (number of diffusion steps) for a 64×64
FID score versus number of inference steps for 64×64
4.7 Cascaded High-Resolution Image Synthesis
We also study cascaded image generation, where SR3 models at different scales are chained together with generative models, enabling high-resolution image synthesis. Cascaded generation allows one to train different models in parallel, and each model in the cascade solves a simpler task, requiring fewer parameters and less computation for training. Inference with cascaded models is also more efficient, especially for iterative refinement models. With cascaded generation we found it effective to use more refinement steps at low-resolutions, and fewer steps at higher resolutions. This was much more efficient than generating directly at high resolution without sacrificing image quality.
For cascaded face generation, as depicted in Fig. 14, we train a DDPM [1] model for unconditional
Synthetic 1024×1024 faces, sampled from an unconditional 64×64 model, followed by two 4× SR3 models.
We also trained a set of Improved DDPM [70] models on class-conditional
Class-conditional 256×256 ImageNet samples. Each row represents samples from a specific ImageNet class, from top to bottom: Goldfish, Red Fox, Balloon, Monarch Butterfly, Church, Fire Truck. For a given label, we sample a 64×64 image from a class-conditional diffusion model, and then apply a 4× SR3 model.
Synthetic 256×256 ImageNet images. We draw a label at random, sample a 64×64 image from the corresponding class-conditional diffusion model, and then apply a 4× SR3 model.
As a way to quantitatively evaluate sample quality, Table 5 reports FID scores for the resulting class-conditional ImageNet samples. Our 2-stage model improves on VQ-VAE-2 [71], is comparable to deep BigGANs [41] at truncation factor of 1.5 but underperforms them a truncation factor of 1.0. Unlike BigGAN, our diffusion models do not provide control of sample quality versus sample diversity; this remains an interesting avenue for future research. Nichol and Dhariwal [70] concurrently trained cascaded generation models using super-resolution conditioned on class labels (SR3 is not conditioned on class labels), and also observed a similar trend with improved FID scores. The effectiveness of cascaded image generation indicates that SR3 models are robust to the precise distribution of inputs (i.e., the specific form of anti-aliasing and downsampling).
4.8 Ablation Studies on Cascaded Models
Table 6 reports results of ablations on a
We also explore the choice of
Discussion and Conclusion
SR3 leverages conditional diffusion models to address single image super-resolution. It initializes the output image with random Gaussian noise iteratively refines the conditioned on the low resolution input. We find that SR3 works well on natural images and faces images, with a wide range of magnification factors, or as part of a cascading pipeline to generate high resolution images. SR3 models outperform several GAN and Normalizing Flow baselines. Human studies, in which subjects are asked to discriminate model outputs from real images, yield SR3 fool rates close to 50% on faces and 40% on natural images, which indicates that SR3 produces high fidelity outputs. The success of SR3 is in part a function of large model capacity and the use of large training datasets, motivating further exploration of scaling in future super-resolution work.
One practical issue with diffusion models is the computation cost of many refinement steps during inference. Our results indicate that one can trade sample quality for generation speed and achieve decent results in just 4 refinement steps. That said, recent and concurrent work proposes alternative approaches that can result in higher quality fast samplers for diffusion models [73], [74], [75], [76]. We further note that the use of self-attention, while powerful, also constrains the output dimension of our model; this will be addressed in future versions of SR3.
Finally, bias is an important issue with all generative models, including SR3. While in theory, our log-likelihood based objective is mode covering (e.g., unlike some GAN-based objectives), we do observe some indication of mode drop in SR3 outputs, e.g., the model consistently generates nearly the same image output during sampling (when conditioned on the same input). We also observe that the model generates very continuous skin texture in face super-resolution, dropping moles, pimples and piercings found in the reference image. We note that SR3 should be used in super-resolution products after further studies of its potential biases. Nevertheless, diffusion models like SR3 can be useful in reducing dataset bias by generating synthetic data from underrepresented groups.
ACKNOWLEDGMENTS
Thanks to Jimmy Ba, Geoff Hinton and Shingai Manjengwa kindly provided their face images for testing, and to Ben Poole, Samy Bengio and the Google Brain team for discussions and technical assistance. The authors thank the authors of [17] for generously providing samples for human evaluation.