Journals & Magazines >IEEE Transactions on Pattern ... >Volume: 45 Issue: 4

Image Super-Resolution via Iterative Refinement

Abstract:

We present SR3, an approach to image Super-Resolution via Repeated Refinement. SR3 adapts denoising diffusion probabilistic models (Ho et al. 2020), (Sohl-Dickstein et al...Show More

Metadata

Abstract:

We present SR3, an approach to image Super-Resolution via Repeated Refinement. SR3 adapts denoising diffusion probabilistic models (Ho et al. 2020), (Sohl-Dickstein et al. 2015) to image-to-image translation, and performs super-resolution through a stochastic iterative denoising process. Output images are initialized with pure Gaussian noise and iteratively refined using a U-Net architecture that is trained on denoising at various noise levels, conditioned on a low-resolution input image. SR3 exhibits strong performance on super-resolution tasks at different magnification factors, on faces and natural images. We conduct human evaluation on a standard 8× face super-resolution task on CelebA-HQ for which SR3 achieves a fool rate close to 50%, suggesting photo-realistic outputs, while GAN baselines do not exceed a fool rate of 34%. We evaluate SR3 on a 4× super-resolution task on ImageNet, where SR3 outperforms baselines in human evaluation and classification accuracy of a ResNet-50 classifier trained on high-resolution images. We further show the effectiveness of SR3 in cascaded image generation, where a generative model is chained with super-resolution models to synthesize high-resolution images with competitive FID scores on the class-conditional 256×256 ImageNet generation challenge.

Published in: IEEE Transactions on Pattern Analysis and Machine Intelligence ( Volume: 45, Issue: 4, 01 April 2023)

Page(s): 4713 - 4726

Date of Publication: 12 September 2022

ISSN Information:

PubMed ID: 36094974

DOI: 10.1109/TPAMI.2022.3204461

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION 1

Introduction

Single-image super-resolution is the process of generating a high-resolution image that is consistent with an input low-resolution image. It falls under the broad family of image-to-image translation tasks, including colorization, in-painting, and de-blurring. Like many such inverse problems, image super-resolution is challenging because multiple output images may be consistent with a single input image, and the conditional distribution of output images given the input typically does not conform well to simple parametric distributions, e.g., a multivariate Gaussian. Accordingly, while simple regression-based methods with feedforward convolutional nets may work for super-resolution at low magnification ratios, they often lack the high-fidelity details needed for high magnification ratios.

Deep generative models have seen success in learning complex empirical distributions of images (e.g., [3], [4]). Autoregressive models [5], [6], variational autoencoders (VAEs) [7], [8], Normalizing Flows (NFs) [9], [10], and GANs [11], [12], [13] have shown convincing image generation results and benefited conditional tasks such as image super-resolution [14], [15], [16], [17], [18]. However, existing techniques often suffer from various limitations; autoregressive models are prohibitively expensive for high-resolution image generation, NFs and VAEs often yield sub-optimal sample quality, and GANs require carefully designed regularization and optimization tricks to tame optimization instability and mode collapse [19], [20], [21], [22].

We propose SR3 (Super-Resolution via Repeated Refinement), a new approach to conditional image generation, inspired by recent work on Denoising Diffusion Probabilistic Models (DDPM) [1], [23], and denoising score matching [1], [24]. SR3 works by learning to transform a standard normal distribution into an empirical data distribution through a sequence of refinement steps, resembling Langevin dynamics. The key is a U-Net architecture [25] that is trained with a denoising objective to iteratively remove various levels of noise from an image. We adapt DDPMs to image-to-image translation by proposing a simple effective modification to the U-Net architecture. In contrast to GANs, which require inner-loop maximization, we minimize a well-defined loss function. Unlike autoregressive models, SR3 uses a constant number of inference steps regardless of output resolution.

SR3 models work well across a range of magnification factors and input resolutions (e.g., see Fig. 1), and they can be cascaded, e.g., going from $64\!\times \!64$ to $256\!\times \!256$, and then to $1024\!\times \!1024$. Cascading models allows one to independently train several small models rather than a single large model with a high magnification factor. We find that cascaded models enable more efficient inference, since directly generating a high-resolution image requires a larger number of costly iterative refinement steps for the same quality. We also show that one can cascade an unconditional generative model with SR3 models to unconditionally generate high-fidelity images. We conduct experiments on the general domain of natural images, as well as the domain of face images.

Automated image quality scores like PSNR and SSIM do not reflect human preference well when the input resolution is low and the magnification factor is large (e.g., [14], [15], [17], [26], [27], [28]). These quality scores often penalize synthetic high-frequency details, such as hair texture, because synthetic details do not perfectly align with the original details. We therefore resort to human evaluation to compare the quality of super-resolution methods. We adopt a 2-alternative forced-choice (2AFC) paradigm in which human subjects are shown a low-resolution input and are required to select between a model output and a ground truth image (cf. [29]). With that data we calculate fool rate scores that capture both image quality and the consistency of model outputs with low-resolution inputs.

On a standard 8× face super-resolution task, SR3 achieves a human fool rate close to 50%, outperforming FSRGAN [14] and PULSE [17] with fool rates of at most 34%. On a 4× task on natural images, SR3 outperforms a ESRGAN [30] EnhanceNet [31] and SRFlow [32] on human evaluation, and a wide range of methods on classification accuracy of a ResNet-50 classifier trained on high-resolution images. To demonstrate unconditional and class-conditional generation we combine a $64\!\times \!64$ generative model with SR3 models to progressively generate $1024 \!\times \! 1024$ unconditional faces in 3 stages, and $256 \!\times \! 256$ class-conditional ImageNet samples in 2 stages, all with competitive FID scores.

SECTION 2

Conditional Denoising Diffusion Model

We are given a dataset of input-output image pairs, denoted $\mathcal {D} = \lbrace {\boldsymbol{x}}_{i}, {\boldsymbol{y}}_{i}\rbrace _{i=1}^{N}$, which represent samples drawn from an unknown distribution $p({\boldsymbol{x}},{\boldsymbol{y}})$. The conditional distribution $p({\boldsymbol{y}}\;|\; {\boldsymbol{x}})$ is a one-to-many mapping in which many target images may be consistent with a single source image. We are interested in learning a parametric approximation to $p({\boldsymbol{y}}\;|\; {\boldsymbol{x}})$ through a stochastic iterative refinement process that maps a source image ${\boldsymbol{x}}$ to a target image ${\boldsymbol{y}}\in \mathbb {R}^{d}$. We approach this problem by adapting the denoising diffusion probabilistic (DDPM) model of [1], [23] to conditional image generation.

The conditional DDPM model generates a target image ${\boldsymbol{y}}_{0}$ in $T$ refinement steps. Starting with a pure noise image ${\boldsymbol{y}}_{T} \sim \mathcal {N}(\boldsymbol{0}, \boldsymbol{I})$, the model iteratively refines the output image to attain a sequence $({\boldsymbol{y}}_{T-1}, {\boldsymbol{y}}_{T-2}, \ldots, {\boldsymbol{y}}_{0}$) according to learned conditional distributions $p_\theta ({\boldsymbol{y}}_{t-1} \;|\; {\boldsymbol{y}}_{t}, {\boldsymbol{x}})$ such that ultimately ${\boldsymbol{y}}_{0} \sim p({\boldsymbol{y}}\; |\; {\boldsymbol{x}})$ (see Fig. 2).

$Fig. 1. - Two representative SR3 outputs: (top) 8× face super-resolution at 16×16 $\!\rightarrow \!$→ 128×128 pixels (bottom) 4× natural image super-resolution at 64×64 $\!\rightarrow \!$→ 256×256 pixels.$

Fig. 1.

Two representative SR3 outputs: (top) 8× face super-resolution at 16×16 $\!\rightarrow \!$ 128×128 pixels (bottom) 4× natural image super-resolution at 64×64 $\!\rightarrow \!$ 256×256 pixels.

Image Super-Resolution via Iterative Refinement

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Introduction

Conditional Denoising Diffusion Model

2.1 Gaussian Diffusion Process

2.2 Optimizing the Denoising Model

Algorithm 1. Training a Denoising Model $f_\theta$

Algorithm 2. Inference in $T$ Iterative Refinement Steps

2.3 Inference via Iterative Refinement

2.4 Justification of the Training Objective

2.5 SR3 Model Architecture and Noise Schedules

Related Work

Experiments

4.1 Datasets

4.2 Training Details

4.3 Evaluation

4.4 Qualitative Results

4.5 Quantitative Evaluation

4.5.1 Consistency With Low-Resolution Inputs

4.5.2 Classification Accuracy on Super-Resolution Outputs

4.5.3 Human Evaluation (2AFC)

4.6 Generation Speed

4.7 Cascaded High-Resolution Image Synthesis

4.8 Ablation Studies on Cascaded Models

Discussion and Conclusion

ACKNOWLEDGMENTS

References