Journals & Magazines >Computational Visual Media >Volume: 11 Issue: 1

Exploring Contextual Priors for Real-World Image Super-Resolution

Abstract:

Real-world blind image super-resolution is a challenging problem due to the absence of target high resolution images for training. Inspired by the recent success of the s...Show More

Metadata

Abstract:

Real-world blind image super-resolution is a challenging problem due to the absence of target high resolution images for training. Inspired by the recent success of the single image generation based method SinGAN, we tackle this challenging problem with a refined model SR-SinGAN, which can learn to perform single real image super-resolution. Firstly, we empirically find that downsampled LR input with an appropriate size can improve the robustness of the generation model. Secondly, we introduce a global contextual prior to provide semantic information. This helps to remove distorted pixels and improve the output fidelity. Finally, we design an image gradient based local contextual prior to guide detail generation. It can alleviate generated artifacts in smooth areas while preserving rich details in densely textured regions (e.g., hair, grass). To evaluate the effectiveness of these contextual priors, we conducted extensive experiments on both artificial and real images. Results show that these priors can stabilize training and preserve output fidelity, improving the generated image quality. We furthermore find that these single image generation based methods work better for images with repeated textures compared to general images.

Published in: Computational Visual Media ( Volume: 11, Issue: 1, February 2025)

Page(s): 159 - 177

Date of Publication: 24 February 2025

ISSN Information:

DOI: 10.26599/CVM.2025.9450303

Contents

SECTION 1

Introduction

Image super-resolution (SR) is a classic low-level vision task that generates a high-resolution (HR) image from a low-resolution (LR) observation. According to different degradation models and conditions, we can divide SR works into four groups. The first uses the most straightforward setting and has attracted the most attention. It assumes that the downsampling kernel is known (e.g., bicubic) and that the LR image is noise-free. Under such assumptions, we can synthesize a large set of HR-LR training images and directly learn the mapping function. Most previous works, from SRCNN [2] to RCAN [3], adopt this assumption, with great success. The second group deals with a more complicated case, where the degradation model may include blurring, noise, and diverse downsampling kernels. This is closer to real usage, but the degradation model is still known and is limited to specific types of degraded images. Recent studies introduce additional priors, e.g., SRMD [4], or estimate model parameters, e.g., IKC [5], using deep models. The third group takes a step further and directly solves the real-world SR problem. They build new real-world SR datasets [6], [7] and adopt semi-supervised or unsupervised frameworks, e.g., CinCGAN [8]. They are potentially capable of dealing with a specific set of real-world images, such as photos taken using a phone. The last group faces the most challenging scenario, where the only information available is the input LR image. Specially, it is infeasible to synthesize paired training data for supervised learning (as in the first and second groups of methods), or to collect a set of LR images with similar degradation types for unsupervised learning (as for the third group). As the problem is very complicated, only a few works [1], [9], [10] attack this topic. In this paper, we focus on the hard but general case, and contribute to the last group of methods. For a detailed classification of recent blind SR methods, see Ref. [11]. One remaining research gap naturally reveals itself-implicit modelling from a single image, which our work addresses.

To distinguish our SR problem from other SR tasks, we term it blind image super resolution from a single image (BSI-SR). Blind indicates that the degradation model is unknown, while single image means that no other training images are available. The only information comes from the input LR image itself. Pioneering works [12]–[14] have exploited self-similarity across different scales of the LR image. Later approaches [9] propose to learn a deep model purely from internal patches. These methods have shown the possibility of performing SR from a single image, but are limited in performance and applications. Their most obvious weakness is that they depend strongly on internal similar patches, so do not work well on images with few repetitive patterns.

Most recently, Shaham et al. [1] introduce an unconditional framework, SinGAN, to generate new samples from a single image. It learns internal distributions of the input with a GAN pyramid, and is generally effective for natural images. One of its applications is image SR. This model is not restricted to specific image types, and has provided a new perspective for the blind SR problem. Nevertheless, directly applying SinGAN to SR has several problems. Firstly, it poorly preserves image structures, such as face and building outlines. Secondly, it usually generates wrong details and artifacts, especially in smooth areas. Note that structure preservation and detail generation are the two main issues used to evaluate image SR. Figures 1 and 3 show some results of SinGAN for natural images, face images, and noisy images, where we can observe distorted textures and inaccurate details.

Why does the original SinGAN fail to achieve satisfactory SR results? They can be mainly ascribed to the design of its generative network. Note that the primary goal of SinGAN is to generate diverse samples that have similar content but different configurations, so the learned patch distribution lacks global structure information. This makes the original SinGAN unsuitable for the SR task. To address the above issue, we introduce a global contextual prior to SinGAN, and propose an SR specific variant-SR-SinGAN. Taking the pipeline of SinGAN (see Fig. 2), we add a global contextual prior module and a local contextual prior module. The global contextual prior helps to preserve image structures, while the local contextual prior improves generated details. Extensive experiments and ablation studies demonstrate the effectiveness of these priors; the final SR-SinGAN achieves significantly superior performance to SinGAN in all cases, including natural images, face images, and noisy images.

Fig. 1

Super-resolution results of 4x for single image generated models. (a) Bicubic upsample. (b) SR results of SinGAN [1]. (c) SR results of the proposed SR-SinGAN. SR-SinGAN outperforms SinGAN in both outlines and structures.

Exploring Contextual Priors for Real-World Image Super-Resolution

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Introduction

Related Work

2.1 Single Image Super-Resolution

2.2 Blind Image Super-Resolution

2.3 Internal Learning Based Image Super-Resolution

Method

3.1 Problem and Approach

3.2 SinGAN Revisited

3.2.1 Framework

3.2.2 Training

3.2.3 Testing

3.3 Motivation

3.3.1 Natural Image Super-Resolution

3.3.2 Noisy Image Super-Resolution

3.3.3 Face Image Super-Resolution

3.3.4 Analysis

3.4 Global Contextual Prior

3.5 Local Contextual Prior

3.6 Initial Contextual Prior

3.6.1 Real Initial Input

3.6.2 Training Size of Initial Input

Experiments

4.1 Datasets and Evaluation Metrics

4.2 Comparator Methods

4.3 Training Details

4.4 Ablation Study

4.4.1 Initial Contextual Prior

4.4.2 Global Contextual Prior

4.4.3 Local Contextual Prior

4.5 Contextual Prior Evaluation Using LAM

4.6 Applications

4.6.1 Natural Image Super-Resolution

4.6.2 Face Image Super-Resolution

4.6.3 Noisy Image Super-Resolution

4.7 Limitations

4.8 Computation Cost

Conclusions

Declaration of Competing Interest

References