Introduction
Image super-resolution (SR) is a classic low-level vision task that generates a high-resolution (HR) image from a low-resolution (LR) observation. According to different degradation models and conditions, we can divide SR works into four groups. The first uses the most straightforward setting and has attracted the most attention. It assumes that the downsampling kernel is known (e.g., bicubic) and that the LR image is noise-free. Under such assumptions, we can synthesize a large set of HR-LR training images and directly learn the mapping function. Most previous works, from SRCNN [2] to RCAN [3], adopt this assumption, with great success. The second group deals with a more complicated case, where the degradation model may include blurring, noise, and diverse downsampling kernels. This is closer to real usage, but the degradation model is still known and is limited to specific types of degraded images. Recent studies introduce additional priors, e.g., SRMD [4], or estimate model parameters, e.g., IKC [5], using deep models. The third group takes a step further and directly solves the real-world SR problem. They build new real-world SR datasets [6], [7] and adopt semi-supervised or unsupervised frameworks, e.g., CinCGAN [8]. They are potentially capable of dealing with a specific set of real-world images, such as photos taken using a phone. The last group faces the most challenging scenario, where the only information available is the input LR image. Specially, it is infeasible to synthesize paired training data for supervised learning (as in the first and second groups of methods), or to collect a set of LR images with similar degradation types for unsupervised learning (as for the third group). As the problem is very complicated, only a few works [1], [9], [10] attack this topic. In this paper, we focus on the hard but general case, and contribute to the last group of methods. For a detailed classification of recent blind SR methods, see Ref. [11]. One remaining research gap naturally reveals itself-implicit modelling from a single image, which our work addresses.
To distinguish our SR problem from other SR tasks, we term it blind image super resolution from a single image (BSI-SR). Blind indicates that the degradation model is unknown, while single image means that no other training images are available. The only information comes from the input LR image itself. Pioneering works [12]–[14] have exploited self-similarity across different scales of the LR image. Later approaches [9] propose to learn a deep model purely from internal patches. These methods have shown the possibility of performing SR from a single image, but are limited in performance and applications. Their most obvious weakness is that they depend strongly on internal similar patches, so do not work well on images with few repetitive patterns.
Most recently, Shaham et al. [1] introduce an unconditional framework, SinGAN, to generate new samples from a single image. It learns internal distributions of the input with a GAN pyramid, and is generally effective for natural images. One of its applications is image SR. This model is not restricted to specific image types, and has provided a new perspective for the blind SR problem. Nevertheless, directly applying SinGAN to SR has several problems. Firstly, it poorly preserves image structures, such as face and building outlines. Secondly, it usually generates wrong details and artifacts, especially in smooth areas. Note that structure preservation and detail generation are the two main issues used to evaluate image SR. Figures 1 and 3 show some results of SinGAN for natural images, face images, and noisy images, where we can observe distorted textures and inaccurate details.
Why does the original SinGAN fail to achieve satisfactory SR results? They can be mainly ascribed to the design of its generative network. Note that the primary goal of SinGAN is to generate diverse samples that have similar content but different configurations, so the learned patch distribution lacks global structure information. This makes the original SinGAN unsuitable for the SR task. To address the above issue, we introduce a global contextual prior to SinGAN, and propose an SR specific variant-SR-SinGAN. Taking the pipeline of SinGAN (see Fig. 2), we add a global contextual prior module and a local contextual prior module. The global contextual prior helps to preserve image structures, while the local contextual prior improves generated details. Extensive experiments and ablation studies demonstrate the effectiveness of these priors; the final SR-SinGAN achieves significantly superior performance to SinGAN in all cases, including natural images, face images, and noisy images.
Super-resolution results of 4x for single image generated models. (a) Bicubic upsample. (b) SR results of SinGAN [1]. (c) SR results of the proposed SR-SinGAN. SR-SinGAN outperforms SinGAN in both outlines and structures.
We have conducted extensive experiments on blind SR. The contributions of this work are summarized as follows:
an initial contextual prior using a downsampled version of LR to remove Gaussian noise; this can help reconstruct realistic information about the target, while using an appropriate input size during initial training stage enhances the learning ability of the model,
an image-based global context prior, the training contextual prior, to preserve the structure of the generated image; unlike existing global context methods, we use downsampled versions of LR instead of learned features during the training process, and
a gradient-based noise prior, the local contextual prior, to generate faithful information to improve visual quality, allowing us to remove unwanted noise while preserving rich details.
Related Work
In this section, we briefly review the most relevant categories of related work on single image super-resolution, blind image super-resolution, and internal learning based image super-resolution.
Overview of the proposed method, where the pyramid of GANs belong to the original SinGAN (a). For blind single image super-resolution, we introduce a global contextual prior and a local contextual prior.
2.1 Single Image Super-Resolution
Single image super-resolution (SISR) has been studied over the past several years. Some early approaches [15] rely on natural image statistics or image-priors. Recently, deep learning has led to dramatic improvements in SR. Since Dong [2] first proposes using CNN for SR and achieves state-of-the-art performance, many CNN architectures have been studied for image restoration [16], [17]. Zhang et al. [3] present an efficient deep model to improve the performance of SR. All the above mentioned CNN based methods aim to minimize the mean-square error (MSE) between the reconstructed degraded image and the ground truth. To avoid overly smooth generated results and improve visual quality, some GAN-based methods have been proposed such as SRGAN [18], which combines an adversarial loss [19] and a perceptual loss as the final objective function, to generate visually more pleasing images than MSE-loss based methods. Burst SR [20] differs in utilizing multiple noisy RAW images to generate a denoised, super-resolved RGB image.
The above methods cannot handle complicated types of degradation (e.g., noise, blurring) because of the assumed bicubic kernel. To deal with this problem, SRMD [4] proposes a strategy to integrate multiple types of degradation in an SR network. In order to deal with real images, BSRGAN [21] uses a more complex but practical degradation model. Real-ESRGAN [22] introduces a high-order degradation modeling process to better simulate complex real world degradation. However, the training in all of these works is supervised, using large-scale synthetic paired data. Hence, these methods cannot be directly used for image restoration tasks in the case of absent paired training data and unknown types of degradation.
2.2 Blind Image Super-Resolution
Blind image super-resolution (blind SR) generalizes the problem by assuming LR with unknown degradations. An early attempt [23] at solving this problem explicitly estimates the unknown point spread function [5], [24]. A few recent works [6], [7] train on real-world SR datasets and adopt semi-supervised or unsupervised frameworks: e.g., the Cycle-in-Cycle network [8] learns a mapping from the original input image to clean image space, using a framework that employs cycle consistency losses. The SR network itself is trained by only employing indirect supervision in the LR domain.
These works focus on the downsampling process to improve SR. However, SR is only performed on images with the learned downsampling operation, and is therefore inapplicable to real-world images. Bulat et al. [25] also focus on the problem of learning the downsampling process. However, this approach specifically addresses the problem of SR for face images, where strong content priors can be learned by the network. Unlike these methods, which need external training datasets to generate models for specified tasks, we tackle the general SR problem only using the input image itself. To tackle the blind SR problem, some works [26]–[28] propose strategies to capture real LR-HR image pairs. However, these methods rely on complicated data collection procedures, requiring specialized hardware, and are difficult and expensive to scale. In contrast, the proposed method does not need any additional data, greatly increasing its usefulness and applicability.
Some other works [5], [29], [30] also take advantage of priors to improve SR performance. An iterative kernel correction (IKC) method [5] is proposed for blur kernel estimation in the blind SR problem. KernelGAN [30] proposes an image-specific Internal-GAN, and FKP [29] proposes a normalizing flow-based kernel prior (FKP) for kernel modeling. However, it is very hard to achieve accurate kernels for very low-quality images.
2.3 Internal Learning Based Image Super-Resolution
Glasner was first to propose internal example-based methods [31], which exploit the self-similarity property to generate exemplar patches from the input image; several studies [31], [32] have further accelerated the implementation. Recently, some studies [2], [33] have taken advantage of deep networks to boost SR performance. Several internal patch based learning approaches [9], [10] have been proposed using deep convolution networks (CNNs). ZSSR [9] proposes an image-specific SR approach, which trains a lightweight network using only the test image itself, by performing extensive data augmentation. ZSSR assumes that the internal information is recurrent at different scales of the target image. This method shows the possibility of SR from a single image, but is limited in performance and applications. Its most obvious weakness is that it depends on internal similar patches, so cannot work well on images with few repetitive patterns. Refs. [10], [34] propose deep image priors (DIP) and double-DIP to reconstruct a degraded image by fine-tuning a randomly initialized CNN. DIP optimizes the model between the generated image and the target input. DIP assumes that the deep CNN has a strong image generation capacity and tends to generate noise-free natural images. DIP has extensive applications to image super-resolution, image denoising, and image inpainting, but provides limited texture and detail generation. These methods tend to reconstruct smooth results due to the optimization. Consequently, as shown in Fig. 11, DIP may fail in tasks that require semantics beyond the target image.
The most closely related work to ours is SinGAN, which shows that it is possible to train a GAN on a single image to achieve various manipulation and restoration effects. However, SinGAN is designed to generate diverse samples with similar content, so the generated image lacks global structure information, as shown in Fig. 1. Thus, the original SinGAN cannot be directly applied to the SR task. Compared to SinGAN, the proposed method makes full use of the image information, so can provide richer semantics about the target image. It can thus restore input LR images more accurately, and generate more natural textures.
Method
3.1 Problem and Approach
Blind image super resolution can be formulated as a degradation model
We resort to the current single image generation framework, SinGAN, to tackle the blind SR problem. In this section, we first revisit the basic model SinGAN. We then analyze the limitations of SinGAN in several SR cases: natural images, face images, and noisy images. To overcome these limitations, we introduce a global contextual prior and a local contextual prior.
3.2 SinGAN Revisited
3.2.1 Framework
SinGAN is an unconditional generative model that can be learned from a single natural image. As Fig. 2 shows, it consists of a pyramid of generators, \begin{equation*}\hat{x}_{N}=G_{N}(z_{N})\tag{1}\end{equation*}
The goal of \begin{equation*}\hat{x}_{n}=G_{n}(z_{N},(\hat{x}_{n+1})\uparrow^{r}),\qquad n < N\tag{2}\end{equation*}
All generators share a similar network structure: they are fully convolutional networks with 5 conv-blocks of the form Conv(
3.2.2 Training
The multi-scale architecture is trained from the coarsest scale to the finest one. Once each GAN is trained, it is kept fixed. The training loss for the \begin{equation*}\min\limits_{G_{n}}\max\limits_{D_{n}} L_{\text{adv}}(G_{n},D_{n})+\alpha L_{\text{rec}}(G_{n})\tag{3}\end{equation*}
The adversarial loss
3.2.3 Testing
To increase the resolution of the input image by a factor of s, SinGAN adopts a pyramid scale factor of
3.3 Motivation
The above original SinGAN can be applied to various image super-resolution problems, but we experimentally find that this basic framework fails to achieve satisfactory results in the blind SR problem. We now analyze the limitations of SinGAN in three particular SR scenarios.
3.3.1 Natural Image Super-Resolution
Figure 3(a) shows SR results of SinGAN for natural images. First, let us take a closer look at the overall structure. By comparing the output to the original input image, we observe that the structures of the fence are severely distorted. The arrangement and outlines of bricks have significantly changed. These wrong structures and textures are unacceptable in image SR. Some newly-introduced textures in smooth regions and around edges are unrealistic. Furthermore, we can see distorted textures, e.g., the generated cloud has similar texture to the reference patch in the input image. Such distorted details can be seen as artifacts that reduce image quality. Similar issues can also be found in Fig. 11 later.
3.3.2 Noisy Image Super-Resolution
We next consider the robustness of SinGAN when presented with noisy images. As can be seen in Fig. 3(b), when the LR image contains slight noise (Gaussian noise with noise level 10), SinGAN amplifies this noise, generating an unnatural, low-quality image. This indicates that the basic SinGAN cannot distinguish noise from natural image content, which largely restricts its application to real-world images.
4x SR results generated by SinGAN, and the reference (Ref) patch cropped from the input. The incorrect generated patches are similar to the reference patch.
3.3.3 Face Image Super-Resolution
We further evaluate the application of SinGAN to face image SR in Fig. 3(c). The generated result contains severe artifacts in face attributes such as eyes, nose, and mouth. When similar artifacts appear in the hair or on a shirt, they may be perceived as textures, but when they appear on the face, they are recognized as undesirable artifacts. We further see that generated artifacts in the eye are similar to textures of the reference patch.
3.3.4 Analysis
The above observations show that SinGAN is incapable of generating satisfactory results for SR tasks, and distorted textures have similar patterns to the reference patch. On asking “why does SinGAN suffer from these problems?”. Despite the intrinsic limitations of GANs, we find that the key reason is that “SinGAN lacks global contextual information”. This problem is seldom seen in conventional SR methods, but is severe in SinGAN. As a generation model, the aim of SinGAN is to capture patch distributions and generate similar image content. If the global context is predetermined, the output image will have limited variation. To address the problem, we introduce a global contextual prior.
A further observation is that different regions require different amounts of texture or detail: smoother areas like the sky need less texture and sharper areas like hair should have more detail. Our experiments show that the amount of texture is related to the input noise level
SR results for different levels of injected noise: (a) large noise
In the SR problem, texture preservation is the basic requirement. The output image should have the same image content as the input image. Thus we need to modify the current SinGAN structure to make it suitable for the SR task. The proposed global prior and local contextual prior not only help preserve structures, but also improve the quality of generated detail.
3.4 Global Contextual Prior
The global contextual prior is added as an extra layer outside the main branch of each GAN model, as shown in Fig. 2. For the
An example is shown in Fig. 5(b). The reconstructed outline of improved SinGAN with the global contextual prior is fixed. This demonstrates that the global contextual prior constrains variation in the output. Although the global contextual prior reduces the diversity of generated images, it helps to preserve the global structure for image restoration as shown in Fig. 5. Thus, global context information plays different roles in different tasks. The reconstructed roof using a global contextual prior is more accurate and realistic comparing to the one generated by the original SinGAN, verifying that the global contextual prior can provide semantic information.
Output of original SinGAN and refined SinGAN with global contextual prior. (a) Random sample of SinGAN. (b) Random sample of SinGAN with global contextual prior. (c) Edited image. (d) Image reconstructed by SinGAN. (e) Image generated by SinGAN with global contextual prior.
3.5 Local Contextual Prior
As shown in Fig. 4, the level of injected noise is related to the amount of generated texture. To introduce the local contextual prior, we firstly revisit the training procedure for image SR. During training, the model injects a noise map \begin{equation*}\hat{z}=zLR_{\mathrm{G}}\tag{4}\end{equation*}
\begin{equation*}\begin{cases}
LR_{x}=LR(x+1,y)-LR(x-1,y)\\ LR_{y}=LR(x,y+1)-LR(x,y-1)\\ \nabla LR_{\mathrm{G}}=(LR_{x},LR_{y})\\ LR_{\mathrm{G}}=\Vert \nabla LR_{\mathrm{G}}\Vert_{2}\end{cases}\tag{5}\end{equation*}
The gradient map is calculated using Ref. [35], and the amount of detail is determined by the gradient values. As shown in Fig. 4, this simple operation can significantly reduce artifacts, especially in smooth regions. It is called a contextual prior as it uses contextual information from the input LR image.
(a) Input image, (b) local entropy, (c) variance, (d) gradient, (e) noise, (f, g, h) re-defined input noise
3.6 Initial Contextual Prior
3.6.1 Real Initial Input
The input to the initial stage
3.6.2 Training Size of Initial Input
We thus now need to determine the input size for the downscaled LR image \begin{equation*}x_{N}{}^{\text{size}}=\begin{cases}
xr^{N}, & xr^{N} < a\\ a, & xr^{N}\geqslant a\\ 12, & \min(xr^{N},a) < 12\end{cases}\tag{6}\end{equation*}
(a) Results using noise in the
Results of SinGAN using different training sizes at scale
We name the new framework using the above two contextual priors SR-SinGAN. These three priors can also be applied individually according to different requirements.
For image super-resolution, we use the last generator
In summary, we make two changes in the initial training stage: (i) using the LR image instead of Gaussian noise, and (ii) using the defined training input size.
Experiments
In this section, we firstly conduct an ablation study to validate the effectiveness of the proposed contextual priors. Then, we present SR-SinGAN results for three representative blind image SR tasks: natural image SR, face image SR, and noisy image SR.
4.1 Datasets and Evaluation Metrics
To evaluate the SR performance of our proposed method, we utilize four commonly used synthetic benchmarks for testing: Set5, BSD100, Helen [36], CelebA [37]. We also select two real SR datasets for testing: the RealSR dataset [21] and the NTIRE 2020 Real-World Image Super-Resolution validation dataset [38]. The RealSR dataset was used by BSRGAN [21], and contains 20 real images either downloaded from the Internet or directly chosen from existing testing datasets. The NTIRE 2020 Real-World Image Super-Resolution validation dataset contains 100 real images with complex degradation types.
For metrics, we chose learned perceptual image patch similarity (LPIPS) [39], PSNR and structural similarity (SSIM) [40] for synthetic test data, and perceptual index (PI) [41], NIQE [42], and NRQM [43] for real test data. Lower PI, NIQE, and LPIPS values indicate higher perceptual quality.
4.2 Comparator Methods
We compared the proposed SR-SinGAN to various single image SR methods: DIP [10], ZSSR [9], SinGAN [1], and FKP [29]; these exploit a single image and do not rely on prior training. FKP proposes a normalizing flow-based kernel prior (FKP) for kernel modeling, and FKP can be easily used to replace the kernel modeling modules of Double-DIP and KernelGAN. We also consider state-of-the-art perceptual-driven SR methods including BSRGAN [21] and Real-ESRGAN [22], trained on synthetic data with supervised learning. Our experiments use the pretrained models provided by BSRGAN and Real-ESRGAN for testing.
4.3 Training Details
Following SinGAN, we use the same training settings. Each scale is trained for 2000 iterations. We use the Adam optimizer [44]. The learning rate is set to 0.0005 and decreases by a factor of 0.1 after 1600 iterations. We set the momentum parameters
4.4 Ablation Study
In order to study the effect of each component, we gradually modified the baseline SinGAN model and compare the results. As the initial input prior plays an important role during the training process, we first applied an appropriate initial input prior. Then, we introduced a global contextual prior and local contextual prior sequentially.
4.4.1 Initial Contextual Prior
As Fig. 8 shows, SinGAN cannot generate satisfactory details in the initial stage (the Nth stage) (see also Fig. 7). Furthermore, the training model cannot reconstruct acceptable textures and details when using an inappropriate initial training size. Therefore, it is not feasible to use noise as input to the
During the experiment, the initial size of the downsampled input image to the
4.4.2 Global Contextual Prior
Figure 9(2) shows that the SR image contains incorrect information for the baby's eyes, and the wrong texture is similar to that of the hat. To handle this problem, we introduce global context to provide global information during the training process.
To train the contextual module, we enlarge the feature size from
Overall qualitative and quantitative comparisons for each component of SR-SinGAN. Each column represents a model with configurations given above. Methods from left to right gradually increase the degree of modification over the baseline SinGAN model.
During the training process,
4.4.3 Local Contextual Prior
After introducing the local contextual prior, we can observe that the PSNR of reconstruction is improved. The generated SR image contains fewer artifacts around the eyes and more natural textures on the hat: see Fig. 9(4). This demonstrates that the local contextual prior can not only increase reconstruction accuracy, but also improve visual quality.
From the above analysis, we can conclude that our careful design benefits blind image reconstruction for the single image generation framework.
4.5 Contextual Prior Evaluation Using LAM
To show the effectiveness of the contextual prior intuitively, we introduce a new attribution method, the local attribution map (LAM) [45]. The LAM aims to find which input pixels influence network outputs, revealing how the architecture influences information usage in the final output. LAM interprets the SR networks in a local manor for specific locations and surroundings. The input to the LAM method is the LR input, and the LAM highlights the pixels which have the greatest impact on the SR results.
Figure 10 shows LAM and SR results for SinGAN with different prior modules. In Figs. 10(b) and 10(c) the output of SinGAN utilizes more input pixels and the influencing input pixels are much more scattered. Compared to the original SinGAN, the introduced global contextual prior improves the weights of the target patch, which helps to remove some disturbed information and preserve the fidelity of the input. With the help of the global contextual prior, the generated image improves the fidelity, e.g., the cropped patches contains fewer artifacts.
Local attribution map (LAM) and SR results. (a) LR input. (b) SinGAN results. (c) LAM and SR image for SinGAN with global contextual prior. (d) LAM and SR image for SinGAN with the local contextual prior.
Figure 10(d) shows the attribution map and SR of SinGAN with the local contextual prior. The modified network pays more attention to information near to input pixels compared to the original SinGAN framework. Compared to the baseline, the smooth area (abdomen of the penguin) is less textured with fewer artifacts and the rocks contain much more detail. This demonstrates that the introduced local contextual prior can provide more related local input information to the network.
4.6 Applications
The above section studies the effects of each component of SR-SinGAN. We next apply the proposed SR-SinGAN in three SR scenarios: natural images, face images, and noisy images.
4.6.1 Natural Image Super-Resolution
We tested SR-SinGAN on natural images by bicubic downsampling them by a factor 4. We compared SR-SinGAN to several state-of-the-art blind single image methods: ZSSR [9], DIP [10], FKP [29], and SinGAN [1]. ZSSR, DIP, and DIP-FKP are three representative MSE-based methods, while SinGAN is based on image generation. Table 2 shows PSNR results on validation datasets Set5 and BSD100. Firstly, SR-SinGAN surpasses DIP by a large margin and outperforms SinGAN by about 1.5 dB on Set5 and 0.5 dB on BSD100. Secondly, although ZSSR and DIP-FKP achieves better PSNR figures, SR-SinGAN produces better perceptual quality, as shown in Fig. 11. Table 3 gives quantitative results for the Real SR dataset. We see that the proposed SR-SinGAN achieves comparable results to FKP, BSRGAN, and Real-ESRGAN. We also compared the SR results of the proposed SR-SinGAN with supervised BSRGAN [21] and Real-ESRGAN, which are trained on amounts of synthetic paired data. As BSRGAN takes denoising into consideration, its SR output tends to be smooth, and it achieves the best PSNR, SSIM, and LPIPS scores.
4× image super-resolution for bicubic downsampled images. Compared to state-of-the-art blind single super-resolution methods (ZSSR [9], DIP [10], and DIP-FKP [29]), SinGAN and SR-SinGAN generate sharper textures and outlines. Compared to supervised BSRGAN and Real-ESRGAN, SR-SinGAN contains unpleasant artifacts at edges.
Figure 11 provides a visual comparison of different methods. Compared to self-image based methods, SR-SinGAN can recover accurate information while generating sharper textures and details. As can be seen, ZSSR, DIP, and DIP-FKP fail to generate sharp edges and structures. Compared to SinGAN [1], SR-SinGAN generates fewer artifacts and reconstructs images with better visual quality. Compared to supervised BSRGAN and Real-ESRGAN, SR-SinGAN produces unpleasant artifacts at edges on synthetic data. This demonstrates that an external dataset can provide effective information for the generated model.
Figure 12 shows super-resolution results for real-world data. It is difficult to conduct an evaluation because both the GT clean image and degradation are unknown for real LR images, so we present the visual results. Compared to ZSSR, DIP, and DIP-FKP, our method is able to reconstruct sharper and cleaner details and textures. The generated results show that SR-SinGAN can generate clearer information from the degraded LR. These results also demonstrate that SR-SinGAN can better handle blurred data. Compared to BSRGAN and Real-ESRGAN, we observe that the supervised methods can handle artificial structures very well, but cannot reconstruct faces or flowers. This demonstrates that the input information cannot guide or restrict the supervised model very well.
4.6.2 Face Image Super-Resolution
To further evaluate the robustness of the proposed method, we also conduct experiments on face image super-resolution. We trained our model using RGB channels without external datasets. For face image super-resolution, we used 100 face images from Helen and 100 face images from CelebA as the validation datasets. In our experiments, the size of 100 CelebA face images (smaller than
4x image super-resolution for real-world data. Compared to BSRGAN and Real-ESRGAN, supervised methods can handle artificial structures very well, but fail for faces and flowers.
Figure 13 presents some examples of face image super-resolution with different resolutions. Compared to existing single image based methods, ZSSR [9], DIP [10], and DIP-FKP [29], which can reconstruct smoother images, SR-SinGAN can generate more information including details and textures. Compared to SinGAN [1], SR-SinGAN can generate faithful details in images of different resolutions. Compared to BSRGAN and Real-ESRGAN, the proposed SR-SinGAN can achieve comparable quantitative results on CelebA, as shown in Table 2. Figure 13(bottom) shows SR results for CelebA face images. The images generated by SinGAN lack rich details and important structure outlines. The other images reconstructed by SinGAN contain severe “fake” face attributes, which may result in poor visual quality. The results from SR-SinGAN contain fewer artifacts. Note that a few artifacts can actually improve the image visual quality.
4.6.3 Noisy Image Super-Resolution
Real noisy images were used to further assess the practicability of SR-SinGAN. For noisy image super-resolution, we conducted experiments on the real noisy dataset RNI15 [46], which contains 15 real noisy images. In the experiments, we downsampled the data by a scale of 4 and used the downsampled versions as LR input.
In the noisy image super-resolution task, the LR images contain noise, which will be amplified during the generation process. To alleviate this problem, the injected noise is refined as
4x image super-resolution for face image data. SR-SinGAN avoids artifacts on the face while preserving important structures in the target input. The refined SR-SinGAN can flexibly handle face images of different sizes.
Figure 14 shows super-resolution results for real noisy images. Compared to the MSE based methods ZSSR, DIP, and DIP-FKP [10], the image generation based methods SinGAN [1] and SR-SinGAN generate sharper edges and details. The SR images of SinGAN contain much more noise and artifacts due to the noise in the LR input. Figure 14 shows that SR-SinGAN can eliminate the annoying noise and generate sharp outlines and textures in some dense texture regions: SR-SinGAN can better handle noisy images and provide noise-free super-resolution results. As BSRGAN takes denoising into consideration, its SR results have the best appearance.
4.7 Limitations
Through the experiments, we can observe that the reconstruction capacity of the learned model is limited for real images. To further analyse the scope of single image generation based methods, we selected two types of textures with different patterns, as shown in Fig. 15. The trained model can reconstruct better details and structures for input with repeating patterns. This phenomenon can also be observed in ZSSR [9] and ZSSR+KernelGAN [30], because multi-scale learning architectures can take advantage of information with different scales; the textures of many real-world images do not contain similar content at different scales. This group of learned models can achieve limited results.
The above description and analysis lead us to conclude that single image generation based methods are more appropriate for images with repeating textures compared to those with general textures.
4.8 Computation Cost
We conducted our experiments using the the same configurations as SinGAN [1]. During the training process, we only add one image-based global context module which has little additional computation cost, and the other strategies do not introduce any additional cost. As Table 4 shows, SR-SinGAN only results in a little additional computation cost during testing.
Conclusions
We have tackled the problem of blind image restoration using unsupervised learning, where the only available data is provided by the low-resolution image. To improve SR results, we introduced SR-SinGAN for blind single image super-resolution, which includes an initial contextual prior during the initial training stage, a training contextual prior during the intermediate training stage, and a testing contextual prior at test time. We have conducted extensive experiments on natural images, face images, and noisy images, and the results demonstrate that the proposed SR-SinGAN can achieve stable and superior performance comparing to state-of-the-art blind single image super-resolution methods.
Limitation of SR-SinGAN. Comparing with repeated textures (a), the SR results of SR-SinGAN cannot achieve satisfactory visual quality for general images without repeated textures (b).
Declaration of Competing Interest
The authors have no competing interests to declare that are relevant to the content of this article.