Introduction
The task of single image super-resolution (SISR) [1], [2], [3] is presently receiving much attention as a widely researched low-level vision task, which aims to generate a high-quality image from a low-quality counterpart. Since Convolutional Neural Networks (CNNs) [4], [5], [6] have been proposed, numerous deep learning approaches [7], [8], [9] with distinct network architectures and training strategies have been proposed to enhance the performance of SISR. However, most advanced SR approaches [10], [11] presume a pre-defined degradation kernel that is often complicated and inaccessible in practical scenarios. Consequently, output images may contain unwanted artifacts. This issue, referred to as “kernel mismatch,” hampers the applicability of deep learning-based SR methods in real-life situations. Blind super-resolution, which pertains to the issue of unknown blur kernels, remains a challenge for many deep learning-based techniques [12], [13].
Currently, the method based on deep learning [1] has achieved good results in the super-resolution of SISR. Some non-blind super-resolution (SR) techniques deal with the issue of multiple degradation issues using their known corresponding kernels. The SRMD method [10] is the first to concatenate a low-resolution (LR) image with a stretched blur kernel to generate a super-resolution image with varied degradations. Zhang et al. [14], [15] further enhance this approach by including deblurring algorithms, which extend the degradation to arbitrary blur kernels. Hussein et al. [16] propose a correction filter that transforms blurry LR images to match the bicubicly designed SR model. Additionally, zero-shot methods [11], [17] are exploited for non-blind SR to investigate the multiple degradation problem. However, these methods [8], [9], [18] all take fixed and predefined degradation settings. The known degradation kernels are different from the degradation factors in real scenarios. Therefore, when these methods are applied to real situations, the performance will be greatly reduced.
At present, blind SR [19], [20], [21], [22] is mainly divided into explicit modeling and implicit modeling. The implicit modeling method is quite different from the explicit modeling method. It does not rely on explicit parameters but uses additional data to implicitly learn the potential hyper-partition model through data distribution. Existing methods [23], [24], [25] often use GAN [26] framework to explore data distribution and representative methods include CinCGAN [27] and FSSR [28].
The basic idea of explicit modeling [26], [29], [30] is to train a model with extra data covering a wide range of degradation, which often requires the parameterization of the blur kernel and noise information. Depending on whether the estimated degradation kernel is included in the proposed framework, most available blind SR methods [19], [20], [21], [22] estimate the blur kernel, involving complex optimization procedures. Typically, these blind SR approaches use a two-stage framework involving kernel estimation and super-resolution image reconstruction utilizing the estimated kernels. [21] mainly introduces a blind super-resolution method based on unsupervised degenerate representation learning. The method considers that the degradation modes of different pictures are different, so it uses contrast learning to learn the degradation representation of different pictures. This paper [25] mainly introduces a real-world super-resolution method based on kernel estimation and noise injection. It proposes a novel degradation framework that estimates the noise distribution of various blur kernels with real noise. Some methods [22], [26], [29] utilize generative adversarial networks to generate high-quality super-resolution images. The [26] is an approach that leverages an internal generative adversarial network (GAN) on a single image to estimate the degradation kernel. This kernel is then applied to a non-blind SR algorithm [31] to obtain the super-resolution images. IKC [19] is presented, and it designs a module to take advantage of spatial features while correcting blur kernels in an iterative manner. In order to get better-estimated kernel and super-resolution results, DAN [20] has designed an end-to-end network architecture that can reconstruct super-resolution features and estimate kernels iteratively. Based on DAN [20], DANv2 [32] is proposed. DANv2 designs a parallel network to exchange information and extract features, which can better supervise the optimization of the network.
These methods rely on the self-similarity properties of natural images to predict the underlying degradation blur kernel. However, their results can be easily affected by the noise, resulting in an inaccurate estimated kernel. While some deep learning-based methods, such as CAB [33], SRMD [10], IKC [19], and DAN [20], have made progress in blind SR, they require combining the blur kernel as additional information and gain different results based on the predefined kernel. Although they perform well with input kernels close to the ground truth, they are not suitable for real-world applications as they cannot predict the accurate kernel for each image. Despite the domination of deep learning in SISR, blind SR remains a challenge, with limited progress made so far.
In the past, these blind SR techniques typically involve a two-step process, including modeling the kernel from low-resolution (LR) images and reconstructing high-resolution (HR) images using that kernel. While this approach has yielded satisfactory results, it presents two significant challenges. Firstly, accurately estimating blur kernels from LR images can be challenging due to ambiguity caused by the downsampling step. Mismatched kernels can lead to a significant decline in performance and unsightly artifacts. Secondly, fully utilizing the estimated HR kernel and LR image information can also be difficult.
To address the identified limitations, we develop a unified framework prioritizing effective feature representations in this paper. The proposed framework is a progressive decoupled network specifically designed to estimate super-resolution images while progressively eliminating the degradation blur kernel information, namely A Progressive Decoupled Network for Blind Super-Resolution (PDNet). PDNet takes a fresh idea to the problem of blind super-resolution by separating the mutual relationships between the super-resolution branch and the degradation kernel branch. The proposed network begins by extracting the super-resolution features and degradation kernel information using a feature extraction block, and then a decoupled representation module (DRM) is employed to encode the mutual relationships among the two combined features. This DRM is equipped with a dual-branch cross-attention mechanism that enables it to adaptively learn complementary and redundant components from each other. It is worth pointing out that the proposed method does not explicitly estimate degradation kernels. Instead, effective super-resolution features and invalid degradation information are gradually separated from degradation features. Although the proposed algorithm does not predict the degenerate kernel, it is undeniable that the degradation information [19], [20] has a positive effect on image reconstruction. So, a Feature Re-degradation Module (FDM) is constructed to combine the decoupled super-resolution features and degradation information to generate the input low-resolution images, which promotes the optimization of the network in reverse.
In summary, our major contributions are highlighted as follows:
We solve the blind super-resolution problem from a new perspective. The proposed scheme no longer predicts the degradation kernel but decouples the super-resolution information from the degradation information.
The network investigates the imaging model between the super-resolution content and unknown blur kernel layer in an image and proposes DRM to decouple the combined features as well as their mutual relations to improve the accurate representation.
Several DRMs are cascaded to form a novel progressive decoupled network that progressively separates the super-resolution feature and degradation content along the network, significantly alleviating the learning difficulty.
Comprehensive experiments conducted on the proposed method have verified that our proposed decoupled representation mechanism has a promising performance for details recovery of blind super-resolution
Method
The section formally introduces the proposed method, which consists of some main components given a reformation of degradation: a shadow feature extraction module extracting coarse information from the LR images, and a parallel path network is followed to decouple the HR feature. Finally, a feature re-degradation module is utilized to optimize the network reversely. The flowchart is shown in Fig. 1.
The framework of our proposed progressive decoupled network. The
A. Problem Formulation
The relationship between the high-resolution image \begin{equation*} I_{LR} = (k \otimes I_{HR}){\downarrow }_{s} + n \tag{1}\end{equation*}
The above function utilizes three primary components: the blur kernel k, the downsampling operation \begin{align*} k_{e} &= F_{E}(I_{LR}) \tag{2}\\ {\mathcal {L}}_{k} &= argmin||k_{t} - k_{e}||_{1} \tag{3}\\ {\mathcal {L}}_{sr}& = argmin||I_{HR} - f(I_{LR}; k_{e})||_{1} \tag{4}\end{align*}
In this paper, we propose a different conception, which divides the degradation LR image into super-resolution features and irrelevant degradation information. The degradation scene is then typically modeled as follows:\begin{equation*} I_{LR} = I_{HR}{\downarrow } + I_{d} \tag{5}\end{equation*}
\begin{align*} \mathcal {L}_{1} &=argmin||I_{HR} - f(I_{LR}))||_{1} \tag{6}\\ \mathcal {L}_{2} &=argmin||I_{LR} - FDM(f(I_{LR}; I_{d})||_{1} \tag{7}\end{align*}
The proposed idea aims to learn the combined features of super-resolution and degradation information and utilize their mutual relationships to distill the valuable features through a decoupled representation way. More specifically, we utilize several decoupled representation modules to construct the architecture.
B. Overall Framework
The architecture is illustrated in Fig. 1, and it consists of three main components: a shadow feature extraction module extracting coarse information for decoupled components, cascaded residual groups made up of decoupled representation modules, and a feature re-degradation module.
Given a degradation LR image \begin{equation*} F_{0,r}, F_{0,d} = F_{ini}(I_{LR}) \tag{8}\end{equation*}
Next, we aim to optimize the relationship between the two components by implementing mutual decoupled representation modules. These modules facilitate various forms of structure information interaction and generate more comprehensive representations by relying on each other. The network obtains a progressive decoupled representation for both components through a series of cascaded decoupled representation modules and long-range connections that aid information propagation. Assuming there are N DRMs, for the n-th DRM, the dual-component learning process can be expressed as follows:\begin{align*} F_{n,r}, F_{n,d} = D_{n}(F_{n-1,r}, F_{n-1,d}) = D_{n}({\dots }(D_{1}(F_{0,r}, F_{0,d})))\\ \,\tag{9}\end{align*}
Furthermore, the advanced dual-branch network employs the feature re-degradation module to generate accurate predictions. Since the network only decouples the degradation information from the super-resolution content, we do not estimate the degradation kernel. Therefore, loss constraints on degradation kernels are not designed separately. However, to exploit the degradation information fully, an extra feature re-degradation module is designed. This module uses the decoupled super-resolution features and degradation information and fuses the two parts. The ultimate goal is to get an LR image of the network input. We use this loop way to implicitly stimulate the optimization process of the network.
This involves utilizing two reconstruction paths to project final features \begin{equation*} I_{LR}^{*} = F_{re}(I_{SR}, I_{d}) \tag{10}\end{equation*}
C. Decoupled Representation Module
Our solution, the Decoupled Representation Module (DRM), offers a unique approach to combinedly represent super-resolution information and degradation content through decoupled learning. Refer to Fig. 2 for a visual representation. The DRM consists of two parts: a feature extraction component and a decoupled cross-attention module (DCAM). The feature extraction component generates a coarse representation of the super-resolution and degradation information, which are fed into the DCAM. The DCAM encodes the self-relation and mutual relationships and obtains a refined, combined representation of the two contents.
The basic feature extraction part is constructed with several convolution layers to extract different features
The DCAM takes \begin{equation*} F_{i,d}, F_{i,r} = F_{i-1,d} + M_{d} \otimes g_{i, d}, F_{i-1,r} + M_{r} \otimes g_{i, r} \tag{11}\end{equation*}
Experiments
A. Datasets and Implementation Details
Following the previous works [19], [20], both DIV2K [34] and Flickr2K [35] are utilized, which include a total of 3450 2k images. To improve IO speed, we crop the images into sub-images. During training, anisotropic and isotropic Gauss methods are employed to develop degradation kernels, resulting in corresponding low-quality LR images. For testing, the experiments utilize five widely recognized benchmarks: Set5 [36], Set14 [37], BSD100 [38], Urban100 [39], and Manga109 [40]. As part of this process, The RGB images are converted to YCbCr color space and only utilize the Y channel to calculate PSNR and SSIM metrics [41].
1) Isotropic Gaussian Kernels
To begin with, the experiments are performed on isotropic Gaussian kernels following the method [19]. Specifically, we maintain a fixed kernel size of
2) Anisotropic Gaussian Kernels
As part of our experiments, we test anisotropic Gaussian kernels using the method described in [26]. For scale factors 2 and 4, we set the kernel size to
3) Implementation Details
In each experiment, the model maintains a batch size of 64 and utilizes LR patch sizes of
B. Comparison With State-of-the-Arts
1) Evaluation With Isotropic Gaussian Kernels
As referenced in [19], we assess the effectiveness of our approach on different datasets generated by Gaussian8 kernels. To estimate its performance, the experiments compare our method against a number of state-of-the-art blind SR techniques, including ZSSR [31] (utilizing bicubic kernel), RCAN [43], DRSR [12], IKC [19], DANv1 [20], DANv2 [32], DSSR [22], MRDA [47], DASR [21], and AdaTarget [46]. Additionally, we follow the same evaluation protocol as [19] and compared the proposed method against CARN [44]. In most cases, we utilized the official implementations and pre-trained models of the respective methods.
Table 1 presents the visualized results, which demonstrate the outstanding performance of our method across all benchmarks. Notably, the performance of CARN decreases dramatically with Gaussian8 due to its deviation from the predefined bicubic kernel. While ZSSR shows superior performance compared to non-blind SR methods, its image-specific network design restricts its ability to leverage abundant training data. Though AdaTarget can enhance image quality, it is still lower than other algorithms. IKC and DAN employ a two-stage strategy that yields significant improvements. Because they directly input the predicted inaccurate degradation kernels into the network, they are less effective than our proposed algorithms.
The visualization results are shown in Fig. 3. RCAN and ZSSR cannot reconstruct the detailed texture in the image, and the deblurring performance is not good. Although the blind SR methods like IKC, DANv1, DANv2, and DSSR generate better results, our algorithm achieves better performance in terms of texture, edges, brightness, etc.
2) Evaluation With Anisotropic Gaussian Kernels
Compared to isotropic Gaussian kernels, the degradation with anisotropic Gaussian kernels is more challenging. We compare our method with such as ZSSR [31], IKC [19], DASR [21], DANv1 [20], DANv2 [32], AdaTarget [46] and KOALAnet [24]. We also compare the proposed method with some bicubicly proposed methods such as EDSR [7], RCAN [43], and DBPN [48].
Table 2 displays the quantitative results of DIV2KRK, revealing a significant enhancement in performance compared to other blind SR approaches with the proposed method. It’s worth noting that when combined with KernelGAN, ZSSR performs better, indicating the significant role of kernel estimation. SOTA blind SR methods like IKC, DAN, and DASR have achieved remarkable accuracy in PSNR and SSIM. AdaTarget can perform comparably with SOTA blind methods by applying an adaptive target to finetune the network. However, the proposed method outperforms all these methods. The visual results in DIV2KRK are provided in Fig. 5 and Fig. 6, demonstrating results obtained by the proposed method are much cleaner, lighter, and have more textures.
C. Ablation Experiments
This paper proposes a progressive architecture to decouple the super-resolution content from the degradation feature without estimating the unknown blur kernels. To verify the effectiveness of the network, an extra loss function is utilized, which is utilized to explicitly constrain the learning of the degradation kernel (DL). In addition, we also discuss the influence of the feature re-degradation module on the network effect.
1) Effectiveness of Different Parts
The performance of ablation experiments is demonstrated in Table 3. When DL is used, this result will decrease compared to our algorithm. This phenomenon also indicates that the use of explicit constraints on degradation kernels will increase the difficulty of network learning and then affect the effectiveness of super-resolution. At the same time, when FDM is replaced, the effect of the network will be greatly reduced. This phenomenon shows two points; first, in the process of blind super-resolution, the degradation content is not insignificant. It can affect the super-resolution performance. So, just decoupling the degradation content is not optimal. Second, the designed FDM can effectively excite the optimization of the proposed network by exploiting the inverse of the degradation feature and the super-resolution feature so that the network can have a better performance.
2) Effectiveness of Different Numbers of DRM
On the other hand, some experiments are conducted to explore the influence of the number of modules on the algorithm. The algorithm we propose is the baseline, which uses several DRMS to form a block. Information is exchanged between different blocks. We label the number of DRM as N and the number of blocks as M. Different amounts of N and M are used to build different networks to conduct experiments. The model employed in the paper is the baseline, where N and M are set to 5 and 8, respectively, denoted
Table 4 shows that the proposed method achieves the best super-resolution performance when N and M are set to 5 and 8, respectively. Besides, the evaluation performance declined with
3) Inference Efficiency
To demonstrate the efficiency of the proposed method, we compare the parameters, inference times, and PSNR performance with other blind methods on Set5 dataset with Gaussian8 kernels for
4) Real-World Degradations
In addition to conducting experiments on a synthetic data set that utilized both isotropic and anisotropic Gaussian kernels, we also compare the proposed method with other existing methods using real-world images. The visual results of this comparison can be seen in Fig. 7. The results of the other methods produced distorted images with noticeable artifacts and blurry edges. In contrast, the proposed method produced SR images with sharper edges, clearer content, and fewer artifacts. These results further demonstrate the promising performance of our proposed method.
The visual results on real-world images of different methods. RD indicates real-world.
Conclusion
In this paper, we design an architecture to tackle the blind super-resolution task from a new perspective. Instead of estimating the kernel, we propose a decoupled representation module to separate the super-resolution feature and degradation information. Based on the decoupled representation module, we propose a progressive decoupled network to restore the super-resolution image while removing the degradation information progressively. Comprehensive experiments on different benchmarks demonstrate promising performance compared with the state-of-the-art approaches on blind super-resolution. However, the network is burdened with too many parameters, hence it requires a lightweight structure. Additionally, it is expected that more efficient attention mechanisms will be designed to help in feature decoupling, leading to better performance. Furthermore, better strategies can be developed to take advantage of degradation features in the future.