Introduction
In recent years, computer vision-based autonomous systems, such as autonomous driving, underwater robotics, video surveillance, and medical imaging, have been widely used [1]. The clarity of the images captured by cameras directly affects the performance of these autonomous systems. However, the acquired images in the real world are not always clear and may suffer from various kinds of degradations, as they are usually taken in complicated situations: bad weather conditions, underwater environments, uneven illumination, moving cameras, etc. For example, images taken by surveillance cameras and medical imaging equipment usually exhibit low-resolution [2]; images taken by moving cameras tend to have motion blur [3], [4]; underwater images have color distortions and noises [5], [6], [7]; images taken in hazy, rainy, or foggy weathers contain different levels of intensity blurs and noises [8]. Such image degradations cause a severe performance drop of visual systems in segmentation, detection, and target tracking [9], [10], [11], [12]. Therefore, it is critical to develop efficient image restoration (IR) algorithms to enhance the environmental adaptability of the visual systems. However, as a typical ill-posed inverse problem, real-world image restoration remains extremely challenging.
In general, image restoration is the process of recovering a high-quality image with good visibility and clean content from a degraded image. As presented in Fig. 1, conventionally, IR techniques in the low-level vision domain can be grouped into six main categories according to the type of degradation involved in the image or video to be processed, i.e., super-resolution (SR), denoising, deraining, dehazing, deblurring, and color correction. Specifically, SR aims to reconstruct a high-resolution image from one or more low-resolution images [13]. Draining (dehazing, deblurring, denoising) is the task of raindrop (haze, blur, noise) removal [11], [14], [15]. Color correction deals with color distortion, especially for underwater images [16]. While typically there are many different branches or classes of IR to handle different types of degradation, many new emerging models can handle multiple tasks after being re- trained on different datasets, owing to the powerful learning prior capabilities of deep neural networks and publicly available benchmark datasets for image processing.
Considering the remarkable progress of IR has been made by deep learning models, several researchers [6], [9], [17], [18], [19] reviewed recent deep learning-based IR techniques and the primary difference among them is the categorization. In other words, each of these works examined the IR problem from a single perspective. For example, Chen et al. [17] reviewed the image restoration methods, datasets, and assessment metrics for real-world single image super-resolution (SISR). Xu et al. [9] surveyed video and image defogging algorithms and image quality assessment methods for the defogged images. Thakur et al. [20] focused on image de-noising analysis and compared different types of denoisers on benchmark datasets. Jian et al. [18] and Wang et al. [6] made a summary of underwater image processing approaches for problems with underwater turbulence, distortion, spectral absorption, and attenuation. Su et al. [19] discussed image restoration techniques for deblurring, denoising, dehazing, and super-resolution as four independent tasks. In summary, these works reviewed image restoration models from completely different perspectives or topics and ignore the generality of deep learning-based methods. However, it is worth noting that many newly emerging Transformer-based IR models, e,g., Restormer [12], SwinIR [21], and Uformer [22], have generalization ability to handle multi-tasks and as cutting-edge research works, they are rarely discussed in existing image processing reviews.
Targeting the shortcomings of the existing IR review works, a comprehensive and systematic study of the advances in IR is given in this work, which is the first attempt to make an overview of IR techniques to our best knowledge. The main contributions of this work are summarized below:
Benchmark datasets are categorized based on their degradation types.
Both conventional and new emerging assessment metrics for comparing recovered image qualities are summarized.
The existing deep learning-based IR methods and their achievements are comprehensively reviewed in general, followed by detailed reviews on CNN-, GAN-, Transformer-, and MLP-based networks in four categories based on the network architectures in particular.
The reconstruction qualities and efficiencies of the representative IR algorithms are compared on benchmark datasets.
Potential challenges and future research directions of IR are discussed and analyzed.
The remainder of this review is organized as follows. The background of IR is briefly introduced in Section II. Benchmark datasets for each degradation type are presented in Section III. Section IV introduces image quality assessment metrics. Section V reviews state-of-the-art IR technologies and methods by categories. The comparisons among representative IR algorithms are presented in Section VI. In Section VII, current challenges and future research directions of IR are analyzed. Finally, concluding remarks are given in Section VIII.
Problem Formulation
Conventional methods for restoring degraded images employ degradation modeling to solve inverse problems, mostly based on the maximum likelihood or Bayesian approaches to iteratively correct the estimated degradations [13], [19]. The general degradation model of a low-quality image \begin{equation*} I_{L} = I_{H} {\otimes } k +\,\,n, \tag{1}\end{equation*}
\begin{equation*} \widehat {I_{H}}=\mathrm {arg}\underset {I_{H}}{\mathrm {max}}[\mathrm {log}(P(I_{L}|I_{H})) +\mathrm {log}(P(I_{H}))], \tag{2}\end{equation*}
It is worth noting that degradations in the real world are more complex than the assumed or predefined degradations in conventional methods because degradation from the physical world is affected by various unknown factors. The deep learning-based methods (e.g., blind image restoration) have more advantages over traditional modeling-based methods in handling complex unknown degradations due to their powerful feature learning capability, and gradually become the dominant methods for IR. Before reviewing the deep learning-based methods in detail, the benchmark datasets for training or testing deep learning models are presented in the next section, followed by a Section introducing current existing image quality metrics.
Datasets
Training (testing) datasets are the cornerstones of the IR algorithms, as deep learning-based models are highly dependent on datasets to learn various degradations. In this section, six categories of related benchmark datasets for six different tasks are briefly introduced.
A. Datasets For Super-Resolution
For the training and testing of SR models, the widely used datasets including DIV2K [24], BSDS500 [25], T91 [26], Set5 [27], Set14 [28], Urban100 [29], Manga109 [30] are summarized in Table 1. Datasets vary in image amount, quality, scene, diversity of contents, and resolution. Several datasets comprise paired data, while some datasets contain only HR images and the corresponding LR images usually need to be generated by bicubic downsampling with a set of degradations (e.g., combinations of different levels of Gaussian blurs and noises) [23], [31]. Table 1 summarizes the main characteristics of critical datasets and relevant information on total image count, resolution of HR image, type of dataset, and classes of images.
B. Datasets For Denoising
A variety of datasets with noisy-clean pairs for image denoising have been collected under different conditions, some of which are dedicated to specific applications (e.g., smartphone cameras [40] and fluorescence microscopy), and most are provided for real-world image noise removal. The datasets’ details are listed in Table 2.
C. Datasets For Dehazing
Existing widely used dehazing datasets containing pairs of real/synthetic hazy and corresponding haze-free images are summarized in Table 3. Few hazy image datasets are collected from real-world scenes while most hazy images are synthetic or artificially generated by a haze machine.
D. Datasets For Deblurring
Several datasets with blurry-sharp pairs for image or video deblurring have been collected covering a wide range of scenes, motions, etc. Dataset details are listed in Table 4.
E. Underwater Image Datasets
Underwater image processing is a newly emerging research field in recent years. Due to the complexity of the underwater circumstances and the high workload, only a few publicly available datasets are collected. Table 5 summarizes publicly available typical databases for underwater image processing and analysis.
F. Datasets For Deraining
Image deraining aims to restore the clean vision from the degraded image taken on a rainy day. Numerous single-image deraining datasets have been recently constructed. Most datasets are synthesized in two different ways: (1) with the photorealistic rendering techniques proposed by [65] or (1) by adding simulated sharp lines slightly in a certain direction. Table 6 summarizes publicly available databases for draining tasks.
Image Quality Assessment
Image quality assessment (IQA) plays a vital role in effective model comparison in the field of image processing. The goal of IQA is to accurately predict the perceived quality by human viewers and further benefit image processing algorithms to improve the image quality to an acceptable level for the human viewers.
In general, IQA can be briefly grouped into two categories, i.e., human perception-based subjective assessment and quality metrics-based objective assessment. Overall, human evaluation is a more direct, easier way, and more in line with practical needs. It is typically referred to as the mean opinion score (MOS), which is an average rating that human raters assign to images. However, the disadvantage of subjective evaluation is two-fold: (i) the evaluation result is easily affected by personal preferences, and (ii) as a non-automated process, subjective assessment is often costly and time-consuming. While several pre-trained CNN or Transformer models [71], [72] based on a large number of human preference scores have been proposed to solve the labor-consuming problem, the predicted quality scores are not always accurate and the model training process still needs extensive human-judged score collection.
By contrast, objective evaluation is more convenient, although the results by different assessment metrics may not be necessarily consistent with each other as well as subjective evaluation. The existing image quality metrics can be grouped into two categories, no-reference (NR), and full-reference (FR) metrics, depending on whether ground truth images are required. Table 7 reports widely used metrics for image quality assessment, including no-reference and full-reference metrics.
A. MS-SSIM
As an FR IQA, multi-scale structural similarity (MS-SSIM) [73] first performs contrast comparison, structure comparison, and luminance comparison on multi-scale images, and then combines the measurement at different scales. Further, an image synthesis approach is adopted to calibrate the parameters of cross-scale image quality models to define the relative importance between scales.
B. PSNR
A representative of common and widely used FR quality metric is the peak signal-to- noise ratio (PSNR), which focuses more on the proximity between pixels and assumes pixel-wise independence, resulting in the low consistency with perceptual quality in some cases.
C. SSIM
The Structure similarity index (SSIM) [74] is an FR image quality metric that measures structural similarity and also performs luminance and contrast comparisons. Compared to PSNR, SSIM reflects visual quality better than PSNR [74]. Generally, PSNR and SSIM are used jointly to evaluate the quality of the restored image.
D. IFC
The information fidelity criterion (IFC) [75] is an FR objective quality assessment criterion based on natural scene statistics. IFC considers natural images as signals with certain statistical properties and quantifies the mutual information between the test and reference images via the signal source and distortion models.
E. VSNR
Visual signal-to- noise ratio (VSNR) [76] is an FR image quality metric based on near-threshold and supra-threshold properties of human vision. First, contrast thresholds for the detection of distortions are computed via a wavelet-based model. If perceived contrast is below the threshold, it is deemed to be of perfect visual quality, otherwise, contrast and global precedence are then taken into account as an alternative measure of structural degradation.
F. FSIM
FSIM [77], as an FR IQA, is based on the assumption that HVS perceives an image mainly based on its salient low-level features. Phase congruency (PC) is used as the primary feature in FSIM, the image gradient magnitude (GM) is employed as the secondary feature to encode contrast information, and they represent complementary aspects of the image visual quality.
G. NIQE
The natural image quality evaluator (NIQE) [78] is a completely blind IQA model without any prior knowledge of distortions. The quality of the distorted image is expressed as the distance between the quality-aware natural scene statistic (NSS) feature model and the distorted image’s multivariate Gaussian (MVG) model.
H. SR-SIM
Spectral residual based similarity (SR-SIM) [79], as an FR real-time IQA, is based on the spectral residual visual saliency model (SRVS) [80]. The feature map obtained from SRVS characterizes the local quality of an image and the bottom-up visual saliency model utilizing a bottom-up visual attention mechanism provides a weighting function to reflect the importance of a local region in the quality map when pooling the final quality score.
I. PIQUE
The perception-based image quality evaluator (PIQUE) [81] is a blind NR quality evaluation metric. PIQUE first divides the test image into non-overlapping blocks and block-level analysis is performed to identify distortion and grade quality. Further, the overall quality of the test image can be obtained by pooling the block-level scores.
J. CCF
Considering that most of the images taken from the underwater environment have no reference images, hence, NR metrics would be the best choice for evaluating the quality of underwater color images. Wang et al. [82] proposed CCF as a linear combination of colorfulness index, contrast index, and fog density index to predict the color loss caused by absorption, the blurring caused by forwarding scattering, and the foggy caused by backward scattering, respectively.
K. UCIQE
Underwater color image quality evaluation metric (UCIQE) [83], quantifies the non-uniform color casts, blurring, and low contrast, and then linearly combines these three components.
L. UIQM
Another NR criterion for evaluating the quality of underwater images, UIQM [84], is a linear combination of three measures: a colorfulness measure (UICM), a sharpness measure (UISM), and a contrast measure (UIConM), and each attribute metric can be used separately for a specific underwater image processing task. Specifically, UICM utilizes asymmetric alpha-trimmed mean to measure the colorfulness; UISM uses enhancement measure estimation (EME) to measure the sharpness of the grayscale edge map obtained by multiplying the original image with the edge map from the Sobel edge detector; the contrast is measured by applying the logAMEE measure [85] on the intensity image.
M. NRQM
NRQM [86] is a learned NR IQA metric for assessing super-resolved images. Statistical properties including local frequency features, global frequency features, and spatial features are modeled with three independent regression forests. The perceptual scores from linear regression are used to predict the quality of reconstructed SR images.
N. LPIPS
The learned perceptual image patch similarity (LPIPS) [87] is a reference-based assessment metric focused on perceptual similarity. It relies on deep features from deep networks trained on large-scale, highly varied, perceptual similarity dataset to predict the perceptual similarity between two images.
O. IQT
Image quality transformer (IQT) [72] is a perceptual FR image quality assessment. It uses a CNN backbone to extract feature representations from paired clean/distorted images and the extracted feature maps are fed into a transformer to predict the reconstructed SR image quality score. IQT was ranked 1st among 13 participants in the NTIRE 2021 perceptual IQA challenge.
P. NeuralSBS
Neural side-by-side (NeuralSBS) [88] is a NR image quality measure. It adopts a CNN model trained on a paired image dataset with labeled human evaluation scores to predict a probability of being preferable to their counterparts.
Review of Image Restoration Models
Researchers have been studying IR methods for many decades. Significant progress has been achieved in IR thanks to the advances of deep neural networks. Thus in this review, we focus more on deep learning-based methods. Fig. 2 presents the overall taxonomy of existing IR techniques. According to the type of architecture, they are grouped into four categories: CNN-, GAN-, Transformer-, and MLP-based methods. Table 8 summarizes existing state-of-the-art (SOTA) IR methods. To have a better presentation of existing studies on IR, some necessary diagrams are provided. The following subsections present these methods in detail.
A. CNN-Based Methods
CNN has dominated computer vision for nearly 10 years. Recently, CNN architectures [89], [93], [115], [116], [117], [118] have made significant advancements for IR and have shown vast superiority over conventional restoration approaches [119], [120], [121], [122], as they can learn generalizable priors from large-scale datasets. Driven by recent enormous efforts on building vision benchmarks, numerous CNNs have been developed and achieved promising performance on a wide variety of image restoration and enhancement tasks. These increased performance gains can be mainly attributed to novel architecture designs, invented or borrowed modules and units, including residual learning [123], [124], [125], dilated convolutions [68], [126], dense connections [125], hierarchical structures [103], [127], [128], encoder-decoder [41], [93], [129], [130], [131], [132], [133], [134], multi-stage frameworks [14], [55], [135], [136], [137], and attention mechanisms [138], [139].
Among CNN designs, encoder-decoder architectures [41], [129], [130], [131], [133], [134] have been extensively studied for IR due to their hierarchical multi-scale representation achieved by progressively mapping the input to low-resolution representations and a corresponding inverse mapping while remaining computationally efficiency.
Although the encoder-decoder architectures are effective in capturing broad context by spatial-resolution reduction, these approaches are unreliable in preserving fine spatial details. High-resolution single-scale networks [118], [140], [124], [141] produce images with spatially more accurate details, and are less effective in encoding contextual information due to their limited receptive field. To address this problem, MIRNet [89] employs parallel multi-scale residual blocks while maintaining the original high-resolution features to preserve precise spatial details. Elective kernel feature fusion and dual attention allow for feature aggregation and feature recalibration along the spatial and channel dimensions, respectively.
To provide a balanced design of spatial details and high-level contextualized information while recovering images, encoder-decoder-based MPRNet [93] employs a multi-stage approach, which progressively restores images by decomposing the challenging IR into smaller easier subtasks. Unlike previous multi-stage approaches [14], [55], [135], [136], [137] that simply cascade stages, MPRNet introduces a supervised attention module (SAM, as shown in Fig.3) between stages to provide ground-truth supervisory signals useful for the progressive IR, which facilitates achieving significant performance gain.
To enhance the transformation modeling capability of CNNs, a deformable convolutional network [142] was developed with deformable convolution and deformable RoI pooling, by augmenting the spatial sampling locations with additional offsets and learning the offsets. Inspired by the projective motion path blur (PMPB) model and deformable convolution, a novel constrained deformable convolutional network (CDCN) [143] using blur kernels estimation approach with projective motion path blur-based deblurring loss function was designed for blind deblurring.
The self-supervised learning model Self2Self [90] targeted the absence of noisy-clean image pairs (i.e., only the available noisy images) for training. It was trained with dropout on the pairs of Bernoulli-sampled instances of the input images. Bernoulli dropout scheme was adopted in both training and testing for variance reduction. Neighbor2Neighbor [91] took one step further and adopted a random neighbor sub-sampler for the generation of training image pairs.
The model-driven CNN-based IR methods, e.g., plug-and-play IR [144], [145], have shown that a denoiser can implicitly serve as the image prior for model-based methods to solve the inverse problem. The representative work DPIR plugs the deep denoiser prior as a modular part into a half quadratic splitting-based iterative algorithm to solve the IR problem.
Instance normalization (IN) is widely used in high-level computer vision tasks, but its performance degrades severely in low-level tasks. To address this problem, the half instance normalization (HIN) block (Fig. 4) was introduced as a building block in HINet [92], which significantly boost the performance of IR.
NBNet [94] achieves denoising from a new perspective of subspace projection, based on the observation that projection can naturally preserve the local structure of the input signal. NBNet learns to generate a set of reconstruction basis for the signal subspace with a subspace attention (SSA) module. By projecting the input into signal subspace, the signal can be enhanced after separation from noise (Fig. 5).
NBNet [94] denoises via subspace projection: 1) Basis generation: generates subspace basis vectors from feature representations; 2) Projection: maps feature representations into the signal subspace.
DGUNet [96] employs a novel interpretable deep unfolding network for single image restoration (SIR), which integrates a gradient estimation into the proximal gradient descent (PGD) algorithm. Inter-stage information pathways broadcast multi-scale features in a spatial-adaptive normalization way, which rectifies the intrinsic information loss.
Degradation-specific CNNs have achieved promising results on benchmark datasets. However, these methods suffer a severe performance drop when degradation is different in practical application. Towards this challenge, SPAIR [95] exploits distortion-localization information and uses distortion guidance to perform spatially-varying modulation on degraded pixels. By employing multi-pairs of encoder-decoder to cope with each type of corruption, AirNet [97] can be used as an all-in-one solution that is capable of handling multiple types of degradations (see Fig. 6). To be specific, contrastive learning is used to extract the degradation representation from the input. Conditioned on the extracted features, the subsequent image restoration network is degradation-guided.
Overall, ‘convolution’ in CNNs provides local connectivity and translation equivariance. These properties bring efficiency and generalization to CNNs, but they also cause two main issues: (1) the limited receptive field of convolution makes it hard in modeling long-range pixel dependency, and (2) the convolution filters have static weights at inference, and thereby cannot flexibly adapt to the input content.
B. GAN-Based Methods
GAN is another class of deep generative models, which has recently gained significant attention [146]. It adopts an architecture in which two opposite networks compete with each other to generate desired data. Discriminator and generator play the two-player minimax game. The generator learns to produce new samples with the same distribution as the target domain and is trained to fool the discriminator and to capture the real data distribution. The minimax game with the value function \begin{align*}& {\underset {G \,\,\, \,\,\,\,D}{ \mathrm {min\,max} } \, V(D,G)} = {E}_{x\sim p_{data}(x)}{[\log D(x)] } \\&\qquad\qquad\qquad\qquad\qquad + \mathbb {E}_{z\sim p_{z}(z)}{[\log (1- D(G(z)))] }. \tag{3}\end{align*}
As a popular GAN variant, Conditional GAN [147], taking the class label and the latent code as inputs, has been widely applied to image-to-image translation problems including image restoration and enhancement as special cases. To recover the finer texture details when upscaling an image, as an image SR model based on GAN, SRGAN employs a deep residual network (ResNet [148]) with skip-connection and an optimized perceptual loss calculated on feature maps of the VGG network [149]. To deal with motion blur, DeblurGAN [101], as an end-to-end blind IR model, exploits Wasserstein GAN [150] with the gradient penalty [151] and the perceptual loss [152]. DeblurGAN-v2 [103], featured with light-weight and fast, introduces a feature pyramid network (FPN) as a core building block into its generator. KMSR [102] improves the blind SR performance by integrating the blur-kernel estimation into GAN.
CycleGAN [153], using unpaired image-to-image translation from a source domain
Directly applying a domain transfer approach for IR would lead to domain-shift problems in translated images due to the lack of effective supervision. Instead, LIR [105] learns invariant presentation from noisy data and reconstructs clear observations by introducing discrete disentangling representation and adversarial domain adaption into a general domain transfer framework.
DBGAN [107] is a distribution-induced bidirectional GAN, which first proposes graph representation learning and utilizes a structure-aware prior distribution estimation based on DPP [156] for latent representation by prototype learning. The diagram of DBGAN is depicted in Fig. 8.
SinGAN [157] shows that a randomly-initialized GAN model is able to capture rich patch statistics being trained from a single image. DGP [110] goes one step further, exploiting deep generative prior for image restoration by being trained on large-scale natural images and adopting a progressive reconstruction strategy that fine-tunes the generator gradually (Fig. 9). DGP achieved a more precise and faithful reconstruction for real images on a range of different tasks (e.g., colorization, inpainting, SR, Image morphing, etc.). GANs have achieved amazing performance even in extremely complex application (e.g., underwater) [158], [159], [160]. A representative work, AquaGAN [161], proposes a weighted combination of content and style loss for the first time, and generates clean underwater images. It is worth noting that the attenuation coefficient in AquaGAN is very sensitive, and the recovery results corresponding to different values vary significantly.
Overall, GANs can generate data that looks similar to the original data. However, there exist major challenges in the training of GANs, i.e., mode collapse, non-convergence, and instability. The trained models may vary a lot between adjacent iterations. It is easy to trap a bad local minimum when training the model, and the generalization ability of the final trained model cannot be guaranteed.
C. Transformer-Based Methods
More recently, another class of neural architectures, Transformer, has achieved great success in natural language processing (NLP) and high-level vision tasks. Subsequent research explorations on Vision Transformers (e.g., ViT [162]) have exemplified their great potential as alternatives to the go-to CNN models.
By leveraging the self-attention (SA) mechanism [163], Transformers mitigates the shortcomings of CNNs (i.e., limited receptive field and inadaptability to input content) with the capability to capture long-range dependencies between image patch sequences and adaptability to given input content. The SA mechanism plays a key role in modeling global connectivity. It calculates the response at a given pixel by a weighted sum of all other positions, which can be described as mapping a query and a set of key-value pairs to an output. The SA scoring function is defined as:\begin{equation*} \mathrm {Attention}(Q,K,V) =\mathrm {softmax}\left({\frac {QK^{T}}{\sqrt {dk}}}\right)V, \tag{4}\end{equation*}
By exploiting the powerful feature representation capability of the Transformer and using the strong computing power of modern hardware, a pre-trained image processing transformer (IPT [111]) on the ImageNet dataset is developed, which can be used for super-resolution, denoising, or deraining task after fine-tuning on a task-specific dataset. The IPT model consists of multiple pairs of heads and tails for different tasks and a shared transformer body including an encoder and decoder, as shown in Fig. 10. Specifically, the multi-head takes the degraded images as input and converts them to feature maps then splits them into patches as “visual words” for subsequent processing in Transformer. The clean images are reconstructed by ensembling output patches. Notably, IPT requires the degradation information in priori and specifies the associated head.
Pioneer vision Transformer works for low-level vision [111], [164] by accepting relatively small patches (tokens) spitted from the input image, which inevitably causes patch boundary artifacts when applied to larger images. To tackle this problem, Swin Transformer [165] adopts a shifted windowing scheme that limits self-attention computation to non-overlapping local windows while also allowing for cross-window connection. SwinIR [21] takes advantage of Swin Transformer and utilizes several Swin Transformer layers for local attention and cross-window interaction. Besides that, SwinIR uses MLP (2 layers with GELU [166]) for feature transformations.
Based on the observation that Transforms can only utilize a limited spatial range of input information through attribution analysis, to address this problem, a hybrid attention Transformer (HAT [167]) combines channel attention and self-attention schemes and makes use of their complementary advantages. To enhance the interaction between neighboring window features, an overlapping cross-attention module is employed in HAT.
While the SA mechanism in Transformer has shown its superiority over CNNs in capturing long-range pixel interactions when dealing with low-level vision tasks, its computational complexity grows quadratically with the spatial resolution, therefore it is infeasible to adapt SA to HR images. By adopting transposed attention across channels, the computational loads are significantly reduced from \begin{equation*} \mathrm {Attention}(Q,K,V) =V \mathrm {softmax}\left({\frac {KQ}{\alpha }}\right), \tag{5}\end{equation*}
Uformer [22] employs U-shaped hierarchical encoder-decoder architecture and leverages local-enhanced window Transformer with two core designs: (1) non-overlapping window-based self-attention (W-MSA), and (2) locally-enhanced feed-forward network (LeFF). Given a degraded image
Overall, SA is highly effective in capturing long-range pixel interactions, but its complexity grows quadratically with spatial resolution. How to reduce the computational complexity of SA and maintain efficiency in modeling global connectivity will be the focus of future research.
D. MLP-Based Methods
Most recently, deep multilayer perceptron (MLP) models have roused great interest in the vision community. MLPs are considered the “classic” form of a neural network, consisting of a series of simple fully-connected layers or perceptrons, which were first developed in 1958. MLP-Mixer [112], an architecture based exclusively on multi-layer perceptrons, presented by Google Brain in May 2021, led MLPs revival. MLP-Mixer exemplifies that, while convolutions and attention are both sufficient for good performance, neither of them is necessary, by leveraging two types of layers: one with MLPs applied independently to image patches for mixing local features, and the other with MLPs applied across patches for mixing spatial information.
The Google Brain team further studied the necessity of self-attention modules in Transformers and proposed the gMLP [113] model, based on MLPs with gating. The spatial gating unit (SGU) is the key design element of gMLP used for cross-token interactions (Fig. 11). The gating function is defined as:\begin{equation*} s(Z) =Z_{1} \odot f_{W,b}(Z_{2}), \tag{6}\end{equation*}
Zhao et al. [168] proposed a multi-axis self-attention embedded in the HiT model. They first split the input image patches into two groups along the channel dimension, one of which performs regional attention operation within fixed windows and the other performs dilated attention across windows. Therefore, self-attention gets enhanced by considering local (within windows) as well as global (across windows) relations.
MAXIM [114], multi-axis MLP for image processing, takes advantage of multi-axis self-attention and gating mechanisms in gMLP. The proposed multi-axis gated MLP block (MAB) (Fig. 12) can enjoy a global receptive field, with linear complexity. The cross-gating block (CGB) as an extension of MAB built on gMLP allows global contextual features to gate the skip-connections. With multi-stage multi-scale architecture, MAXIM achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks.
Notably, MLPs have an advantage in capturing global attention, but applying gMLP to low-level vision tasks has to overcome a missing attribute of basic CNNs.
Overall, the above four different types of IR solutions have their unique characteristics, and thus it is necessary to weigh their pros and cons according to the application requirements. More specifically, CNNs, serving as backbone networks, have achieved impressive results with dense connections [169] and more complex forms of convolution [142], [170], [171], but have a limitation in modeling long-range pixel interactions and processing images with large size. Confronted with the lack of well-aligned image pairs, GAN-based domain translation methods are more flexible, but the generalization issue still exists. Transformer has a significant advantage of capturing long-range dependencies at the cost of quadratic computational complexity due to self-attention. Although attention-free MLP can achieve comparable performance in IR to Transformers, it requires further exploration to make breakthroughs in reducing the complexity and capturing global and local connectivity.
Comparisons Among State-of-the-Arts
In this section, representative SOTA IR methods are visually and numerically compared on benchmark datasets. More specifically, the selected competitors include DeblurGAN-v2 [103], MPRNet [93], MIRNet [89], DGUNet [96], MAXIM [114], Restormer [12], Uformer-B [22], IPT [111], SwinIR [21], HAT-L [48], and BSRGAN [108], covering multiple kinds of approaches mentioned in Section V. Experiments are conducted on six benchmark datasets for different IR tasks, GoPro [55] for motion deblurring, SIDD [40] for denoising, Rain100L [40] for deraining, SOTS [48] for outdoor dehazing, Set14 [28] for SR, and UIEB [61] for underwater image restoration. Several different IQAs are used to measure the quality of the restored images. The numerical results are presented in Tables 9 to 14 and the visual comparisons are shown in Fig. 13.
Visual comparisons of representative SOTA IR algorithms for six different tasks. From top row to bottom row: motion deblurring on GoPro [55], denoising on SIDD [40], draining on Rain100L [40], outdoor dehazing on SOTS [48],
Experiment Results Analysis: (1) Benefiting from the design of a multi-scale restoration modulator, Uformer-B [22] shows its advantage in motion blur removal tasks. The image restored by Uformer-B is more clear, compared with other methods (See Table 9 and the 1st row in Fig. 13). (2) For the denoising task, the image reproduction quality of Restormer [12] is more faithful to the ground truth than other methods (See Table 10 and the 2nd row in Fig. 13), owing to the multiscale hierarchical design incorporating gating mechanism in the feed-forward network. (3) IPT [111] shows its significant superiority over other methods in deraining (Table 11 and the 3rd row in Fig. 13), which can be attributed to its pre-training strategy, as the pre-trained model enjoys a convenience of self-generating training instances based on the original real images. (4) DehazeFormer-B [11], based on Swin Transformer [165], improved performance in dehazing by modifying the normalization layer, adapting softRELU, and aggregating spatial information. It achieved the best results of dehazing in terms of details restoration (See Table 12 and the 4th row in Fig. 13). (5) The combination of channel attention and self-attention, and the overlapping cross-attention module significantly enhanced the performance of HAT-L [48] in recovering the finer details on the SR task. HAT-L outperforms other SOTA SR models, as shown in Table 13 and the 5th row of Fig. 13. (6) Two existing SOTA underwater IR methods, Water-Net [61] and AquaGAN [110], are compared. From the last row of Fig 13, we can observe that the restored image by AquaGAN has better quality in terms of color correction and texture details restoration. Numerical results listed in Table 14 match the visual observation.
Challenges and Future Suggestions
In this section, current challenges faced by IR are analyzed from different perspectives and directions are discussed for future research.
A. Image Datasets
Most benchmark datasets for IR are synthetic-based or simply handcrafted, while few datasets are collected in limited scenes with real degradations, due to the infeasibility of collecting the degraded/undegraded image pairs in most real scenes. However, the lack of high diversity of real degradations in learning samples fails to provide strong priors for learning-based models, which causes severe performance degradation for trained models in real applications. How to generate more realistic degradation in learning samples could be a research direction in the future.
B. Image Quality Assessment
IQA plays a critical role in image processing tasks. Although it is easy for human beings to distinguish perceptually better images, it has been proved to be difficult for algorithms. GAN-based image processing algorithms have posed particular challenges, as they bring completely new characteristics to the output images. The growing discrepancy between the quantitative evaluation results and the perceptual quality will affect the development of image processing algorithms if the IQA methods cannot objectively compare their perceptual quality. Therefore, it is desired to develop more suitable IQA methods accordingly to adapt to the emerging image processing algorithms.
C. Image Restoration Algorithms
Although promising results have been achieved in a specific area, such as denoising, deblurring, deraining, dehazing, and super-resolution, image restoration has encountered the following obstacles in practice. 1) The type of degradation must be known in advance to select a competing model since most existing methods can only handle one specific degradation in the inference stage, although they can handle different IR tasks after retraining. 2) In real applications, the degradation usually changes in complex environments and the images may suffer from various degradations consecutively or even simultaneously (e.g., rainy and hazy weather). Confronted with these challenges, it is necessary to build a generic model that can handle all possible various degradations in one solution.
Despite the promising performance gains achieved with novel architecture designs and newly invented modules or units, the increase in computational complexity of the SOTA frameworks limits their real-time applications. Lightweight models are desired for practical needs. Optimizing the IR models to have a better trade-off between the restoration performance and running time would be an unavoidable topic.
Conclusion
With the rapid development of consumer and industry cameras and smartphones, the requirements of obtaining high-quality images are constantly growing. Real-world image restoration plays a crucial role in recovering clean images and has been receiving increased attention. Due to its ill-posed nature, IR remains a challenging problem. In this paper, the commonly used datasets and assessment metrics for IR models are first summarized. Then, recent IR methods for reproducing realistic images, including CNN-, GAN-, Transformer-, and MLP-based algorithms, are comprehensively reviewed. The pros and cons of each type of architecture are presented. Finally, the challenges of IR are analyzed and the analysis shows that although some progress has been made on IR in the past few years, these unsolved problems indicate promising directions for future explorations.