Introduction
Light field (LF) cameras record both intensity and direction of light rays, and enable many applications such as refocusing [1], depth estimation [2], [3], [4], and view rendering [5], [6], [7]. Since high-resolution (HR) LF images are beneficial to various applications but are generally obtained at an expensive cost, it is necessary to reconstruct HR LF images from low-resolution (LR) LF images, i.e., to achieve LF image super-resolution (SR).
In the past decade, deep neural networks (DNNs) have been successfully applied to LF image SR and achieved significant progress [8], [9], [10], [11], [12]. In the area of LF image SR, many networks [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30] were developed to improve SR accuracy. However, real-world LF image SR has remained under investigated due to the following two reasons. First, it is challenging to develop an LF image SR model that can handle real-world degradation. Real-world LF images suffer from diverse degradation which varies with both imaging devices (e.g., Lytro or RayTrix cameras) and shot conditions (e.g., scene depth, focal length, and illuminance). However, existing LF image SR methods focus on the design of network architecture, and develop models on the simple bicubic downsampling degradation. Consequently, these methods suffer a notable performance drop when applied to real LF images. Second, it is challenging to simultaneously utilize the degradation information while incorporating the complementary angular information. Existing methods generally achieve real-world SR on single images (i.e., ignore the view-wise correlation), and thus cannot achieve satisfactory performance on LF image SR.
In this article, we propose a simple yet effective method for real-world LF image SR. In our method, we first formulate a practical degradation model to approximate the degradation process of real LF images, and then develop a convolutional neural network to super-resolve LF images with diverse and real degradation. To incorporate the degradation prior into the SR process, we design a degradation-modulating convolution (DM-Conv) whose weights are dynamically generated according to the degradation representation. By integrating the proposed DM-Conv with the disentangling mechanism [27], our network (namely, LF-DMnet) can well incorporate spatial and angular information under diverse degradation. As shown in Fig. 1, compared with DistgSSR [27] and DASR [31], our method achieves better performance on real LF images and generates images with more clear details and fewer artifacts.
The contributions of this work are summarized as follows.
We propose a practical LF degradation model to handle the real-world LF image SR problem. Different from existing works which focus on the advanced network designs, we first address the importance of degradation formulation and modulation in LF image SR.
We propose a degradation-modulating network (i.e., LF-DMnet) to incorporate the degradation prior into the SR process. Extensive ablation studies and model analyses validate the effectiveness of our degradation modulation mechanism.
Our method achieves state-of-the-art SR performance on both synthetic and real-world degradation, which not only provides a simple yet strong baseline, but also takes a step toward practical real-world LF image SR.
The rest of this article is organized as follows. In Section II, we briefly review the related works. In Section III, we describe our degradation model for LF image SR. In Section IV, we introduce the details and design thoughts of our LF-DMnet. Experimental results are presented in Section V. Finally, we conclude this article in Section VI.
Related Work
In this section, we briefly review several major works for DNN-based single image SR and LF image SR.
A. Single Image Super-Resolution
The goal of single image SR is to reconstruct an HR image from its LR version. According to different degradation settings, existing single image SR methods can be roughly categorized to single degradation-based methods and multidegradation-based methods.
Early works on DNN-based single image SR are generally developed on a single and fixed degradation (e.g., bicubic downsampling). Dong et al. [34] first applied convolution neural networks to image SR and developed a three-layer network named SRCNN. Although SRCNN is shallow and lightweight, it outperforms many traditional SR methods [35], [36], [37], [38]. Since then, deep networks have dominated the SR area and achieved continuously improved accuracy with large models and complex architectures. Kim et al. [39] applied global residual learning strategy to image SR and developed a 20-layer network called VDSR. Lim et al. [40] proposed an enhanced deep SR (EDSR) network by using both global and local residual connections. Zhang et al. [41] combined residual learning with dense connection to build a residual dense network with more than 100 layers. Subsequently, Zhang et al. [42] developed a very deep network in a residual-in-residual architecture to achieve competitive SR accuracy. More recently, attention mechanism [43], [44], [45] and Transformer architectures [46], [47] have been extensively studied to achieve state-of-the-art SR performance.
Although the aforementioned methods have achieved continuously improved SR performance, they are designed for a single fixed degradation (e.g., bicubic downsampling) and will suffer from a significant performance drop when the degradation differs from the assumed one. Consequently, many methods have been proposed to achieve image SR with multiple various degradation [48]. Zhang et al. [49] proposed an SRMD network where the degradation map was concatenated with the LR image as the input of the DNN. Subsequently, Xu et al. [50] applied dynamic convolutions to achieve better SR performance than SRMD. In [51], an unfolding SR network was developed to handle different degradation by alternately solving a data subproblem and a prior subproblem. Gu et al. [52] proposed an iterative kernel correction method (namely, IKC) to correct the estimated degradation by observing previous SR results. More recently, Wang et al. [31] achieved degradation representation learning in a contrastive manner and developed a degradation-aware SR network named DASR for real-world single image SR.
B. LF Image Super-Resolution
The goal of LF image SR is to super-resolve each subaperture image (SAI) of an LF. A straightforward scheme to achieve LF image SR is applying single image SR methods to each SAI independently. However, this scheme cannot achieve a good performance since the complementary angular information among different views is not considered. Consequently, existing LF image SR methods focus on designing advanced network architectures to fully use both spatial and angular information.
Yoon et al. [13] proposed the first DNN-based method called LFCNN to enhance both spatial and angular resolution of an LF. In their method, SAIs are first super-resolved using SRCNN [34], and then finetuned in pairs or quads to incorporate angular information. Wang et al. [15] proposed a bidirectional recurrent network for LF image SR, in which the angular information in adjacent horizontal and vertical views was incorporated in a recurrent manner. Zhang et al. [19] proposed a multibranch residual network to incorporate the multidirectional epipolar geometry prior for LF image SR. In their subsequent work MEG-Net [23], the SR performance was further improved by applying 3-D convolutions to SAI stacks of different angular directions. Jin et al. [20] developed an all-to-one method for LF image SR, and performed structural consistency regularization to preserve the LF parallax structure. Wang et al. [21] developed an LF-InterNet to repetitively interact spatial and angular information for LF image SR, and then generalized the spatial-angular interaction mechanism to the disentangling mechanism [27] to achieve state-of-the-art SR accuracy.
More recently, Wang et al. [22] used deformable convolutions [53], [54] to address the disparity problem in LF image SR. Cheng et al. [24] proposed a zero-shot learning scheme to handle the domain gap among different LF datasets. Liang et al. [29] proposed a Transformer-based LF image SR network, in which a spatial Transformer and an angular Transformer were designed to model long range spatial dependencies and angular correlation, respectively. Wang et al. [28] proposed a detail-preserving Transformer to exploit nonlocal context information and preserve details for LF image SR. Heber et al. [30] investigated the nonlocal spatial-angular correlations in LF image SR, and developed a Transformer-based network called EPIT to achieve state-of-the-art SR performance.
Although remarkable progress have been achieved in LF image SR, existing methods only focus on the advanced network design but ignored the generalization capability to real-world degradation. In this article, we handle the real-world LF image SR problem by formulating a practical LF degradation model and designing a degradation-modulating network.
LF Image Degradation Formulation
In this section, we formulate a general and practical degradation model for real-world LF image SR. In Section III-A, we analyze the camera imaging process and derive the image degradation model. In Section III-B, we extend the degradation model to 4-D LF images to build the LF image degradation model, and discuss its key components. In Section III-C, we compare the differences between our method and existing SR methods.
A. Degradation Formulation
In this section, we first formulate the camera imaging process considering three key factors including point spread function (PSF), sensor sampling, and additional noise. Then, we derive the image degradation model based on the formulated camera imaging process.
Fig. 2 shows a toy example of the camera imaging process, in which the light rays are first projected onto the sensor plane [as shown in Fig. 2(a)], and then sampled by the sensor units [as shown in Fig. 2(b)]. Let \begin{align*}&\hspace {-.5pc} {\mathcal {I}}_{\text {real}}(x, y) = \int _{-\infty }^{+\infty } \int _{-\infty }^{+\infty } k_{\text {psf}}(u,v) \\&\cdot \, {\mathcal {I}}_{\text {ideal}}(x-u, y-v) {dudv} \tag{1}\end{align*}
\begin{equation*} {\mathcal {I}}_{\text {real}} = {\mathcal {I}}_{\text {ideal}} \otimes k_{\text {psf}} \tag{2}\end{equation*}
\begin{equation*} {\mathcal {I}}_{\text {LR}}(h, w) = \int _{h-\frac {\epsilon }{2}}^{h+\frac {\epsilon }{2}} \int _{w-\frac {\epsilon }{2}}^{w+\frac {\epsilon }{2}} {\mathcal {I}}_{\text {real}}(x, y) dxdy + \mathcal {N}(h, w) \tag{3}\end{equation*}
\begin{equation*} {\mathcal {I}}_{\text {LR}} = \left [{ {\mathcal {I}}_{\text {real}} }\right]_{\epsilon } + \mathcal {N} \tag{4}\end{equation*}
Illustration of the camera imaging process. (a) Camera imaging model. (b) Image on the sensors.
In image SR task, it is expected to reconstruct (or estimate) the ideal image function \begin{equation*} {\mathcal {I}}_{\text {HR}} = \left [{ {\mathcal {I}}_{\text {ideal}} }\right]_{\frac {\epsilon }{\alpha }} \tag{5}\end{equation*}
\begin{equation*} \left [{ \mathcal {I} }\right]_{\epsilon } = \left ({\left [{ \mathcal {I} }\right]_{\frac {\epsilon }{\alpha }}}\right)_{\downarrow _{\alpha }},\quad \mathcal {I}={\mathcal {I}}_{\text {ideal}} \text { or} {\mathcal {I}}_{\text {real}}. \tag{6}\end{equation*}
\begin{equation*} {\mathcal {I}}_{\text {LR}} = \left ({\left [{ {\mathcal {I}}_{\text {ideal}} \otimes k_{\text {psf}} }\right]_{\frac {\epsilon }{\alpha }}}\right)_{\downarrow _{\alpha }} + \mathcal {N}. \tag{7}\end{equation*}
\begin{equation*} \left [{ {\mathcal {I}}_{\text {ideal}} \otimes k_{\text {psf}} }\right]_{\frac {\epsilon }{\alpha }} = \left [{ {\mathcal {I}}_{\text {ideal}} }\right]_{\frac {\epsilon }{\alpha }} \otimes k_{\text {psf}}. \tag{8}\end{equation*}
\begin{equation*} {\mathcal {I}}_{\text {LR}} = \left ({{\mathcal {I}}_{\text {HR}} \otimes k }\right)_{\downarrow _{\alpha }} + \mathcal {N}. \tag{9}\end{equation*}
B. LF Image Degradation Model
We use the two-plane model [55] to parameterize 4-D LF as
Here, we extend our degradation model [i.e., (9)] to 4-D LFs and build the LF image degradation model as \begin{equation*} \mathcal {I}^{\textit {lr}}_{u,v} = \left ({\mathcal {I}^{\textit {hr}}_{u,v} \otimes {k}_{u,v}}\right){\downarrow }_{\alpha } + \mathcal {N}_{u,v} \tag{10}\end{equation*}
1) Blur Kernel:
We follow existing works [49], [52] to use the isotropic Gaussian kernel parameterized by kernel width to synthesize blurring LF images. Note that, although anisotropic kernels (e.g., anisotropic Gaussian blur and motion blur) are also used in recent single image SR methods [31], [51], [56], [57], [58] for degradation modeling, we do not consider these blur kernels in our method because under LF structures, the rotation angle of the anisotropic Gaussian kernel and the trajectory of the motion blur of each SAI should be different but correlated. The formulation of these anisotropic blur kernels depends on the 6-D pose changing of LF cameras, and belongs to the LF deblurring task [59], [60], [61]. As demonstrated in Section V-B2, based on the isotropic Gaussian blur assumption, our method can achieve promising SR performance on real LF images.
2) Noise:
Real-world LF images (especially those captured by Lytro cameras) generally have large noise. Directly super-resolving noisy LF images without performing noise reduction can result in visually unpleasant artifacts (see Section V-D2). In this article, we consider the simple channel-independent additive white Gaussian noise in our degradation process. Each element in the noise tensor
3) Downsampling:
We adopt the widely used bicubic downsampling approach in our method. In this way, our degradation model can be degeneralized to a standard bicubic downsampling degradation when the kernel width and noise level equal to zero. Note that, different from blur kernel and noise level which can vary in the training phase, the downsampling approach is assumed to be fixed.
C. Comparison to Existing Works
1) Compared to Existing LF Image SR Methods:
Compared to existing LF image SR methods [19], [20], [21], [22], [23], [27], [28], [29] which use the bicubic downsampling approach to produce LR LF images, our method adopts a more practical degradation model [i.e., (10)] since the blur kernel and noise level in our model can be adjusted in the training phase to enlarge the degradation space. It is shown in Section V-B2 that our LF-DMnet trained with this degradation model can achieve promising SR performance on real LF images, which demonstrates that our proposed degradation model can well cover the real-world degradation of LF images.
2) Compared to More Complex Synthetic Degradation:
It is also worth noting that several recent works for single image SR [62], [63] designed very complex degradation models to train deep networks for real-world SR. In these methods, various kinds of blur, noise, and downsampling schemes were considered, and the order of these degradation elements (also including JPEG compression) were randomly shuffled to cover as much real-world degradation as possible. Although these methods [62], [63] achieve favorable visual performance on real-world images, we do not consider designing such a complex degradation model in this article because of the following three reasons. First, single images are generally captured by various cameras and transmitted multiple times on internet, and thus go through complex and high-order degradation [63]. In contrast, LF images are captured by a few kinds of imaging devices (e.g., Lytro or RayTrix), and saved to specific file formats that do not go through JPEG compression. Consequently, the degradation space of LF images is smaller than that of single images. Second, abundant high-quality HR images and diverse scenarios are required to train a network to fit such complex degradation. Networks in [62] and [63] were trained on multiple large-scale single image datasets [64], [65], [66], [67] with thousands of high-quality HR images. In contrast, publicly available high-quality LF datasets are limited in amount, spatial resolution, and scene diversity. Consequently, it is difficult for an LF image SR network to learn such complex degradation with insufficient training samples. Third, as the first work to address LF image SR with multiple degradation, we aim to demonstrate the importance of degradation modulation to LF image SR, and propose a simple yet effective solution to this problem. Consequently, we do not make our degradation model over-complex.
Network Architecture
A. Overview
Based on the degradation model in (10), we develop a degradation-modulating network (LF-DMnet) that can super-resolve LF images with various degradation. An overview of our LF-DMnet is shown in Fig. 3(a). Given an array of LR SAIs and their corresponding degradation (i.e., kernel width and noise level of each view), our LF-DMnet sequentially performs kernel prior embedding (KPE), degradation-modulated feature extraction, and upsampling. Following [27], we build our network by cascading four residual groups. In each residual group, a degradation-modulating block (DM-Block) is designed to process features according to the degradation, and four disentangling blocks (Distg-Blocks) are used to achieve spatial-angular information incorporation. The final output of our network is an array of HR LF images. Note that, since most LF image SR methods [16], [19], [20], [21], [22], [27], [29] use SAIs distributed in a square array as their inputs, in this article, we follow these methods and set
B. Kernel Prior Embedding
Handling image SR with multidegradation is more challenging than handling that with bicubic downsampling only, since the solution space of the former one is much larger than the latter one. In such case, incorporating kernel priors into the SR process can constrain the solution space to a mainfold and thus reduce the ill-posedness of the SR process [56]. Since only isotropic Gaussian kernel (with different kernel widths) is considered in our method, we designed a KPE module to fully incorporate the kernel prior into the SR process.
In the KPE module, the isotropic Gaussian kernel
C. Degradation-Modulating Block
DM-Block is designed to process image features based on the given degradation. To achieve this goal, a simple and straightforward scheme is to concatenate degradation representation with image features and fuse them via convolutions [49], [50]. However, as demonstrated in several recent works [31], [52], directly convolving image features with degradation representations can cause interference since there is a domain gap between these two kinds of representations. Motivated by the fact that images with different degradation are generated by convolving the original high-quality image using isotropic Gaussian kernel with different kernel widths, in this article, we design a DM-Conv whose kernels are dynamically generated according to the input degradation representation.
Specifically, in each DM-Block, the degradation representation
D. Disentangling Block
Although the proposed DM-Block can handle input images with various degradation, it processes the image features of different views separately without considering the interview correlation. Since information both within a single view and among different views is beneficial to the performance of LF image SR, in this article, we modify the Distg-Block [27] to incorporate multidimensional information for LF image SR.
Different from the Distg-Block in [27] where a series of specifically designed convolutions (i.e., spatial, angular, and epipolar feature extractors) are applied to a single macropixel image (MacPI) feature, in this article, we organize LF features into different shapes and apply plain convolutions to the reshaped features. Our modified approach is equivalent to the original design but is more simple and generic. Specifically, considering both batch and channel dimensions, the input feature of our Distg-Block can be denoted by
By adopting Distg-Blocks, our method can incorporate the beneficial spatial and angular information from the input LF to achieve state-of-the-art SR performance. The effectiveness of the Distg-Block for multidegraded LF image SR is validated in Section V-C.
E. Discussion on the Nonblind SR Setting
Recent single image SR methods [31], [52], [56], [57] generally adopt the blind SR settings, i.e., the “groundtruth” degradation is unknown for the SR networks. That is because, compared to nonblind SR methods [49], [50], [51], [58] where the degradation is also required as the input, blind SR is more practical since the real-world degradation is generally difficult to obtain.
However, in this article, we adopt the nonblind SR settings as in [49], and take both degraded LF images and their degradation (blur kernel width and noise level) as inputs of our network. Reasons are in three folds. First, performing nonblind SR helps us to better investigate the impact of input degradation to the SR performance, which has not been studied in LF image SR. Since the kernel width and noise level are independently fed to our network, performing nonblind SR can help us decouple different degradation elements and investigate their influence, respectively, as demonstrated in Section V-D. Second, performing nonblind SR helps us to explore the upper bound of blind SR because the groundtruth degradation information can be used as an accurate prior in nonblind SR. As the first work to achieve LF image SR with multidegradation, one of the major contributions of this article is to break the limitation of single fixed degradation and show the great potential and practical values of multidegraded LF image SR. To this end, nonblind SR is purer and more suitable than blind SR. Third, since the proposed degradation model has only two underdetermined coefficients, we can easily find a proper input degradation by observing the super-resolved images to correct the input degradation, or adopting a grid search strategy [49] to traverse kernel widths and noise levels in a reasonable range, as described in Section V-D2.
Experiments
In this section, we first introduce the datasets and implementation details, then compare our network to several state-of-the-art SR methods. Finally, we conduct ablation studies to investigate our design choices and further analyze the impact of the input kernel widths and noise levels.
A. Datasets and Implementation Details
Our method was trained and validated on synthetically degraded LFs generated according to (10), and further tested on real LFs captured by Lytro Illum and Raytrix cameras. For training and validation, three public LF datasets including HCInew [68], HCIold [69], and STFgantry [70] were adopted. The division of training and validation set was kept identical to that in [22], [27], [28], [29]. To test the generalization capability of our method to real-world degradation, three public LF datasets (i.e., EPFL [32], INRIA [71], and STFlytro [33]) developed with Lytro cameras and a dataset [72] developed with a Raytrix camera were used as our test sets. Totally 39, 8, and 26 scenes were used for training, validation, and test in this article, respectively.
The LFs in the HCInew [68], HCIold [69], STFgantry [70], EPFL [32], INRIA [71], and STFlytro [33] datasets have an angular resolution of
Our network was trained using the
Following [31], [49], [52], we used PSNR and SSIM calculated on the RGB channel images as quantitative metrics for validation. To obtain the metric score (e.g., PSNR) for a dataset with
B. Comparisons With State-of-the-Art Methods
In this section, we compare our method to the following state-of-the-art SR methods.
DistgSSR [27] and LFT [29]: Two top-performing LF image SR methods developed on the bicubic downsampling degradation.
SRMD [49]: A popular nonblind single image SR method developed on isotropic Gaussian blur and Gaussian noise degradation.
DASR [31]: A state-of-the-art blind single image SR method developed on anisotropic Gaussian blur and Gaussian noise degradation.
BSRGAN [62] and Real-ESRGAN [63]: Two recent real-world single image SR methods developed on the complex synthetic degradation.
Besides the aforementioned compared methods, we also include bicubic upsampling method to produce baseline results.
1) Results on Synthetically Degraded LFs:
Table I shows the quantitative PSNR and SSIM results achieved by different methods under synthetic degradation with different blur and noise levels. It can be observed that DistgSSR and LFT produce the top-2 highest PSNR and SSIM results under the bicubic downsampling degradation (i.e.,
SRMD and DASR achieves much better performance than DistgSSR and LFT on blurry and noisy scenes since these two methods are designed for multidegraded image SR. Note that, SRMD is benefited from the input ground-truth degradation and thus slightly outperforms DASR. It can be also observed that the PSNR and SSIM values produced by BSRNet and Real-ESRNet are lower than SRMD and DASR. That is because, the degradation space in BSRNet and Real-ESRNet are much larger, so that the capability of these two methods in handling specific degradation is less powerful. It is worth noting that these single image SR methods only use spatial context information within single views for SR but overlook the correlations among different views, resulting in inferior SR performance and the angular inconsistency issue (see Section V-B3).
Compared to these state-of-the-art single and LF image SR methods, our LF-DMnet can simultaneously incorporate the complementary angular information and adapt to different degradation, and thus achieves the best PSNR and SSIM results on both in-distribution degradation and out-of-distribution (e.g., kernel width = 4.5 or noise level = 90) degradation except for the noise-free bicubic downsampling one. The benefits of angular information and degradation adaption are further analyzed in Section V-C. Fig. 5 shows the visual results produced by different methods with blur kernel width and noise level being set to 1.5 and 15, respectively. It can be observed that our LF-DMnet can recover faithful details from the blurry and noisy input LFs.
Visual results achieved by different methods on synthetically degraded LFs (
2) Results on Real LFs:
We test the practical values of different SR methods by directly applying them to LFs captured by Lytro and Raytrix cameras. Since the groundtruth HR images of the input LFs are unavailable, we compare the visual results produced by different methods in Figs. 6 and 7. It can be observed that the image quality of the input LFs is low since the bicubically upsampled images are blurry and noisy. DistgSSR and LFT augment the input noise and produce results with artifacts (see Fig. 6) or blurring details (see Fig. 7). This demonstrates that methods developed on the fixed bicubic downsampling degradation cannot handle real-world degradation and thus have limited practical values.
Visual results achieved by different methods on real LFs captured by Lytro Illum cameras for
Visual results achieved by different methods on real LFs captured by a Raytrix camera for
Although SRMD, DASR, BSRGAN, and Real-ESRGAN are specifically designed to handle image SR with multiple degradation, these methods do not consider interview correlation and ignore the beneficial angular information. Consequently, these single image SR methods suffer from noise residual (e.g., see the results of DASR in Figs. 6 and 7), oversmoothness (e.g., see the results of SRMD in Figs. 6 and 7), and angular inconsistency (e.g., see the results of BSRGAN and Real-ESRGAN in Fig. 7) issues.
Compared to existing methods, our method achieves the best SR performance on real LFs, i.e., the results produced by our method have finer details (e.g., the words and characters in scene general_11) and less artifacts. This demonstrates that our network trained on the proposed degradation model can effectively handle real LF image SR problem. Readers are referred to the videos4 to view more visual SR results on real LFs.
3) Angular Consistency:
Since LF image SR methods are required to preserve the LF parallax structure and generate angular-consistent HR LF images, we evaluate the angular consistency of different SR methods by visualizing their EPI slices. As shown below the zoom-in regions in Figs. 5–7, our LF-DMnet can generate more straight and clear line patterns than other SR methods on both synthetic and real-world degradation, which demonstrates that the LF parallax structure is well preserved by our method. Readers can refer to this video5 for a visual comparison of angular consistency.
4) Efficiency:
We compare our LF-DMnet to existing SR methods in terms of the number of parameters, FLOPs, and running time. As shown in Table II, our LF-DMnet has a moderate model size which is slightly larger than DistgSSR due to the additional KPE branch and the DM-Blocks. Note that, these additional 0.27M parameters only result in a 0.52G and 0.002 s increase in FLOPs and running time, respectively. Compared to DASR, BSRNet (i.e., BSRGAN), and Real-ESRNet (i.e., Real-ESRGAN), our method has significantly smaller model size, lower FLOPs, and shorter running time. These results demonstrate the efficiency of our method.
C. Ablation Study
In this section, we investigate the effectiveness of our proposed modules and design choices by comparing our LF-DMnet with the following variants.
Model 1: We introduce a baseline model by removing the DM-Block and the angular and EPI branches in the Distg-Block. Consequently, this variant is equivalent to a plain single image SR network that neither performs degradation modulation nor incorporates angular information. Note that, we increase the number of convolution layers in this variant to make its model size not smaller than our LF-DMnet.
Model 2: We investigate the effectiveness of our DM-Conv by replacing it with a depth-wise
convolution and a vanilla3\,\, {}\times {}3 convolution. Distg-Block is maintained in this variant to incorporate angular information. Note that, the KPE module is also removed since vanilla convolutions do not take degradation as their input. This variant can be considered as an LF image SR method without degradation modulation (e.g., DistgSSR) retrained on our proposed degradation model.3\,\, {}\times {}3 Model 3: In this variant, we remove the angular and EPI branches in the Distg-Block and adopt the same strategy as in Model 1 to make the model size of this variant not smaller than our LF-DMnet. Since this model only incorporates intraview information to achieve degradation-modulated SR, it can be considered as a nonblind single image SR method, and the benefits of the angular information to real-world LF image SR can be validated.
Model 4: We modify the KPE module in this variant to investigate the effectiveness of KPE. Specifically, we do not perform isotropic Gaussian kernel reconstruction but directly fed the blur kernel width to a five-layer MLP to generate the blur degradation representation. Consequently, the isotropic Gaussian kernel prior cannot be incorporated by this variant.
Model 5: In this variant, we remove the degradation-modulating channel attention layer (i.e., DM-CA) from the DM-Block to investigate the benefits of channel-wise degradation modulation.
1) Degradation-Modulating Convolution:
As the core component of our LF-DMnet, DM-Conv can adapt image features to the given degradation and thus enhances the capability to handle different degradation. As shown in Table III, without using DM-Conv, Model 2 suffers from a 1.61 dB decrease in average PSNR as compared to LF-DMnet. This is because, different degradation have different spatial characteristics (as analyzed in Section IV-C) and cannot be well handled via fixed convolution kernels. In contrast, our DM-Conv dynamically generates convolutional kernels conditioned on the input degradation to recover the degraded image features, and thus achieves higher PSNR values on a wide range of synthetic degradation. Moreover, we visualize the kernels of our DM-Convs (averaged along the channel dimension) with different input blur and noise levels. As shown in Fig. 8, all the four DM-Convs learn different kernel patterns for different input degradation, and the kernel intensity also varies at different network stages. The above quantitative and visualization results demonstrate the effectiveness of our DM-Conv.
Kernel visualization of our DM-Convs with different input blur and noise levels. (a) DA-Conv 1. (b) DA-Conv 2. (c) DA-Conv 3. (d) DA-Conv 4.
2) Angular Information:
The major difference between our LF-DMnet and nonblind single image SR methods (e.g., SRMD) is the incorporation of the angular information. As shown in Table III, when the angular information is not used (i.e., Model 3), the average PSNR value suffers a 2.25 dB drop. This performance gap is also consistent with the gap between SRMD and our method in Table II. This clearly demonstrates that the complementary interview correlation is crucial for real-world LF image SR.
3) Kernel Prior Embedding:
It can be observed in Table III that Model 4 without KPE suffers a 0.22 dB decrease in PSNR as compared to our LF-DMnet, and the PSNR drop is more significant on noise-free scenes. That is because, without KPE, our network has to search for the best degradation kernel to recover the degraded image features. Since we adopt the isotropic Gaussian kernel as the blur kernel for synthetic degradation, KPE can help our network to reduce the searching space and thus facilitates our network to learn more accurate kernel representations.
4) Degradation-Modulating Channel Attention:
As shown in Table III, when the degradation-modulating channel attention is removed, Model 5 suffers a 0.20 dB decrease in average PSNR as compared to LF-DMnet. This demonstrates the effectiveness of channel-wise degradation modulation. Since our DM-Conv can only adapt to different degradation in the spatial dimension, DM-CA can be used as a complementary part of DM-Conv to enhance its degradation adaptation capability. It is also worth noting that our DM-CA only introduces 0.01 M increase in model size, 2 ms increase in running time, and negligible increase in FLOPs. These results demonstrate the high efficiency of our model design.
D. Degradation Mismatch Analyses
In this section, we first analyze the performance variation of our method with mismatched input and groundtruth synthetic degradation. Then, we apply our LF-DMnet to real LF images and analyze its SR performance with various input blur kernel widths and noise levels.
1) Synthetic Degradation:
Since our LF-DMnet is a nonblind SR method, it requires to take the blur kernel and noise level as its input. In the aforementioned experiments with synthetic degradation, we directly use the groundtruth degradation as the input degradation of our network. To investigate the performance of our method when the input degradation mismatches with the groundtruth one, we conduct the following experiments. First, we investigate the performance variation of our LF-DMnet with mismatched blur kernel widths by traversing the groundtruth kernel width
Visualization of the performance variation of our method with mismatched degradation. (a)–(d) PSNR values achieved with mismatched blur kernel widths under different noise levels. (e)–(h) PSNR values achieved with mismatched noise level under different blurs. (i)–(l) PSNR values achieved with simultaneously mismatched blurs and noise levels under four representative degradation settings (marked by white cross).
From Fig. 9, we can draw the following conclusions.
Best SR performance can be achieved when the input degradation matches the groundtruth one.
The performance variation caused by blur mismatch is more significant than that caused by noise mismatch.
When
,B_{\mathrm{ in}}\neq B_{\mathrm{ gt}} leads to much more significant performance degradation thanB_{\mathrm{ in}}>B_{\mathrm{ gt}} .B_{\mathrm{ in}} < B_{\mathrm{ gt}} As the noise level increases, the PSNR variation caused by the blur kernel mismatch is reduced.
2) Real-World Degradation:
To investigate the influence of the input kernel widths and noise levels to the SR performance under real-world degradation, we directly apply our LF-DMnet to the LFs captured by Lytro and Raytrix cameras, and traverse the input blur kernel width (from 0 to 3 with a step of 1) and the input noise level (from 0 to 60 with a step of 15). Since both the groundtruth HR images and their degradation are unavailable, we evaluate the performance of our method by visually comparing its SR results. Fig. 10 shows the
A large input kernel width can enhance the local contrast and sharpens edges and textures, but an over-large kernel width introduces ringing artifacts to the result images.
A large input noise level can enhance the local smoothness and helps to alleviate the artifacts, but an over-large input noise level makes the result images blurring.
Our LF-DMnet can achieve better SR performance on Lytro LFs by setting kernel width and noise level to 2 and 30, respectively, and can achieve better SR performance on Raytrix LFs by setting kernel width and noise level to 4 and 60, respectively.
Readers can further refer to our interactive online demo6 to view the influence of input degradation to the SR results.
Conclusion and Discussion
In this article, we achieve real-world LF image SR via degradation modulation. We developed an LF degradation model based on the camera imaging process, and proposed an LF-DMnet that can modulate degradation priors into the SR process. Experimental results show that our method can produce visually pleasant and angular consistent SR results on real-world LF images. Through extensive ablation studies and model analyses, we validated the effectiveness of our designs and obtained a series of insightful observations.
It is worth noting that, although our LF-DMnet achieves significantly improved performance than existing methods on real-world LF image SR, it is sensitive to the input degradation and requires accurate degradation estimation. When the input blur kernel widths and noise levels mismatch with the real ones, our method will produce images with artifacts or oversmoothness. Moreover, due to the nonblind setting in our method, when applying our method to a novel LF camera with unknown degradation, we need to first “measure” the PSF and the noise level of this camera, which is user-unfriendly and not practical enough. In the future, we will study the more challenging blind LF image SR problem, and try to design a more practical method for real-world LF image SR. We believe that our LF-DMnet will serve as a fundamental work and can inspire more researchers to focus on real-world LF image SR.