Introduction
Light field (LF) cameras can simultaneously record the spatial and angular information of the scene through a single snapshot, making them valuable for various applications such as depth estimation [1]–[4], view synthesis [5], 3D reconstruction [6], and virtual reality [7]. However, due to the inherent trade-off between the spatial and angular in LF cameras, the spatial resolution of LF images is limited which imposes a constraint on the accuracy of subsequent applications. Hence, LF image super-resolution (SR) is an important research hotspot and there are numerous methods [8]–[21] have been developed to enhance the resolution of LF images, leading to significant advancements.
In recent years, diffusion models [22] are built upon Markov chains that progressively introduce noise to the data. By learning the inverse process of iterative denoising, diffusion models can transform latent variables from simple distributions, such as Gaussian distributions, into data following complex distributions. Consequently, diffusion models have demonstrated exceptional results in various image generation tasks, especially in 2D image SR [23]–[25]. Compared to 2D images, LF images contain two additional angular dimensions, bringing more challenges when applying diffusion models for LF image SR. A straightforward approach is applying diffusion-based image SR models to each LF sub-aperture image (SAI). However, performing SAI SR independently ignores the relationships between different SAIs, leading to angular inconsistencies in the high-resolution LF. Furthermore, diffusion models’ inherently iterative denoising process requires thousands of steps to achieve satisfactory visual quality, significantly increasing the computational complexity and inference time.
In general, the challenges of exploring diffusion models for LF image SR mainly stem from two aspects: 1) how to guarantee angular consistency between different SAIs of the LF? 2) how to accelerate the inference speed of the diffusion model for LF images while preserving a high-quality visual appearance? To address the above issues, we propose the first diffusion-based method (i.e., LFSRDiff) for LF image SR through the LF disentanglement mechanism and residual modeling. Specifically, we incorporate the disentanglement mechanism [12] into the diffusion model and propose disentangled U-Net (Distg U-Net) for learning noise to preserve angular consistency. Further, we adopt residual modeling [23], [26], [27] where the diffusion model learns the residuals between upsampled low-resolution (LR) and ground truth high-resolution (HR). The residual modeling can expedite model training and yield better results. Figure 1 illustrates the diagram of LFSRDiff.
In summary, this paper makes the following contributions: 1) We propose the first diffusion-based model for LF image SR, which maintains high-quality spatial appearance and angular consistency. 2) We integrate the LF disentanglement mechanism into the diffusion model and propose a Distg U-Net for inverse process noise learning. We also adopt residual learning to speed up model training and enable more efficient sampling than standard direct learning. 3) Extensive experiments on five datasets verify the superiority of LFSRDiff in terms of both visual quality and perceptual metric.
Method
We adopt the two-plane LF parameterization model [28] to represent an LF image, which can be formulated as a 4D function ℒ (u,v,h,w) ∈ℝU×V×H×W, where U and V stand for angular dimensions, where H and W stand for spatial dimensions. Our method takes the LR LF ℒLR ∈ℝU×V×H×W as inputs, and generate the SR LF ℒSR ∈ ℝU×V×αH×αW as outputs, where α stands for the scale factor of SR. In the following, we will introduce the residual conditional diffusion models, the LF disentanglement mechanism and the network architecture of Distg U-Net in detail.
The diagram of LFSRDiff. Given an LR LF image, LFSRDiff is capable of learning the candidate SR distribution and generating the residual between upsampled LR and HR.
A. Residual Conditional Diffusion Models
For conditional diffusion models, we need to introduce the condition ℒLR into the denoising network fθ to control model outputs. Similarly to [23], [25], [27], the training objective Ldirect(θ) can be written as:
\begin{equation*}{\left\| { \in - {f_\theta }\left( {\sqrt {{{\bar \alpha }_t}} {\mathcal{L}}_{HR}^0 + \sqrt {\left( {1 - {{\bar \alpha }_t}} \right)} \in ,t,{\mathcal{L}_{LR}}} \right)} \right\|_1}\tag{1}\end{equation*}
Residual Modeling. Training a diffusion model to obtain the final HR LF image \begin{equation*}{\left\| { \in - {f_\theta }\left( {\sqrt {{{\bar \alpha }_t}} \left( {\mathcal{L}_{HR}^0 - \operatorname{up} \left( {{\mathcal{L}_{LR}}} \right)} \right) + \sqrt {\left( {1 - {{\bar \alpha }_t}} \right)} \in ,t,{\mathcal{L}_{LR}}} \right)} \right\|_1}\tag{2}\end{equation*}
B. Disentanglement Mechanism in LF
Since LF images contain both spatial information and angle information that are entangled with each other, the original U-Nets in the existing diffusion models [23], [25], [27] are designed for general images and cannot adeptly handle LF images. To address this limitation, we incorporate the LF disentanglement mechanism [12] into the network design of the diffusion model. Next, we introduce the LF disentanglement mechanism.
As shown in Fig. 3, the LF disentanglement mechanism [12] obtains spatial, angular, and Epipolar plane image (EPI) features according to different combinations of LF image pixels. Furthermore, by setting different convolution kernel sizes, strides, and dilated rates, the corresponding features of the LF image can be extracted. Here, we employ three types of feature extractors for feature extraction, U = V = A.
Spatial Feature Extractor (SFE) is a convolution with a kernel size of 3×3, a stride of 1, and a dilation of A.
Angular Feature Extractor (AFE) is a convolution with a kernel size of 3×3, a stride of A, and a dilation of 1.
EPI Feature Extractors (EFE) need to extract horizontal and vertical EPI features. For horizontal EPI (EPI-H) features, we design EFE-H as a convolution with a kernel size of 1×A2, a vertical stride of 1, and a horizontal stride of A. EFE-V adopts a similar approach.
C. Distg U-Net
An overview of our Distg U-Net is shown in Fig. 2(a). According to Eq. 2, the input to Distg U-net fθ is timestep t, LR LF image ℒLR and the noisy residual HR LF image
Firstly, we use the SFE designed in Sec. II-B to extract
Distg-Block. The Distg-Block is the basic module of our Distg U-Net based on the LF disentanglement mechanism mentioned in Sec. II-B (as shown in Fig. 2(c)). Specifically, we employ SFE, AFE, EFE-V, and EFE-H to disentangle spatial, angular, and EPI features, respectively. Then, angular features and EPI features are upsampled and concatenated with spatial features. Finally, an SFE is adopted to fuse concatenated features and output fused features. Note that the Distg-Block in the LF encoder contains a residual connection, while DitsgRes-Group does not use a residual connection since the feature map is downsampled.
LF Encoder. The LR LF feature is encoded by the LF Encoder and is added to each reverse step to guide the generation to the corresponding HR LF image. In this paper, we choose the EPIT [30] as the LF Encoder, which is robust to disparity variations.
Experiments
A. Experimental Settings
Following previous methods [11], [13], [33]–[35], we use five mainstream LF image datasets, i.e., EPFL [36], HCINew [37], HCIold [38], INRIA [39], and STFgantry [40] for our LFSR experiments. We follow the same diffusion settings for fair comparisons with SRDiff [23]: The timestep T is set to 100 and the cosine noise schedule. We adopt a two-stage training strategy. First, we pre-train the LF Encoder using an L1 loss for the sake of efficiency. Then, we fix the LF Encoder and use the training loss mentioned in Eq.2 for training Distg U-Net. We adopt Adam as the optimizer, set the batchsize to 4, the learning rate to 2×10−4 and decrease it by half every 100k iterations. We adopt the well-known distortion-based metric (i.e., PSNR) and LPIPS [41] as the perceptual metric. Following [27], we also evaluate the sampling averaging (SA) results.
B. Comparisons with state-of-the-art Methods
We compare our method to 9 state-of-the-art SR methods, including 2 single image SR methods [23], [31] and 7 LF image SR methods [10], [12], [13], [30], [32]–[34]. Table I shows a quantitative comparison among LF image SR methods. Our LFSRDiff achieves the best perception score (i.e., LPIPS of 0.1392), nearly a 16% reduction compared to EPIT [30], and maintains a competitive distortion score (i.e., PSNR of 32.15) on five datasets for 4× SR tasks. Moreover, our method through sampling average (Ours-SA) achieves the state-of-theart distortion score (i.e., PSNR of 32.42). Figure 4 shows the qualitative results achieved by different methods for 4× SR. As can be seen from the zoomed-in areas, the diffusion-based SISR method (i.e., SRDiff [23]) cannot reliably recover the missing details, such as the Arabic numerals. Other LFSR methods tend to obtain higher PSNR but lead to blurry results, such as the lines in the scene ISO Chart. In contrast, our LFSRDiff can learn the SR LF image distribution by iteratively learning the noise and achieving the best visual quality. An EPI contains patterns of oriented lines whose slopes reflect the disparity values and angular consistency. As shown in Fig. 4, the vertical EPI of the SISR method generates unclear lines, which shows that the SISR method does not consider the characteristics of LF structure. LF image SR methods generate better EPIs than the SISR method. Compared with other SR methods, the EPI of our method can maintain sharper and clearer lines with fewer artifacts, demonstrating that our method can preserve the LF disparity structure and angular consistency well.
The exhibition of spatial, angular, and EPI features based on a toy LF Macro-pixel image (MacPI) [12] representation with U =V =3 and H = W =3. Here, the same color represents the same angular domain information, and the same letter represents the same spatial domain information.
Qualitative comparison of different SR methods for 4× SR. The super-resolved center view images and vertical EPIs are shown. Best viewed zoom-in electronically.
C. Ablation Study
1) Distg U-Net Variants
We conduct experiments with different variants of Distg U-Net and compare them with the original U-Net [23]. We control the parameters of different variants to be approximately equal to ensure a fair comparison. As shown in Table II, using the disentanglement mechanism to extract spatial features alone, the metrics are already better than the original U-Net network. Furthermore, by exploring different combinations of spatial, angular, and EPI features, we find that the combination of all three features achieves the best results. The Distg U-Net variant (2.13M) outperforms the original U-Net (2.91M) by about 0.5 dB in terms of PSNR, which demonstrates the effectiveness of the disentanglement mechanism. The feature channel dimension is further increased, and the results will be improved accordingly.
Training curves for direct learning and residual modeling. The right column compares the crop on a Cards from the STFgantry [40]. Residual modeling has a more stable training curve and achieves better visual quality.
2) Residual Modeling
We explore the performance of residual modeling and direct learning. As shown in Fig. 5, direct learning exhibits an unstable training process and produces outputs that are not realistic and contain much noise. In contrast, residual modeling achieves higher PSNR (i.e., 27.757 vs. 23.147) and better visual perception with a more stable training process. Another benefit of residual modeling lies in the reduced computational cost associated with sampling. Due to the iterative process of diffusion sampling, the U-Net requires numerous iterations for each generated sample, typically ranging from several hundred to thousands of steps. In contrast, our method only takes 100 steps to produce realistic results, proving that it is an effective way to reduce the computational burden.
Conclusion
In this paper, we propose a novel diffusion-based method, LFSRDiff, for LF image SR. We introduce the Distg U-Net by integrating the LF disentanglement mechanism with the diffusion model and leverage residual learning to accelerate model training and inference. Extensive experimental results consistently demonstrated the superior capability of our method in generating more realistic results than existing LF image SR methods.