Conferences >ICASSP 2025 - 2025 IEEE Inter...

LFSRDiff: Light Field Image Super-Resolution via Diffusion Models

Abstract:

Diffusion models have become a rising star in image super-resolution (SR) tasks. However, it is not trivial to apply diffusion models for light field (LF) image SR, which...Show More

Metadata

Abstract:

Diffusion models have become a rising star in image super-resolution (SR) tasks. However, it is not trivial to apply diffusion models for light field (LF) image SR, which requires maintaining the high-quality visual appearance of each sub-aperture image (SAI) and the angular consistency between the different SAIs. This paper proposes the first diffusion-based LF image SR model, namely LFSRDiff, by incorporating the LF disentanglement mechanism and residual modeling. Specifically, we introduce a disentangled U-Net (Distg U-Net) for diffusion models, enabling improved extraction and fusion of the spatial and angular information in LF images. Furthermore, we leverage residual modeling in diffusion to learn the residual between the upsampled low-resolution and the ground truth high-resolution, which significantly accelerates model training and yields superior results compared to direct learning. Extensive experiments conducted on the five datasets demonstrate the effectiveness of our approach, which can produce realistic SR results and achieve the highest perceptual metric in terms of LPIPS. Code is publicly available at https://github.com/chaowentao/LFSRDiff.

Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 06-11 April 2025

Date Added to IEEE Xplore: 07 March 2025

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49660.2025.10889642

Conference Location: Hyderabad, India

Funding Agency:

Wentao Chao

School of Artificial Intelligence, Beijing Normal University, Beijing, China

Junli Zhao

College of Computer Science and Technology, Qingdao University, Qingdao, China

Fuqing Duan

School of Artificial Intelligence, Beijing Normal University, Beijing, China

Guanghui Wang

Department of Computer Science, Toronto Metropolitan University, Toronto, Canada

Contents

SECTION I.

Introduction

Light field (LF) cameras can simultaneously record the spatial and angular information of the scene through a single snapshot, making them valuable for various applications such as depth estimation [1]–[4], view synthesis [5], 3D reconstruction [6], and virtual reality [7]. However, due to the inherent trade-off between the spatial and angular in LF cameras, the spatial resolution of LF images is limited which imposes a constraint on the accuracy of subsequent applications. Hence, LF image super-resolution (SR) is an important research hotspot and there are numerous methods [8]–[21] have been developed to enhance the resolution of LF images, leading to significant advancements.

In recent years, diffusion models [22] are built upon Markov chains that progressively introduce noise to the data. By learning the inverse process of iterative denoising, diffusion models can transform latent variables from simple distributions, such as Gaussian distributions, into data following complex distributions. Consequently, diffusion models have demonstrated exceptional results in various image generation tasks, especially in 2D image SR [23]–[25]. Compared to 2D images, LF images contain two additional angular dimensions, bringing more challenges when applying diffusion models for LF image SR. A straightforward approach is applying diffusion-based image SR models to each LF sub-aperture image (SAI). However, performing SAI SR independently ignores the relationships between different SAIs, leading to angular inconsistencies in the high-resolution LF. Furthermore, diffusion models’ inherently iterative denoising process requires thousands of steps to achieve satisfactory visual quality, significantly increasing the computational complexity and inference time.

In general, the challenges of exploring diffusion models for LF image SR mainly stem from two aspects: 1) how to guarantee angular consistency between different SAIs of the LF? 2) how to accelerate the inference speed of the diffusion model for LF images while preserving a high-quality visual appearance? To address the above issues, we propose the first diffusion-based method (i.e., LFSRDiff) for LF image SR through the LF disentanglement mechanism and residual modeling. Specifically, we incorporate the disentanglement mechanism [12] into the diffusion model and propose disentangled U-Net (Distg U-Net) for learning noise to preserve angular consistency. Further, we adopt residual modeling [23], [26], [27] where the diffusion model learns the residuals between upsampled low-resolution (LR) and ground truth high-resolution (HR). The residual modeling can expedite model training and yield better results. Figure 1 illustrates the diagram of LFSRDiff.

In summary, this paper makes the following contributions: 1) We propose the first diffusion-based model for LF image SR, which maintains high-quality spatial appearance and angular consistency. 2) We integrate the LF disentanglement mechanism into the diffusion model and propose a Distg U-Net for inverse process noise learning. We also adopt residual learning to speed up model training and enable more efficient sampling than standard direct learning. 3) Extensive experiments on five datasets verify the superiority of LFSRDiff in terms of both visual quality and perceptual metric.

SECTION II.

Method

We adopt the two-plane LF parameterization model [28] to represent an LF image, which can be formulated as a 4D function ℒ (u,v,h,w) ∈ℝ^U×V×H×W, where U and V stand for angular dimensions, where H and W stand for spatial dimensions. Our method takes the LR LF ℒ_LR ∈ℝ^U×V×H×W as inputs, and generate the SR LF ℒ_SR ∈ ℝ^{U×V×αH×αW} as outputs, where α stands for the scale factor of SR. In the following, we will introduce the residual conditional diffusion models, the LF disentanglement mechanism and the network architecture of Distg U-Net in detail.

Fig. 1.

The diagram of LFSRDiff. Given an LR LF image, LFSRDiff is capable of learning the candidate SR distribution and generating the residual between upsampled LR and HR.

Show All

A. Residual Conditional Diffusion Models

For conditional diffusion models, we need to introduce the condition ℒ_LR into the denoising network f_θ to control model outputs. Similarly to [23], [25], [27], the training objective L_direct(θ) can be written as:

$\begin{equation*}{\left\| { \in - {f_\theta }\left( {\sqrt {{{\bar \alpha }_t}} {\mathcal{L}}_{HR}^0 + \sqrt {\left( {1 - {{\bar \alpha }_t}} \right)} \in ,t,{\mathcal{L}_{LR}}} \right)} \right\|_1}\tag{1}\end{equation*}$ View Source

where

${\bar \alpha _t} = \prod\nolimits_{j = 1}^t {{\alpha _j}}$

and the 0 < α_j < 1 for all (1,…,T) is the hyperparameter that controls the variance of Gaussian noise added at each iteration.

Residual Modeling. Training a diffusion model to obtain the final HR LF image $\mathcal{L}_{HR}^0$ directly from Gaussian noise is difficult and inference is very time-consuming since needs a large T iteration. Inspired by residual modeling [23], [26], [27], we change the objective of the diffusion model to predict the residual between HR LF image $\mathcal{L}_{HR}^0$ and upsampled LR LF image up(ℒ_LR). The residual objective L_res(θ) can be written as:

$\begin{equation*}{\left\| { \in - {f_\theta }\left( {\sqrt {{{\bar \alpha }_t}} \left( {\mathcal{L}_{HR}^0 - \operatorname{up} \left( {{\mathcal{L}_{LR}}} \right)} \right) + \sqrt {\left( {1 - {{\bar \alpha }_t}} \right)} \in ,t,{\mathcal{L}_{LR}}} \right)} \right\|_1}\tag{2}\end{equation*}$ View Source

B. Disentanglement Mechanism in LF

Since LF images contain both spatial information and angle information that are entangled with each other, the original U-Nets in the existing diffusion models [23], [25], [27] are designed for general images and cannot adeptly handle LF images. To address this limitation, we incorporate the LF disentanglement mechanism [12] into the network design of the diffusion model. Next, we introduce the LF disentanglement mechanism.

As shown in Fig. 3, the LF disentanglement mechanism [12] obtains spatial, angular, and Epipolar plane image (EPI) features according to different combinations of LF image pixels. Furthermore, by setting different convolution kernel sizes, strides, and dilated rates, the corresponding features of the LF image can be extracted. Here, we employ three types of feature extractors for feature extraction, U = V = A.

Spatial Feature Extractor (SFE) is a convolution with a kernel size of 3×3, a stride of 1, and a dilation of A.

Angular Feature Extractor (AFE) is a convolution with a kernel size of 3×3, a stride of A, and a dilation of 1.

EPI Feature Extractors (EFE) need to extract horizontal and vertical EPI features. For horizontal EPI (EPI-H) features, we design EFE-H as a convolution with a kernel size of 1×A², a vertical stride of 1, and a horizontal stride of A. EFE-V adopts a similar approach.

C. Distg U-Net

An overview of our Distg U-Net is shown in Fig. 2(a). According to Eq. 2, the input to Distg U-net f_θ is timestep t, LR LF image ℒ_LR and the noisy residual HR LF image $\mathcal{L}_{Res}^t$ and generates the residual LF image $\mathcal{L}_{\operatorname{Re} s}^{t - 1}$ by predicting the noise at timestep t.

Firstly, we use the SFE designed in Sec. II-B to extract $\mathcal{L}_{Res}^t$ spatial feature. Then, the spatial feature is added with LR LF ℒ_LR feature, extracted by a generalized LF Encoder. Follow [22], [23], we convert timestep t into a timestep embedding t_e using sinusoidal Positional Encoding commonly proposed in Transformer [29]. The main components of Distg U-Net contain three DitsgRes-Groups in the encoder (DistgRes-E), two DitsgRes-Groups in the bottleneck (DistgRes-M), and three DitsgRes-Groups in the decoder (DistgRes-D). Skip connections are used between the encoder and the decoder at the same level. Each DitsgRes-Group (see Fig. 2(b)) consists of a residual spatial convolution, two disentangling blocks (DistgBlocks). The timestep embedding t_e is added to the extracted feature of the first Distg-Blocks through spatial replication. DistgRes-E and DistgRes-D have an extra downsampling or upsampling operation.

Distg-Block. The Distg-Block is the basic module of our Distg U-Net based on the LF disentanglement mechanism mentioned in Sec. II-B (as shown in Fig. 2(c)). Specifically, we employ SFE, AFE, EFE-V, and EFE-H to disentangle spatial, angular, and EPI features, respectively. Then, angular features and EPI features are upsampled and concatenated with spatial features. Finally, an SFE is adopted to fuse concatenated features and output fused features. Note that the Distg-Block in the LF encoder contains a residual connection, while DitsgRes-Group does not use a residual connection since the feature map is downsampled.

LF Encoder. The LR LF feature is encoded by the LF Encoder and is added to each reverse step to guide the generation to the corresponding HR LF image. In this paper, we choose the EPIT [30] as the LF Encoder, which is robust to disparity variations.

Fig. 2.
The architecture of Distg U-Net. Here, a 3×3 LF is employed to illustrate.

Show All

TABLE I Quantitative comparison of different SR methods in terms of distortion (PSNR) and perception (LPIPS) metrics. The best results are in bold faces and the second best results are underlined.

SECTION III.
Experiments
A. Experimental Settings
Following previous methods [11], [13], [33]–[35], we use five mainstream LF image datasets, i.e., EPFL [36], HCINew [37], HCIold [38], INRIA [39], and STFgantry [40] for our LFSR experiments. We follow the same diffusion settings for fair comparisons with SRDiff [23]: The timestep T is set to 100 and the cosine noise schedule. We adopt a two-stage training strategy. First, we pre-train the LF Encoder using an L₁ loss for the sake of efficiency. Then, we fix the LF Encoder and use the training loss mentioned in Eq.2 for training Distg U-Net. We adopt Adam as the optimizer, set the batchsize to 4, the learning rate to 2×10⁻⁴ and decrease it by half every 100k iterations. We adopt the well-known distortion-based metric (i.e., PSNR) and LPIPS [41] as the perceptual metric. Following [27], we also evaluate the sampling averaging (SA) results.
B. Comparisons with state-of-the-art Methods
We compare our method to 9 state-of-the-art SR methods, including 2 single image SR methods [23], [31] and 7 LF image SR methods [10], [12], [13], [30], [32]–[34]. Table I shows a quantitative comparison among LF image SR methods. Our LFSRDiff achieves the best perception score (i.e., LPIPS of 0.1392), nearly a 16% reduction compared to EPIT [30], and maintains a competitive distortion score (i.e., PSNR of 32.15) on five datasets for 4× SR tasks. Moreover, our method through sampling average (Ours-SA) achieves the state-of-theart distortion score (i.e., PSNR of 32.42). Figure 4 shows the qualitative results achieved by different methods for 4× SR. As can be seen from the zoomed-in areas, the diffusion-based SISR method (i.e., SRDiff [23]) cannot reliably recover the missing details, such as the Arabic numerals. Other LFSR methods tend to obtain higher PSNR but lead to blurry results, such as the lines in the scene ISO Chart. In contrast, our LFSRDiff can learn the SR LF image distribution by iteratively learning the noise and achieving the best visual quality. An EPI contains patterns of oriented lines whose slopes reflect the disparity values and angular consistency. As shown in Fig. 4, the vertical EPI of the SISR method generates unclear lines, which shows that the SISR method does not consider the characteristics of LF structure. LF image SR methods generate better EPIs than the SISR method. Compared with other SR methods, the EPI of our method can maintain sharper and clearer lines with fewer artifacts, demonstrating that our method can preserve the LF disparity structure and angular consistency well.
Fig. 3.
The exhibition of spatial, angular, and EPI features based on a toy LF Macro-pixel image (MacPI) [12] representation with U =V =3 and H = W =3. Here, the same color represents the same angular domain information, and the same letter represents the same spatial domain information.

Show All
Fig. 4.
Qualitative comparison of different SR methods for 4× SR. The super-resolved center view images and vertical EPIs are shown. Best viewed zoom-in electronically.

Show All
C. Ablation Study
1) Distg U-Net Variants
We conduct experiments with different variants of Distg U-Net and compare them with the original U-Net [23]. We control the parameters of different variants to be approximately equal to ensure a fair comparison. As shown in Table II, using the disentanglement mechanism to extract spatial features alone, the metrics are already better than the original U-Net network. Furthermore, by exploring different combinations of spatial, angular, and EPI features, we find that the combination of all three features achieves the best results. The Distg U-Net variant (2.13M) outperforms the original U-Net (2.91M) by about 0.5 dB in terms of PSNR, which demonstrates the effectiveness of the disentanglement mechanism. The feature channel dimension is further increased, and the results will be improved accordingly.
Fig. 5.
Training curves for direct learning and residual modeling. The right column compares the crop on a Cards from the STFgantry [40]. Residual modeling has a more stable training curve and achieves better visual quality.

Show All
TABLE II PSNR, SSIM, and parameters about different Distg U-Net (DU-Net) Variants for 4× SR.
2) Residual Modeling
We explore the performance of residual modeling and direct learning. As shown in Fig. 5, direct learning exhibits an unstable training process and produces outputs that are not realistic and contain much noise. In contrast, residual modeling achieves higher PSNR (i.e., 27.757 vs. 23.147) and better visual perception with a more stable training process. Another benefit of residual modeling lies in the reduced computational cost associated with sampling. Due to the iterative process of diffusion sampling, the U-Net requires numerous iterations for each generated sample, typically ranging from several hundred to thousands of steps. In contrast, our method only takes 100 steps to produce realistic results, proving that it is an effective way to reduce the computational burden.

SECTION IV.
Conclusion
In this paper, we propose a novel diffusion-based method, LFSRDiff, for LF image SR. We introduce the Distg U-Net by integrating the LF disentanglement mechanism with the diffusion model and leverage residual learning to accelerate model training and inference. Extensive experimental results consistently demonstrated the superior capability of our method in generating more realistic results than existing LF image SR methods.

Wentao Chao

School of Artificial Intelligence, Beijing Normal University, Beijing, China

Junli Zhao

College of Computer Science and Technology, Qingdao University, Qingdao, China

Fuqing Duan

School of Artificial Intelligence, Beijing Normal University, Beijing, China

Guanghui Wang

Department of Computer Science, Toronto Metropolitan University, Toronto, Canada

References is not available for this document.

MIT Libraries

MIT Libraries

LFSRDiff: Light Field Image Super-Resolution via Diffusion Models

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction