Introduction
Light field is a vector function that describes the amount of light flowing in all directions through any point within three-dimensional space. The space encompassing all possible light rays is represented by the five-dimensional plenoptic function, with the magnitude of each ray determined by its radiance. The dimensions of this function include three spatial coordinates that define a point in space, and two angular coordinates that specify the direction from which the light is traveling at that point. This concept finds application in diverse areas, including refocusing [1] and depth estimation [2]. A microlens array-based light field camera enables the capture of both the intensity and direction of light reflected from a subject using a single sensor. This simplifies calibration and rectification, making light field capture more straightforward. However, it presents an inverse relationship between spatial and angular resolutions [1], [3].
Due to the trade-off nature of the microlens array, light field SR has emerged as a prominent and important area in computer vision. Light field SR can be categorized into angular and spatial SR. Angular SR generates dense SAIs from sparse ones, whereas spatial SR focuses on enhancing the resolution of SAIs. Angular SR has been explored through correlation [4], [5], [6] or even for generating light fields from a single image [7], [8]. In the spatial SR task, various approaches have been introduced, including CNN [4], [9], [10], [11], [12], [13], [14], transformer [15], [16], [17], [18], [19], and non-learning-based methods [20], [21]. In our study, we focus on the spatial light field SR.
Deep learning-based SR models provide high performance but have the drawback of requiring large amounts of training data. However, acquiring ground-truth light field images is time-consuming and labor-intensive because these images need to be captured through multiple cameras arranged in horizontal and vertical arrays. To address this, we explore DA to enhance model performance without needing additional training datasets. DA is recognized as an effective way to enhance the performance of deep neural networks [22], [23]. Unfortunately, DA is rarely applied in light field SR tasks.
In light field SR, CutMIB [24] is the first notable DA method. It is based on CutBlur [25], which is initially proposed for single-image SR. CutBlur works by cutting patches from identical locations in low-resolution or high-resolution images and pasting them into images of different resolutions. CutMIB extends this idea to light field SR by cutting patches from the same location in SAIs, blending them, and pasting the blended patches into images with different resolutions. However, CutMIB does not consider the characteristics of light field when extracting patches; it cuts patches at the same location across all SAIs without accounting for the pixel shift between different SAIs. Generally, light field image processing utilizes pixel shift to combine similar pixel information from different SAIs, considering spatial-angular correlation [26]. However, CutMIB does not model such correlation, thereby limiting its performance improvement. Another DA method for light field SR, CutDEM [27], has been proposed to address these limitations by incorporating depth estimation information into patch extraction. However, as depth needs to be estimated for each SAI individually and corresponding weights calculated, the computational complexity significantly increases.
To create a DA method specialized for light field SR, we propose a novel motion-aware data augmentation pipeline that takes into account the unique characteristics of light field. The proposed DA method, CutMAA, incorporates motion information to respect the spatial-angular correlation of light field. CutMAA aligns pixels in all SAIs using the calculated motion information between the central SAIs, instead of cutting out patches uniformly [24], [25]. Then, CutMAA cuts and pastes from the aligned SAIs to generate augmented images. Our DA method is inspired by recent light field image processing frameworks, which also utilize motion information to better capture the spatial-angular correlation between SAIs [28], [29]. By leveraging motion information in DA for light fields, our method allows for improved preservation and enhancement of spatial-angular correlation, making it more suitable for light field-related tasks. This results in improved performance and robustness in light field SR scenarios (Figure 1). Our main contributions are listed below:
We propose a novel DA framework, CutMAA, to enhance the performance of light field SR networks. Our straightforward yet effective strategy is specifically tailored for light fields by incorporating motion-awareness.
Extensive experiments show that CutMAA can significantly improve the light field SR methods.
CutMAA attains the highest scores across mostly tested light field SR methods, demonstrating superior performance in comparison to other methods evaluated.
Comparison of PSNR(
Related Work
A. Light Field Spatial Super-Resolution
Light field spatial SR has evolved over time. It initially relied on mathematical modeling [20], [36]. This was followed by non-learning-based methods, such as the 4D geometric approach [21], [37], which leverage the projection and optimization of the intrinsic 4D light field structure. More recently, the focus has shifted toward learning-based approaches, employing CNNs and transformers.
The pioneering CNN-based method, LFCNN [38], extended from SRCNN [39], initially designed for single images, by using CNNs to learn the correlations between SAIs within light fields. This integration of information from adjacent views has become foundational in light field spatial SR. Yeung et al. [40] introduced spatial-angular separable convolutions to model sub-pixel relationships in light field structures. Wang et al. [41] proposed a two-way recurrent network to capture spatial-angular correlations between views. Meng et al. [42] advanced the field with densely connected networks using 4D convolutions to explicitly learn spatial-angular correlations. The aforementioned methods provide subset of SAIs to the SR network, resulting in suboptimal integration of spatial-angular relationships within the light field. To address this, Jin et al. [12] proposed an all-to-one framework that combines information from all SAIs to derive the result for one SAI. Liu et al. [35] introduced a 3D convolution-based multi-shared context block to leverage relationships across all SAIs. Most light field spatial SR approaches have recently been based on transformer models [15], [16], [17], [18], [19]. LFT [15], LF-DET [18], and M2MT-Net [19] perform spatial-angular encoding like sub-sampling spatial modeling and multi-scale angular modeling. EPIT [16] leveraged epipolar plane images (EPI) to model spatial-angular correlations. DPT [17] proposed a transformer that reconstructs the light field as a sequence using a gradient map. Additionally, HLFSR method [14] introduced a hybrid CNN, incorporating spatial, angular, and epipolar information.
In contrast to these previous studies, we focus on inflating the training dataset with a novel DA technique. With our proposed CutMAA, the performance of these light field SR models is improved without altering their architectural design or adding inference latency.
B. Data Augmentation
DA is a technique within machine learning and computer vision, enabling the expansion of the training dataset. The process involves applying diverse transformations to training samples that modify their visual characteristics while preserving their original labels and intrinsic information. The primary objective of DA is to make the model more robust and mitigate overfitting by introducing a wide variety of potential inputs.
DA has been primarily applied to high-level vision tasks, such as image classification and object detection. Various image transformation methods have been introduced, including traditional geometric transformations such as horizontal flips, vertical flips, and rotations, along with advanced techniques such as Cutout [43], Mixup [23], CutMix [22], and Puzzle Mix [44]. Although low-level vision has fewer DA techniques than high-level vision, several methods have emerged. Timofte et al. [45] proposed seven methods to improve the performance of single image SR, including a DA method. Basic DA methods such as rotation and flipping have demonstrated performance improvements that have been implemented across entire datasets. However, these methods are primarily employed in traditional SR models [46], [47] and SRCNN [39] only. Feng et al. [48] analyzed Mixup [23] in SR to reduce model overfitting. Modern DAs for low-level vision tasks, such as CutBlur and mixture-of-augmentation [25], [49], are specially designed to leverage the characteristics of SR. Particularly, CutBlur, exchanges patches from the same location in low-resolution (or low-quality) and high-resolution (high-quality) images, guiding the model on how and where to restore, resulting in performance improvements.
In light field SR tasks, CutMIB [24] is the first DA method. It extends CutBlur [25] to suit light field SR. Given that light field SR involves multiple SAIs, CutMIB incorporates blending operation that integrates SAIs in their augmentation pipeline. Our proposed CutMAA is also based on CutBlur. However, different from CutBlur and CutMIB, we design the DA operation to be more aligned with the nature of light fields by incorporating motion-awareness in SAIs.
C. Optical Flow
Optical flow can be categorized into sparse and dense approaches. Sparse optical flow methods [50] track the movement of specific features or points within an image, allowing it to capture essential motion information while reducing computational costs. By contrast, dense optical flow methods [51], [52], [53], [54], [55], [56], [57] calculate motion information for every pixel in an image, resulting in highly accurate optical flow estimates with more computational resources.
Farneback algorithm [52] is the most widely used method for optical flow because it quickly approximates movement between two consecutive images with a polynomial. FlowSF [54] is a method of expressing the movement of each pixel derived from differences in pixel values in a probabilistic expression without using an iterative method. DeepFlow [55] proposed a descriptor-matching algorithm that can improve performance in fast operations using a deep-matching method. RLOF [53] is an iterative Lucas-Kanade method that proposes an M-estimator framework to increase the accuracy of movement boundaries and pixels that appear and disappear. Dual TV L1 [51], [57] presented a solution to the TV-L1 formula, using point-wise thresholds based on the dual energy formula of TV energy. PCA-Flow [56] introduced PCA layers to increase accuracy at the boundary and extend it to a hierarchical model. Although the methods introduced above do not require additional information such as disparity and non-leasing-based methods, they have the advantage of obtaining an optical flow with only two consecutive images. Recently, a method of obtaining optical flow using deep learning has been introduced, and this method achieves SOTA results of optical flow. FlowNet [58] is the first to introduce a deep learning approach and perform pixel-level localization by introducing a correlation layer for end-to-end learning. RAFT [59] is the basis of the optical flow inference method through the deep neural network and improves performance through repetitive operations that extract features for each pixel and create a 4D correlation volume for all pairs of pixels.
In our study, the proposed CutMAA is designed to be motion-aware to reflect the unique characteristics of light field. To incorporate motion information into our DA pipeline, we adopt the pre-trained optical flow method, RAFT [59] to capture correlations between each SAI.
Method
A. Problem Formulation
Let a low-resolution (LR) light field be denoted as
The SR network learns a mapping function f that takes
B. Motion-Aware Data Augmentation
The proposed CutMAA generates new training samples through four processes: warping, cutting, blending, and pasting (Figure 2). Warping operation calculates the motion information between the central SAI and others. Then, it aligns all the pixels in each SAI using the computed motion information. Note that the warping procedure is conducted for both LR and HR light fields. Cutting, Blending, and Pasting operations randomly extract patches, blending the patches from SAIs, and pasting the blended patches back into the light field at different resolutions (HR to LR or LR to HR). We present the details of each procedure below.
Overall framework of CutMAA. The proposed method comprises four procedures: warping, cutting, blending, and pasting. Orange dashed rectangles with different patterns represent patches in various views (from multiple microlens cameras). The warping operation adjusts all the SAIs according to the nonuniform motion information from the center SAI, aligning the SAIs with each other. CutMAA then crops random patches from all SAIs with the cut operation and integrates and pastes these patches at different resolutions through the blend and paste procedures.
1) Warping
This process aligns SAIs with a warping operation to match them to the center SAI. Without this alignment, subsequent augmentations may occur between unaligned and irrelevant patches, potentially hindering performance improvement. To address this, we initially compute the per-pixel motion vector
After calculating the motion vector, each pixel in the SAI is warped to a new position determined by adding the motion vector. The goal is to align pixels with similar characteristics across each SAI. Given that u and v values are typically in a floating-point, pixels at non-integer positions are interpolated. If the new location has no pixel, then it is replaced with the existing pixel value. The new position of a pixel initially at position \begin{align*} {\mathcal {L}}_{i} '^{LR} (x, y) = \begin{cases} \displaystyle L_{i}^{LR}(x + u_{ci}, y + v_{ci}) & \text {if}~ 0 \leq x + u_{ci} \leq W, \\ \displaystyle & \phantom {\text {if}~} 0 \leq y + v_{ci} \leq H, \\ \displaystyle L_{i}^{LR} (x, y) & \text {otherwise} \end{cases} \tag {1}\end{align*}
2) Cutting
This operation extracts out patches from random regions, ensuring that patches are obtained from the same location across all SAIs in \begin{equation*} p_{i}^{LR} = \mathbf {M} \odot {\mathcal {L}}_{i} '^{LR}, i \in [1, K], \tag {2}\end{equation*}
3) Blending
This process involves integrating the K patches, \begin{equation*} P_{blend}^{LR} = \frac {1}{K}\sum _{i=1}^{K}p_{i}^{LR} \tag {3}\end{equation*}
4) Pasting
Inspired by CutBlur [25], we paste the blended LR patch \begin{align*} \hat {\mathcal {L}}_{i}^{LR} & = \mathbf {M} \odot P_{blend}^{HR_{r}^{\downarrow }} + (1 - \mathbf {M}) \odot {\mathcal {L}}_{i}^{LR} \tag {4}\\ \hat {\mathcal {L}}_{i}^{HR} & = \mathbf {M} \odot P_{blend}^{LR_{r}^{\uparrow }} + (1 - \mathbf {M}) \odot {\mathcal {L}}_{i}^{HR} \tag {5}\end{align*}
5) Discussion
The most significant difference between CutMIB and ours is how the blended patch is obtained. In CutMIB,
Difference between CutMIB [24] and the proposed CutMAA. CutMIB extracts patches from multiple SAIs and blends these to make an integrated view for each SAI. However, it lacks consideration of the light field’s characteristics; since SAIs have slight pixel shifts, they need correction. CutMAA addresses this by incorporating motion-awareness into the augmentation pipeline. To this end, before extracting patches in the cutting operation, CutMAA performs a warping process to align all SAIs, ensuring accurate patch extraction and integration.
C. Motion Vector in Light Field
Given a low-resolution light field \begin{equation*} \overrightarrow {v^{LR}} = (u_{ij}^{LR}, v_{ij}^{LR} ). \tag {6}\end{equation*}
\begin{equation*} \overrightarrow {v^{HR}} = (u_{ij}^{HR}, v_{ij}^{HR} ). \tag {7}\end{equation*}
To obtain precise motion information between SAIs, we utilize the optical flow of the light field. Optical flow describes the pattern of object movements in an image, capturing the direction and distribution of pixel distances between successive frames. Let an image
When extending optical flow to light fields, given that a light field is captured using parallel cameras or lens arrays with different camera parameters for the same scene, two SAIs (e.g.
1) Discussion
We predict motion between SAIs using optical flow because it has been widely utilized in light field-based studies [28], [29]. For example, optical flow is employed between consecutive SAIs to estimate depth, demonstrating good performance in terms of speed and accuracy [28]. It has also been used in light field SR tasks; Farrugia et al. [29] used optical flow to align SAIs relative to the central view, aiming to generalize the spatial and angular relationships across different datasets. This effectively aligns the SAIs, resulting in improved SR performance. Therefore, we focus on the suitability of optical flow for light fields, using it to create a motion-aware DA pipeline.
Other approaches, such as stereo matching, can be used to perform motion-aware alignment. However, the baseline between adjacent views in a light field is narrow, making it difficult to utilize stereo-matching methods effectively. For light fields captured with a Lytro camera, the disparity range between adjacent views is less than one pixel. Due to this narrow baseline, pixel shifts in the spatial domain are accompanied by interpolation with blurring, which degrades the performance of stereo matching [26].
Experimental Result
A. Implementation Details
CutMAA is implemented using the light field SR benchmark baseline [61], with additional prepossessing for DA, which is integrated into the data loading process. Other procedures follow the conventional light field SR frameworks, where the light fields are converted to YCbCr channels, and only the Y channel is used for training. We use all the light field datasets provided by the light field SR benchmark, totaling five datasets [30], [31], [32], [33], [34]. The angular size of all light fields is fixed at
B. Experimental Setting
To validate the performance of the proposed method, we use light field SR baselines including ATO [12], LF-InterNet [13], IINet [35], and DistgSSR [4]. All methods are trained using L1 loss with the Adam optimizer. The learning rate and batch size were adjusted according to the environment to implement the baseline performance. We utilize five light field datasets [30], [31], [32], [33], [34] from the light field SR benchmark [61]. The angular size is fixed at 5, and the scale factor is set to 4. As with CutMIB, datasets are trained as cropped patches of
C. Model Comparison
Table 1 presents the comparison results of not using DA, CutMIB, and CutMAA across various light field SR baselines. Applying our method generally results in an increase in PSNR and the best SSIM scores across mostly datasets. Although some light field SR methods experience a performance drop in PSNR when applying CutMIB, using CutMAA improves the PSNR performance for all methods except LF-InterNet on the EPFL dataset and IINet on the STFgantry dataset. In those cases where PSNR does not increase, the SSIM remains stable, indicating that the structural integrity of the images is preserved. Moreover, CutMAA consistently achieves the highest SSIM scores across mostly light field SR methods and datasets, highlighting its robustness and effectiveness in enhancing both resolution and image quality in light field SR tasks.
Figure 4 presents the qualitative results for cases without DA, with CutMIB, and with CutMAA. The qualitative results are presented by comparing the ground truth and difference images. Similar to the quantitative results, we compare four baselines and provide ground truth and zoomed-in images for a more detailed comparison. Our proposed method, CutMAA, shows superior performance in the residual intensity maps, effectively reducing the difference between the predicted and ground truth images.
Qualitative comparison for DA methods using various light field SR as baselines. Absolute residual intensity map between the network output and the ground-truth HR image, showing the qualitative differences. Best viewed on the electronic screen.
Table 2 presents a comparison of the mean squared error (MSE) and bad pixel ratio (BP) at a threshold of 0.7 between the extracted depth images and the ground truth depth images. The Lego knights dataset from STFgantry is used for our experimental comparison. Since the dataset does not provide depth ground truth, we use depth images derived from the HR images as the GT for comparison. The baseline methods include ATO [12], LF-InterNet [13], and DistgSSR [4], and CutMAA demonstrates superior performance across all metrics compared to these methods.
D. Model Analysis and Ablation Study
In this section, we present experimental results based on the accuracy of motion information. Calculating the warped SAI affects SR performance among the proposed methods because pixels are warped along motion information. Therefore, several motion estimation methods are presented to obtain warped SAIs, and the results are compared. The two methods used for comparison are DeepFlow [55] and RAFT [59]. The reason for using DeepFlow in the comparison is that it has shown the best performance among dense optical flow estimation methods that do not use deep learning. RAFT is the baseline model for state-of-the-art deep learning-based optical methods and is currently the most widely used approach for motion estimation. The optical flow results for each method for Grove2 of the dataset [62] with ground truth are presented in Figure 5. Grove2 is a synthetic dataset suitable for comparing optical flow because it features detailed parts such as leaves and a background that gradually moves away.
DeepFlow and RAFT, which are used in the comparison, compute dense optical flow for all pixels. Figure 5(b) shows the extraction of approximate movement in the grove, but the saturation representing the magnitude is incorrect. By contrast, Figure 5(c) not only captures the detailed movement in the grove but also provides relatively accurate magnitude results. Although the optical flow for the background is inaccurately extracted in Figure 5(b), Figure 5(c) shows uniform results from the background to the foreground. We compare how well the correlation of the light field is maintained according to the accuracy of motion-awareness based on optical flow. DistgSSR [4] is used for light field SR baseline as the experiment Figure 6 compares the results based on the accuracy of motion-awareness. In the qualitative results, applying motion-awareness eliminates noise and reduces blurriness compared to the noisy baseline. In the baseline, we observe significant noise and blurring at the edges of the spear. When CutMIB is applied, the blur is substantially reduced, but the noise around the spear area remains. However, applying CutMAA eliminates this noise. In addition, as the motion estimation accuracy improves, not only the noise around the spear but also the noise in the background is removed.
Visual comparisons of various motion-awareness upon the DistgSSR [4] baseline. Best viewed on the electronic screen.
Alongside qualitative comparisons, Figure 7 compares the results of depth estimation using CAE [2]. The accuracy improves as the motion-awareness becomes more precise. At the edges of the Knight, the baseline shows inaccurate depth estimation with noticeable noise. In the case of CutMIB, which does not consider motion, this noise is hardly removed. In CutMAA, as the accuracy of motion-awareness increases, the noise is reduced, and ultimately, the method using RAFT achieves the highest accuracy.
Depth estimation comparisons of various optical flow methods. Best viewed on the electronic screen.
Table 3 presents the quantitative results based on motion-awareness accuracy, using PSNR and SSIM for comparison. CutMIB, which does not account for motion-awareness, shows a slight PSNR improvement. In contrast, CutMAA, which incorporates motion-awareness, achieves an average increase of about 0.5 dB. The performance gain is even higher when using RAFT, which has greater motion-awareness accuracy, compared to DeepFlow. For SSIM, DA generally improves performance, with CutMIB showing either a decrease or minimal change. In contrast, CutMAA consistently outperforms CutMIB, with no dataset showing decreased performance when motion-awareness is considered. As motion-awareness accuracy increases, both PSNR and SSIM also improve, achieving the highest SSIM scores. Thus, higher motion-awareness accuracy leads to better performance.
E. Depth Estimation
To compare the depth estimation accuracy after SR is applied, we utilize ATO [12], LF-InterNet [13], IINet [35], DistgSSR [4] as light field SR baselines, employing a scale factor of 4. For the experiment, we use Knights Lego from STFgantry dataset.
Light field imaging offers a significant advantage in depth estimation due to its inherent multi-view nature. Different from traditional stereo vision-based methods, light field imaging eliminates the need for camera calibration, thereby streamlining the depth estimation process. To verify the preservation of light field characteristics, we compare the estimated depth images across different scenarios. Our depth estimation methodology employs the CAE [2] approach, enhanced with graph cut techniques and edge-preserving filtering. The baselines used for the experiments include ATO [12], LF-InterNet [13], DistgSSR [4].
Figure 8 presents the qualitative results comparing depth estimation across various baselines, as well as the light field results from CutMIB and CutMAA. The application of DA leads to noticeable improvements in depth estimation accuracy compared to the baselines without augmentation. Additionally, our motion-aware approach, which considers the spatial-angular correlation of the light field, shows improved depth estimation accuracy across all baselines. This improvement is evident in the results of CutMAA, which offer more accurate depth estimation compared to the other methods. This result is most clearly seen at the edges of the Lego arm. In the results using the baseline or CutMIB, the depth at the edges is not accurately estimated, leading to noise. However, with CutMAA, there is a significant reduction in this noise.
Figure 9 shows the binary maps generated from the BP ratios obtained. These maps are created by setting a threshold of 0.7 for the BP calculation, with pixels exceeding this threshold marked as 1 in the binary map. For this comparison, we use ATO [12], LF-InterNet [13], DistgSSR [4] as baseline methods. Only methods that apply DA are included in the binary map comparisons.
CutMAA demonstrates a reduction in BP, which is reflected in the decreased number of white pixels in the corresponding binary map. This improvement indicates more accurate depth extraction compared to the ground truth, suggesting that CutMAA better accounts for the inherent characteristics of light fields. This leads to more precise depth reconstruction with fewer errors compared to the ground truth. Through the comparison of depth estimation results, we demonstrate that our motion-aware DA method is effective for light field processing.
Conclusion
In this work, we proposed CutMAA, a novel DA method designed specifically for light field SR. Unlike the previous DA strategy [24], CutMAA is motion-aware, taking into account the spatial-angular correlation of SAIs in the light field. To achieve this, we incorporated a warping operation into the DA pipeline, adjusting each SAI according to its motion before applying the cut, blend, and paste operations. This approach preserves the spatial-angular correlation, resulting in significant improvements in performance and robustness over CutMIB [24]. However, CutMAA relies on accurate optical flow during warping, limiting its effectiveness if precise estimation is unavailable. Through extensive evaluations, we demonstrate that CutMAA outperforms previous approaches across various SR networks and light field datasets, highlighting its effectiveness and adaptability.