Introduction
In recent decades, considerable investigations from both industry and academia have been poured into autostereoscopic display. Light Field Image (LFI) can provide more immersive perception by increasing the spatial resolution and the number of views. And the hand-hold LF cameras, such as the Lytro-Illum camera, have been emerging in the market at an incredible speed, so LFI has been given rise to much more attentions in very recent years [1], [2]. LFI can be captured by such LF camera via putting a microlens array in front of the sensor. Using this architecture, light field data can be preserved in a large matrix of sub-image which records not only the light intensity in different spatial positions but also the light rays in various directions, thus it is a full representation of a real 3D scene [3]. Besides, due to the limited sensor solution, there is a trade-off between spatial resolution and angular resolution, i.e., increasing angular resolution will sacrifice spatial resolution [4]. LFI has different representations, and the primary two of them are shown in Fig. 1. The raw format, as illustrated in Fig. 1(a), consists of lots of Micro-Images (MIs) (as shown in Fig. 1(b)). The raw image data are produced by the camera sensor with a resolution of
The sub-aperture LFI not only contains various light direction information but also can be directly applied to auto-stereoscopic display, owing to dense-view feature. Nevertheless, a huge amount of LFI data limits the storage and transmission, indicating that LFI compression faces a major challenge in LF data volume reduction. Based on our previous work [7], an efficient LFI compression approach is presented in this paper. It mainly makes three improvements: 1) the depth value expression has been amended; 2) the cost function has also been modified; 3) the weighted factor in depth optimization is computed by the characteristics of its corresponding viewpoint instead, which can fully refine the depth map. Meanwhile, to omit observable geometric distortion and blurring, only a subset of light field viewpoint images is selected as the experimental objects in this paper.
An outline of this article is organized as follows. In section II, the related achievements about LFI compression are described. Section III provides our proposed LFI compression method. The simulated experimental results are exhibited and analyzed in section IV. Ultimately, this paper is concluded in section V.
Related Works About LFI Compression
The standard encoder fails to directly compress LFI, so researchers have proposed modification algorithms to avoid direct applying. To this end, the static image compression codec is utilized in [8] to compress raw image with tile segmentation, which confirms that segmentation pre-processing supplies a higher compression rate for static image compression codec. Discrete Wavelet Transform (DWT) is also adopted to derive a finer content, but restricts compression gain compared with current video codecs [9]. These compression methods are reportedly more efficient than JPEG or JPEG2000, but not as efficient as HEVC (High Efficiency Video Coding) encoder with still picture coding. The reason is probably that they are unable to exploit the strong correlation existed in considerable MIs.
HEVC, the state-of-the-art video encoder, is good at eliminating redundancy in traditional two-dimensional images using intra prediction tools. Nevertheless, it is not developed for the type of correlation existed in LFIs [10], which has triggered the novel coding methods. Moreover, different data formats can strongly impact the compression performance [11]. For the raw LFI format, several approaches are usually to organize the MIs into a sequence and encode this sequence using HEVC codec. A high order prediction model for LFI compression is proposed in [12] supported by HEVC, in which geometric transformations are up to eight degrees of freedom. A novel LFI compression strategy is presented in [13], and this method makes raw image compatible for video codec through linear transformation and interpolation procedures. But the large number of MIs obstructs efficient compression for video encoder. C. Perra et al. proposes a series of low bitrate algorithms in [8], [14], and [15] so as to fit with original video codec. Although these algorithms analyze all kinds of segmentation, they just divide a raw LFI into several slices simply and coarsely, which fail to fully remove the correlation within LFI.
Other than the raw format LFI, the sub-aperture LFI has more correlations and its compression issue has become the investigation focus. Aiming at this format, several researches make advantages of video encoder and transform such a LFI into single viewpoint video sequence to be addressed. An efficient LFI compression framework is proposed in [16], and through iteration of Rate Distortion Optimization(RDO) process, this algorithm is able to find the optimal configuration and achieve an expectant compression performance. A predictive scheme is described in [17] that fully exploits the strong correlation between the current view and its neighbors, in which a sparse predictor is put forward to perform least squares interpolation and predict side views. Reference [18] demonstrates that compression is improved using video codec if considering viewpoint images as a video sequence, and the spiral scan order can outperform other mapping orders. In this way, the motion estimation and compensation tools in video codec can be performed for this flexible coding order. In [19], all viewpoint images are sliced into seven layers and organized into a pseudo sequence according to the position of each viewpoint so that the inter correlation can be exploited. On the basis of [19], a pseudo sequence based on 2-D hierarchical coding structure is proposed in [20], and this structure achieves a high compression ratio and a precise motion vector scaling as well. It is benefit for compression to transform viewpoint images into pseudo sequence, whereas the compression efficiency is negatively influenced by the increasing number of viewpoint images.
In the case of Multiview Video Coding (MVC) structure, Multiview extension of High Efficiency Video Coding (MV-HEVC) is quite feasible for multi-view coding [21]–[23]. Therefore, a collection of MVC modifications has been proposed for LFI compression. In [24], the viewpoint images in LFI are rearranged into a Multi-View Video(MVV), so that LFI can be compressed using MVC standard codec. Additionally, a new modified MVC prediction structure for LF data are proposed in [25], which adds vertical disparity into the inter-view prediction strategy and adjusts the coding order of MVC. Doubtlessly, this multiview structure increases the computational complexity for MVC.
If each image in sub-aperture LFI is regarded as one viewpoint, this multiview structure contains massive viewpoints. It will increase the computational burden with substantial header information. Simply organizing viewpoint images as a single-view video sequence can decrease the number of views adequately; however, it dramatically sacrifices the relationship among viewpoints. MVD coding structure can synthesize intermediate view at decoder side using a small number of depth maps and their corresponding texture videos, resulting in a dramatically low bitrates for multiview video [26]. The MVD structure is supported in 3D extension of HEVC (3D-HEVC) which is the state-of-the-art 3D video coding standard specified by the Moving Picture Experts Group (MPEG) and Joint Collaborative Team on 3D Video Coding (JCT-3V) [27]. Learning-based view synthesis methods have been proposed in [4] and [28], which obtain reliable results. The learning process consumes substantial computation time that prevents LF coding from real-time application, as a result, we decide to synthesize intermediate viewpoint after decoding using the scene geometric information, i.e., depth map. There is no proper depth map for sub-aperture LFI, so seldom LFI compression approaches adopt MVD structure. Even though light field device itself cannot capture the depth information of a natural scene, it is possible to estimate the depth map based on the multiview characteristics of LFI. An accurate depth map estimation algorithm is proposed in [29], where a disparity interpolation approach is utilized to improve the precision of depth estimation. Another akin depth map render method is presented in [30] based on a stereo pair, which overcomes the interference problem of depth cameras. Additionally, a novel 4D LF depth estimation method is presented in [31] based on the linear scheme of Epipolar Plane Image (EPI) and Locally Linear Embedding (LLE). Unfortunately, only few depth estimation algorithms have been proposed for the LFI with small disparity and dense views.
Admittedly, researchers have made great efforts to compress LFI by all kinds of coding platforms. It is, actually, expected that MVD is a good alternative coding approach for removing redundancy of LF data. An earlier version of our work has been appeared in [7], in which the depth map was estimated and the MVD strategy was adopted for LF compression. But the simple depth optimization in [7] limited the quality of synthesis. To deal with this issue, we improve the depth optimization in this article based on the characteristics of the associated viewpoint image. Due to the high-quality depth estimation, the proposed approach can synthesize the intermediate viewpoint image. Additionally, since only a subset of viewpoint images is selected and coded, the bitrates can be saved significantly.
Proposed Sub-Aperture LFI Compression Method
As demonstrated in the previous sections, because Lytro-Illum camera has been the mainstream LF camera on the market currently, sub-aperture format becomes a prevalent investigated subject. Consequently, numerous compression algorithms have been presented to eliminate the correlation existing in sub-aperture LFI. Here MVD structure is, consequently, used to compress LFI, since the structure not only requires a small number of views to be coded but also sufficiently compresses a sub-aperture LFI by making use of both inter-view and inter-frame relationships.
A. Depth Map Estimation Using EPI
As described in [1], we can calculate the depth value of any point based on the expression
Based on this observation, we design a decision model to determine the best slope from a candidate angle set for each pixel in EPI, as expressed in (1). For a certain point
After searching the optimal slope of each point in EPI, the depth values of the total points in any viewpoints can be obtained. Because this paper adopts MVD strategy to compress LFI, the value of depth map is limited between 0 and 255 using Min-Max Normalization function (MMN) which can maintain the linear structure of depth values.
B. Depth Map Optimization Using Reference Image
An appropriate depth map is characterized by several large homogenous regions, where the pixels have almost the same depth value. However, the pixels in flat areas in EPI may have two or more best angles, which lead to error slope. Therefore, it is unable to decide which the optimal slope is in this region, and these pixels are marked as error pixels. To make an improvement for the precision of estimated depth map, we perform depth enhancement for the error pixel based on the similarity between depth map and its corresponding texture image. It is observed that the edges in depth map should correspond to the pixels in texture image that locate at edges. Additionally, for the pixels locate at the homogeneous regions in texture image, the associated pixels in depth map usually tend to have constant depth value. Based on this observation, therefore, the texture image can be regarded as a reference to eliminate the error pixels in the associated depth map.
All the error pixels are in the homogenous region according to aforementioned analysis, so their depth values should approximate their neighboring correct pixels. Thus, we design a weighted mean filter, formulated in (3), to smooth the error pixel located at homogeneous region.
Fig. 3 shows the comparative results before and after depth optimization. It is obvious from Fig. 3 that these two texture images are chiefly occupied by homogeneous background, so their associated depth maps are full of error pixels. Especially, it is hard to distinguish the foremost magnet in the initial depth map of Magnets_1. Similarly, though the objects in Ankylosaurus_&_Diplodocus_1 can be distinguished, the background of studio consists of numerous error pixels. Whereas, since our depth optimization method regards texture image as the reference, the error pixels in homogenous region are removed significantly while preserving objects in foreground.
C. Depth Maps Based LFI Compression
As above, MVD architecture is more proper than other coding architecture, because that MVD structure is able to encode multiview 3D video, in which just a small number of texture video and its corresponding depth maps are encoded as well as the resultant bitsreams are multiplexed. Virtual intermediate views located between two neighboring real views can be synthesized using the Depth Image-Based Rendering (DIBR) technique [33].
The quality of intermediate views is related to the disparity between actual views. More specifically, if the disparity is too large or too slight, the hole of occlusion will be generated which drastically affects the quality of intermediate views [34]. Fortunately, LFI captured by Lytro-Illum camera has rather narrow baseline, so it is able to avoid artifacts in synthesized viewpoint images.
The depth maps of the whole viewpoint images are rendered in horizontal by our proposed depth estimation algorithm, and then we transform the viewpoint images into multi-view video sequence by columns. Considering the geometric distortion, blurring and vignetting effect, the border 56 viewpoint images (marked as gray) are omitted without coding and the remaining
Experimental Results and Discussion
In this section, we verify the feasibility of the proposed LFI compression method. We firstly compare the proposed depth map estimation approach with the other state-of-the-art depth estimation methods [36]–[38]. And then the coding results comparison is demonstrated, in which the typical LFI compression structures in [18] and [19] are benchmarking.
A. Test Condition
To evaluate the compression performance of our presented approach, we choose 10 LFIs (tabulated in Table 1) from the MMSPG-EPFL LFI Dataset to be tested as in the recent grand challenge at ICME 2016. These LFIs have various features so that they can be utilized in the benchmark investigation. Two LFIs in the MMSPG-EPFL database, named Color_Chart_1 and ISO_Chart_12, are mainly used for calibration, so they are not added in the simulated experiments. Meanwhile, to be compatible for 3D-HEVC codec, each viewpoint is cut into the resolution of
In this paper, the 3D-HEVC test model (HTM) reference software HTM 16.0 has been utilized, by which the multiview images plus depth maps are encoded. The encoder is basically configured as follows: maximum CU size is
B. Visual Assessment for Estimated Depth Map
In this section, we take the Accurate Depth Map Estimation (ADME) [36], the Efficient Large Scale Stereo (ELSS) [37] and the Line-Assisted Graph Cut (LAGC) [38] as anchors to compare the estimated depth map. Fig. 5 illustrates depth estimation results. For I01, I02, I03, I04, I05, I09, and I10 with much complex texture, ELSS and LAGC just outline the corresponding depth map, but our estimation algorithm still can generate the accurate depth map. I06, I07 and I08 contain large motionless regions, so the matching-based method, such as ELSS and LAGC, fails to produce the available depth maps for such images. As expected, through our designed depth map optimization, the error pixels in flat areas are repaired absolutely. Although ADME can draw a relatively detailed and continuous depth map, especially for the object near to camera, it is unable to compute the depth value of the object far from camera, resulting in that much depth information loss negatively impact on performance of view synthesis. Compared with other approaches, our proposed depth estimation method can obtain a good performance as shown in Fig.5. However, the depth map is provided to synthesize intermediate viewpoint image in MVD scheme. Therefore, the reconstruction performance can reflect the quality of the estimated depth map, which is discussed in next sub-section.
C. Key Viewpoint Images Quality Evaluation
Since the depth map is mainly used to synthesize the missing viewpoint image instead of visualization, the average SSIM (structural similarity index measurement) and PSNR (peak signal to noise rate) of synthesized viewpoint image are measured to evaluate the estimated depth map, as shown in Fig.6 and Fig. 7 respectively. Due to the very poor performance of ELSS and LAGC, only the ADME is compared with our method. In Fig.6 and Fig. 7, the red bar specifies our proposed depth estimation solution and the blue bar denotes ADME. As can be seen, although the depth map generated by ADME is smooth, its overall synthesized quality is lower than ours because of much information loss. With regard to our proposed method, the average SSIM is more than 0.8, and the maximum SSIM is even close to 0.97; the average PSNR is over 33dB, and the maximum PSNR is 36dB approximately. Both approaches obtain a high synthesis performance for I06 and I08 whose content is arranged with very sparse foreground objects in front of large flat background. The I03 contain extremely complex texture, which is sensitive to depth map, so a little flaw in produced depth map makes the synthesized image not satisfied enough. In order to assess intuitively, Fig. 8 gives the subjective quality of several synthesized viewpoint images. It is observed from Fig. 8 that due to much smooth texture, the synthesized viewpoint image of I06 and I08 looks approximately like the original viewpoint image. In addition, only the edge of the flowers in the front of I03 is blurring, which is negligible. Similarly, the wheel in I01 has small distortion. To this end, it is eager for us to further improve the proposed depth estimation solution and design a more robust view synthesis algorithm in next work.
D. Evaluation of Our Proposed Compression Scheme
To evaluate the compression efficiency of our proposed scheme, we leverage the efficient conventional encoder HEVC as benchmarking. In addition, for comparing test using HEVC, we transform the 169 viewpoints into a single viewpoint sequence along spiral scan mapping, similar to [18]. The coding performance throughout this paper is jointly measured by the average PSNR for the 169 viewpoint images (including 91 coded viewpoint images and 78 synthesized viewpoint images as shown in (5)) and the total bitrates (viewpoints plus depth maps). Their variations are calculated by Bjontegaard Delta PSNR (BD-PSNR) and Bjontegaard Delta Bitrate (BD-Bitrates), respectively [39].
Table 2 shows the comparison results in both Y and YUV channel. It is observed that our proposed compression scheme can averagely achieve over 2dB PSNR gain and 63% bitrates reduction both in YUV and Y channel. The I05, I06, I07, and I08 contain a large homogeneous region which limits the depth estimation performance, so the PSNR reduction for these is a little lower than others. We believe if a more robust depth optimization is designed, the reconstruction performance will be improved significantly, which we leave as our future work. Obviously, since our scheme adopts the advanced 3D-HEVC codec, which fully exploits the correlation among viewpoints, considerable spatial redundancy can be removed. With the precise estimated depth maps, the proposed method just requires a small number of viewpoint images to be coded with their corresponding depth maps. Therefore, the simulated experimental results demonstrate the proposed LFI compression scheme is able to reduce bitrates extremely while maintaining a satisfactory reconstruction PSNR.
In this work, simulations run on one Intel Core i7-4790 3.6 GHz processor with 8GB DDR3 random access memory. The decoding time including decoding bitstream time and new viewpoint synthesis time is shown in Table 3. As can be seen, since we utilized the original codec, the decoding can be performed instantly. By contrast, the new viewpoint synthesis is a time-consuming process, because it is a complex pixel displacement computation.
E. Comparison With Other LFI Compression Methods
In this section, we compare the proposed solution with three typical sub-aperture LFI compression approaches, Line Scan Mapping (LSM) and Rotation Scan Mapping (RSM) in [18], respectively, and Multiple View Structure (MVS) in [19]. Their RD curves are depicted in Fig. 7. LSM and RSM first reorganize viewpoint images into pseudo sequence with linear and spiral orders, and then code the sequence using video codec. As can be seen, the RSM outperforms LSM as descripted in [18]. MVS is specifically designed for LF data, in which it assigns each viewpoint image with different layer with different QPs and prediction correlations. In this way, MVS can reduce residual in LFIs greatly. As a result, MVS is superior to both LSM and RSM. Especially for low bitrates, MVS can keep high coding performance compared to other solutions. Note that our proposed algorithm adopts MVD architecture that avoids the exhaustive viewpoint images coding. With the estimated depth map, therefore, our proposed method can save much more bitrates when high PSNR is gained. However, under the low bitrates circumstance, the coding difference is not evident between our method and the algorithm in [18]. Our proposed method is much inferior to MVS in that case, which triggers an improvement for us in the future.
Conclusion
Virtual view synthesis concept is applied in this paper to improve LFI compression efficiency. To synthesize partial viewpoint images instead of coding the whole viewpoints, we propose an accurate depth map estimation method that can precisely describe the scene geometry information. We reorder a few viewpoint images into multiview video sequence and utilize MVD structure to code the selected viewpoints and their associated depth maps. At the decoder side, the missing viewpoint images are synthesized by DIBR technique using the estimated depth map and then the sub-aperture LFI is reconstructed. The experimental results show that our proposed scheme can obtain a great improvement in LFI compression compared to the conventional and state-of-the-art LFI compression algorithms.