Loading web-font TeX/Main/Regular
Efficient Light Field Images Compression Method Based on Depth Estimation and Optimization | IEEE Journals & Magazine | IEEE Xplore

Efficient Light Field Images Compression Method Based on Depth Estimation and Optimization


This paper achieves sparse coding and reconstruction for light field dense viewpoint images via efficient depth estimation as well as optimization.

Abstract:

Massive recent investigations from both industry and academia have been poured into autostereoscopic display. One of the emerging techniques, light field image (LFI), can...Show More

Abstract:

Massive recent investigations from both industry and academia have been poured into autostereoscopic display. One of the emerging techniques, light field image (LFI), can provide more immersive perception by increasing the number of views and the spatial resolution. However, these advantages restrict the storage and transmission due to such dense-view image simultaneously. To solve this problem, we propose to compress the LFI using multiview video plus depth (MVD) coding architecture. In this paper, we preliminarily estimate the depth based on the concept of epipolar plane image. To achieve a depth value, we design an optimal slope decision algorithm to determine the best slope with the minimal cost. While this rough estimation produces some error points within initial depth map, therefore, we present a depth optimization algorithm using the characteristic of the associated texture image. Ultimately, a small number of selected viewpoint images are encoded with their corresponding depth maps using the MVD framework, and then, the unselected viewpoint images are synthesized by depth image-based rendering technique. To verify the validity of the proposed LFI compression scheme, extensive experiments are conducted. The simulated results demonstrate that the proposed depth map estimation algorithm is superior to other state-ofthe-art methods for the LFI. Meanwhile, our LFI compression method outperforms other LFI compression algorithms significantly.
This paper achieves sparse coding and reconstruction for light field dense viewpoint images via efficient depth estimation as well as optimization.
Published in: IEEE Access ( Volume: 6)
Page(s): 48984 - 48993
Date of Publication: 30 August 2018
Electronic ISSN: 2169-3536

Funding Agency:


SECTION I.

Introduction

In recent decades, considerable investigations from both industry and academia have been poured into autostereoscopic display. Light Field Image (LFI) can provide more immersive perception by increasing the spatial resolution and the number of views. And the hand-hold LF cameras, such as the Lytro-Illum camera, have been emerging in the market at an incredible speed, so LFI has been given rise to much more attentions in very recent years [1], [2]. LFI can be captured by such LF camera via putting a microlens array in front of the sensor. Using this architecture, light field data can be preserved in a large matrix of sub-image which records not only the light intensity in different spatial positions but also the light rays in various directions, thus it is a full representation of a real 3D scene [3]. Besides, due to the limited sensor solution, there is a trade-off between spatial resolution and angular resolution, i.e., increasing angular resolution will sacrifice spatial resolution [4]. LFI has different representations, and the primary two of them are shown in Fig. 1. The raw format, as illustrated in Fig. 1(a), consists of lots of Micro-Images (MIs) (as shown in Fig. 1(b)). The raw image data are produced by the camera sensor with a resolution of 7728\times 5368 pixels and overlapped with “GRBG” color format, but this color format is invisible. The raw image needs to be transformed into sub-aperture format (as shown in Fig. 1(c)) for visualization. Arbitrary two adjacent sub-aperture images have almost same content and quality but a slightly small disparity. In general, the light volume that arrives at the camera sensor decrease from the image center to the edge, and this effect is known as vignetting artifacts [5]. Due to this, the viewpoint image is more far away from the center, its quality declines more extremely. Especially for the corners, they are nearly full of invalid pixel as shown in Fig. 1(c). With regard to the sub-aperture format, each viewpoints is composed of the associated pixels within MIs in raw image [6], and a slightly small disparity exists in arbitrary two viewpoint images, as shown in Fig. 1(d).

FIGURE 1. - The two types of LFI; (a) raw format; (b) close-up of (a); (c) sub-aperture format; (d) close-up of (c).
FIGURE 1.

The two types of LFI; (a) raw format; (b) close-up of (a); (c) sub-aperture format; (d) close-up of (c).

The sub-aperture LFI not only contains various light direction information but also can be directly applied to auto-stereoscopic display, owing to dense-view feature. Nevertheless, a huge amount of LFI data limits the storage and transmission, indicating that LFI compression faces a major challenge in LF data volume reduction. Based on our previous work [7], an efficient LFI compression approach is presented in this paper. It mainly makes three improvements: 1) the depth value expression has been amended; 2) the cost function has also been modified; 3) the weighted factor in depth optimization is computed by the characteristics of its corresponding viewpoint instead, which can fully refine the depth map. Meanwhile, to omit observable geometric distortion and blurring, only a subset of light field viewpoint images is selected as the experimental objects in this paper.

An outline of this article is organized as follows. In section II, the related achievements about LFI compression are described. Section III provides our proposed LFI compression method. The simulated experimental results are exhibited and analyzed in section IV. Ultimately, this paper is concluded in section V.

SECTION II.

Related Works About LFI Compression

The standard encoder fails to directly compress LFI, so researchers have proposed modification algorithms to avoid direct applying. To this end, the static image compression codec is utilized in [8] to compress raw image with tile segmentation, which confirms that segmentation pre-processing supplies a higher compression rate for static image compression codec. Discrete Wavelet Transform (DWT) is also adopted to derive a finer content, but restricts compression gain compared with current video codecs [9]. These compression methods are reportedly more efficient than JPEG or JPEG2000, but not as efficient as HEVC (High Efficiency Video Coding) encoder with still picture coding. The reason is probably that they are unable to exploit the strong correlation existed in considerable MIs.

HEVC, the state-of-the-art video encoder, is good at eliminating redundancy in traditional two-dimensional images using intra prediction tools. Nevertheless, it is not developed for the type of correlation existed in LFIs [10], which has triggered the novel coding methods. Moreover, different data formats can strongly impact the compression performance [11]. For the raw LFI format, several approaches are usually to organize the MIs into a sequence and encode this sequence using HEVC codec. A high order prediction model for LFI compression is proposed in [12] supported by HEVC, in which geometric transformations are up to eight degrees of freedom. A novel LFI compression strategy is presented in [13], and this method makes raw image compatible for video codec through linear transformation and interpolation procedures. But the large number of MIs obstructs efficient compression for video encoder. C. Perra et al. proposes a series of low bitrate algorithms in [8], [14], and [15] so as to fit with original video codec. Although these algorithms analyze all kinds of segmentation, they just divide a raw LFI into several slices simply and coarsely, which fail to fully remove the correlation within LFI.

Other than the raw format LFI, the sub-aperture LFI has more correlations and its compression issue has become the investigation focus. Aiming at this format, several researches make advantages of video encoder and transform such a LFI into single viewpoint video sequence to be addressed. An efficient LFI compression framework is proposed in [16], and through iteration of Rate Distortion Optimization(RDO) process, this algorithm is able to find the optimal configuration and achieve an expectant compression performance. A predictive scheme is described in [17] that fully exploits the strong correlation between the current view and its neighbors, in which a sparse predictor is put forward to perform least squares interpolation and predict side views. Reference [18] demonstrates that compression is improved using video codec if considering viewpoint images as a video sequence, and the spiral scan order can outperform other mapping orders. In this way, the motion estimation and compensation tools in video codec can be performed for this flexible coding order. In [19], all viewpoint images are sliced into seven layers and organized into a pseudo sequence according to the position of each viewpoint so that the inter correlation can be exploited. On the basis of [19], a pseudo sequence based on 2-D hierarchical coding structure is proposed in [20], and this structure achieves a high compression ratio and a precise motion vector scaling as well. It is benefit for compression to transform viewpoint images into pseudo sequence, whereas the compression efficiency is negatively influenced by the increasing number of viewpoint images.

In the case of Multiview Video Coding (MVC) structure, Multiview extension of High Efficiency Video Coding (MV-HEVC) is quite feasible for multi-view coding [21]–​[23]. Therefore, a collection of MVC modifications has been proposed for LFI compression. In [24], the viewpoint images in LFI are rearranged into a Multi-View Video(MVV), so that LFI can be compressed using MVC standard codec. Additionally, a new modified MVC prediction structure for LF data are proposed in [25], which adds vertical disparity into the inter-view prediction strategy and adjusts the coding order of MVC. Doubtlessly, this multiview structure increases the computational complexity for MVC.

If each image in sub-aperture LFI is regarded as one viewpoint, this multiview structure contains massive viewpoints. It will increase the computational burden with substantial header information. Simply organizing viewpoint images as a single-view video sequence can decrease the number of views adequately; however, it dramatically sacrifices the relationship among viewpoints. MVD coding structure can synthesize intermediate view at decoder side using a small number of depth maps and their corresponding texture videos, resulting in a dramatically low bitrates for multiview video [26]. The MVD structure is supported in 3D extension of HEVC (3D-HEVC) which is the state-of-the-art 3D video coding standard specified by the Moving Picture Experts Group (MPEG) and Joint Collaborative Team on 3D Video Coding (JCT-3V) [27]. Learning-based view synthesis methods have been proposed in [4] and [28], which obtain reliable results. The learning process consumes substantial computation time that prevents LF coding from real-time application, as a result, we decide to synthesize intermediate viewpoint after decoding using the scene geometric information, i.e., depth map. There is no proper depth map for sub-aperture LFI, so seldom LFI compression approaches adopt MVD structure. Even though light field device itself cannot capture the depth information of a natural scene, it is possible to estimate the depth map based on the multiview characteristics of LFI. An accurate depth map estimation algorithm is proposed in [29], where a disparity interpolation approach is utilized to improve the precision of depth estimation. Another akin depth map render method is presented in [30] based on a stereo pair, which overcomes the interference problem of depth cameras. Additionally, a novel 4D LF depth estimation method is presented in [31] based on the linear scheme of Epipolar Plane Image (EPI) and Locally Linear Embedding (LLE). Unfortunately, only few depth estimation algorithms have been proposed for the LFI with small disparity and dense views.

Admittedly, researchers have made great efforts to compress LFI by all kinds of coding platforms. It is, actually, expected that MVD is a good alternative coding approach for removing redundancy of LF data. An earlier version of our work has been appeared in [7], in which the depth map was estimated and the MVD strategy was adopted for LF compression. But the simple depth optimization in [7] limited the quality of synthesis. To deal with this issue, we improve the depth optimization in this article based on the characteristics of the associated viewpoint image. Due to the high-quality depth estimation, the proposed approach can synthesize the intermediate viewpoint image. Additionally, since only a subset of viewpoint images is selected and coded, the bitrates can be saved significantly.

SECTION III.

Proposed Sub-Aperture LFI Compression Method

As demonstrated in the previous sections, because Lytro-Illum camera has been the mainstream LF camera on the market currently, sub-aperture format becomes a prevalent investigated subject. Consequently, numerous compression algorithms have been presented to eliminate the correlation existing in sub-aperture LFI. Here MVD structure is, consequently, used to compress LFI, since the structure not only requires a small number of views to be coded but also sufficiently compresses a sub-aperture LFI by making use of both inter-view and inter-frame relationships.

A. Depth Map Estimation Using EPI

As described in [1], we can calculate the depth value of any point based on the expression Z=\frac {f}{\Delta u / {\Delta v}} (where \Delta s is the geometrical distance between the two virtual cameras; \Delta u is the distance between the corresponding scene points moved in the image; f is the focal length). Therefore, constructing EPI can compute the ratio {\Delta u} / {\Delta v} so that angular information of LFI can be archived. Fig. 2 depicts the process of EPI transformation. Viewpoint images in sub-aperture format are stacked as a cube on the left in Fig. 2. One slice cut by a horizontal plane is one EPI of the viewpoint cube and a part of EPI is shown on the right in Fig. 2. Furthermore, it is observed from the right side in Fig. 2 that EPI has many lines with different slopes. Each corresponding scene point is projected onto an epipolar line, whose slope reflects the depth of the corresponding scene point. The slope can be calculated as the ratio {\Delta u} / {\Delta v} . Besides, different slopes imply different depths, i.e., the greater the slope is, the larger the depth value is.

FIGURE 2. - Diagram of the EPI construction. The stack of a row sub-aperture LFIs is shown on the left and a specific EPI is shown on the right. The lines with different colors represent different slopes.
FIGURE 2.

Diagram of the EPI construction. The stack of a row sub-aperture LFIs is shown on the left and a specific EPI is shown on the right. The lines with different colors represent different slopes.

Based on this observation, we design a decision model to determine the best slope from a candidate angle set for each pixel in EPI, as expressed in (1). For a certain point p=(u,s) located at EPI, the angle with the minimal cost can be decided as the best one.\begin{equation*} \alpha ^{\ast }(p)=\mathop {\arg \min }\limits _{\alpha \in \left ({{-45^{\circ },45^{\circ }} }\right)} C(\alpha,p)\tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where C(\alpha,p) is our designed cost function of the candidate angle \alpha for the point p=(u,s) , and \alpha ^{\ast }(p) represents the determined best angle for this point. Since the LFI captured by hand-hold light field camera is characterized by tiny disparity, the candidate angle set ranges from −45° to 45° instead of the whole angles. The cost function C(\alpha,p) is demonstrated as follow.\begin{equation*} C(\alpha,p)={\sum \limits _{i}^{S} {\omega _{i} (V_{u,s} -V_{\Delta u,\Delta s}^{(\alpha)})^{2}}} \bigg / {\sum \limits _{i}^{S} {\omega _{i}}}\tag{2}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where S stands for the height of the EPI, being equal to the number of viewpoints. V_{u,s} denotes the luminance value of the point to be tested, and V_{\Delta u,\Delta s}^{(\alpha)} implies the luminance value of the candidate points on the epipolar line with angle of \alpha . During cost computation, the points on the epipolar line are considered exhaustively to accumulate the cost. Moreover, \omega _{i} is a flag, which is assigned to 0 if the candidate pixel is unavailable; otherwise, \omega _{i} is equal to 1.

After searching the optimal slope of each point in EPI, the depth values of the total points in any viewpoints can be obtained. Because this paper adopts MVD strategy to compress LFI, the value of depth map is limited between 0 and 255 using Min-Max Normalization function (MMN) which can maintain the linear structure of depth values.

B. Depth Map Optimization Using Reference Image

An appropriate depth map is characterized by several large homogenous regions, where the pixels have almost the same depth value. However, the pixels in flat areas in EPI may have two or more best angles, which lead to error slope. Therefore, it is unable to decide which the optimal slope is in this region, and these pixels are marked as error pixels. To make an improvement for the precision of estimated depth map, we perform depth enhancement for the error pixel based on the similarity between depth map and its corresponding texture image. It is observed that the edges in depth map should correspond to the pixels in texture image that locate at edges. Additionally, for the pixels locate at the homogeneous regions in texture image, the associated pixels in depth map usually tend to have constant depth value. Based on this observation, therefore, the texture image can be regarded as a reference to eliminate the error pixels in the associated depth map.

All the error pixels are in the homogenous region according to aforementioned analysis, so their depth values should approximate their neighboring correct pixels. Thus, we design a weighted mean filter, formulated in (3), to smooth the error pixel located at homogeneous region.\begin{equation*} W_{e} =\frac {1}{\left |{ {N(e)} }\right |}\sum \limits _{n\in N(e)} {\gamma _{n} D_{n}}\tag{3}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where W_{e} is a window that takes error pixel as center, with the size of 3\times 3 , and D_{n} signifies the depth values of error pixel and its neighboring pixels, respectively. N(e) represents a pixel set containing the available neighboring pixels around this error pixel, and \left |{ {N(e)} }\right | means the cardinality of N(e) . Furthermore, \gamma _{n} denotes the weighted factor achieved from the associated texture image. As one of the fastest edge-preserving filters, Guided Image Filtering (GIF) can maintain the structures of the filtering output image as that of the guidance image [32]. Inspired by the concept of GIF, we compute the weighted factor \gamma _{n} as follow.\begin{align*} 0=&~V_{e} -\gamma _{n} V_{n},\quad \forall n\in N(e) \\ s.t.~1-\varepsilon\le&~\frac {1}{\left |{ {N(e)} }\right |}\sum \limits _{n\in N(e)} {\left |{ {\gamma _{n}} }\right |} \le 1+\varepsilon\tag{4}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where V_{e} indicates the luminance value of the pixel in texture image that corresponds to the error pixel in depth map, and V_{n} is the luminance value of its neighboring pixel. \varepsilon denotes the offset to control smoothness. Specifically, the larger this offset is, the stronger the smoothness will be, and vice versa. To balance the smoothness, the offset \varepsilon is set to 0.012 empirically in this paper. As we treat the texture image as the reference image, the error pixel can be smoothed efficiently. By this manner, the error pixel to be corrected can keep a good consistency with its neighboring pixels.

Fig. 3 shows the comparative results before and after depth optimization. It is obvious from Fig. 3 that these two texture images are chiefly occupied by homogeneous background, so their associated depth maps are full of error pixels. Especially, it is hard to distinguish the foremost magnet in the initial depth map of Magnets_1. Similarly, though the objects in Ankylosaurus_&_Diplodocus_1 can be distinguished, the background of studio consists of numerous error pixels. Whereas, since our depth optimization method regards texture image as the reference, the error pixels in homogenous region are removed significantly while preserving objects in foreground.

FIGURE 3. - Depth map illustration before and after optimization. Ankylosaurus_&_Diplodocus_1 and Magnets_1 are shown from top row to bottom row. (a) center view; (b) the initial estimated depth map; (c) depth map after optimization.
FIGURE 3.

Depth map illustration before and after optimization. Ankylosaurus_&_Diplodocus_1 and Magnets_1 are shown from top row to bottom row. (a) center view; (b) the initial estimated depth map; (c) depth map after optimization.

C. Depth Maps Based LFI Compression

As above, MVD architecture is more proper than other coding architecture, because that MVD structure is able to encode multiview 3D video, in which just a small number of texture video and its corresponding depth maps are encoded as well as the resultant bitsreams are multiplexed. Virtual intermediate views located between two neighboring real views can be synthesized using the Depth Image-Based Rendering (DIBR) technique [33].

The quality of intermediate views is related to the disparity between actual views. More specifically, if the disparity is too large or too slight, the hole of occlusion will be generated which drastically affects the quality of intermediate views [34]. Fortunately, LFI captured by Lytro-Illum camera has rather narrow baseline, so it is able to avoid artifacts in synthesized viewpoint images.

The depth maps of the whole viewpoint images are rendered in horizontal by our proposed depth estimation algorithm, and then we transform the viewpoint images into multi-view video sequence by columns. Considering the geometric distortion, blurring and vignetting effect, the border 56 viewpoint images (marked as gray) are omitted without coding and the remaining 13\times 13=169 viewpoint images are selected for experiments. As mentioned in [35], the coding result is closely related to the sparseness. More specifically, when sampling parameter is just set to 2, the encoder can obtain the best performance. Although the tested light field image in this paper is different from that in [35] in terms of content and structure, it still gives us an inspiration about the solution to select coding viewpoints. Obviously, a small number of coding viewpoints saves overhead bits, but it limits the other remaining view synthesis. Additionally, the entire light field images are encoded, which can maintain a satisfied reconstructed performance, whereas the coding bitrates increase dramatically. Consequently, we will fully investigate the relationship among sparseness, bitrates, and reconstructed performance in next work, so as to find an optimal sampling solution. In this paper we only select about a half of the whole viewpoint images as input video sequences, as illustrated in Fig. 4 (marked as blue and red). The red viewpoints are regarded as center view, as well as their left and right neighboring parts in blue are considered as dependent viewpoints. Therefore, only 13\times 7=91 viewpoint images and their associated depth maps are coded. The center view is also referred to as the independent viewpoint, so its texture image and associated depth map are coded independently using a conventional HEVC codec. Except for the independent viewpoint and dependent viewpoints, the remaining white parts are synthesized by DIBR technique, instead of being coded.

FIGURE 4. - Diagram of viewpoint selection and transformation. The blocks in red represent the base viewpoint and those in blue means the non-base viewpoint. Additionally, all of them are transformed in the same way.
FIGURE 4.

Diagram of viewpoint selection and transformation. The blocks in red represent the base viewpoint and those in blue means the non-base viewpoint. Additionally, all of them are transformed in the same way.

SECTION IV.

Experimental Results and Discussion

In this section, we verify the feasibility of the proposed LFI compression method. We firstly compare the proposed depth map estimation approach with the other state-of-the-art depth estimation methods [36]–​[38]. And then the coding results comparison is demonstrated, in which the typical LFI compression structures in [18] and [19] are benchmarking.

A. Test Condition

To evaluate the compression performance of our presented approach, we choose 10 LFIs (tabulated in Table 1) from the MMSPG-EPFL LFI Dataset to be tested as in the recent grand challenge at ICME 2016. These LFIs have various features so that they can be utilized in the benchmark investigation. Two LFIs in the MMSPG-EPFL database, named Color_Chart_1 and ISO_Chart_12, are mainly used for calibration, so they are not added in the simulated experiments. Meanwhile, to be compatible for 3D-HEVC codec, each viewpoint is cut into the resolution of 624\times 432 , resulting in a 3D LF structure of dimension 624\times 432\times 169 .

TABLE 1 Tested Light Field Images
Table 1- 
Tested Light Field Images

In this paper, the 3D-HEVC test model (HTM) reference software HTM 16.0 has been utilized, by which the multiview images plus depth maps are encoded. The encoder is basically configured as follows: maximum CU size is 64\times 64 pixels, and maximum CU depth is 4, resulting in minimum CU size of 8\times 8 ; motion search range is set to 64; Quantization Parameters (QPs) are respectively selected with 22, 27, 32, and 37.

B. Visual Assessment for Estimated Depth Map

In this section, we take the Accurate Depth Map Estimation (ADME) [36], the Efficient Large Scale Stereo (ELSS) [37] and the Line-Assisted Graph Cut (LAGC) [38] as anchors to compare the estimated depth map. Fig. 5 illustrates depth estimation results. For I01, I02, I03, I04, I05, I09, and I10 with much complex texture, ELSS and LAGC just outline the corresponding depth map, but our estimation algorithm still can generate the accurate depth map. I06, I07 and I08 contain large motionless regions, so the matching-based method, such as ELSS and LAGC, fails to produce the available depth maps for such images. As expected, through our designed depth map optimization, the error pixels in flat areas are repaired absolutely. Although ADME can draw a relatively detailed and continuous depth map, especially for the object near to camera, it is unable to compute the depth value of the object far from camera, resulting in that much depth information loss negatively impact on performance of view synthesis. Compared with other approaches, our proposed depth estimation method can obtain a good performance as shown in Fig.5. However, the depth map is provided to synthesize intermediate viewpoint image in MVD scheme. Therefore, the reconstruction performance can reflect the quality of the estimated depth map, which is discussed in next sub-section.

FIGURE 5. - Experiments on the partial LFIs from Table 1. From I01to I10 are shown from top row to bottom row. (a) center view; (b) depth map by ADME; (c) depth map by ELSS; (d) depth map by LAGC; (e) depth map by our proposed method.
FIGURE 5.

Experiments on the partial LFIs from Table 1. From I01to I10 are shown from top row to bottom row. (a) center view; (b) depth map by ADME; (c) depth map by ELSS; (d) depth map by LAGC; (e) depth map by our proposed method.

C. Key Viewpoint Images Quality Evaluation

Since the depth map is mainly used to synthesize the missing viewpoint image instead of visualization, the average SSIM (structural similarity index measurement) and PSNR (peak signal to noise rate) of synthesized viewpoint image are measured to evaluate the estimated depth map, as shown in Fig.6 and Fig. 7 respectively. Due to the very poor performance of ELSS and LAGC, only the ADME is compared with our method. In Fig.6 and Fig. 7, the red bar specifies our proposed depth estimation solution and the blue bar denotes ADME. As can be seen, although the depth map generated by ADME is smooth, its overall synthesized quality is lower than ours because of much information loss. With regard to our proposed method, the average SSIM is more than 0.8, and the maximum SSIM is even close to 0.97; the average PSNR is over 33dB, and the maximum PSNR is 36dB approximately. Both approaches obtain a high synthesis performance for I06 and I08 whose content is arranged with very sparse foreground objects in front of large flat background. The I03 contain extremely complex texture, which is sensitive to depth map, so a little flaw in produced depth map makes the synthesized image not satisfied enough. In order to assess intuitively, Fig. 8 gives the subjective quality of several synthesized viewpoint images. It is observed from Fig. 8 that due to much smooth texture, the synthesized viewpoint image of I06 and I08 looks approximately like the original viewpoint image. In addition, only the edge of the flowers in the front of I03 is blurring, which is negligible. Similarly, the wheel in I01 has small distortion. To this end, it is eager for us to further improve the proposed depth estimation solution and design a more robust view synthesis algorithm in next work.

FIGURE 6. - The synthesis quality for the LFIs of I01 to I10 in SSIM. The horizontal axis represents the image index and the vertical axis implies the synthesized viewpoint quality in SSIM. (QP = 37).
FIGURE 6.

The synthesis quality for the LFIs of I01 to I10 in SSIM. The horizontal axis represents the image index and the vertical axis implies the synthesized viewpoint quality in SSIM. (QP = 37).

FIGURE 7. - The synthesis quality for the LFIs of I01 to I10 in PSNR. The horizontal axis represents the image index and the vertical axis implies the synthesized viewpoint quality in PSNR. (QP = 37).
FIGURE 7.

The synthesis quality for the LFIs of I01 to I10 in PSNR. The horizontal axis represents the image index and the vertical axis implies the synthesized viewpoint quality in PSNR. (QP = 37).

FIGURE 8. - The perceptual quality for four light field images. Top to bottom: I01, I03, I06 and I08. (a) the original viewpoint image; (b) the synthesized viewpoint image using our produced depth map.
FIGURE 8.

The perceptual quality for four light field images. Top to bottom: I01, I03, I06 and I08. (a) the original viewpoint image; (b) the synthesized viewpoint image using our produced depth map.

FIGURE 9. - RD curves of various LFI compression solutions. The RD curves of I01 to I10 are illustrated from (a) to (j). PSNRY denotes the reconstruction PSNR in Y channel and bpp indicates the bitrates calculated as bit per pixel.
FIGURE 9.

RD curves of various LFI compression solutions. The RD curves of I01 to I10 are illustrated from (a) to (j). PSNRY denotes the reconstruction PSNR in Y channel and bpp indicates the bitrates calculated as bit per pixel.

D. Evaluation of Our Proposed Compression Scheme

To evaluate the compression efficiency of our proposed scheme, we leverage the efficient conventional encoder HEVC as benchmarking. In addition, for comparing test using HEVC, we transform the 169 viewpoints into a single viewpoint sequence along spiral scan mapping, similar to [18]. The coding performance throughout this paper is jointly measured by the average PSNR for the 169 viewpoint images (including 91 coded viewpoint images and 78 synthesized viewpoint images as shown in (5)) and the total bitrates (viewpoints plus depth maps). Their variations are calculated by Bjontegaard Delta PSNR (BD-PSNR) and Bjontegaard Delta Bitrate (BD-Bitrates), respectively [39].\begin{equation*} PSNR_{avg} =\frac {1}{169}\left({\sum \limits _{91} {PSNR_{coded}^{(u,v)}} +\sum \limits _{78} {PSNR_{syn}^{(u,v)}}}\right)\tag{5}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Table 2 shows the comparison results in both Y and YUV channel. It is observed that our proposed compression scheme can averagely achieve over 2dB PSNR gain and 63% bitrates reduction both in YUV and Y channel. The I05, I06, I07, and I08 contain a large homogeneous region which limits the depth estimation performance, so the PSNR reduction for these is a little lower than others. We believe if a more robust depth optimization is designed, the reconstruction performance will be improved significantly, which we leave as our future work. Obviously, since our scheme adopts the advanced 3D-HEVC codec, which fully exploits the correlation among viewpoints, considerable spatial redundancy can be removed. With the precise estimated depth maps, the proposed method just requires a small number of viewpoint images to be coded with their corresponding depth maps. Therefore, the simulated experimental results demonstrate the proposed LFI compression scheme is able to reduce bitrates extremely while maintaining a satisfactory reconstruction PSNR.

TABLE 2 Coding Results Comparison With HEVC Codec in Low-Delay Case
Table 2- 
Coding Results Comparison With HEVC Codec in Low-Delay Case

In this work, simulations run on one Intel Core i7-4790 3.6 GHz processor with 8GB DDR3 random access memory. The decoding time including decoding bitstream time and new viewpoint synthesis time is shown in Table 3. As can be seen, since we utilized the original codec, the decoding can be performed instantly. By contrast, the new viewpoint synthesis is a time-consuming process, because it is a complex pixel displacement computation.

TABLE 3 Decoding Time Including Decoding Bitstream Time and New Viewpoint Synthesis Time in Second
Table 3- 
Decoding Time Including Decoding Bitstream Time and New Viewpoint Synthesis Time in Second

E. Comparison With Other LFI Compression Methods

In this section, we compare the proposed solution with three typical sub-aperture LFI compression approaches, Line Scan Mapping (LSM) and Rotation Scan Mapping (RSM) in [18], respectively, and Multiple View Structure (MVS) in [19]. Their RD curves are depicted in Fig. 7. LSM and RSM first reorganize viewpoint images into pseudo sequence with linear and spiral orders, and then code the sequence using video codec. As can be seen, the RSM outperforms LSM as descripted in [18]. MVS is specifically designed for LF data, in which it assigns each viewpoint image with different layer with different QPs and prediction correlations. In this way, MVS can reduce residual in LFIs greatly. As a result, MVS is superior to both LSM and RSM. Especially for low bitrates, MVS can keep high coding performance compared to other solutions. Note that our proposed algorithm adopts MVD architecture that avoids the exhaustive viewpoint images coding. With the estimated depth map, therefore, our proposed method can save much more bitrates when high PSNR is gained. However, under the low bitrates circumstance, the coding difference is not evident between our method and the algorithm in [18]. Our proposed method is much inferior to MVS in that case, which triggers an improvement for us in the future.

SECTION V.

Conclusion

Virtual view synthesis concept is applied in this paper to improve LFI compression efficiency. To synthesize partial viewpoint images instead of coding the whole viewpoints, we propose an accurate depth map estimation method that can precisely describe the scene geometry information. We reorder a few viewpoint images into multiview video sequence and utilize MVD structure to code the selected viewpoints and their associated depth maps. At the decoder side, the missing viewpoint images are synthesized by DIBR technique using the estimated depth map and then the sub-aperture LFI is reconstructed. The experimental results show that our proposed scheme can obtain a great improvement in LFI compression compared to the conventional and state-of-the-art LFI compression algorithms.

References

References is not available for this document.