Introduction
3D applications have attracted wide interest in recent years. While much of the emphasis has been on stereoscopic displays, which require glasses to enable depth perception, a new generation of autostereoscopic displays, which emit different pictures depending on the position of the observer’s eyes and do not require glasses for viewing, is starting to emerge and commercially become available [1], [2]. The latter often employ depth-based image rendering techniques to generate a dense set of views to the scene [3]. In order to render these views with acceptable quality, it is desirable to use high-quality depth maps, which need to be represented and coded along with the texture. Depth maps can be estimated from a stereo or multicamera setup using stereo correspondence techniques [4]. They could also be acquired by a special depth camera; this particular area has seen notable advances in recent years with designs based on structured light [5] or time-of-flight-based imaging [6]. Finally, depth information is an integral part of computer-generated imagery, which is popular in many cinema productions.
To address the above needs and to leverage the state-of-the-art compression capabilities offered by the High Efficiency Video Coding (HEVC) standard [7], [8], a vision for the next-generation 3D video format was published by the Moving Picture Experts Group (MPEG) [9] with the aim to develop a 3D video format that could facilitate the generation of intermediate views with high compression capabilities in order to support advanced stereoscopic display functionality and emerging autostereoscopic displays. Following this, a reference framework that utilized depth-based image rendering was prepared so that candidate technology could be evaluated. A key challenge was generating high-quality depth maps for the available multiview video sequences and preparing anchor material with sufficiently high quality. It was also critically important to define an appropriate evaluation procedure, as no well-defined process for evaluating the impact of depth coding and rendering results existed. It was ultimately decided to measure the PSNR of both coded and synthesized views as well as subjectively assess the quality on stereoscopic and autostereoscopic multiview displays.
In 2011, a call for proposals (CfP) was issued based on a specified set of requirements and the defined evaluation procedure [10], which solicited technology contributions for both the Advanced Video Coding (AVC) and HEVC frameworks. The responses demonstrated that substantial benefit over existing standards could be achieved. As a result, the ISO/IEC MPEG and ITU-T Video Coding Experts Group (VCEG) standardization bodies established the Joint Collaborative Team on 3D Video Coding Extension Development (JCT-3V) in July 2012, mandated to develop next-generation 3D coding standards with more advanced compression capability and support for synthesis of additional perspective views, covering both AVC- and HEVC-based extensions. For the HEVC development, a first version of the reference software was contributed by proponents of top-performing responses to the CfP [11]. Based on this software platform, which also included tools for view synthesis and synthesized view distortion-based rate-distortion optimization [12], a range of core experiments were conducted over a period of three years in order to develop all major aspects of the specifications that are described in this paper. As part of this, the JCT-3V has developed two extensions for HEVC, namely, Multiview HEVC (MV-HEVC) [13], which is integrated in the second edition of the standard [14], and 3D-HEVC [15], which was completed in February 2015 and will be part of the third edition.
MV-HEVC comprises only high-level syntax (HLS) additions and can thus be implemented using existing single-layer decoding cores. Higher compression (compared with simulcast) is achieved by exploiting redundancy between different camera views of the same scene. 3D-HEVC aims to compress the video-plus-depth format more efficiently by introducing new compression tools that perform the following:
explicitly address the unique characteristics of depth maps;
exploit dependencies between multiple views as well as between video texture and depth.
It is noted that MV-HEVC follows the same design principle as Multiview Video Coding (MVC), the multiview extension of H.264/MPEG-4 AVC [16], [17]. Moreover, since MV-HEVC and 3D-HEVC were developed in parallel with the scalable extension of HEVC (SHVC [18]), all extensions share a basic inter-layer prediction design utilizing almost the same HLS. The common design enables a single texture base view to be extracted from MV-HEVC, SHVC, and 3D-HEVC bitstreams, which is decodable by a main profile compliant HEVC decoder. Also, a 3D-HEVC encoder can generate a bitstream that allows the stereoscopic texture views to be decoded by an MV-HEVC decoder. Further aspects of these designs will be explained in the following sections.
The rest of this paper is organized as follows. In the following section, basic concepts of multilayer coding in HEVC are explained. Section III outlines the specific aspects of the HLS design for MV-HEVC and 3D-HEVC. Section IV describes the new coding tools that are specified in 3D-HEVC. Section V provides definitions of conformance points, i.e., profiles defined for MV-HEVC and 3D-HEVC. Section VI reports the compression performance of the two extensions. Conclusions and outlook are given in Section VII. Note that only the MV-HEVC and 3D-HEVC extension parts of HEVC are discussed in this paper, while a description of the first edition of HEVC [7] can be found in [8].
Multilayer Coding Design
MV- and 3D-HEVC, as well as SHVC, employ a multilayer approach where different HEVC-coded representations of video sequences, called layers, are multiplexed into one bitstream and can depend on each other. Dependencies are created by inter-layer prediction to achieve increased compression performance by exploiting similarities among different layers.
In MV- and 3D-HEVC, a layer can represent texture, depth, or other auxiliary information of a scene related to a particular camera perspective. All layers belonging to the same camera perspective are denoted as a view; whereas layers carrying the same type of information (e.g., texture or depth) are usually called components in the scope of 3D video (and should not be mistaken in the following with the color components composing a picture as defined in HEVC [7]).
Fig. 1 shows a typical coding structure for pictures, including four layers of two views and two components (texture and depth) for each of the shown two time instances: by design choice, all pictures associated with the same capturing or display time instance are contained in one access unit (AU) and have the same picture order count (POC). The layer of the first picture within an AU is usually denoted as the base layer. Unless the base layer is external (e.g., when using hybrid codec scalability as described in Section III-A7), it is required to conform to an HEVC single-layer profile, and hence to be the texture component of the base view. The layers of the pictures following the base layer picture in an AU are denoted as enhancement layers or non-base layers, and the views other than the base view are denoted as enhancement views or non-base views. In an AU, the order of views is required to be the same for all components. To facilitate combined coding, it is further required in 3D-HEVC that the depth component of a particular view immediately follows its texture component. An overview of dependencies between pictures in different layers and AUs is depicted in Fig. 1 and further discussed below. Note that enhancement-layer random access point pictures are usually coded using inter-layer prediction and thus are not necessarily only intra-picture predicted.
A. MV-HEVC Inter-Layer Prediction
A key benefit of the MV-HEVC architecture is that it does not change the syntax or decoding process required for HEVC single-layer coding below the slice level. This allows reuse of existing implementations without major changes for building MV-HEVC decoders.
Beyond conventional temporal inter-picture prediction [marked (
This way, the motion vectors (MVs) may be actual temporal MVs (subsequently denoted as TMVs) when related to temporal reference pictures of the same view, or may be disparity MVs (DMVs) when related to IV reference pictures. Existing block-level HEVC motion compensation modules can be used which operate the same way regardless of whether an MV is a TMV or a DMV.
In HEVC single-layer coding, motion information (MV and reference index) for a current prediction block (PB) can be coded in merge mode or using advanced MV prediction (AMVP). In both modes, a list of candidates is created from the motion information of spatial or temporal neighboring PBs. In this process, MVs from neighboring blocks may be temporally scaled by the following:
multiplying the POC difference between the picture of the current PB and its reference picture;
dividing the POC difference between the picture of the neighboring PB and its reference picture.
B. 3D-HEVC Inter-Layer Prediction
For increased compression performance, 3D-HEVC extends MV-HEVC by allowing new types of inter-layer prediction. As indicated in Fig. 1, the new prediction types are the following:
combined temporal and IV prediction (
), referring to a picture in the same component but in a different AU and a different view;$A+V$ inter-component prediction (
), referring to pictures in the same AU and view but in a different component;$C$ combined inter-component and IV prediction (
), referring to pictures in the same AU but in a different view and component.$C+V$
C. Limitations of Inter-Layer Prediction
Since IV prediction in both MV- and 3D-HEVC is achieved through block-based disparity compensation (in contrast to full epipolar geometric transformations), the coding tools described in this paper are most efficient when the view signals are aligned in a 1D linear and coplanar arrangement. This can be achieved through camera setup or preprocessing of the sequences through a rectification process. While the standard does not impose any limitations regarding the arrangement of multiview video sequences, the coding efficiency can be expected to decrease when there is significant misalignment, similar to MVC [20].
A second assumption is that texture and depth pictures in the same AU and view are spatially aligned (or have been appropriately registered), such that samples at equal positions represent the same point of the depicted scene. If they are not aligned, the effectiveness of coding tools with a dependency between texture and depth components decreases.
MV- and 3D-HEVC High-Level Syntax
HLS is an integral part of a video codec. An important part of it is the network abstraction layer (NAL), providing a (generic) interface of a video codec to (various) networks/systems. HEVC (single-layer coding) HLS was designed with significant consideration of extensibility mechanisms. These are also referred to as hooks, which basically allow future extensions to be backward compatible to earlier versions of the standard. Important HLS hooks in HEVC include the following.
Inclusion of a layer identifier (ID) in the NAL unit header, whereby the same NAL unit header syntax applies to both HEVC single-layer coding and its multilayer extensions.
Introduction of the video parameter set (VPS), which was mainly introduced for use with multilayer extensions, as VPS contains cross-layer information.
Introduction of the layer set concept and the associated signaling of multilayer hypothetical reference decoder (HRD) parameters.
Addition of extensibility for all types of parameter sets and slice header, which allows the same syntax structures to be used for both the base layer and enhancement layers without defining new NAL unit types and to be further extended in the future when needed.
A. Common HLS for Layered HEVC Extensions
1) Parameter Set and Slice Segment Header Extensions:
The VPS has been extended by adding the VPS extension structure to the end, which mainly includes information on: 1) scalability type and mapping of NAL unit header layer ID to scalability IDs; 2) layer dependency, dependency type, and independent layers; 3) layer sets and output layer sets (OLSs); 4) sub-layers and inter-layer dependency of sub-layers; 5) profile, tier, and level (PTL); 6) representation format (resolution, bit depth, color format, etc.); 7) decoded picture buffer (DPB) size; and 8) cross-layer video usability information, which includes information on cross-layer picture type alignment, cross-layer intra random access point (IRAP) picture alignment, bit rate and picture rate of layer sets, video signal format (color primaries, transfer characteristics, etc.), usage of tiles and wavefronts and other enabled parallel processing capabilities, and additional HRD parameters.
It should be noted that the VPS applies to all layers, while in the AU decoding order dimension it applies from the first AU where it is activated up to the AU when it is deactivated. Different layers (including the base layer and a non-base layer) may either share the same sequence parameter set (SPS) or use different SPSs. Pictures of different layers or AUs can also share the same picture parameter set (PPS) or use different PPSs. To enable sharing of SPS and PPS, all SPSs share the same value space of their SPS IDs, regardless of the layer ID values in their NAL unit headers; the same is true for PPSs.
Among other smaller extensions, the slice segment header has been extended in a backward compatible manner by adding the following information:
the discardable flag that indicates whether the picture is used for at least one of temporal inter-picture prediction and inter-layer prediction or neither (when neither applies the picture can be discarded without affecting the decoding of any other pictures in the same layer or other layers);
a flag that indicates whether an instantaneous decoder refresh (IDR) picture is a bitstream splicing point (if yes, then pictures from earlier AUs would be unavailable as references for pictures of any layer starting from the current AU);
information on lower layer pictures used by the current picture for inter-layer prediction;
POC resetting and POC most significant bits (MSBs) information.
2) Layer and Scalability Identification:
Each layer is associated with a unique layer ID, which must be increasing across pictures of different layers in decoding order within an AU. In addition, a layer is associated with scalability IDs specifying its content, which are derived from the VPS extension and denoted as view order index (VOI) and auxiliary ID.
All layers of a view have the same VOI. The VOI is required to be increasing in decoding order of views. Furthermore, a view ID value is signaled for each VOI, which can be chosen without constraints, but should indicate the view’s camera position (e.g., in a linear setup).
The auxiliary ID signals whether a layer is an auxiliary picture layer carrying depth, alpha, or other user defined auxiliary data. By design choice, auxiliary picture layers have no normative impact on the decoding of nonauxiliary picture layers (denoted as primary picture layers).
3) Layer Sets:
The concept of layer sets was already introduced in HEVC version 1. A layer set is a set of independently decodable layers that conventionally contains the base layer. Layer sets are signaled in the base part of the VPS. During the development of the common multilayer HLS, two related concepts, namely, OLSs and additional layer sets, were further introduced. An OLS is a layer set or an additional layer set for which the target output layers are specified. Nontarget-output layers are, for example, those layers that are used only for inter-layer prediction but not for output/display. An OLS can have two layers for output (e.g., stereoscopic viewing) but contain three layers. An HEVC single-layer decoder would only process one target output layer, i.e., the base layer, regardless of how many layers the layer set contains. This is the reason why the concept of OLSs was not needed in HEVC version 1.
An additional layer set is a set of independently decodable layers that does not contain the base layer. For example, if a bitstream contains two simulcast (i.e., independently coded) layers, then the non-base layer itself can be included in an additional layer set. This concept can also be used for signaling the PTL for auxiliary picture layers, which are usually coded independently from the primary picture layers. For example, a depth or alpha (i.e., transparency) auxiliary picture layer can be included in an additional layer set and indicated to conform to the monochrome (8 bit) profile, regardless of which single-layer profile the base (primary picture) layer conforms to. Without such a design, many more profiles would need to be defined to handle all combinations of auxiliary picture layers with single-layer profiles. To realize the benefits of this design, a hypothetical independent non-base layer rewriting process was specified, which transcodes independent non-base layers to a bitstream that conforms to a single-layer profile.
By design choice, an additional layer set is allowed to contain more than one layer, e.g., three layers with layer ID values equal to 3, 4, and 5, where the layer with layer ID equal to 3 is an independent non-base layer. Along with this, a bitstream extraction process for additional layer sets was specified. While the extracted subbitstream does not contain a base layer, it is still a conforming bitstream, i.e., the multilayer extensions of HEVC allow for a conforming multilayer bitstream to not contain the base layer, and compliant decoding of the bitstream may not involve the base layer at all.
4) Profile, Tier, and Level:
Compared with earlier multilayer video coding standards, a fundamentally different approach was taken for MV-HEVC and SHVC for the specification and signaling of interoperability points (i.e., PTL in the context of HEVC and its extensions). Rather than specifying PTL for an operation point that contains a set of layers, PTL is specified and signaled in a layer-specific manner in MV-HEVC and SHVC. Consequently, a decoder that is able to decode two-layer bitstreams with 1080p at 30 frames/s at the base layer and 1080p at 60 frames/s at the enhancement layer should express its capability as a list of two PTLs equivalent to {Main profile Main tier Level 4, Multiview Main profile Main tier Level 4.1}. A key advantage of this design is that it facilitates easy decoding of multiple layers by reusing single-layer decoders. If PTL was specified for the two layers together, then the decoder would need to be able to decode the two-layer bitstreams with both the base and enhancement layers of 1080p at 60 frames/s, causing overprovisioning of resources.
5) RPS and Reference Picture List Construction:
In addition to the five RPS lists (RefPicSetStCurrBefore, RefPicSetStCurrAfter, RefPicSetStFoll, RefPicSetLtCurr, and RefPicSetLtFoll) defined in HEVC version 1, two more RPS lists, RefPicSetInterLayer0 and RefPicSetInterLayer1 (denoted as RpsIL0 and RpsIL1, respectively), were introduced to contain inter-layer reference pictures. Given a current picture, those inter-layer reference pictures are included into two sets depending on whether they have view ID values greater or smaller than the current picture. If the base view has a greater view ID than the current picture, then those with greater view IDs are included into RpsIL0 and those with smaller view IDs into RpsIL1, and vice versa. The derivation of RpsIL0 and RpsIL1 is based on VPS extension signaling (of layer dependency and inter-layer dependency of sub-layers) as well as slice header signaling (of lower-layer pictures used by the current picture for inter-layer prediction).
When constructing the initial reference picture list 0 (i.e., RefPicListTemp0), pictures in RpsIL0 are immediately inserted after pictures in RefPicSetStCurrBefore, and pictures in RpsIL1 are inserted last, after pictures in RefPicSetLtCurr. When constructing the initial reference picture list 1 (i.e., RefPicListTemp1), pictures in RpsIL1 are immediately inserted after pictures in RefPicSetStCurrAfter, and pictures in RpsIL0 are inserted last, after pictures in RefPicSetLtCurr. Otherwise the reference picture list construction process stays the same as for HEVC single-layer coding.
6) Random Access, Layer Switching, and Bitstream Splicing:
Compared with AVC, HEVC provides more flexible and convenient random access and splicing operations, by allowing conforming bitstreams to start with a clean random access (CRA) or broken link access (BLA) picture. In addition, MV-HEVC and SHVC support the following.
Non-cross-layer aligned IRAP pictures, i.e., it is allowed in an AU to have IRAP pictures at some layers and non-IRAP pictures at other layers.
A conforming bitstream can start with any type of IRAP AU, including an IRAP AU where the base layer picture is an IRAP picture while (some of) the enhancement layer pictures are non-IRAP pictures.
To support non-cross-layer aligned IRAP pictures, the multilayer POC design needs to ensure that all pictures in an AU have the same POC value. The design principle is referred to as cross-layer POC alignment and is required to enable a correct in-layer RPS derivation and a correct output order of pictures of target output layers.
The multilayer HEVC design allows extremely flexible layering structures. Basically, a picture of any layer may be absent at any AU. For example, the highest layer ID value can vary from AU to AU, which was disallowed in SVC and MVC. Such flexibilities imposed a great challenge on the multilayer POC design. In addition, although a bitstream after layer or sub-layer switching is not required to be conforming, the design should still enable a conforming decoding behavior to work with layer and sub-layer switching, including cascaded switching behavior. This is achieved by a POC resetting approach.
The basic idea of POC resetting is to reset the POC value when decoding a non-IRAP picture (as determined by the POC derivation process in HEVC version 1), such that the final POC values of pictures of all layers of the AU are identical. In addition, to ensure that POC values of pictures in earlier AUs are also cross-layer aligned and that POC delta values of pictures within each layer remain proportional to the associated presentation time delta values, POC values of pictures in earlier AUs are reduced by a specified amount [22].
To work with all possible layering structures as well as some picture loss situations, the POC resetting period is specified based on a POC resetting period ID that is optionally signaled in the slice header [23]. Each non-IRAP picture that belongs to an AU that contains at least one IRAP picture must be the start of a POC resetting period in the layer containing the non-IRAP picture. In that AU, each picture would be the start of a POC resetting period in the layer containing the picture. POC resetting and the decreasing of POC values of same-layer pictures are applied only for the first picture within each POC resetting period, such that these operations would not be performed more than necessary; otherwise POC values would be corrupted.
7) Hybrid Codec Scalability and Multiview Support:
The HEVC multilayer extensions support the base layer being coded by other codecs, e.g., AVC. A simple approach was taken for this functionality by specifying the necessary elements of a conceptual interface by which the base layer may be provided by the system environment in some manner that is not specified within the HEVC standard. Basically, except for information on the representation format and whether the base layer is a target output layer as signaled in the VPS extension, no other information about the base layer is included in the bitstream (as input to the enhancement-layer decoder).
8) Hypothetical Reference Decoder:
The main new developments of the HRD compared with HEVC version 1 include the following three aspects relevant for MV- and 3D-HEVC. First, the bitstream conformance tests specified for HEVC version 1 are classified into two sets and a third set is additionally specified. The first set of tests is for testing the conformance of the entire bitstream and its temporal subsets. The second set of bitstream conformance tests is for testing the conformance of the layer sets specified by the active VPS and their temporal subsets. For the first and second sets of tests, only the base layer pictures are decoded and other pictures are ignored by the decoder. The third set of tests is for testing the conformance of the OLSs specified in the VPS extension and their temporal subsets.
The second aspect is the introduction of bitstream partition (BP) specific coded picture buffer (CPB) operations, wherein each BP contains one or more layers, and CPB parameters for each BP can be signaled and applied. These parameters can be utilized by transport systems that transmit different sets of layers in different physical or logical channels; one extreme example is one channel for each layer. The layer specific CPB parameters are also a basis for defining the semantics of layer specific PTL. The third aspect is the layer specific DPB management operations, where each layer exclusively uses its own sub-DPB. To ensure the design works with (cascaded) layer switching behavior, sharing of a particular memory unit across layers is disallowed.
9) SEI Messages:
Supplemental enhancement information (SEI) messages in HEVC version 1 have been adapted to be applicable in the multilayer contexts, in a backward compatible fashion, some of them with significant semantics changes. In addition, some new SEI messages are specified that apply to all multilayer HEVC extensions.
Furthermore, the following new SEI messages are specified for MV-HEVC and 3D-HEVC: 1) the 3D reference displays information SEI message; 2) the depth representation information SEI message; 3) the multiview scene information SEI message; 4) the multiview acquisition information SEI message; and 5) the multiview view position SEI message. The latter three correspond to the SEI messages of the same name in MVC.
B. 3D-HEVC Specific HLS
The MV-HEVC HLS provides generic support for multilayer extensions, and therefore, only a few additional HLS features have been introduced in 3D-HEVC to support the signaling of depth layers, additional reference layers, tool parameters, and a new SEI message, as described in the following.
In MV-HEVC, the auxiliary ID can be used to signal that a layer is carrying depth. In 3D-HEVC a new scalability ID element called depth flag has been introduced. In contrast to layers indicating depth by the auxiliary ID, layers enabling the depth flag can use the new 3D-HEVC coding tools.
Reference layers additionally required for new inter-layer prediction methods are signaled in the VPS as in MV-HEVC. However, when a reference picture list is constructed, only pictures from the current component are included, such that inter-component sample prediction is avoided.
Enabling flags for several of the tools shown in Table I are signaled in an additional SPS extension. Moreover, camera parameters can be present in a VPS extension (when constant) or the slice header (when varying over time). Camera parameters allow the conversion of values of a depth picture to disparities by scaling and offsetting and are required by view synthesis prediction (VSP) (Section IV-C6) and depth refinement (Section IV-A2). A depth lookup table (DLT), utilized as described in Section IV-F2, can be signaled in a PPS extension.
Finally, the alternative depth information SEI message provides information required for alternative rendering techniques, based on global depth maps or warping.
3D-HEVC Techniques
An overview of the 3D-HEVC texture and depth coding tools is provided in Table I. Texture coding tools provide increased compression performance by applying new IV prediction techniques, or enhancing existing ones. Some of the texture coding tools derive disparity for IV prediction, or segmentation information from samples of an already decoded depth layer. These depth-dependent techniques can be disabled when texture-only coding is performed.
Improved coding of depth maps has also been introduced into 3D-HEVC. Since depth maps typically contain homogeneous areas separated by sharp edges, new intra-picture prediction and residual coding methods have been specified to account for these unique signal characteristics. In addition, new depth coding tools that allow for IV prediction of motion or the prediction of motion and partitioning information from texture layers have also been specified.
Some of the new prediction techniques allow prediction with higher accuracy by introducing sub-block partitions (SBPs), which in some cases can also subdivide a PB into two parts with a nonrectangular shape.
In the remainder of this section, the 3D-HEVC decoding processes as listed in Table I are discussed in detail. A new module, which forms the basis for several 3D-HEVC tools, is disparity derivation (Section IV-A). Further techniques modify or extend existing HEVC single-layer coding processes for block partitioning (Section IV-B), motion prediction (Section IV-C), inter-picture sample prediction (Section IV-D), intra-picture sample prediction (Section IV-E), and residual coding (Section IV-F). By design choice, several core elements of HEVC such as entropy coding, deblocking, sample adaptive offset (SAO), coding of quantized transform coefficients, the transform tree (except conditions for its presence), and AMVP have not been modified.
A. Disparity Derivation
The majority of 3D-HEVC coding techniques are based on IV prediction, wherein sample values, prediction residuals, subpartitioning, or motion information of a block in a picture of the current view are predicted from a reference block in a picture of a different view. To find a reference block, the disparity derivation process is invoked at the coding unit (CU) level to provide a VOI of a reference view (RV), to be used for IV prediction, and a predicted disparity vector (PDV). The PDV indicates the spatial displacement of the reference block in the RV relative to the position of the coding block (CB) in the current picture. In the following, the reference VOI (RVOI) and the associated PDV are referred to as predicted disparity information (PDI). The PDI for texture layers is derived as described in the following sections. For simplicity, the PDI for depth layers is constant for a slice and correspond to an available RV and a disparity vector derived from a depth value of 128 (for 8-bit sample precision).
1) Neighboring Block Disparity Vector:
Neighboring block disparity vector (NBDV) operates without referring to depth layers to allow the prediction of PDI for applications in which only the texture information is of interest (referred to as texture-only coding). As such, the PDI is derived from motion information of temporally and spatially neighboring blocks [24].
The temporally neighboring blocks are located in two different pictures. The first picture is the collocated reference picture, which is the picture signaled for temporal MV prediction (TMVP) [8]. The second picture is chosen among the temporal reference pictures, where the one which may have blocks more likely to be coded using DMVs is selected [22]. The two temporally neighboring blocks cover the positions
To determine the PDI for the current CU, blocks at the neighboring positions are searched in order {
When the PDI cannot be derived from motion information of neighboring blocks as described above, a second search is applied, in which the stored PDI of the neighboring CUs from positions
If derivation from neighboring blocks or CUs fails, the VOI of an IV reference picture of the current picture is used as RVOI and the PDV is set equal to the zero vector.
2) Depth Refinement:
Although NBDV can operate without referring to depth layers, the accuracy of the PDI can be increased by exploiting the additional depth information. Since the texture of a view is coded first in 3D-HEVC, the depth map of the current view is not available when coding texture. For this, the PDI derived by NBDV is used to identify the block that corresponds to a current CB in the already decoded depth map of the RV. The PDV of the CU is then replaced by a refined disparity vector, derived from the maximum of the four corner sample values of the depth block [25].
B. Partitioning Syntax Prediction
A new tool in 3D-HEVC, called quadtree limitation (QTL), predicts the partitioning of a depth CU from syntax elements of a collocated texture CU. By design choice, QTL is not available in I slices and IRAP pictures.
When a region of a texture picture includes only low frequencies, such that a coarse split in CBs is applied by an encoder, it can be assumed that high-frequency signal parts in the collocated region of the associated depth picture are either also not present or irrelevant for view synthesis. Therefore, when a depth block has the same position and size as a CB in the corresponding texture picture, the flag indicating a further split is not present in the depth coding quadtree and the depth block cannot be split into smaller CBs. Though this could have been implemented as an encoder only restriction, the additional bit rate saving was considered beneficial.
For the same reasons, the PB partitioning of a depth CB having limited size is restricted by the texture CB. When the texture CB consists of a single PB, the same applies for the depth CB without additional signaling. When the texture CB is split horizontally or vertically, only splits in the same direction can be signaled. In the case that the texture CB is associated with four PBs, no restrictions apply for the depth CB.
C. Motion Prediction
3D-HEVC specifies an extended candidate list for the merge mode. The list includes the conventional HEVC single-layer coding candidates (although the temporal MV candidate is predicted by a modified process), and several new candidates. Some of the new candidates are based on inter-layer prediction of motion information in SBP granularity, such that the granularity in the reference layer can be taken into account.
The derivation of merge candidates is performed in two separate steps. In the first step, an initial merge candidate list is derived as specified for HEVC single-layer coding, including the removal of redundant entries, but using the modified TMVP derivation as described below. In the second step, an extended merge candidate list is constructed from the initial list and additional candidates. To limit worst-case complexity, the second step is not applied for PUs with luma PB sizes
Candidates from positions
Additional candidates are the texture (T) candidate, the IV candidate, the shifted IV (IVS) candidate, the VSP candidate, the disparity information (DI) candidate and the shifted DI (DIS) candidate.
1) Extended Temporal Motion Vector Prediction for Merge:
In MV-HEVC (and single-layer HEVC), the reference index is always zero for a TMVP merge candidate [also denoted as collocated (Col) candidate]. Therefore, the reference indexes of the collocated block and the TMVP candidate may indicate reference pictures of different types (one temporal, the other IV), such that the TMVP candidate is not available. However, when this case occurs in 3D-HEVC, the reference picture index of the TMVP candidate is changed to an alternative value, which indicates an available reference picture having the same type as the reference picture of the collocated block [19]. Hence, the same type of prediction (either temporal or IV) is indicated by both the candidate and the collocated block, such that the MV of the TMVP candidate can be predicted from the MV of the collocated block. For MV prediction, scaling might be applied. If the collocated block refers to a temporal reference picture, the MV is scaled based on POC values, as described in Section II-A. Otherwise, when the collocated block uses IV prediction, scaling based on the view ID values, which correspond to spatial camera positions, is performed.
2) Disparity Information Candidates:
The PDI derived for the current CU can be used to identify a reference block for IV sample prediction. When a reference picture list of the current picture includes a picture from the RV, the DI candidate is available and its motion information related to this list is given by the PDV (with vertical component set to zero) and the reference index to the picture [26].
The DIS candidate is derived from the DI candidate by adding an offset of 1 in units of luma samples to the horizontal MV components of the DI candidate. The primary motivation for the DIS candidate is that the PDV (predicted as described in Section IV-A) used to derive the DI candidate may not always match the actual disparity, such that offering an additional choice can improve performance.
3) Sub-Block Motion Prediction:
Conventionally, a merge candidate (e.g., an initial list candidate) consists of a single set of motion information (up to two MVs and their reference picture indexes, and a reference picture list indication), which might be used for inter-picture prediction of the entire current PB. In 3D-HEVC, a PB can be further subdivided into rectangular SBPs (e.g., as depicted in Fig. 3), when the T, the IV, or the VSP candidate is selected. For this purpose, these candidates consist of multiple sets of motion information each for inter-picture prediction of one SBP of the current PB. Thus, motion information can be inherited with a finer granularity.
3D-HEVC supports the smallest bipredictive SBP size of
4) Inter-View Candidates:
Prediction of MVs from other views has been proposed, e.g., in [27]. Based on this idea, the IV candidate inherits motion information from a picture included in the same AU and the RV. An example for a
First, the position of the corresponding block in the picture of the RV is identified by adding the PDV to the center position of the current SBP. When the corresponding block is coded using inter-picture prediction, its MVs might be inherited for the current SBP. The condition for inheritance is that a reference picture list of the current picture includes a picture with the same POC as the reference picture of the corresponding block. When such a picture is available, the motion information for the current SBP is set equal to the MV of the corresponding block and the index of the picture in the reference picture list of the current picture. Since the condition ensures that the temporal distance between pictures containing: 1) the corresponding block and its reference block and 2) the current SBP and its reference block; are equal, POC-based MV scaling is not necessary. When the corresponding MV cannot be inherited for the current SBP, it inherits the motion information of the default SBP, which is the SBP whose top-left corner sample is closest to the center of the current PB. In the case that motion information is also not available for the default SBP, the IV candidate is not available.
In some cases, the PDV can be inaccurate, such that the prediction error of the IV candidate is high. Here, the IVS candidate can be an alternative, as it is based on another disparity assumption: to derive the IVS candidate the same method as for the IV candidate is applied, but with a single SBP (equal to the PB), and additional horizontal and vertical offsets (equal to the half width and half height of the PB, respectively) added to the PDV.
5) Texture Candidate:
The derivation process of the T candidate is similar to that of the IV candidate. However, whereas the IV candidate is inherited from blocks of the same component in the RV, the T candidate is derived from the texture component in the same view. Hence, instead of a corresponding block, a collocated block is identified at the center position (marked with
6) View Synthesis Prediction:
In view synthesis, a picture is conventionally rendered by shifting samples of a texture picture by disparities obtained from a depth map. For 3D-HEVC sample prediction, the same principle can be applied in coarser granularity by disparity compensation of a reference block in a texture picture. More specifically, when the VSP candidate is selected, IV sample prediction is performed for the SBPs of the current PB using MVs obtained from its corresponding depth block [30]. As for disparity refinement, the corresponding depth block is identified in the depth picture of the RV using the PDV derived by NBDV.
The VSP candidate can be chosen by signaling its index in the merge candidate list, or, in case that the PB at
After partitioning, the motion information of the SBPs is derived. For complexity reduction, all SBPs use uniprediction from the texture picture of the RV. Moreover, the vertical component of MVs is always set equal to zero. The horizontal component is derived for each SBP by converting the maximum value of the four corner samples (marked with # in Fig. 2) of its corresponding depth SBP within the corresponding depth block.
D. Inter-Picture Sample Prediction
3D-HEVC extends inter-picture sample prediction in texture layers by three techniques: residual prediction exploits the correlation of sample prediction errors in different views or AUs. Adaptive weighting of an IV sample prediction has been enabled by illumination compensation (IC). Depth-based block partitioning (DBBP) combines two predictions for texture CB according to a sub-partitioning derived from a corresponding depth block. For depth coding, motion compensation has been simplified.
1) Residual Prediction:
In texture layers, the energy of the residue of the current PB may be reduced by performing additional motion compensation either in the RV or a different AU. The concept of RP is to reuse the MVs of the current PB to predict the residual signal. The predicted residual is calculated and added on top of the motion compensated signal derived with the MVs of the current PB for each used prediction direction. To accommodate possible quantization differences, a CU level weighting factor
Since the additional motion compensation to derive the residual signal would require a significant increase in memory bandwidth and calculations, the following design tradeoffs were made.
Only CUs associated with a single PU may have RP enabled, and thus, the smallest luma PB size with RP enabled is
.$8\times 8$ RP is not used for chroma components of PBs of luma size
.$8\times 8$ Bi-linear interpolation is used for the motion compensation in RP enabled CUs to calculate the residual signal as well as the motion compensation between the current PB and the reference block identified by its MV.
a) Inter-view residual prediction:
As illustrated in Fig. 4(a),
b) Temporal residual prediction:
As illustrated in Fig. 4(b),
c) Further constraints:
The signaled (
A bidirectionally predicted PB, with the two MVs, denoted as
2) Illumination Compensation:
The purpose of IC is to improve IV prediction when there are illumination mismatches between views. This is done by applying a scale factor and offset to the prediction samples. IC is separately applied for each unidirectional IV sample prediction of a PB (hence twice for biprediction). For complexity reduction, IC can be signaled only in CUs associated with a single PU not using RP.
The scale and offset values are calculated by matching a set of samples in the reference picture to a set of samples in the current picture [33]. The set in the current picture is given by the samples spatially adjacent to the top and left of the current PB, where only every second sample is used to keep complexity low. Accordingly, the set of the reference picture includes every second sample adjacent to the top and left of the block used as reference for IV prediction. A linear least square solution is approximated using a lookup table to avoid a division operation. For chroma PBs, IC is even simpler since it only derives and applies the offset.
3) Depth-Based Block Partitioning:
DBBP predicts segmentation information from an already decoded depth map to improve the compression of a dependent texture [34]. It is invoked by a flag that is present with luma CBs of size
This way, if the depth map bears reliable information about the position of an object boundary, different motion compensation modes could be used at both sides of that boundary, and the prediction can be improved without using small PB sizes.
4) Full Sample Motion:
Fractional sample interpolation at sharp edges in depth maps can create ringing artifacts. For this, 3D-HEVC only supports full sample motion accuracy for depth layers. Further benefits of this approach are a reduced bit rate for MV signaling, and a reduced complexity for motion compensation.
E. Depth Intra-Picture Prediction
In addition to intra-picture prediction modes provided by HEVC single-layer coding, which are unchanged in 3D-HEVC, a new skip mode and three new prediction modes, called Intra_Single, Intra_Wedge, and Intra_Contour mode, are available for intra-picture coding in depth layers. The new skip mode allows an early signaling of frequently used intra-picture prediction modes. The three prediction modes have been introduced for efficient representation of sharp edges and homogeneous areas, which are typical in depth maps. The Intra_Single mode signals a single boundary sample value as prediction for the entire PB and can only be applied together with the intra-picture skip mode. In the Intra_Wedge and Intra_Contour modes a subdivision of the PB into two SBPs is signaled or derived. The SBPs are not required to have a rectangular shape (as depicted in Fig. 3). For each SBP, a DC value is predicted from decoded boundary samples of adjacent blocks. The new intra-picture prediction modes have two common differences in comparison with the conventional intra-picture prediction modes: First, boundary value smoothing is not applied; and second, PBs using the new modes are unavailable in the derivation of most probable modes.
1) Intra_Wedge and Intra_Contour:
The Intra_Wedge and the Intra_Contour modes [35] are signaled at the PU level by two flags. The first flag indicates that one of the two new modes is used instead of a conventional intra-picture prediction mode. The other flag indicates which mode is used.
When a PU uses the Intra_Wedge mode, the subpartitioning is explicitly signaled by an index value referring to a set of binary patterns denoted as wedgelets. The set of wedgelets contains patterns resulting from a segmentation of the PB with straight lines. The number of wedgelets in the set depends on the PB size. The wedgelets can be either created on-the-fly when decoding the PU, or be created and stored in advance.
In the Intra_Contour mode, the subpartitioning of the PB is inter-component predicted from a collocated block in the texture picture of the same view and AU. For segmentation, first, a threshold is derived by averaging the four corner sample values of the collocated texture block, and then a binary pattern representing the SBPs is derived by comparing sample values of the texture block with the threshold value. Due to the thresholding, the area of an SBP can also be disjunct.
After generation of the binary pattern defining the two SBPs, the same sample prediction process is applied in Intra_Wedge and Intra_Contour modes. The process is invoked for each of the two SBPs and provides a DC prediction from a subset of the decoded boundary samples [depicted in Fig. 5(a)] of blocks adjacent to the current PB. How the prediction DC is calculated depends on which of the neighboring samples
A special case occurs when none of the four samples is a neighbor of the current SBP [as for the SBP marked with
2) Intra-Picture Skip and Intra_Single:
The intra-picture skip mode can be applied in depth layers for CUs, which are not using the conventional (inter-picture) skip mode [7]. When the intra-picture skip mode is used, the CU contains only three syntax elements: 1) the skip flag being equal to 0; 2) a flag indicating the intra-picture skip mode being equal to 1; and 3) an index that selects the prediction mode. Similar to the (inter-picture) skip mode, other syntax elements are not present and the CU is associated with a single PB.
Depending on the signaled index [37] the prediction is derived by the horizontal or vertical Intra_Angular mode [7] or the Intra_Single mode. In the Intra_Single mode [38], [37], the prediction for the whole PB is chosen again depending on the signaled index as the value of the decoded boundary sample adjacent to position (
F. Depth Residual Coding
As described in Section IV-B, high frequency components of blocks in a depth map can be irrelevant for view synthesis, such that the depth DC becomes more important. To efficiently preserve the DC component of the prediction residual (denoted as DC offset) in 3D-HEVC, it can be explicitly signaled in addition to, or as alternative to, quantized transform coefficients, where the latter is referred to as DC-only mode. DC offset coding has furthermore been extended by a DLT technique, which exploits the fact that the value range of depth samples is often only sparsely used.
1) DC(-Only) Coding:
The DC-only mode can be enabled by a flag present in CUs associated with a single PU and not coded in skip or intra-picture skip mode. When enabled together with intra- or inter-picture prediction modes as already specified in single-layer HEVC, one DC offset is signaled and added on top of the intra-picture predicted [39] or motion compensated signal. In the Intra_Contour or the Intra_Wedge modes, one DC offset is present for each of the two SBPs.
Moreover, when a PU is coded in Intra_Wedge or Intra_Contour mode and not using DC-only coding, it can be assumed that it contains an edge, which is relevant for view synthesis. For better preservation, such PUs can signal both DC offsets and quantized transform coefficient [40].
A flag in the PPS indicates if the DC offsets are directly added to each sample of the prediction, or if they are scaled before in a nonlinear process using the DLT.
2) Depth Lookup Table:
Samples of a depth map are typically represented with 8 bits of precision, although only a small set of distinct depth values, potentially nonuniformly distributed over the value range, might be used. To map a compressed range of consecutive index values to such a set of distinct depth values, a DLT can be transmitted in the PPS [41]. When the DLT is present, coding performance is increased by signaling DC offsets in the compressed index range instead of DC offsets with higher magnitude in the depth sample range. For this, an encoder derives the index offset
In Intra_Wedge or Intra_Contour mode,
Profiles
The second edition of HEVC specifies one profile for MV-HEVC, which is the Multiview Main profile, or simply MV Main profile. For backward compatibility, the MV Main profile requires the base layer to conform to the Main profile. Moreover, block-level coding tools of enhancement layers are similarly constrained to enable reusing legacy decoder hardware below the slice level. Hence, only 4:2:0 chroma and a sample precision of 8 bits are supported. A constraint introduced to limit complexity of IV prediction is that the number of reference layers (including indirect reference layers) used by a layer must not be greater than 4.
For 3D-HEVC the 3D Main profile specifies a superset of the capabilities of the MV Main profile, such that a 3D Main profile conforming decoder is able to decode MV Main profile conforming bitstreams. The base layer is required to conform to the Main profile. New low-level coding techniques of 3D-HEVC are supported only by enhancement layers that are not auxiliary picture layers. Texture layers support only 4:2:0 chroma and depth layers support only monochrome. For both components, the sample precision is restricted to 8 bits.
Furthermore, both profiles disallow inter-layer prediction between layers that use different picture sizes.
Compression Performance
To evaluate the compression efficiency of the different extensions, simulations were conducted using the reference software HTM [42], and experimental evaluation methodology that has been developed and is being used by the standardization community [43], [44]. In that framework, multiview texture video and the corresponding depth can be provided as input, while the decoded views and additional views synthesized at selected positions can be generated as output. For evaluation, two setups have been used as shown in Table III.
The first setup evaluates the typical use case foreseen for MV-HEVC, which is the coding of stereo video without depth (hence of two texture layers denoted as
Although not shown in Table III, it is worth mentioning that only a modest bit rate saving of about 6% on average (26% enh. only) can be achieved by 3D-HEVC compared with MV-HEVC for the stereo case. However, the target application of 3D-HEVC is coding of data suitable for view synthesis at autostereoscopic displays. For this, a set of eight sequences, each comprising texture (
Conclusion
Experts of ITU-T VCEG and ISO/IEC MPEG have jointly developed the multiview and 3D extensions of HEVC. Both extensions allow the transmission of texture, depth, and auxiliary data for advanced 3D displays. An increased compression performance compared with simulcast HEVC is achieved by inter-layer prediction. In contrast to 3D-HEVC, MV-HEVC can be implemented without block-level changes. However, thanks to advanced coding techniques a higher coding efficiency can be provided by 3D-HEVC in particular for cases where depth maps have to be coded.
As both extensions have been developed to support stereoscopic and autostereoscopic displays, they have not been specifically designed to handle arrangements with a very large number of views or arbitrary view positioning. Accordingly, improved capabilities for supporting such configurations may be a subject for future standardization.
ACKNOWLEDGMENT
The authors would like to thank all the experts who have contributed to the development of MV-HEVC and 3D-HEVC.