Introduction
The Versatile Video Coding (VVC a.k.a. ITU-T Rec. H.266 and ISO/IEC 23090-3) is the latest international video-compression standard jointly finalized by ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Motion Picture Expert Group (MPEG) in July 2020 [1], [2]. The project of this new standard was announced by the so called “Joint Video Exploration Team (JVET)” of VCEG and MPEG in October 2017 with the joint call for proposal of video compression with capability beyond High Efficiency Video Coding (HEVC) [3] and then officially launched in April 2018. At the same time, JVET has been renamed to “Joint Video Experts Team”. The main goal of VVC is to address two aspects of industry needs for a future video coding standard: (a) higher coding efficiency than HEVC with over 50% bitrate reduction while keeping the same subjective quality for SDR/HDR contents with picture size at least covering from VGA to
Although the VVC standard inherits the framework of block-based hybrid coding, similar to HEVC, it adopts several highly adaptive and sophisticated coding tools. In general, VVC follows a multi-type tree structure (i.e., quadripartite, binary and/or ternary tree) to split a picture into a variety of block shapes (i.e., square, or non-square). Each block is a basic unit for signaling prediction information. Then, intra prediction and/or inter prediction operates on a block-by-block or subblock-by-subblock basis [5] within the basic unit, followed by transform and quantization processes with switchable bases for residual coding, a chain of in-loop filters (i.e., deblocking, sample adaptive offset, adaptive loop filtering [6]) for subjective quality improvement and syntax coding/parsing for transmission.
In a video signal, high temporal redundancy exists between sequential pictures. Therefore, inter prediction, targeting at reducing the temporal redundancy, makes a major contribution in the video compression capability and plays a key role in the hybrid video coding scheme. In VVC, a lot of novel coding tools are developed to further improve inter prediction. In general, those tools can be classified into two major groups, depending on whether the whole block share the same set of motion information, i.e., “whole block-based inter prediction” wherein only one set of motion information is utilized and “subblock-based inter prediction” wherein each sub-block could have its own set of motion information.
This paper covers both algorithm descriptions and performance analysis of whole block-based inter-prediction coding tools while the subblock-base inter prediction is overviewed in another overview paper of this special issue [7]. Basically, the whole block-based inter-prediction coding tools include the extended adaptive motion vector prediction (AMVP) mode and block merging which are employed in the HEVC inter prediction scheme, and multiple other coding tools introduced in the VVC standardization work. Therefore, following the inter prediction scheme, the coding tools are categorized into 3 topics, AMVP, merge, and others, as follows:
Motion vector (MV) coding: MV predictor (MVP) candidate list with 2 candidates generated based on spatial/temporal MV predictors or MVs from history-based motion vector prediction(HMVP) tables [8], [9]; symmetric MV difference(SMVD) that signals a pair of symmetric bi-prediction MV differences (MVDs) with slice-level indicated reference pictures [10], [11]; adaptive MV resolution (AMVR) for MV predictors and MVDs at 1) quarter- to 4- luma samples or 2) one-16th- to one- luma sample precision depending on selected motion models, i.e., 1) for the translational motion model and 2) for the affine motion model [13]–[18];
Block merging: block merging candidate list with at most 6 candidates generated based on spatial/temporal, candidates from HMVP tables and a synthetic pairwise motion vector predictor; geometric partitioning mode (GPM) that geometrically splits a block at an angle of power-of-two tangent with a boundary-shifting offset [22]–[24]; merge mode with MV difference(MMVD) that allows adding power-of-2 1-D offsets to either horizontal or vertical components of a selected merge candidate [22], [23]; combined inter-intra prediction (CIIP) to generate a prediction block with weighted combination of a planar intra predictor and the motion-compensated temporal predictor of a selected merge candidate [25], [26]; merge estimation region (MER) to allow independent/parallel derivation of MV prediction list and merge candidate list of coding units (CUs) inside the region [27];
Others: bidirectional prediction with coding unit weights (BCW) to introduce non-equal weights at CU level for bi-prediction [28]–[30]; motion vector compression and range to encode motion fields on reference pictures by using a 10-bit mantissa-exponent representation at every
grid [31].$8\times 8$
These coding tools introduced to inter prediction offer more encoding options for VVC to efficiently represent the motion fields of coded blocks with complex object motion. For example, the VVC standard offers quite a few motion models, including translational model, 4- and 6-parameter affine model, to accommodate more video content types with complex structures and ensures coding efficiency. The effectiveness of each individual coding tool reported during the development of VVC has justified the increase of encoding/decoding complexity [43].
This paper is organized as follows. Sections II presents the motion data coding in the HEVC standard, including the AMVP mode and merge mode. Multiple aspects of inter prediction in the VVC standard are provided in Section III. Sections IV, V and VI give the detailed description of individual coding tools and the interactions among tools. Tool-by-tool testing results are given in Section VII to explore the coding performance and analyze the implementation complexity, and Section VIII concludes this paper.
Motion Data Coding in HEVC
In HEVC, inter prediction is represented by two modes: the AMVP mode and merge mode, wherein reference picture indices and MVDs are signaled in the former mode but not signaled in the latter one. Skip mode is a special merge mode, in which residuals are inferred to be zero and thus not signaled. In this section, we briefly review the two modes in HEVC.
AMVP mode origins from MV competition [32] wherein one of the best MVPs could be selected according to rate-distortion cost. For the AMVP mode in HEVC, motion vector predictions are used to exploit spatio-temporal correlations of MVs among prediction units (PUs). The encoder can select the best MVP from an MVP candidate list and transmit the corresponding index together with the reference picture index and MVD. The MVP candidate list for each reference picture list with up to two candidates is constructed, following the flow depicted in Fig. 1. More specifically, the MVPs from spatial neighboring PUs, left and above to the current PU, an MVP derived from temporal motion vector prediction (TMVP) and zero MVPs are added to fulfill the candidate list in order. The locations of five neighboring PUs are denoted respectively as A0, A1, B0, B1 and B2 in Fig. 2. The left spatial neighboring MVP candidate can be derived from A0 and A1 when at least one of them is inter predicted, and the above spatial neighboring MVP candidate can be derived from B0, B1 and B2 when at least one of them is inter predicted. A scaled MV, according to the temporal distances between the reference picture associated with the left MVP and the current reference picture, will be output as the left MVP if no MV refers to the target reference picture in block A0 and A1. Similarly, a scaled MV will be output as the above MVP if no MV refers to the target reference picture in block B0, B1 and B2. The above spatial neighboring MVP candidate is discarded if it is identical to the left neighboring MVP candidate. The TMVP candidate is derived by scaling an MV stored at location H or C as shown in Fig. 2 in the collocated picture to the target reference picture. Location H is checked first. If no MV is available at location H, location C is checked.
Five neighboring spatial locations (above blocks: A0, A1; left blocks: B0, B1 and B2) and locations of collocated blocks for TMVP (H and C) of the current block.
With the merge mode in HEVC, motion information of the current PU can be directly inherited from spatial or temporal neighboring blocks [33]. A merge candidate list with five candidates is constructed as demonstrated in Fig. 3. Like in the AMVP mode, the encoder selects the best merge candidate from the candidate list and transmits the corresponding index, but without any reference index or MVDs.
To derive spatial merge candidates, a maximum of four merge candidates are selected among candidates located in the positions depicted in Fig. 2. The order of derivation is A1, B1, B0, A0 and B2. Position B2 is considered only when any PU of position A1, B1, B0 and A0 is not available (e.g. because it belongs to another slice or tile) or is not inter predicted. After the candidate at position A1 is added, the addition of the remaining candidates is subject to a redundancy check which ensures that candidates with same motion information are excluded from the list to improve the coding efficiency. To reduce computational complexity, only the pairs linked with an arrow in Fig. 4 are compared and a candidate is added to the list only if it passes the redundancy check. In HEVC, a CU may be partitioned into PUs, which may bring redundancy with the merge mode. Fig. 5 depicts the “second PU” partitioned from a CU by
The second PU partitioned by (a)
In the derivation of the temporal merge candidate, the TMVP candidate is derived from MVs stored at location H or C as shown in Fig. 2 in the collocated picture, similar to the TMVP candidate for AMVP mode. For a TMVP candidate in the merge candidate list, the MVs will be scaled to the reference picture with reference index 0 in the corresponding reference picture list.
Beside spatio-temporal merge candidates, there are two additional types of merge candidates: combined bi-predictive merge candidate and zero motion candidate with (0, 0) motion vector. Combined bi-predictive merge candidates are generated by utilizing spatio-temporal merge candidates for B-slice only. The combined bi-predictive candidates are generated by combining a first MV referring to reference list 0 of a first merge candidate, and a second MV referring to reference list 1 of a second merge candidate where the first and second merge candidates are selected from available merge candidates in the merge candidate list according to a pre-defined order. The two MVs will form a new bi-predictive candidate. If the merge candidate list is not fulfilled, zero merge candidates will be appended to the list to fill it up.
Overview of Motion Vector Coding and Block Merging in VVC
As aforementioned, VVC supports both the whole block-based and subblock-based inter prediction tools. The whole block-based inter-prediction has been widely used for decades in former video coding standards such as HEVC. With whole block-based inter prediction, a set of motion information, which comprises MVs and reference pictures, is assigned to a block, and the motion compensation (MC) is performed on the whole block with the set of motion information. Unlike whole block-based inter prediction, subblock-based inter prediction first divides a block into subblocks, e.g.
In the following sub-sections, the newly employed whole block-based inter coding tools in VVC are provided.
A. Motion Vector Predictor From HMVP Tables
In HEVC, there are two types of MVPs, i.e., spatial MVP and temporal MVP which utilize the motion information from spatially adjacent or temporal blocks. While in VVC, a new type of MVP, i.e., HMVP is introduced. The basic idea of HMVP is to further use previously coded MV as MVP which are associated with adjacent or non-adjacent blocks relative to current block. In order to track available HMVP candidates, a table of HMVP candidates is maintained at both encoder and decoder and updated on the fly. Whenever a new CTU row starts, the table is reset to ease parallel coding. There are up to five candidates in the HMVP table. After coding one inter predicted block which is not in sub-block mode (including affine mode) or GPM, the table is selectively updated by appending the associated motion information to the end of the table as a new HMVP candidate. A restricted first-in-first-out (FIFO) rule is applied to manage the table wherein the redundant candidate in the HMVP table is firstly removed instead of the first one. With HMVP, the motion information of previously coded blocks can be utilized for more efficient motion vector prediction, even if the coded blocks are not spatially adjacent to the current block.
The HMVP candidates can be added to the AMVP candidate list as well as the merge candidate list.
B. Motion Vector Coding
In addition to the introduction of HMVP candidate added to the AMVP mode, there are other new features for the AMVP mode in VVC. The SMVD technology sets the MVD for reference list 1 as a mirror of the MVD for reference list 0 to save the overhead for coding MVD. The AMVR technology allows MVDs of a block to be signaled in 4-, 1/2-, 1- or 4- luma sample resolutions for translational motion model, which is also adopted to further save bits of MVD. These two coding tools could be applied together for one block. Besides, to provide more precise motion compensation, the precision of MV is 1/16-luma sample in VVC instead of 4-luma sample in HEVC.
C. Block Merging
In VVC, merge/skip mode is more sophisticated than that in HEVC. First, besides the neighboring block-based merge candidates similar to those in HEVC, two new types of merge candidates are added into the merge candidate list, namely HMVP merge candidates and the pairwise average merge candidate. Second, besides the regular merge mode which is similar to merge mode in HEVC, VVC adopts three additional merge modes known as MMVD mode, CIIP mode and GPM.
The newly introduced HMVP and pairwise average merge candidates are put into the merge candidate list after the spatial or temporal neighboring block-based merge candidates. The pairwise average merge candidate in VVC replaces the combined bi-predictive merge candidates in HEVC. The pairwise average merge candidate is put after HMVP candidates and is generated by averaging the MVs of the first two available merge candidates in the merge candidate list. As in HEVC, a merge estimation region is adopted by VVC to facilitate the hardware design for the encoder.
The three additional merge/skip modes can help VVC to adapt better to varieties of video contents. MMVD serves as an intermediate motion representation between merge/skip mode and AMVP mode. An MVD index is signaled for a merge candidate and represents an MVD or a pair of MVDs that are limited to four directions and eight distances. The combination of inter-prediction and intra-prediction has been studied for many years [34]–[36]. With CIIP as adopted in VVC, only the planar mode is used to generate the intra-prediction block, and a merge candidate is used to generate the inter-prediction block. The final prediction is a weighted sum of the inter-prediction and intra-prediction blocks. MC with non-rectangular partitions also gains a lot of research attentions for years [37], [38]. With GPM in VVC, a coding block is partitioned into two parts which may be non-rectangular or asymmetric rectangular. Two inter-prediction blocks are generated with two MVs derived from two merge candidates for the two parts individually. The final prediction is a weighted sum of the two inter-prediction blocks with weighting values particularly designed for the partitioning shapes.
D. Weighting of Motion-Compensated Prediction and Motion Data Storage
In VVC, some coding tools such as BCW can be applied to both merge mode and AMVP mode. With BCW, a set of weighting value candidates can be selected for bidirectional inter prediction. The index of the selected weighting values is signaled for AMVP mode and inherited for merge mode, if allowed.
Besides coding efficiency, computational complexity and storage requirement are also intensively evaluated during standardization for video coding. To limit the required storage of MVs used for temporal prediction, VVC employs a new algorithm to compress stored MVs, by using 10-bit mantissa-exponent representation.
Motion Vector Coding in VVC
As known, inter prediction is a key part of every video coding standard. The compression ratio heavily relies on efficient representation of motion data. The motion vector coding in the new VVC standard is based on the well-established concepts of HEVC. However, there are several refinements, including the revised AMVP candidate list construction process, SMVD and AMVR, which lead to an improved coding efficiency. The detailed descriptions and theories behind those tools will be described in the following subsections.
A. Motion Vector Predictor List Generation Algorithm
The motion vector prediction algorithm of VVC is based on the AMVP of HEVC. It is applied in case of explicit motion vector signaling, i.e., for inter predicted blocks that do not use merge mode. For each motion vector, a list of exactly two motion vector predictor candidates is generated and a flag indicates which of the two is used. The following candidates are checked for availability:
up to two spatial candidates (similar to that in HEVC),
up to one temporal candidate (similar to that in HEVC),
up to four HMVP candidates (new in VVC),
zero motion vectors, if not enough other candidates are available (similar to that in HEVC).
For the spatial and the temporal (co-located) candidates, the same spatial locations as in HEVC are used. Unlike in HEVC, when the picture order count (POC) of the candidate’s reference picture does not match the POC of the current reference picture, the candidate is considered unavailable in this case and the scaling of spatial candidates is removed in VVC to save the computational complexity. As in HEVC, in order to limit memory bandwidth requirements, the temporal motion vector predictor (TMVP) is restricted to only use co-located candidates that belong to the same coding tree unit (CTU) row as the current block. Also, as in HEVC, the TMVP candidate can be disabled at sequence level or at picture level. The TMVP is also not available for coding blocks of size
B. Symmetric Motion Vector Difference
In the bi-prediction mode, lots of bits are used to code the motion information which includes the reference picture indices, MVP indices and MVDs for reference picture list 0 and list 1. To code the motion information of bi-prediction mode in a more effective way, the SMVD technology is adopted. The SMVD mode is an inter bi-prediction mode in which part of motion information is derived with an assumption of linear motion. Therefore, the number of bits for motion information coding can be reduced. For a CU coded with a non-affine bi-prediction mode, a SMVD flag is signaled to indicate whether the SMVD mode is selected or not. When the flag is true, only the MVP indices of list 0 and list 1 and the MVD of list 0 are signaled, and other motion information (i.e., reference pictures and MVD of list 1) is not signaled but derived at the decoder side.
First, the MVD of list 1 is symmetrically derived from the MVD of list 0. That is, the MVD of list \begin{equation*} \left ({{mvdx}_{L1},{mvdy}_{L1} }\right)=\left ({{-mvdx}_{L0},-{mvdy}_{L0} }\right).\tag{1}\end{equation*}
Second, the reference pictures of list 0 and list 1 are determined at slice-level. These two reference pictures can only be short-term reference pictures and may be as in either case 1 or case 2, as follows.
Case 1: Reference picture of list 0 is the nearest picture among all the pictures preceding the current picture in output order in list 0. Reference picture of list 1 is the nearest picture among all the pictures following the current picture in output order in list 1
Case 2: Reference picture of list 0 is the nearest picture among all the pictures following the current picture in output order in list 0. Reference picture of list 1 is the nearest picture among all the pictures preceding the current picture in output order in list 1
If none of the reference pictures can be found as in case 1 or case 2, the SMVD mode is marked as unavailable and the SMVD flag is not sent.
C. Adaptive Motion Vector Resolution
The video coding standards HEVC and AVC/H.264 use a fixed motion vector resolution of quarter luma sample. However, it is a well-known fact that in order to achieve overall rate-distortion optimality, an optimum trade-off between displacement vector rate (
Note that in case the half luma sample resolution is selected for a coding block, also an alternative luma interpolation filter is used for the half-sample position in this block. This aspect of AMVR is also known as switchable interpolation filter (SIF). The frequency responses of the regular interpolation filter and the alternative interpolation filter are shown in Fig. 7. It can be seen that the alternative filter has a strong low pass characteristic which can be beneficial for attenuating high frequency noise components. More details about SIF can be found in [16]. The application of the alternative interpolation filter is also further propagated in merge mode, i.e., if a coding block in merge mode references a neighboring block that uses the alternative interpolation filter, the referencing block will use it as well.
Block Merging in VVC
Block merging is an efficient coding tool in HEVC. In VVC, more merge modes are introduced for higher coding efficiency. In this section, regular merge mode is first introduced. Subsequently, new merge modes, including geometric partitioning mode, merge mode with motion vector difference, combined inter-intra prediction are discussed in order. Finally, merge estimation region for low encoding complexity is described.
A. Block Merging Candidate List Generation Algorithm
There are five types of motion vector (MV) predictor candidates in regular merge mode, i.e., spatial candidates, temporal candidates, HMVP candidates, pairwise average candidate, and zero MV candidates. The spatial and temporal candidates are the same as those in HEVC except that the order of the first two spatial candidates is swapped for higher coding efficiency. HMVP candidates, derived from an HMVP table, are inserted into merge list after spatial and temporal candidates until the merge list reaches the maximum allowed size minus one. To avoid duplicated candidates while keeping a relatively low complexity, redundancy check is applied. Only when any of the following three conditions is met, an HMVP candidate is inserted into the merge list.
The HMVP candidate is not the last two in the HMVP table;
The current HMVP candidate is not the same as the spatial candidates derived from A1/B1, as depicted in Fig. 2.
The pairwise average candidate is generated by averaging a pre-defined candidate pair in the existing merge candidate list. There is up to one pairwise candidate which averages the first two existing candidates in merge candidate list. The averaged motion vectors are calculated separately for each reference list. If both motion vectors are available in one list, these two motion vectors are averaged even when they point to different reference pictures; if only one motion vector is available, use the one directly; if no motion vector is available, keep this list invalid. If the merge candidate list is not full after inserting the pairwise average candidate, zero MV candidates will be added until the candidate list is full, which is the same as in HEVC.
B. Merge Mode With Motion Vector Difference
In addition to merge mode wherein motion information is directly derived from neighboring, historic or zero motion information, the MMVD technology [19], [20] is adopted to allow further encoding an MVD as refinement of the derived motion information. Roughly speaking, MMVD mode is between AMVP mode and merge mode, and it provides a new trade-off between accuracy of motion information and bitrate. In MMVD, one of the first two candidates in the merge candidate list is selected as base motion, and an MVD represented by a direction and a distance is encoded as refinement of the base motion. Four directions including {0, 90, 180, 270}-degrees are allowed for an MVD and a direction index is signaled to indicate the selected direction. Meanwhile, two distance tables [20], [21], each of which has eight distance entries as illustrated in Table I are designed for the MVD. The encoder can select a distance table at picture level. A distance index is signaled for an MVD to indicate the selected entry of the distance table.
Only one MVD is signaled for both unidirectional and bidirectional base motion. When the base motion is bidirectional, the signaled MVD is directly used for one reference picture list and is scaled according to POC distances from the current picture to the two reference pictures before being used for the other reference picture list. When the current picture is with shorter absolute POC distance to the reference picture in list 1 than to the reference picture in list 0, the signaled MVD is directly used for list 1, otherwise, it is directly used for list 0.
C. Geometric Partitioning Mode
The GPM [22]–[24] aims to increase the partition precision and to better fit the moving objects boundaries using a geometrical partition of a coding tree leaf node CU. Because of the more flexible partitioning and the blending process, the GPM is benefit to video contents that include rigid moving objects relative to static background or other moving objects. Furthermore, the newly design GPM algorithm significantly reduces the encoder and decoder complexity yet reserves the coding gain comparing with the prior methods proposed to HEVC. Therefore, the presented algorithm has been adopted in VVC. This section briefly presented the algorithm of GPM, for further background logic, comprehensive description, and statistical analysis, readers can refer to [24].
Fig. 8 shows an example of the prediction process of GPM. In VVC, GPM is designed for CU with size \begin{equation*} P_{\mathrm {G}}=\left ({W_{0}\circ P_{0}+W_{1}\circ P_{1}+4 }\right)\gg 3\tag{2}\end{equation*}
\begin{equation*} W_{0}+W_{1}=8J_{w,h},\tag{3}\end{equation*}
Example of the prediction process of GPM; note that both predictions may originate from pictures in a same reference picture list, which is not shown in this example.
The weights in the blending matrices of GPM are derived based on the displacement from a sample location to the partitioning boundary as shown in Fig. 9. The displacement from an arbitrary location \begin{equation*} d\left ({x_{C},y_{C} }\right)=x_{C}\cos \left ({\varphi }\right)-y_{C}\sin \left ({\varphi }\right)+\rho,\tag{4}\end{equation*}
\begin{align*}&\hspace {-0.5pc}d\left ({m,n }\right)=\left ({\left ({\left ({m+\rho _{x,j} }\right)\ast 2-w+1 }\right)\cdot \mathrm {cosLut}[i] }\right) \\&+\,\left ({\left ({\left ({n+\rho _{y,j} }\right)\ast 2-h+1 }\right)\cdot \mathrm {cosLut}[\left ({i+8 }\right)\% 32] }\right),\tag{5}\end{align*}
\begin{align*} \rho _{x,j}=\begin{cases} 0,& i\% 16=8~\text {or}~(i\% 16=0~\text {and}~h\ge w) \\ \pm j\cdot w /8,& \mathrm {otherwise} \\ \end{cases}\tag{6}\end{align*}
\begin{align*} \rho _{y,j}=\begin{cases} 0,& i\% 16=8~\text {or}(i\% 16=0~\text {and}~h\ge w) \\ \pm j\cdot h/8,& \mathrm {otherwise,} \\ \end{cases}\tag{7}\end{align*}
\begin{equation*} \gamma _{m,n}=\mathrm {Clip3}\left ({0,8, \left ({d\left ({m,n }\right)+32+4 }\right)\gg 3 }\right),\tag{8}\end{equation*}
Unlike the regular bi-prediction, the GPM coded CUs contain three types of MVs. That is, each partition contains its own unidirectional MVs, and the blending area is physically predicted by bidirectional MVs from both partitions. Therefore, the MVs of GPM, which are stored for the MV prediction of the succeeding CUs, are adapted to the partitioning boundary. The displacement from the integer central position of a
D. Combined Inter-Intra Prediction
Inter prediction uses the signaled motion data to reference the temporal information from different pictures to perform motion compensation which shows significant benefit in video compression. Among inter prediction, merge mode is one special inter prediction which uses simpler signaling scheme to derive the motion data based on a previously coded CU. On the other hand, intra prediction tends to provide more accurate spatial prediction when a sample is closer to the reference sample. To take the advantages of both inter-prediction merge mode and intra prediction, a new merge mode, called CIIP mode, is designed for the CU which contains at least 64 luma samples and has both CU width and CU height smaller than 128 [25], [26]. In CIIP mode, a weighted combination of inter-prediction merge mode and intra prediction is utilized for prediction as follows. The merge prediction is derived with the inter-prediction process for a regular merge mode and the intra prediction is derived with the intra-prediction process for planar mode. Then, a weighted averaging process is applied to combine both predictions. The sum of prediction weights is equal to 4 and a right-shift operation is used after adding two weighted predictions. The final prediction for CIIP, denoted as \begin{equation*} P_{\mathrm {CIIP}}=\left ({W_{merge}\ast P_{merge}+W_{intra}\ast P_{intra}+2 }\right)\gg 2,\tag{9}\end{equation*}
E. Merge Estimation Region
Merge estimation region (MER) was first introduced in HEVC as an implementation-friendly feature for encoders. It is used to enable parallel cost estimation of merging candidates for different CUs. MER divides a picture into equally sized and non-overlapping square regions and allows a spatial merging candidate to be added into merging candidate list only when the current CU and the neighboring CU are in different MERs, as shown in Fig. 10.
Two CUs are treated as in the same MER when the following conditions are met,\begin{align*} \begin{cases} (\text {xCb} \gg \text {Log2ParMrgLevel})=\text {(xNb} \gg \text {Log2ParMrgLevel)} \\ (\text {yCb} \gg \text {Log2ParMrgLevel})=\text {(yNb} \gg \text {Log2ParMrgLevel)} \\ \end{cases}\end{align*}
In addition to spatial merging candidates, subblock-based merging candidates follow the same rules as that of spatial merging candidates.
When updating HMVP table, a constraint is imposed to break the dependency between different CUs within one MER in order to allow parallel processing. That is, HMVP table is not updated until the last CU located at the bottom-right of a MER satisfies the following conditions, \begin{align*} \begin{cases} \displaystyle \left ({\left ({\text {xCb} + \text {cbWidth} }\right)\gg \text {Log2ParMrgLevel} }\right)\\ \displaystyle \quad >(\text {xCb} \gg \text {Log2ParMrgLevel}) \\ \displaystyle \left ({\left ({\text {yCb} + \text {cbHeight} }\right)\gg \text {Log2ParMrgLevel} }\right) \\ \displaystyle \quad >(\text {yCb} \gg \text {Log2ParMrgLevel}) \\ \displaystyle \end{cases}\tag{10}\end{align*}
Moreover, encoder-only binary tree (BT) and ternary tree (TT) split constraints are needed where the general rule for MER is that any CU not smaller than MER size should contain one or multiple complete MERs and any CU smaller than MER size should locate within one MER entirely. The following shows the detailed constraint: When either width or height of current CU is larger than MER, the following applies:
If cbHeight <= R, then disallow horizontal BT split for current CU;
If cbWidth <= R, then disallow vertical BT split for current CU;
If cbHeight <= 2 * R, then disallow horizontal TT split for current CU;
If cbWidth <= 2 * R, then disallow vertical TT split for current CU;
The MER size can be adaptively decided and signaled as sps_log2_parallel_merge_level_minus2 in the sequence parameter set, wherein Log2ParMrgLevel is equal to 2+sps_log2_parallel_merge_level_minus2.
Others
A. Bidirectional Prediction With Coding Unit Weights
The BCW technology (a. k. a. generalized bi-prediction) is a syntax shortcut of weighted bi-prediction (WP) to predict a block by weighted-averaging two motion-compensated prediction blocks. Unlike WP which indicates weights at slice level respectively for all reference pictures, BCW signals the use of weight at CU level by using an index (denoted as wIdx) pointing to where the selected weight is located in a list of pre-defined candidate weights. In general, this list pre-defines 5 candidate weights (i.e., {−2, 3, 4, 5, 10}/8) to be selected for reference pictures in reference list 1, where −2/8 and 10/8 are used to reduce negatively correlated noises between prediction blocks of bi-prediction. The list may be reduced to {3, 4, 5}/8 when there are forward and backward reference pictures in both reference lists to achieve better trade-off between performance and complexity [45]. Since unit-gain constraint is applied, once the weight (denoted as \begin{equation*} P_{BCW}=(8(1-W)\ast P0+8W\ast P1+4)\gg 3,\tag{11}\end{equation*}
The notion of BCW is also extended to affine AMVP modes. Same as aforementioned, BCW signals a wIdx for the whole affine CU, and the corresponding weights of the signaled index are applied to subblock motion compensation [29].
The use of wIdx is buffered for subsequent CUs in the same frame to perform spatial motion merging, either for regular or for affine merge mode. When a spatial neighboring merge candidate is bi-predicted and the current CU selects this candidate, all the reference indices and motion vectors (or control-point motion vectors in the case of inherited affine merge mode) including its wIdx are inherited by the current CU. The only exception that the wIdx is not inherited occurs when the current CU has CIIP flag enabled. In the case of constructed affine merge mode, the wIdx is simply inherited from the one associated with above-left control-point motion vectors (or above-right control-point motion vectors when above-left ones are not used) [30]. It is noted that when the inferred wIdx points to a non-0.5 weight, decoder-side motion vector refinement and bidirectional optical flow are both turned off.
B. Motion Vector Compression and Range
In VVC, the MV precision is increased from quarter luma sample in HEVC to 1/16-luma sample to provide a more precise motion compensation. With the increase of the MV precision, the bit depth of MV is also extended by 2 bits in order to allow the same MV range as that in HEVC. Thus, in VVC, 18 bits are used for one MV component with the range from −217 to 217 – 1.
In terms of temporal motion storage, the motion field compression is performed in two aspects. First, the temporal motion vectors are stored at \begin{align*} sign=&mv\gg 17,scale=\lfloor \log _{2}((mv\oplus sign)\vert 31)\rfloor -5, \\ val=&(mv+((1\ll scale)\gg 1))\gg scale, \\ exp=&\begin{cases} scale+((val\oplus sign)\ll 5) & scale\geq 0\\ 0 & others, \end{cases} \\ mant=&\begin{cases} (val\&31)\vert (sign\ll 5) & scale\geq 0\\ mv & others, \end{cases}\tag{12}\end{align*}
\begin{align*} mv^{\prime }=\begin{cases} mant & exp=0\\ \left ({mant\oplus 32 }\right)\ll (exp-1) & others,\\ \end{cases}\tag{13}\end{align*}
Experimental Results
In this section, the inter-prediction coding regarding MV coding and block merging in VVC is evaluated. The JVET common test conditions [40] were used to measure the coding performance with bit-rate savings in terms of the Bjøntegaard Delta (BD) rate [41]. To calculate the BD-rates, four rate points were generated by using quantization parameters (QPs) 22, 27, 32, and 37 and piece-wise cubic interpolation [40] was used. Weighted combined PSNRs of YUV components (using a weighting factor of 6 for the luma component and 1 for the two chroma components) were used to consider chroma fidelity in the BD-rates calculation. Note, the positive
The video sequences in the common test conditions (CTC) are categorized into six classes (Class A-F) covering resolutions from UHD (Class A1/A2,
VTM-9.0 reference software [42] with the Main10 profile settings were used to evaluate the VVC inter-prediction coding performance. Two configurations, random access (RA) and low delay with B slices (LDB), were utilized to simulate common video applications. In the RA configuration, which targets video broadcasting applications, an Intra Random Access Pictures (IRAP) frame was inserted approximately every one second. In LDB configuration, which targets real time video applications, decoding order and output picture order need to be the same.
A. Performance Analysis of VVC Inter-Prediction Coding Tools in VTM
The MV coding and block merging techniques mentioned in Section III were tested to demonstrate their overall performance impact relative to VTM anchors. Table II summarizes the coding performance by disabling all coding tools described in Section III, in terms of
As reported in Table II, the overall
The encoding time is approximately doubled, mainly due to extra rate-distortion evaluation for mode selection (e.g., AMVR, BCW, MMVD) at the encoder [43]. The variation of decoding time is relatively minor since the extra computational complexity introduced (e.g. blending for GPM, BCW and CIIP) is nearly negligible when compared with interpolation process of motion compensation.
Among all, AMVR and GPM are the two most performing tools, each of which could deliver up to 1.6% of
B. Luma Samples Coverage of VVC Inter-Prediction Coding Tools in VTM
The effectiveness of the MV coding and block merging techniques were further justified by using the luma samples coverage of each coding tool. Fig. 11 (a) and (b) illustrate the average percentage of inter predicted samples in different coding modes for RA and LDB, respectively. Over 40% (and up to 50% for RA) of the samples were coded with at least one of the VVC inter coding tools (not including HMVP and pairwise merge candidates). Note that one luma sample may be coded with multiple tools (e.g., being coded with BCW and another inter coding tools), so the luma sample may be counted multiple times accordingly. And since SMVD is disallowed for LDB configuration, it is not depicted in Fig. 11 (b). These numbers roughly indicate VVC could achieve a more effective trade-off between the accuracy of the motion field representation and the required overhead. Thus, almost half of the luma samples in testing sequences that were coded by using HEVC-based methods (e.g., quarter luma sample MVD without precision adaptivity, rectangular-only prediction unit with equal weights, regular merge without offsetting) are now replaced by using the VVC-based MV coding and block-merging techniques.
C. AMVR
In Table V, the various AMVR modes are analyzed based on the luma sample usage. The numbers give the percentage of all luma samples which are encoded using the given AMVR mode when using VTM-9.0 under the CTC. In the heading, “
In Table VI, the AMVR modes are analyzed using tool-off tests. Again, the CTC are used. For the results in the first column, AMVR has been disabled completely, i.e., both in the translational and the affine variant. The resulting BD-rate values are positive, indicating a coding loss caused by disabling AMVR. In Test 1, only translational AMVR with a motion vector resolution of one luma sample (“full-sample AMVR”) has been enabled. A small, positive BD-rate number indicates smaller coding loss. In other words, the difference between the values in the first and the second column shows the coding gain of full-sample AMVR. In Test 2, both full-sample and 4-sample AMVR are enabled. In particular for the high-resolution sequences in Class A1, the coding gain has been further improved. In Test 3, translational AMVR with all supported motion vector resolutions (i.e., full-sample, 4-sample, and half- sample), including SIF, has been enabled, only affine AMVR is still disabled. Again, the positive BD-rate values have been further reduced, especially for the high-resolution classes of sequences, indicating a higher coding gain. In the last test (“SIF off”), only the switchable interpolation filter has been disabled, showing an average coding loss of 0.3%.
D. GPM
As shown in Table III (fourth column) and Table IV (third column), an average BD-rate reduction of 0.8% and 1.6% for RA and LDB configurations can be achieved by GPM. In some sequences (e.g., BQMall, RaceHorses BasketballDrill, and KristenAndSara), the coding performance was notably better than the average for both RA and LDB configurations. The particularly favorable results of these sequences were mainly attributed to clearly distinctive motion field and moving object boundaries contained in these sequences. Another observation is that the performance of GPM for LDB was generally better than that for RA. The reason is that the reference pictures are typically closer to the current picture in LDB than in RA, which yields shorter MVs that are more easily to be coded using merge mode in both partitions of GPM. Therefore, GPM is more often selected in LDB than in RA.
In addition to the improved coding performance and the low complexity, GPM also enhanced the visual quality. For the sequences containing motion of rigid objects, the GPM was used predominantly for coding of the moving objects boundaries as an example shown in Fig. 12(a). Therefore, sharp and clear edges of moving objects were visible in the coded sequences with GPM instead of the serrated and blurred moving object boundaries caused by rectangular partitions in the coded sequences without GPM. This phenomenon was shown as still picture example in Fig. 12(b) and (c).
E. MER
MER is an important feature for a commercial hardware encoder because it can effectively reduce pipeline latency which is introduced by deriving merge candidates depending on the MVs of the spatial neighbors of the current CU. By using MER, merge mode decision of all CUs inside the MER region can be performed at the same time without waiting for each other. When exerting MER at the encoder, the performance on VTM-9.0 are shown in Table VII. Y with MER size equal to
Conclusion
The motion vector coding and block merging tools in the VVC which enhance and extend the main concept of inter coding in HEVC are introduced in this paper. The technical details and design philosophy are presented and illustrated. Thanks to those advanced coding tools, significant objective and subjective improvements have been demonstrated when compressing various video contents and under different use cases. In addition, the complexity of the tools was investigated and carefully optimized during the standardization process in JVET which will be definitely helpful for the success of VVC.
ACKNOWLEDGMENT
The authors would like to greatly thank all the JVET experts who have contributed to the VVC inter mode coding.