Introduction
The amount of video data has increased rapidly, especially with the growing use of Internet-based streaming services and devices that receive video broadcasts. The bandwidth and storage capacity of video applications is limited, requiring efficient video compression techniques. This need for video compression will further increase due to the higher resolutions of volumetric content such as 360-degree and high dynamic range (HDR) videos. Considering this diverse and growing demand for more powerful compression, a new video coding standardization project called versatile video coding (VVC) was launched recently by the Joint Video Exploration Team (JVET) of two expert groups: ISO/IEC Moving Picture Experts Group (MPEG) and ITU-T Video Coding Experts Group (VCEG). The JVET published the initial draft of VVC in 2018 [1] and released the VVC test model (VTM). VTM has a similar structure to the High Efficiency Video Coding (HEVC) test model (HM), but it uses advanced tools that provide better compression performance.
A key concept among these tools is the multiple-type tree (MTT) segmentation structure [2]. While the HEVC standard can only support a quad-tree (QT) structure to split a block into multiple coding units (CUs), the MTT structure in VVC can have a binary tree (BT) or ternary tree (TT) as an additional sub-tree structure in a QT. Thus, MTT can support more diverse CU block shapes than QT, contributing to more effective coding performance.
The flexibility of MTT, however, leads to high computational complexity in encoding. In the JVET meeting, it is reported that equipping the QT plus BT (QTBT) structure increases by 1.8 times (for the random-access case) the encoding complexity of the joint exploration test model (JEM) that preceded VTM [3]. Several researchers targeted the QTBT structure to reduce the encoding complexity of JEM [4], [5]. It is no surprise that MTT comprising QT, BT, and TT further increases the complexity of video encoders. For example, the software implementation of VTM (version 1.0), which used the MTT structure, was approximately half the speed of the software implementation of HM (version 16.18), even though the VTM implementation turned on SIMD instructions [6]. This high encoding complexity with MTT must be overcome to improve real-time applications and multimedia services, particularly for high battery-drain devices.
Among the tools in each block of MTT, motion estimation (ME) shows the highest encoding complexity in VVC. The computational complexity of ME increases even more than in HEVC due to more advanced inter-prediction schemes, recursively conducted in fine partitioning blocks of MTT. Affine motion estimation (AME) [7], [8], which characterizes non-translational motions such as rotating and zooming, turns out to be efficient in rate-distortion (RD) performance at the expense of high encoding complexity. Reducing the complexity of the VTM encoders thus requires speeding up the AME process and the associated affine motion compensation (AMC).
The AME has a significant portion of computational complexity of the overall ME processing time, and therefore it is important to reduce the complexity. In Fig. 1, we show the computational complexity of unidirectional prediction, bi-prediction, and affine prediction of the ME process when encoding several video sequences with VTM 3.0. Each percentage shows the ratio of the total ME time. We observe that the AME has the significant computational complexity around 54.75% on average. Therefore, we focus on developing the fast AME technique rather than the conventional unidirectional prediction and bi-prediction techniques. In fact, many researchers have attempted to alleviate the complexity of the conventional unidirectional prediction and bi-prediction in previous video coding standards [9]–[13], [24], [25] based on QT-based partitioning structures. However, there are only few works to alleviate the complexity of AME in VVC. For VTM, there is much room to further reduce the ME complexity, particularly in AME, in an MTT structure.
Time complexity of unidirectional prediction, bi-prediction, and affine prediction of ME in VTM3.0. Each percentage represents the portion of the total time in ME. Quantization parameter is set to 37.
In this paper, we propose a fast encoding method to efficiently reduce the encoding complexity of AME in VTM when MTT is used. The proposed method consists of two procedures. The first procedure eliminates redundant AME and AMC processes, using an early termination scheme by exploiting parent CUs. Specifically, motion information from the parent CU that has been previously encoded in MTT is exploited. The second procedure reduces the number of reference frames used for AME. To the best of our knowledge, these approaches are the first attempts to reduce the AME complexity in the VVC literature. To demonstrate the efficiency of the proposed method, the associated ME time in VTM is measured under a random-access (RA) configuration. Experimental results show that the AME time of the VTM 3.0 is reduced to 64% on average with negligible coding loss.
This paper is organized into five sections. Section II reviews fast ME methods and provides an overview of AME in VTM (from now on, we regard VTM as the VTM 3.0). Section III analyzes the characteristics of the MTT and the affine model and presents the proposed fast encoding method. Section IV reports the experimental results in comparison with VTM. Finally, Section V presents the conclusions.
Related Work
A. Overview of Motion Estimation in VVC
Conventional ME (CME) uses a block-matching algorithm with a rectangular block shape to compute a translation motion vector (MV). The HEVC standard adopts a QT structure so that a CU size for ME can vary within a size of
To achieve even greater compression efficiency, the recently-introduced MTT structure [2] enables more flexible partitioning structures for prediction. MTT is a tree structure in which a block can be split into a QT, BT, or TT from the root: a coding tree unit (CTU). A CTU can be first split into three tree types, but only the BT and TT types have an interchangeable structure: this implies that the BT can have a sub-BT or sub-TT, as can TT. On the contrary, the QT structure can only be a starting point for a CTU. Accordingly, from the leaf node of a QT, BTs and TTs may be tested. One example of the MTT in a CTU is shown in Fig. 2. A CTU is first split into a QT with four sub-CUs, and subsequently, the third sub-CU is split by either a BT or TT at a horizontal/vertical direction. Furthermore, each BT-partitioned or TT-partitioned sub-CU can be split until the pre-defined maximum depth of each tree structure is reached. These variations could produce the best motion shape to be encoded for improving the compression performance of the VTM.
An example of possible BT/TT splitting (dotted blue line) within a CTU after QT split.
Compared to block shape flexibility, CME for the prediction of translational motion has not been extensively studied in the JVET community. Rather, AME has been attractive to video coding experts because it enlarges the variety of motions that can be estimated. For CME, the existing video coding standards such as AVC/H.264 and HEVC use a MV that covers translational motion. However, AME enables the prediction of not only translational motion but also linearly transformed motion such as scaling and rotation. If a camera zooms or rotates to capture a video, AME can predict the motion more accurately than translation-based CME. Li et al. [7] reported that the coding gain from AME is approximately 10% under the RA case on top of HM for affine test sequences. In the recent JEM implementation, AME provides a meaningful coding gain as well. [8].
To generate the MV for AME, VTM can choose one of two affine models depending on the control point parameters. One is the four-parameter affine model, and the other is the six-parameter model [2]. As shown in Fig. 3, two or three vectors can generate an affine-transformed block. We can denote an affine MV to be predicted as mv in the two-dimensional Cartesian coordinate system. Thus, mv can be represented as (mvh, \begin{equation*} \begin{cases} {mv}^{h}=\dfrac {b^{h}-a^{h}}{w}x+\dfrac {b^{v}-a^{v}}{w}y+a^{h} \\ {mv}^{v}=\dfrac {b^{v}-a^{v}}{w}x+\dfrac {b^{h}-a^{h}}{w}y+a^{v}, \\ \end{cases}\tag{1}\end{equation*}
\begin{equation*} \begin{cases} {mv}^{h}=\dfrac {b^{h}-a^{h}}{w}x+\dfrac {c^{h}-a^{h}}{h}y+a^{h} \\ {mv}^{v}=\dfrac {b^{v}-a^{v}}{w}x+\dfrac {c^{v}-a^{v}}{h}y+a^{v}, \\ \end{cases}\tag{2}\end{equation*}
In VTM, a block for affine motion can also be predicted in two ways for CME: unidirectional prediction and bi-prediction. Both four-parameter and six-parameter affine models can employ the two predictions as shown in Fig. 4. Either unidirectional prediction or bi-prediction for an AME process requires the associated reference frames, thereby increasing the encoding complexity of VTM. When only counting the number of required reference frames per the ME process, the AME process requires twice that of the CME process. Since AME has high complexity, it is better to use a threshold in deciding whether AME should be conducted or not; thus, the VTM uses a threshold-based decision scheme for fast AME encoding. Let us denote
To obtain the best MV for each CU, the VTM conducts the ME process for the integer-pixel first, and subsequently, the sub-pixel ME from the best integer-pixel MV, which is in the same order as HM. However, because VTM has no PU partitions, VTM cannot use the best result of a square-shaped PU for other partitions. Thus, the ME process in the VTM differs slightly from the HM [15]. It is noteworthy that the detailed process can be configured differently depending on the parameters set for ME.
In the RA configuration, an encoder searches multiple available reference frames to obtain the best MV and the best reference frame to minimize the RD cost of a block. As shown in (3), the method of Lagrange multipliers is used to compute the RD cost \begin{equation*} J=D+ \lambda \cdot R,\tag{3}\end{equation*}
\begin{equation*}\mathop {\text {argmin}}\limits _{i\in \Gamma (\varphi)|\varphi =L0,L1}{\{J\left ({\varphi \left ({i }\right) }\right)\}},\tag{4}\end{equation*}
\begin{equation*}\mathop {\text {argmin}}\limits _{i\in \Gamma (\varphi)|\varphi =L0,L1}\left\{{\frac {J\left ({\varphi \left ({i }\right)\vert \varphi =L_{0} }\right)}{2}+ \frac {J\left ({\varphi \left ({i }\right)\vert \varphi =L_{1} }\right)}{2}}\right\},\tag{5}\end{equation*}
The overall process of ME in VTM is depicted in Fig. 4. A CU in MTT node starts CME first, followed by AME. In both processes, the same reference frame sets
B. Fast Affine Motion Estimation
A primary complexity problem in CU encoding involves the ME and motion compensation (MC) processes. In particular, when the number of reference frames used for the ME increases, the associated complexity is also increased with a greater result accuracy, leading to more coding gains, and vice versa. Recognizing that videos in the real world have a variety of motions, this increased range could create computational and memory complexity. This should be overcome for fast encoders.
The VTM encoder adopts several strategies of HM in ME such as diamond search, raster search, and refinement processes [15]. Thus, the related work on HM could also be useful for determining whether the VTM can be easily accelerated. Research on fast ME can be classified under three strategies: simpler search pattern, adjusting search range, and early termination. The two former strategies have been studied for many years. Two diamond search patterns [16] were presented and are currently the core part of ME in both HM and VTM. Recently, to reduce the encoding complexity of HM, a directional search pattern method [13], a rotating hexagonal pattern [12] with some skipping methods, and an adaptive search range [17] showed reasonable results in a much smaller search range (i.e., 64) than with VTM.
Another approach to reducing the complexity of ME is reducing the number of reference frames to be searched. Pan et al. [26] present a reference frame selection method to reduce the encoding complexity of ME in HM. The number of initial reference frames is studied for fast ME encoding, since in general adjacent reference frames with high similarity tend to be selected [27]. As discussed in [28], [29], coding efficiency and computational complexity substantially vary depending on which reference frames are employed. Thus, reference frames should be carefully chosen in the context of other prediction tools.
A simpler, effective method is an early termination strategy that terminates redundant ME processes (either for unidirectional prediction or bi-prediction) per block. Skipping such ME process can significantly reduce the associated encoding time, but may result in quality degradation, thereby requiring high accuracy on the decision process for the early termination strategy. Recently, it has been discovered that the recursive block partitioning process in HEVC encoding provides a strong correlation of motion information so that the ME process can be terminated with high accuracy. For example, previously encoded PU information in QT structure is exploited in [9] and extended with some modifications of motion search in [10], showing almost no coding loss. Also, the bi-prediction ME process can be terminated early with given PU information in QT structure [18], [19]. However, those early termination methods cannot be directly applied to the MTT structure as PU partitioning is no longer available in VTM. Thus, a new statistical analysis of the MTT is required for low-complexity VTM encoders.
VTM 3.0 used in our experiments includes the state-of-the-art algorithm on lightweight affine motion estimation (AME) in VVC. Since the first VVC test model (VTM 1.0) was released in 2018, there have been several works [30], [31] applied to the test model for the VVC standardization. Those efforts were made to search a better trade-off between coding efficiency and computational complexity. To be specific, in [30], Zhou proposed to reuse affine motion parameters of neighboring blocks instead of deriving the parameters again. In [31], Zhang et al. proposed to avoid the parameter derivation process of a small chroma block. The two methods are to reduce the total memory bandwidth and the encoding complexity of VTM 2.0, and they were integrated to VTM 3.0. However, there is much room to further reduce the AME complexity of VTM by using the MTT structure as discovered in the next section.
Proposed Method
In this section, we propose a fast AME algorithm to tackle the computational complexity of ME in VTM. For this, we develop two features reflecting motion characteristics of the current CU from previously coded information and use them in a two-step fast AME algorithm. In the first stage, we conduct the early termination of the AME process at the level of a parent CU in MTT. The best prediction mode of a parent CU is examined, and then it is determined whether or not to skip the entire AME process based on the prediction mode. In the second stage, the prediction direction of the best reference frame in CME is examined to reduce the number of reference frames in the current CU.
In the following subsections, two proposed features—(a) the best inter-prediction mode of the parent CU and (b) the prediction direction in the CME of the current CU—are analyzed statistically to determine if they can be effectively used for the decision. Then, the two-step fast coding schemes using these features are presented in detail.
A. Statistical Analysis of Feature Selection
Following the notion of Bayes’ theorem, we compute the posterior probability when the features are observed. For the first feature, we define \begin{equation*}p(A \vert S_{par}) = \frac {p(S_{par} \vert A\mathrm {) }p(A)}{p(S_{par})},\tag{6}\end{equation*}
\begin{equation*}p(A \vert U_{CME}) = \frac {p(U_{CME} \vert A\mathrm {) }p(A)}{p(U_{CME})},\tag{7}\end{equation*}
Table 1 shows the probabilities, obtained from four UHD video sequences [22], encoded by VTM 3.0 under an RA configuration. 200 frames per sequence were encoded with two quantization parameters (QPs) of 25 and 35 for simplicity. We used sequences and QP values for this statistic different from those in Section IV to separate the training data from the test data. By distinguishing the material, we believe that the statistic presented in this section would not be biased against the test data presented in Section V.
In Table 1,
B. Fast Affine Motion Estimation Method
The proposed method consists of two parts: one is to extract features within an MTT structure and the other is to apply the algorithm of fast AME encoding. We describe the former one as shown in Fig. 5 with partitioning examples in an MTT structure. Fig. 5(a) shows the proposed encoding framework equipped with an MTT structure to filter out redundant AME processing. In VVC, there are more CU shapes, so a parent node can have a large number of sub-CUs. Therefore, the proposed method is designed to allow for sub-CUs in recursive block-partitioning to use the previously-encoded information when building QT (
The proposed encoding framework in MTT structure for passing motion information and associated examples.
Fig. 6 shows a block diagram of the proposed method for how the conventional ME in the original VTM software is changed. The steps in the VTM software algorithm are highlighted. As shown, the best prediction mode of a parent CU,
In the second stage, the best mode of
Experiments and Results
The compression performance was measured by the Bjøntegaard-Delta bitrate (BDBR) measurement method [20], using bitstream results encoded by four QP values: 22, 27, 32, and 37. The test materials were chosen from the common test condition (CTC) for standard dynamic range (SDR) video [21]. Tested sequences herein are categorized in five classes as in [21]. Class B, C, and D represent
The time complexity of AME was measured by the running time. The AME time ratio, ATR, is reported in comparison with the anchor, by measuring the entire time of AME and the associated AMC functions. Since AME time (AT) may vary with QP values, ATR per each sequence is calculated by the geometric mean of four AT results as shown in (8):\begin{equation*} ATR=\left ({\prod \limits _{i=1}^{4} \frac {AT_{\psi }({QP}_{i})}{AT_{o}({QP}_{i})} }\right)^{\frac {1}{4}},\tag{8}\end{equation*}
Table 2 shows the performance comparisons of the proposed method and the anchor in terms of both coding efficiency and encoding complexity. The proposed method reduces the AME time of the VTM to 63%, while the coding loss is 0.1%, on average, in comparison with the anchor. At the maximum performance, the ATR of the BQSquare sequence reaches 39%, with a 0.47% loss. At the minimum performance, the ATR of the RitualDance sequence reaches 79%, which is still a noticeable improvement. As shown in Table 2, the difference in coding performance for all tested sequences between the anchor and the proposed method is minimal. The BQSquare sequence is the worst case in terms of BDBR of the Y (luma) component, yet the loss is still within 0.5%. Moreover, when considering other chrominance components (i.e., U and V), the average BDBR loss of Y, U, and V is less than the loss of BDBR Y only. In particular, in several sequences, BDBR-U and BDBR-V performed better than the anchor. For example, the BQMall sequence showed a 0.20% decrease in both BDBR-U and BDBR-V, and the BasketballPass sequence showed a coding gain in both U and V components. It is noteworthy that, in general, the proposed method sustained coding efficiency robustly, especially in higher resolutions (i.e., Class A1, A2 and B). It is observed that the ETR is around 95% with the slight coding loss. The best ETR is observed in the BQSquare as in the ATR. It is noted that the AME is implemented with Single Instruction Multiple Data (SIMD) as an optimized parallel processing technique. This aspects imply that the complexity of AME could increase furthermore if an encoder does not support such the architecture-dependent optimization technique.
The actual running time of the AME process is measured per QP value as shown in Fig. 7. In QP 22, the sum of running time for the AME process of all sequences is 36 hours in the anchor. However, the proposed method reduces the AME time by 8 hours, approximately. This trend can be similarly observed in other QP values. Moreover, Fig. 7 shows that the AME time is reduced to about 55% in QP 37. When considering the complexity reduction is more challenging in a higher QP value, this aspect can be efficiently used for an encoder of low-end devices.
Sum of actual running time (hour) per QP value of the proposed method and the anchor for the proposed method.
As the proposed method contain two stages to accelerate the AME process, an additional experiment was conducted to evaluate the implementation of each stage on top of the anchor. For simplicity, sequences were encoded with the smaller number of frame counts: 100 frames for Class A1 and A2 sequences, but 3 seconds for other sequences. As shown in Fig. 6, Stage I is the early termination scheme that terminates the entire AME process, whereas Stage II is the range reduction scheme that reduces the number of reference frames for AME.
Table 3 shows the result of the additional experiment in comparison with the same anchor. Stage I reduced the ATR to 69%, as this process could skip the entire AME process if applicable. The best performance of Stage I was achieved with the BQSquare sequence as in Table 2 (results of the full integration of the proposed method). Stage II saved 4% of ATR but sustained most of the original compression performance, showing only a 0.02% decrease of BDBR-Y. It is noteworthy that Stage II sometimes gained compression performance: 0.13%, 0.08%, and 0.06% gains for RaceHorsesC, BQSquare and RitualDance sequences. In conclusion, both stages in the proposed method contributed to reducing encoding complexity while sustaining coding efficiency.
Conclusion
VTM achieves far better compression performance than its predecessor, the HEVC standard, according to the literature. Among the newly adopted technologies for VTM, affine prediction contributes substantially to compression performance by capturing a greater variety of motions found in nature. However, the complexity of AME is a bottleneck for low-complexity encoder applications. In this paper, the encoding complexity of AME was investigated, and the associated key observation was discovered using statistics. A fast AME method was proposed for a VVC encoder. The proposed method showed that the AME time of VTM was reduced to 64% on average, with negligible coding loss. We believe that these contributions could help promote future research on the complexity of VVC encoders. In future work, we will develop a machine learning-based fast algorithm to automatically learn the features as practiced in [23].