Introduction
The past decade has witnessed significant advancements in video capture and display technologies, leading to an explosion of high-definition (HD) and ultra high-definition (UHD) videos. Despite substantial improvements in video coding technologies, transmitting high-definition videos remains a challenge, particularly in bandwidth-constrained environments.
To address this issue, a well-established strategy is resampling-based video coding. In this strategy, videos are down-sampled prior to encoding, and the decoded video is subsequently up-sampled to match the original resolution [1]. For example, the Alliance for Open Media Video 1 (AV1) [2] incorporates a mode where frames are encoded at a lower resolution and then up-sampled to the original resolution using bilinear or bicubic interpolation on the decoder [3]. Similarly, Versatile Video Coding (VVC) [4] supports a resampling-based coding scheme known as Reference Picture Resampling (RPR), which facilitates temporal prediction across different resolutions. However, these methods which utilize handcrafted interpolation filters struggle to effectively handle the complex characteristics inherent in natural videos.
With the rapid development of deep learning, neural network (NN)-based super-resolution (SR) has demonstrated superior performance compared to traditional up-sampling methods [5], [6]. Consequently, it is natural to leverage these advanced SR model in resampling-based video coding. For example, Li et al. [7] introduced a CNN-based block up-sampling scheme for intra-frame coding in HEVC/H.265 [8], allowing each coding tree unit to choose between full-resolution and low-resolution coding modes based on rate-distortion costs. Additionally, Lin et al. [9] proposed utilizing decoded information generated during video coding to enhance the performance of the SR model, incorporating prediction frames and QP values as additional inputs in their network. However, using a resampling-based strategy can lead to a loss of compression performance for chroma components. To address this issue, a luma-only resampling strategy was proposed in [10], where only the luma component is downsampled, and the video sequence is encoded in YUV444 format. On the decoder side, the reconstructed luma components are upsampled to restore the YUV420 format.
However, these prior works neglect the complexity of SR models. Although they achieves promising rate-distortion improvements, the huge complexity of neural network computations has been a major roadblock, which makes it unpractical for real-world applications. To design a simplified model, in a study of neural network-based in-loop filter for video coding [11], the CP decomposition is applied to vanilla convolution layer to reduce the complexity. This work achieves good trade-off between coding performance and complexity.
In this paper, we propose a low complexity SR model for resampling-based video coding that requires only 20 kMAC/pixel. To reduce the complexity, the standard 3 × 3 convolution layer is decomposed into 1 × 3 convolution layer followed by 3 × 1 convolution layer, which is inspired by the CP decomposition. We use this decomposition for most of our vanilla 3 × 3 convolution layers. Additionally, we utilize multiple decoded information generated during video coding as inputs of SR model, which further improves the coding performance. Moreover, to save the number of parameters of neural network, we employ a single model with split branches to process luma and chroma components, respectively. To the best of our knowledge, it is the first time that super resolution model with split branches is used in resampling-based video coding. Experimental results demonstrate that the proposed SR model achieves a promising trade-off between coding performance and complexity.
Network structure of proposed super resolution model. The red number indicates the number of channels. In this paper, we assume that the input video sequence is in YUV420 format. Consequently, the chroma components of the input low-resolution reconstruction
Network structure of low complexity backbone block. The sepCONV denotes depth-wise convolution and the red number presents the number of channels. In this paper, C1 and C are 64 and 16, respectively.
Proposed Method
The network structure of proposed SR model is illustrated in Fig. 1. This paper presents three key contributions compared to existing resampling-based video coding approaches: (i) multiple decoded information are used as auxiliary input; (2) the design of a single model with dual branches; (3) the implementation of low complexity backbone block. The combination of all these designs leads to a SR model with low complexity (20 kMAC/pixel) and reduced storage requirements (0.1M #parameters), while still achieving promising coding performance.
A. Network Structure
Network input. The model input comprises seven components: the low-resolution reconstruction rec, the prediction information pred, the boundary strength BS, the base QP QPbase, and the slice QP QPslice, and the slice type information IPB. All these inputs are derived from the decoded information, except for the reconstruction rec. The decoded information is used to provide more details on the compressed low-resolution frames to the SR model. For example, the prediction samples pred can provide the texture details of the compressed frames [9]. The boundary strength conveys the location and intensity information related to the compression artifacts [11]. The slice type IPB indicates the block prediction mode that whether the intra mode or inter mode is used. The base QP reflects the overall quality of reconstructed video, while the slice QP is related to the quality of individual frames of the video. To process these inputs, several convolution layers are employed to extract features, followed by a 1×1 convolution to fuse the concatenated features.
Single model with dual branches. Existing approaches typically employ separate models for processing luma and chroma components. In contrast, this paper proposes a single model that processes both components, thereby reducing complexity and the number of parameters in the neural network. In the body of the neural network, split of luma and chroma branches with different complexity is designed. The chroma branch utilizes fewer number of channels and backbone blocks compared to the luma branch. Additionally, a pixel shuffle layer is incorporated in the luma branch, considering that the input video sequence is YUV420 format. In this paper, the parameters are set as follows: the CY and CUV are set to 16, and NY = 20 and NC = 10 denote the number of backbone block used in the luma branch and chroma branch, respectively.
Low complexity backbone block. Each branch of the proposed SR model utilizes stacks of backbone blocks and the structure of backbone block is illustrated in Fig. 2. The first two 1 × 1 convolutions are used to extract the information from the input features, which is a classic bottleneck design. Then, the vanilla 3 × 3 convolution layer is decomposed into a combination of several layers for lower complexity by CP decomposition [11]. Specifically, a 1 × 3 depth-wise convolution followed by a 3 x 1 depth-wise convolution, and a 1 × 1 point-wise convolution is used for composition. This design maintains a similar receptive field as that of vanilla 3 × 3 convolution while the complexity is significantly decreased. Given an input with the shape of [B,C,H,W], the Multiply-Accumulate Operations per pixel (MAC/pixel) for the proposed backbone block is 2CC1 + 6C + C2. In contrast, when CP decomposition is not used (i.e., vanilla 3 × 3 convolution is used in the backbone block), the complexity is 2CC1 + 9C2. As shown in Fig. 3 where we set C1 = 64, the complexity is significantly reduced by employing CP decomposition when C increased.
B. Implements
The proposed method is implemented on PyTorch framework [12]. The training is performed on a single Tesla V-100 GPU. During training, we set the mini-batch size to 64 and utilize Adam optimizer with an initial learning rate of 4e-4. A total of 45 epochs are conducted for training and the L1 loss is utilized. After training, the model is quantized to int16 model and then integrated into the VTM-11.0 software [13].
Experimental Results
A. Datasets
We use the DIV2K dataset [14] and BVI-DVC [15] for training. All 800 images from the DIV2K and videos with resolution of 1080p and 2160p from BVI-DVC are selected to generate training data. The VTM-11.0 [13], which is reference software for VVC, is utilized to compress these training data using QPs {22, 27, 32, 37, 42}. The training data will be down-sampled by 2× in VTM before encoding. Then the downsampled data will be encoded and finally we extract the decoded low-resolution reconstructed frames and the necessary decoded information for training. The corresponding original frames are used as labels during training.
To evaluate the proposed method, we follow the JVET common test conditions for neural network-based video coding technology [16]. Considering that resampling-based coding is a useful tool to compress high resolution videos, the common test sequences from the Class A1 and A2 are chosen for testing, which are 4K resolution. The configuration of all-intra (AI) and Random Access (RA) are tested with the QP value of {22, 27, 32, 37, 42}. The compression performance is measured with BD-rate [17], where PSNR is used as the quality metric.
B. Results and Analysis
Quantitative results. The Tables II and I present the BD-rate results of the proposed method compare to VTM-11.0 anchor for AI and RA configurations, respectively. The proposed SR shows average {-3.72%, -1.50%, 0.34%} and {-3.65%, -0.66%, 1.89%} BD-rate changes for {Y, U, V} under RA and AI configurations, respectively. It can be observed that the proposed method improves the coding performance for luma (Y) component significantly. Slightly performance drop is observed for the chroma (U and V) components which is common phenomenon for resampling-based video coding [10]. Considering the complexity of proposed model, the compression performance is promising.
Performance comparison. We compare the proposed method with RPR upsampling filter which is a non-CNN filter and SR models from [18], [19] which are well-known SR models for resampling-based video coding in the exploration of next video coding standards. The BD-rate results of luma components under RA and AI configurations are presented in Tab. III. Compared to [18], it can be seen that the proposed method shows better performance on AI configuration but worse performance on RA configurations. However, considering the complexity is reduced 95% and the number of parameters is reduced 98%, the trade-off between coding performance and complexity is promising. The proposed-L is a high complexity model based on proposed model which utilizes more backbone blocks and larger channel numbers. By comparing the proposed-L and [18], the proposed-L shows better performance when the complexity is aligned. These results demonstrate the effectiveness of the proposed model.
Rate distortion curve. The rate distortion curves for the test sequences are illustrated in Fig. 4. The proposed method achieves higher compression performance gains across these sequences, with particularly significant improvements observed in the low-bitrates region. This demonstrate that the proposed resampling-based video coding is effective, especially under limited bandwidth conditions.
Subjective quality. The subjective results are shown in Fig. 5 where all the images are compressed with QP 42. The red box highlights the difference between the proposed method and RPR filter results. In the figure, the artifacts around the letter "C" are clearly visible in the RPR results. In contrast, the proposed method produces much clearer and sharper results with the artefacts largely removed. It demonstrates that our method can provide both visually pleasing results and high PSNR value for objective quality.
C. Ablation Studies
To validate the effectiveness of each individual input, we perform the ablation studies in this section. The BD-rate performance under all-intra configuration is reported in the Table IV. Firstly, it can be observed that each input is effective. Secondly, the performance drops significantly by removing IPB input. Since we use single model to handle both all-intra and random access configurations, the IPB information is crucial for specifying the type of prediction mode for the SR model.
Conclusion
In this paper, we propose a low complexity CNN SR model for resampling-based video coding. The vanilla convolution is decomposed by CP decomposition in order to achieve lower complexity. Besides, multiple decoded information are used as the auxiliary inputs of SR model, which provides texture and compression information helping to improve the coding performance. Moreover, to reduce network parameters, split branches within single model is designed for processing luma and chroma components, respectively. Experimental results show that the proposed method can achieve promising performance improvements even when the complexity is reduced significantly.