Introduction
The advancements in Neural Network (NN) technology and the hardware improvements that made it possible to train models of increasing sizes have been promoting an increasing research effort on learned solutions for image and video processing tasks [1]. This is because NNs have interesting properties like loss-function choice flexibility and off-the-shelf libraries capable of maximizing parallelism on CPU/GPU platforms.
At first, spatial Super-Resolution (SR) [2], [3], [4] and Video Frame Interpolation (VFI) [5], [6], [7] tasks explored NN technology to improve the state-of-the-art results. However, after Ballé, Laparra, and Simoncell [8] presented a solution to the non-differentiable quantization problem, further research has been conducted to improve the learned image and video compression with more sophisticated models [9], [10], [11], [12]. These solutions follow the single-layer model from Fig. 1: an encoder receives the original signal and produces a standardized compressed bitstream, then a decoder parses that bitstream and runs the necessary steps to reconstruct the signal.
Single-layer image compression model. Both the input and output images have three channels, width W and height H.
Recent works are expanding the learned solutions to employ multi-layer compression [13], [14] (Fig. 2) using concepts and techniques inherited from scalable, handcrafted codecs like the Scalable HEVC (SHVC) [15]. As shown in Fig. 2, two or more layers perform the compression in this approach. The original media is first downsampled (
Multi-layer image compression model. In addition to the EL output, the BL is capable of reproducing the image at half its original resolution.
Multi-layer codecs are designed to have functionalities that are not provided by single-layer codecs [14]. The most straightforward one is the capability of efficiently generating bitstreams for different resolutions and/or quality levels, so that the reconstructed media achieves a level that satisfies the transmission or device requirements. Although it is possible to find fully learned multi-layer solutions in the literatures [13] and [14], they might bring a prohibitive overhead in pipelines that employ handcrafted codecs, since there is limited hardware support for this technology on user devices such as smartphones.
A few works propose to combine handcrafted- and learned-based compression solutions, employing the former as the BL and the latter as the EL [16], [17], [18]. Such an approach allows for reproducing the BL media on well-established platforms at a low complexity cost. By isolating the NN solutions on the EL, it is possible to explore cutting-edge compression technology while avoiding the challenges in the previously mentioned approaches.
In this work, we explore the hybrid handcrafted- and learned-based combination. But in contrast to the existing works that adopt a multi-layer approach as the one in Fig. 2, we propose an asymmetric multi-layer solution – Asymmetric Multi-layer Compression (AMLC) –, as depicted in Fig. 3. Notice that no data goes from the BL to the EL at the encoder side. Therefore, the steps necessary for BL encoding (
We propose a novel solution simplifying the multi-layer coding process. To demonstrate the benefits of our AMLC model, we compare its results with the ones from a fully handcrafted and fully learned multi-layer models, employing three datasets and six quality metrics for coding efficiency evaluation.
The AMLC model significantly enhances coding efficiency, outperforming the SHVC reference software by up to 62.41%.
Considering the complexity, the proposed AMLC is
faster to encode than the fully learned model. We demonstrate that such a simplification does not compromise our model’s overall coding efficiency.$1.28\times $ We reinforce the benefits of combining handcrafted BL and learned EL. The handcrafted BL is faster to encode and decode than the learned implementation, and it is compatible with existing software and hardware platforms. Meanwhile, the learned EL delivers high-quality results and allows for simplifications such as the proposed one.
Proposed AMLC model. We adopt a handcrafted-based BL and a learned-based EL, removing the ILP dependency between those layers at the encoding side.
The remainder of this paper is organized as follows. Section II brings related works using NN for image and video processing and compression. Then, Section III presents the motivations for the AMLC model, supported by the literature review. Section IV introduces the AMLC model and its components. Section V details the training and evaluation methods. The experiment results are presented and further analyzed in Section VI. Finally, Section VII draws conclusions and proposes future work.
Related Work
In this section, we review the works that have proposed and advanced single-layer learned compression, as well as those ones encompassing multi-layer compression with handcrafted and learned solutions.
A. Single-Layer Learned Compression
In image compression, the quantization step is essential to allow lossy compression, thus achieving high compression rates. However, the fact that such a step is not differentiable poses a great challenge to NN-based compression solutions. To mitigate this problem, Ballé et al. [8] and Ballé et al. [9] proposed a framework that approximates the quantization effects using uniform noise addition during training. After those works were published, other ones have shown remarkable results, further advancing the the state-of-the-art in handcrafted encoders. Ballé et al. [19] proposed applying a hyperprior coding to predict the
The works that followed have proposed further contributions to hyperprior coding. Minnen et al. [10] and Lee et al. [11] proposed methods to predict the mean used in the Gaussian distribution, relying on the context of previously decoded data (eq. 1 with \begin{equation*} p(\hat {y})\sim \sum _{k=1}^{K} w_{k}N(\mu _{k}, \sigma ^{2}_{k}) \tag {1}\end{equation*}
Cheng et al. [12] learned image codec proposal. Module
Several early applications of NN for video compression targeted specific handcrafted video encoder modules while still relying on the well-known quantization step from those encoders [20]. Eventually, end-to-end solutions for video compression appeared, such as the DVC [21], a framework that maps handcrafted video encoder modules to equivalent learned models. Instead of using pre-trained models for optical-flow prediction as proposed in DVC, the SSF model [22] has motion estimation and compensation modules designed for video compression. The VCT model [23], in turn, explores the Transformer’s temporal capabilities to simplify the learned video compression modules.
B. Multi-Layer Compression
Currently, there are three major handcrafted multi-Layer video compression standards: SHVC [15], Versatile Video Coding (VVC) [24], and Low Complexity Enhancement Video Coding (LCEVC) [25]. SHVC is a set of scalable tools based on the High Efficiency Video Coding (HEVC) standard, requiring the reconstructed frames from the BL and – if available – the motion information. The possibility of using only the reconstructed frames makes SHVC agnostic to the BL standard. In VVC, the successor of HEVC, the support for multi-layer compression was included as one of its features. LCEVC, in turn, is also agnostic to the BL standard, but there is a major difference between this and the aforementioned standards: LCEVC does not use any tools associated with existing coding standards. The LCEVC lossy compression was designed with low-complexity tools to encode the residues between the upsampled frames and the source frames.
End-to-end learned models are also being explored for multi-layer compression. Su et al. [13] proposed a model to achieve quality scalability, where each layer has a model inspired by [19], but allowing different layers to share parameters with recurrent structures. While the first layer compresses an image, the other layers compress residues to improve the previous layer quality, distributing the Rate-Distortion (RD) trade-offs throughout the multiple layers.
Mei et al. [14] proposed a model for learned multi-layer image compression, exploring quality and spatial scalability. Their model has a learned module to process the decoded latent space features from a given layer, generating a predictor for the EL above. Those same features – upsampled when using spatial scalability – are concatenated to the results for all the EL above to improve the reconstruction. Both Cheng et al. [13] and Mei et al. [14] works have a fully learned end-to-end model, adapting the loss function for multi-layer compression. These works still use the RD optimization, but the rate and distortion are the sum of each layer’s estimated rate and distortion, weighted by a given factor.
Multi-layer compression has also been proposed recently as a solution to produce standard-compliant bitstream using a handcrafted solution in the BL while isolating learned solutions to the EL. Lee et al. [16] proposed a solution that employs a learned EL to transmit residual features used to enhance the BL reconstruction quality. The decoder synthesizes the enhanced images by combining the BL reconstruction and the EL compressed residues. Travers, Bonnineau et al. [17] overall approach is similar, but their contribution is centered on spatial scalability. Benjak et al. [18] address video compression, presenting a solution for both spatial and temporal scalability. Their work combines learned models for super-resolution, motion estimation, and residual compression. Those models are more complex than the ones in the aforementioned works, requiring several steps and presenting restrictions for training, such as a reduced training dataset for fine-tuning the whole model.
Motivation for a Combined Handcrafted and Learned Asymmetric Compression
Table 1 summarizes the characteristics of the works presented in the previous section. As mentioned, the NN advancements are promoting the rise of new and different applications. In a short time span, learned compression achieved state-of-the-art coding efficiency results compared to handcrafted codecs, which have been optimized for decades. On the one hand, adopting learned solutions is challenging because it is necessary to train several model instances, that is, one for each RD constraint. Moreover, efficient encoding and decoding demand specific hardware architectures. On the other hand, multi-layer learned compression might achieve multiple RD constraints, but the hardware requirement issue remains unsolved for the works that apply learned solutions on the BL.
As also presented in Table 1, instead of a fully learned or a handcrafted multi-layer solution, we combine the benefits of both learned and handcrafted solutions employing them to appropriate layers: 1) the BL bitstream produced by a handcrafted codec is compliant with an already in use standard while 2) on top of that, the EL becomes available for exploring learned latent spaces to enhance the media reconstruction. Furthermore, we take advantage of the end-to-end training to remove dependencies between the EL and the ILP extracted from BL at the encoding side, an approach not explored by similar multi-layer models [16], [17], [18]. Introducing such an asymmetric approach to the multi-layer allows for a higher level of parallelism at the encoder side. To the best of our knowledge, this is the first work to employ such a strategy.
Proposed Model: AMLC
In this section, we detail the proposed AMLC model components, as well as the loss function adaptation and training process.
A. Model Implementation
Fig. 3 shows the proposed model AMLC. Similarly to other multi-layer models, AMLC has a BL, an EL, and an interlayer processing block. As argued in Section III, the BL uses a handcrafted codec, and we chose the HEVC reference software in our case study. Our decision to use HEVC instead of the latest standard (VVC [24]) allowed us to conduct a greater number of experiments due to the difference in execution time between the two reference softwares. However, the AMLC model abstracts the BL implementation, allowing for future experiments to be conducted with different handcrafted implementations under the same method described in this work.
The inter-layer processing depends on the scalability type adopted in the system. In this work, we consider only spatial scalability, which applies to both image and video compression, implementing the inter-layer processing with the bicubic upsampling method. Despite advances in learned SR [2], [3], [4], we chose to employ a simpler solution for inter-layer processing. This approach avoids using an additional model that could require re-training and would increase the overall complexity, such as observed in [18].
Fig. 5 provides more details on the EL decoding implementation. The learned encoder and decoder correspond to the implementation presented by [12] that uses attention blocks, the state-of-the-art solution for image compression. The concatenation of the ILP and the learned decoder output feeds the Merge block. Our Merge block implementation – depicted in Fig. 6 – is similar to the U-net [26]. Although the U-net was proposed for an image segmentation problem, a similar module, commonly called feature extraction or feature merging, appears in other image processing tasks research such as VFI [5], [6]. To avoid checkerboard visual artifacts in the images, we implement Down and Up blocks (Fig. 7) using average pooling and bilinear upsampling layers, respectively.
Merge block. On the left side, the Down blocks decrease the width (W) and height (H) dimensions and increase the number of feature maps by a factor of 2. The Up blocks on the right side perform the opposite process. The Conv2d parameters are the kernel size k, stride s, and
B. Two Stage Training and Loss Functions
The AMLC EL has two different learned components: the learned compressor and the Merge block. While there are pre-trained models available for the learned compressor, we had to train the Merge block from scratch. In order to reduce the training effort, we relied on the transfer learning technique, splitting the training into two stages and adapting the loss functions accordingly.
In the first stage, we transferred and frozen the weights of a learned image compressor into our compression block and updated only the Merge block weights. We defined the training loss as follows:\begin{equation*} Loss = D(Ori, Rec), \tag {2}\end{equation*}
The second stage consists in fine-tuning the whole model, which now includes the EL compressor training. In this stage, the EL compressor is no longer responsible for coding images, but learned features to improve the final image reconstruction. Because training affects the compression, our loss function has to consider the RD trade-off such as follows:\begin{equation*} Loss = \lambda D(Ori, Rec) + R_{\text {Total}}, \tag {3}\end{equation*}
Related works set lambda (\begin{equation*} |D_{S1} - D_{S2}| \lt threshold, \tag {4}\end{equation*}
Experimental Setup
In this section, we detail the experimental configurations, which include baseline implementations, model parameters, datasets, and evaluation metrics.
A. Experimental Configuration
1) Infrastructure
We performed training, inference, and runtime results gathering in a machine equipped with 16GB of RAM, a Ryzen 7 5700X CPU, and a single RTX3060 GPU with 12GB of VRAM.
2) Baselines
We use the SHVC reference software (SHM12.41) in the All-Intra configuration as a handcrafted multi-layer baseline codec. For the fully learned multi-layer codec (FL-codec), we use the Cheng2020 [12] in simulcast mode, following the method described by [14]. To generate the RD curves, we set the BL and EL operating at the same quality levels in both cases, using the Quantization Parameters (QPs) 27, 32, 37, and 42 for SHVC and levels 1, 3, 4, and 6 for FL-codec.
3) Dataset
We used the Vimeo90k – with samples at
In addition to the Vimeo90k, we assess the performance of the AMLC model using the validation split from CLIC-Professional and CLIC-Mobile datasets [31]. CLIC-Professional has professionally captured images, ranging from
4) Base Layer
The converted datasets have the original images to train the EL compressor, but we still need to pair them with the BL reconstruction for model training. We downsampled the YUV420 frames with FFmpeg using the bicubic method, scaling down each dimension to
5) Model Instances
We pre-loaded the learned compressor weights using the CompressAI checkpoints [27]. Although Cheng et al. [12] model has six compression levels available, we limited our research to four, selecting two with the highest compression results (levels 1 and 3), and two with the highest quality results (levels 4 and 6). Level 1 and 3 models set the kernel number to 128, whereas levels 4 and 6 set it to 192 [12], [27]. We set
6) Training Parameters
We used center crop patches of
B. Evaluation Metrics
1) Bitrate
We can obtain the bitrate by adding the total file size from each layer’s bitstream. We also use a bitrate metric that is relative to image resolution, called bits per pixel (bpp).
2) Quality
In this work we employed six different objective quality metrics to assess the AMLC results: the Peak Signal-to-Noise Ratio (PSNR)-YUV with the luma channel weighted by 6 and the chroma channels each weighted at 1) and its components luma (Y) and chroma (U and V averaged); the PSNR-RGB, which is commonly used to assess learned compression models [27]; the PSNR-HVS-M [32], which is still based on PSNR but takes the Human Visual System (HVS) characteristics into consideration (frequency and contrast); we use the MS-SSIM metric [33], which also tries to improve the correlation with the HVS by considering the image structural information. We compute all the metrics using the JPEG AI Quality Assessment Framework [29].
3) Coding Efficiency
After measuring the rate and quality, we can compare the coding efficiency of different encoders using the Bjøntegaard Delta (BD) with polynomial fitting [34], [35]. For BD computation purposes, we generate the convex hull RD curve from the different coding configurations of our solution. Although the original BD formulation is based on the PSNR metric, it can be applied to MS-SSIM after converting it as follows [35]:\begin{equation*} \text { MS-SSIM}_{\text {dB}} = -10 \log _{10}{(1-\text {MS-SSIM})} \tag {5}\end{equation*}
Experimental Results
In this section, we evaluate the experimental results by analyzing RD curves, coding efficiency results, complexity, and a subjective evaluation.
A. RD Evaluation
Fig. 8 shows the average RD results on Vimeo90k. The AMLC curves are generally above SHVC curves for all quality metrics: the BL QP 37 curve shows the best results, while the BL QP 22 is the closest one to the SHVC curve. At the highest EL quality level (level 6), AMLC model with BL QP 37 has higher compression rates without losing quality compared to lower QPs. Although counter-intuitive, this happens because the smaller the BL bitstream size, the larger the space available for EL optimizations, which is the only result evaluated by the quality metric. As the EL rate decreases, lower BL QPs sustain better quality results than the others. The BL QP 22 curve, for instance, does not present the best RD results, but it has the best quality results at lower EL levels, as clearly – but not exclusively – observed in Fig. 8(f).
Average RD results of the SHVC, FL-codec and the proposed AMLC model on the Vimeo90k dataset for each quality metric. The QP is used to specify an anchor BL coding parameter for AMLC.
When compared to the FL-codec, the AMLC achieves higher compression rates while maintaining similar quality results when combining BL QPs 32 and 37, and EL levels 4 and 6. However, to fairly compare both implementations, it is first necessary to assess their BL implementation performance. In the following subsection, we present the coding efficiency comparisons to objectively demonstrate the AMLC advantages.
B. Coding Efficiency Comparisons
Table 3 reports the AMLC coding efficiency results, comparing it to the SHVC. As one could expect from the Vimeo90k RD-curves, AMLC solution has significant coding efficiency gains compared to SHVC, demonstrating the benefits of using an EL learned solution. It is noticeable that the hightest gains in all datasets are for the UV components. Nevertheless, the gains over the SHVC are not only better on PSNR-UV, PSNR-RGB, and MS-SSIM, but also on PSNR-Y and PSNR-HVS.
Mei et al. [14]’s ablation study has shown that training and evaluating a learned model with YUV420 media is not as promising as conducting the research in RGB domain: their proposal has slightly worse results than the SHVC codec, particularly for PSNR-Y. In this paper, the YUV420 domain is relevant because handcrafted video codecs typically operate and are optimized on that domain. Although we do not train with YUV420 as suggested by Mei et al. [14], we guarantee a lossless YUV420 to RGB conversion, and vice versa. Therefore, we train on RGB without using data removed with the YUV420 chroma subsampling. Our coding efficiency gains over SHVC for both chroma (UV) and luma (Y) indicate that our training method is a viable alternative to directly training on YUV420.
HEVC and SHVC introduced tools to improve the coding efficiency of samples with higher resolution than the ones of Vimeo90k samples [36]. Our model, in turn, was trained only with the low resolution images from Vimeo90k dataset. Therefore, SHVC has higher odds to achieve better results in the CLIC datasets. Relative to the results on the Vimeo90k dataset, AMLC has reduced coding efficiency gains compared to SHVC on CLIC datasets. Nevertheless, the AMLC still performs better in all these cases, with an improvement of −6.47% on BD-rate in the worst case.
Before comparing the AMLC results with the FL-codec, it is necessary to first understand the coding efficiency results between their respective BL codecs. Table 4 summarizes the BD results comparing the handcrafted and learned BLs for the PSNR-based metrics. The handcrafted BL is less efficient than the learned BL by 33.68% and −1.5dB in the worst case. However, from Table 5, we observe that AMLC delivers similar or slightly better coding efficiency results than FL-codec across different quality metrics and dataset combinations. Taking the PSNR-RGB evaluation on CLIC-Professional, for instance, our method loses by 0.6% and 0.03dB compared to the FL-codec. However, this is still a remarkable achievement when considering the handcrafted BLs coding efficiency disadvantage.
Finally, we made an equivalent implementation of the AMLC model, but adding residual connections at both encoding- and decoding-side, similarly to a usual multi-layer implementation as depicted in Fig. 2. Table 6 summarizes the coding efficiency results comparing the AMLC model to the modified one. Both present similar coding efficiency in terms of quality for all evaluated metrics. Considering the coding efficiency comparisons in terms of bitrate, the AMLC is slightly better for PSNR-YUV and MS-SSIM, while losing for PSNR-RGB and PSNR-HVS. Nevertheless, the close results between those models are evidence that the encoding-side inter-layer dependency elimination, adopted in AMLC, does not affect negatively the overall coding efficiency performance results compared to a usual multi-layer compression implementation.
One can optimize learned models for other metrics through loss function tuning and training it to maximize the MS-SSIM quality instead of the MSE, for instance. By training our model with MS-SSIM, we might improve MS-SSIM results. However, the coding efficiency gains on that metric, even considering the MSE-based training, is evidence of our proposed model’s robustness. Therefore, we did not conduct the MS-SSIM-based training.
C. Complexity
Table 7 reports the encoding and decoding time for Vimeo90k, in seconds, of different codecs and components. The Handcrafted-BL is the BL codec used in both SHVC and AMLC. The FL-Codec uses both Learned-BL and Learned-EL, which are the Cheng2020 model operating at different layers. Meanwhile, the AMLC EL combines both Learned-EL and the Merge Block. Compared to the learned model, the handcrafted BL faster execution times compensate for its disadvantage in coding efficiency loss: the handcrafted codec is at least
Table 8 has a similar report as Table 7 but considers the compression of images at 1080p resolution. When compressing at a higher resolution, the handcrafted BL is
Considering the whole multi-layer coding implementation, the handcrafted codec also overcomes – as expected – both the AMLC and the FL-codec in terms of complexity. Compared to the AMLC, the SHVC is, in the worst case,
The asymmetric proposed approach and the use of handcrafted BL result in an advantage to AMLC over the FL-codec: our solution is at least
D. Visual Comparisons
Fig. 9 brings random samples from Vimeo90k validation dataset. At the highest quality configuration, the AMLC produces similar visual quality results compared to the other methods, but with a higher compression rate. These samples and Fig. 10 also demonstrate that the AMLC is not producing visual artifacts such as checkboard patterns and color shifting.
Random samples from Vimeo90k with each codec configure to produce the highest quality.
Patches from CLIC-Professional images, demonstrating that AMLC is not producing color shifting or checkboard patterns artifacts.
Conclusion
In this work, we propose the AMLC model, exploring the concept of multi-layer compression to combine handcrafted and learned compressors within an asymmetric approach. AMLC successfully reduces complexity compared to a fully learned multi-layer solution and is faster for both coding and decoding. Moreover, the handcrafted BL solution is not only at least
Regarding coding efficiency, the AMLC model outperforms the SHVC reference software, achieving a BD-Rate gain of −62.42%. Furthermore, AMLC overcomes the challenges posed by the use of a less coding-efficient BL and the lack of inter-layer dependencies on the encoding side, producing results with similar coding efficiency as the fully learned multi-layer model and its equivalent implementation with residual connections on both the encoding and decoding sides. This emphasizes the advantage of using an asymmetric learned compression solution on the EL. The reported gains are corroborated by the analysis considering multiple quality metrics and datasets.
Although we brought an asymmetric multi-layer solution to the discussion, we recognize that there is room for future improvements. We implemented a model and validate the AMLC solution against the relevant codecs, but the asymmetric approach and the model presented can serve as a framework. Therefore, we envisage experiments with other components, either aiming to reduce complexity – by adopting a less complex learned compression solution for the EL and Merge Block.
In this work, we performed experiments on single images. Expanding the proposed solution with frame interpolation for video technology, for instance, requires careful consideration of the learned coding complexity, which is already significant for single-image coding. Moreover, video content may vary along different scenes, affecting the bitrate. This paper and related works demonstrate that learned compression can achieve higher coding efficiency compared to traditional handcrafted codecs. However, extending AMLC to videos requires more research on rate control algorithms to address bitrate and bandwidth variations when transmitting the media. research on rate control algorithms to address bitrate and bandwidth variations when transmitting the media.
Finally, experiments reproducing the method with VVC reference software might help compare our solution to the ones proposed in [16], [17], and [18]. However, we highlight that the AMLC model was fairly compared against relevant and openly available implementations [14], [15], [27], leading to a wider assessment coverage when compared to related works.