Journals & Magazines >IEEE Access >Volume: 13

Asymmetric Multi-Layer Compression: Decoupling Inter-Layer Coding Dependencies With a Learned Model

Proposed AMLC model. We adopt a handcrafted-based BL and a learned-based EL, removing the ILP dependency between those layers at the encoding side.

Abstract:

Multi-layer compression is employed to achieve scalability by providing a base bitstream and additional enhancement layers to be used when extra bandwidth is available. I...Show More

Metadata

Abstract:

Multi-layer compression is employed to achieve scalability by providing a base bitstream and additional enhancement layers to be used when extra bandwidth is available. In this work, we combine handcrafted and learned solutions, isolating the former in the base layer and using the latter in the enhancement layer. Moreover, the proposed Asymmetric Multi-layer Compression (AMLC) model decouples the base and enhancement layers at the encoding side, leveraging the end-to-end enhancement layer model to simplify the coding process. AMLC encodes an image

$1.28\times$ faster than a fully learned multi-layer codec while maintaining similar quality and coding efficiency despite the lower quality of the handcrafted base layer used in AMLC. While the proposed asymmetric encoder together with the handcrafted base layer reduce the overall complexity, the learned enhancement layer brings benefits in terms of coding efficiency. AMLC outperforms the coding efficiency of the Scalable HEVC (SHVC) reference software, a handcrafted multi-layer codec, achieving BD-Rate gains of -27.86%, -34.19%, and -26.57% for PSNR-YUV, PSNR-RGB, and MS-SSIM quality metrics, respectively. Finally, an equivalent implementation of the proposed model adding the inter-layer dependency on the encoding side produces results that are similar to the ones from AMLC’s, confirming its potential to simplify multi-layer compression without significant losses in coding efficiency.

Proposed AMLC model. We adopt a handcrafted-based BL and a learned-based EL, removing the ILP dependency between those layers at the encoding side.

Published in: IEEE Access ( Volume: 13)

Page(s): 31671 - 31682

Date of Publication: 17 February 2025

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2025.3542979

Funding Agency:

References is not available for this document.

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

The advancements in Neural Network (NN) technology and the hardware improvements that made it possible to train models of increasing sizes have been promoting an increasing research effort on learned solutions for image and video processing tasks [1]. This is because NNs have interesting properties like loss-function choice flexibility and off-the-shelf libraries capable of maximizing parallelism on CPU/GPU platforms.

At first, spatial Super-Resolution (SR) [2], [3], [4] and Video Frame Interpolation (VFI) [5], [6], [7] tasks explored NN technology to improve the state-of-the-art results. However, after Ballé, Laparra, and Simoncell [8] presented a solution to the non-differentiable quantization problem, further research has been conducted to improve the learned image and video compression with more sophisticated models [9], [10], [11], [12]. These solutions follow the single-layer model from Fig. 1: an encoder receives the original signal and produces a standardized compressed bitstream, then a decoder parses that bitstream and runs the necessary steps to reconstruct the signal.

FIGURE 1.

Single-layer image compression model. Both the input and output images have three channels, width W and height H.

Show All

Recent works are expanding the learned solutions to employ multi-layer compression [13], [14] (Fig. 2) using concepts and techniques inherited from scalable, handcrafted codecs like the Scalable HEVC (SHVC) [15]. As shown in Fig. 2, two or more layers perform the compression in this approach. The original media is first downsampled (1) and then compressed by a Base Layer (BL) (2), which is usually a single-layer codec. An inter-layer processing step transforms the BL reconstructed media into an Interlayer Prediction (ILP) (3a) to be used by the Enhancement Layer (EL) (4a). In the model shown in Fig. 2, an upsampling method produces the ILP. If allowed by the standard, additional BL signals (e.g., motion vectors) may be passed along to improve compression. On the decoding side, the BL decoder reconstructs the media in a lower resolution (3b). Reflecting the structure of the encoder side, the upsampling produces the ILP (4b), which the EL decoder uses to reconstruct the media at its original resolution (5).

FIGURE 2.

Multi-layer image compression model. In addition to the EL output, the BL is capable of reproducing the image at half its original resolution.

Show All

Multi-layer codecs are designed to have functionalities that are not provided by single-layer codecs [14]. The most straightforward one is the capability of efficiently generating bitstreams for different resolutions and/or quality levels, so that the reconstructed media achieves a level that satisfies the transmission or device requirements. Although it is possible to find fully learned multi-layer solutions in the literatures [13] and [14], they might bring a prohibitive overhead in pipelines that employ handcrafted codecs, since there is limited hardware support for this technology on user devices such as smartphones.

A few works propose to combine handcrafted- and learned-based compression solutions, employing the former as the BL and the latter as the EL [16], [17], [18]. Such an approach allows for reproducing the BL media on well-established platforms at a low complexity cost. By isolating the NN solutions on the EL, it is possible to explore cutting-edge compression technology while avoiding the challenges in the previously mentioned approaches.

In this work, we explore the hybrid handcrafted- and learned-based combination. But in contrast to the existing works that adopt a multi-layer approach as the one in Fig. 2, we propose an asymmetric multi-layer solution – Asymmetric Multi-layer Compression (AMLC) –, as depicted in Fig. 3. Notice that no data goes from the BL to the EL at the encoder side. Therefore, the steps necessary for BL encoding (1a and 2) and EL encoding (1b) can be executed in parallel. In such an approach, the EL codec is trained as an end-to-end model, optimizing the encoder to transmit relevant information for the final image reconstruction. Although we removed the ILP at the EL encoding side, that prediction is still available at the decoding side, where all the steps follow the same structure as the ones depicted in Fig. 2. Therefore, our main contributions are:

We propose a novel solution simplifying the multi-layer coding process. To demonstrate the benefits of our AMLC model, we compare its results with the ones from a fully handcrafted and fully learned multi-layer models, employing three datasets and six quality metrics for coding efficiency evaluation.
The AMLC model significantly enhances coding efficiency, outperforming the SHVC reference software by up to 62.41%.
Considering the complexity, the proposed AMLC is $1.28\times $ faster to encode than the fully learned model. We demonstrate that such a simplification does not compromise our model’s overall coding efficiency.
We reinforce the benefits of combining handcrafted BL and learned EL. The handcrafted BL is faster to encode and decode than the learned implementation, and it is compatible with existing software and hardware platforms. Meanwhile, the learned EL delivers high-quality results and allows for simplifications such as the proposed one.

FIGURE 3.

Proposed AMLC model. We adopt a handcrafted-based BL and a learned-based EL, removing the ILP dependency between those layers at the encoding side.

Show All

The remainder of this paper is organized as follows. Section II brings related works using NN for image and video processing and compression. Then, Section III presents the motivations for the AMLC model, supported by the literature review. Section IV introduces the AMLC model and its components. Section V details the training and evaluation methods. The experiment results are presented and further analyzed in Section VI. Finally, Section VII draws conclusions and proposes future work.

SECTION II.

Related Work

In this section, we review the works that have proposed and advanced single-layer learned compression, as well as those ones encompassing multi-layer compression with handcrafted and learned solutions.

A. Single-Layer Learned Compression

In image compression, the quantization step is essential to allow lossy compression, thus achieving high compression rates. However, the fact that such a step is not differentiable poses a great challenge to NN-based compression solutions. To mitigate this problem, Ballé et al. [8] and Ballé et al. [9] proposed a framework that approximates the quantization effects using uniform noise addition during training. After those works were published, other ones have shown remarkable results, further advancing the the state-of-the-art in handcrafted encoders. Ballé et al. [19] proposed applying a hyperprior coding to predict the $\sigma $ values of a zero-mean Gaussian distribution estimating the arithmetic coding probabilities (eq. 1 with $K = 1$ , $w_{1} = 1$ , and $\mu _{1} = 0$ ).

The works that followed have proposed further contributions to hyperprior coding. Minnen et al. [10] and Lee et al. [11] proposed methods to predict the mean used in the Gaussian distribution, relying on the context of previously decoded data (eq. 1 with $K = 1$ and $w_{1} = 1$ ). Cheng et al. [12] proposed the model depicted in Fig. 4, generalizing the probability estimation considering the weighted sum of different Gaussian distributions as follows:\begin{equation*} p(\hat {y})\sim \sum _{k=1}^{K} w_{k}N(\mu _{k}, \sigma ^{2}_{k}) \tag {1}\end{equation*} View Source

$FIGURE 4. - Cheng et al. [12] learned image codec proposal. Module $g_{a}$ transforms the image to the latent space y. On inference, the quantization Q produces a signal $\hat {y}$ . In the training stage, the uniform noise (U) approximates Q and produces a signal $\tilde {y}$ . The $g_{s}$ then transforms the latent space back into an image. The hyperprior analysis ( $h_{a}$ ) and synthesis ( $h_{s}$ ) provide side information to predict the probabilities for arithmetic coding given a context model ( $C_{m}$ ).$

FIGURE 4.

Cheng et al. [12] learned image codec proposal. Module $g_{a}$ transforms the image to the latent space y. On inference, the quantization Q produces a signal $\hat {y}$ . In the training stage, the uniform noise (U) approximates Q and produces a signal $\tilde {y}$ . The $g_{s}$ then transforms the latent space back into an image. The hyperprior analysis ($h_{a}$ ) and synthesis ($h_{s}$ ) provide side information to predict the probabilities for arithmetic coding given a context model ($C_{m}$ ).

Show All

Several early applications of NN for video compression targeted specific handcrafted video encoder modules while still relying on the well-known quantization step from those encoders [20]. Eventually, end-to-end solutions for video compression appeared, such as the DVC [21], a framework that maps handcrafted video encoder modules to equivalent learned models. Instead of using pre-trained models for optical-flow prediction as proposed in DVC, the SSF model [22] has motion estimation and compensation modules designed for video compression. The VCT model [23], in turn, explores the Transformer’s temporal capabilities to simplify the learned video compression modules.

B. Multi-Layer Compression

Currently, there are three major handcrafted multi-Layer video compression standards: SHVC [15], Versatile Video Coding (VVC) [24], and Low Complexity Enhancement Video Coding (LCEVC) [25]. SHVC is a set of scalable tools based on the High Efficiency Video Coding (HEVC) standard, requiring the reconstructed frames from the BL and – if available – the motion information. The possibility of using only the reconstructed frames makes SHVC agnostic to the BL standard. In VVC, the successor of HEVC, the support for multi-layer compression was included as one of its features. LCEVC, in turn, is also agnostic to the BL standard, but there is a major difference between this and the aforementioned standards: LCEVC does not use any tools associated with existing coding standards. The LCEVC lossy compression was designed with low-complexity tools to encode the residues between the upsampled frames and the source frames.

End-to-end learned models are also being explored for multi-layer compression. Su et al. [13] proposed a model to achieve quality scalability, where each layer has a model inspired by [19], but allowing different layers to share parameters with recurrent structures. While the first layer compresses an image, the other layers compress residues to improve the previous layer quality, distributing the Rate-Distortion (RD) trade-offs throughout the multiple layers.

Mei et al. [14] proposed a model for learned multi-layer image compression, exploring quality and spatial scalability. Their model has a learned module to process the decoded latent space features from a given layer, generating a predictor for the EL above. Those same features – upsampled when using spatial scalability – are concatenated to the results for all the EL above to improve the reconstruction. Both Cheng et al. [13] and Mei et al. [14] works have a fully learned end-to-end model, adapting the loss function for multi-layer compression. These works still use the RD optimization, but the rate and distortion are the sum of each layer’s estimated rate and distortion, weighted by a given factor.

Multi-layer compression has also been proposed recently as a solution to produce standard-compliant bitstream using a handcrafted solution in the BL while isolating learned solutions to the EL. Lee et al. [16] proposed a solution that employs a learned EL to transmit residual features used to enhance the BL reconstruction quality. The decoder synthesizes the enhanced images by combining the BL reconstruction and the EL compressed residues. Travers, Bonnineau et al. [17] overall approach is similar, but their contribution is centered on spatial scalability. Benjak et al. [18] address video compression, presenting a solution for both spatial and temporal scalability. Their work combines learned models for super-resolution, motion estimation, and residual compression. Those models are more complex than the ones in the aforementioned works, requiring several steps and presenting restrictions for training, such as a reduced training dataset for fine-tuning the whole model.

SECTION III.

Motivation for a Combined Handcrafted and Learned Asymmetric Compression

Table 1 summarizes the characteristics of the works presented in the previous section. As mentioned, the NN advancements are promoting the rise of new and different applications. In a short time span, learned compression achieved state-of-the-art coding efficiency results compared to handcrafted codecs, which have been optimized for decades. On the one hand, adopting learned solutions is challenging because it is necessary to train several model instances, that is, one for each RD constraint. Moreover, efficient encoding and decoding demand specific hardware architectures. On the other hand, multi-layer learned compression might achieve multiple RD constraints, but the hardware requirement issue remains unsolved for the works that apply learned solutions on the BL.

TABLE 1 Main Characteristics of Reviewed Works and Ours

As also presented in Table 1, instead of a fully learned or a handcrafted multi-layer solution, we combine the benefits of both learned and handcrafted solutions employing them to appropriate layers: 1) the BL bitstream produced by a handcrafted codec is compliant with an already in use standard while 2) on top of that, the EL becomes available for exploring learned latent spaces to enhance the media reconstruction. Furthermore, we take advantage of the end-to-end training to remove dependencies between the EL and the ILP extracted from BL at the encoding side, an approach not explored by similar multi-layer models [16], [17], [18]. Introducing such an asymmetric approach to the multi-layer allows for a higher level of parallelism at the encoder side. To the best of our knowledge, this is the first work to employ such a strategy.

SECTION IV.

Proposed Model: AMLC

In this section, we detail the proposed AMLC model components, as well as the loss function adaptation and training process.

A. Model Implementation

Fig. 3 shows the proposed model AMLC. Similarly to other multi-layer models, AMLC has a BL, an EL, and an interlayer processing block. As argued in Section III, the BL uses a handcrafted codec, and we chose the HEVC reference software in our case study. Our decision to use HEVC instead of the latest standard (VVC [24]) allowed us to conduct a greater number of experiments due to the difference in execution time between the two reference softwares. However, the AMLC model abstracts the BL implementation, allowing for future experiments to be conducted with different handcrafted implementations under the same method described in this work.

The inter-layer processing depends on the scalability type adopted in the system. In this work, we consider only spatial scalability, which applies to both image and video compression, implementing the inter-layer processing with the bicubic upsampling method. Despite advances in learned SR [2], [3], [4], we chose to employ a simpler solution for inter-layer processing. This approach avoids using an additional model that could require re-training and would increase the overall complexity, such as observed in [18].

Fig. 5 provides more details on the EL decoding implementation. The learned encoder and decoder correspond to the implementation presented by [12] that uses attention blocks, the state-of-the-art solution for image compression. The concatenation of the ILP and the learned decoder output feeds the Merge block. Our Merge block implementation – depicted in Fig. 6 – is similar to the U-net [26]. Although the U-net was proposed for an image segmentation problem, a similar module, commonly called feature extraction or feature merging, appears in other image processing tasks research such as VFI [5], [6]. To avoid checkerboard visual artifacts in the images, we implement Down and Up blocks (Fig. 7) using average pooling and bilinear upsampling layers, respectively.

FIGURE 5.

EL decoding solution for the AMLC model depicted in Fig. 3.

Show All

FIGURE 6.

Merge block. On the left side, the Down blocks decrease the width (W) and height (H) dimensions and increase the number of feature maps by a factor of 2. The Up blocks on the right side perform the opposite process. The Conv2d parameters are the kernel size k, stride s, and $pad=1$ for padding. The number of feature maps depends on the parameter N.

Show All

$FIGURE 7. - Merge block component details. The number of feature maps $C_{in}$ and c can be inferred from the dimensions presented in Fig. 6.$

FIGURE 7.

Merge block component details. The number of feature maps $C_{in}$ and c can be inferred from the dimensions presented in Fig. 6.

Show All

B. Two Stage Training and Loss Functions

The AMLC EL has two different learned components: the learned compressor and the Merge block. While there are pre-trained models available for the learned compressor, we had to train the Merge block from scratch. In order to reduce the training effort, we relied on the transfer learning technique, splitting the training into two stages and adapting the loss functions accordingly.

In the first stage, we transferred and frozen the weights of a learned image compressor into our compression block and updated only the Merge block weights. We defined the training loss as follows:\begin{equation*} Loss = D(Ori, Rec), \tag {2}\end{equation*} View Sourcewhere D can be either calculated for Mean Squared Error (MSE) or Multiscale Structural Similarity Index Measure (MS-SSIM) with the respective functions $255^{2}\times \text { MSE}(.)$ – assuming 8-bit samples – or $1-\text {MS-SSIM}(.)$ [27]. Because the compressor is frozen, the compression rate is constant and has no role in this optimization.

The second stage consists in fine-tuning the whole model, which now includes the EL compressor training. In this stage, the EL compressor is no longer responsible for coding images, but learned features to improve the final image reconstruction. Because training affects the compression, our loss function has to consider the RD trade-off such as follows:\begin{equation*} Loss = \lambda D(Ori, Rec) + R_{\text {Total}}, \tag {3}\end{equation*} View Sourcewhere $\lambda $ is the Lagrangian multiplier. Considering the multi-layer compression, we formulate the bitrate component ($R_{\text {Total}}$ ) as the sum of both the $R_{EL}$ and $R_{BL}$ . $R_{EL}$ is the bitrate from the EL compressor, which we estimate along the training following the formulations from [12] and [27]. We obtain the $R_{BL}$ bitrate directly from the BL compressed file, which is constant for any given sample.

Related works set lambda ($\lambda $ ) values to achieve different RD trade-off levels on each training of their learned compressor. Although we use a learned compressor implementation from the literature, we cannot directly use the $\lambda $ values reported by the authors [12] and [27] because we have changed the Loss equation. To maintain continuity between the training stages, we use the evaluation data distortion results from the end of the first training stage ($D_{S1}$ ) and the beginning of the second one ($D_{S2}$ ), selecting the $\lambda $ value that guarantees the following assertion:\begin{equation*} |D_{S1} - D_{S2}| \lt threshold, \tag {4}\end{equation*} View Sourceempirically setting $threshold=1$ when training with the MSE.

SECTION V.

Experimental Setup

In this section, we detail the experimental configurations, which include baseline implementations, model parameters, datasets, and evaluation metrics.

A. Experimental Configuration

1) Infrastructure

We performed training, inference, and runtime results gathering in a machine equipped with 16GB of RAM, a Ryzen 7 5700X CPU, and a single RTX3060 GPU with 12GB of VRAM.

2) Baselines

We use the SHVC reference software (SHM12.41) in the All-Intra configuration as a handcrafted multi-layer baseline codec. For the fully learned multi-layer codec (FL-codec), we use the Cheng2020 [12] in simulcast mode, following the method described by [14]. To generate the RD curves, we set the BL and EL operating at the same quality levels in both cases, using the Quantization Parameters (QPs) 27, 32, 37, and 42 for SHVC and levels 1, 3, 4, and 6 for FL-codec.

3) Dataset

We used the Vimeo90k – with samples at $448\times 256$ resolution – for training and evaluation, processing the data as follows. From each frame triplet from the dataset, we select the second frame only. By using the FFmpeg [28], we converted each sample from RGB to YUV420. We converted them back to RGB by copying the chroma samples – without any interpolation procedure – and then using the color conversions adopted in the JPEG AI Quality Assessment Framework [29]. Notice that the final RGB image reflects the degradation from the chroma sub-sampling necessary for YUV420 conversion. However, the adopted conversion guarantees two properties: 1) the following YUV420 and RGB conversion will not accumulate any error; 2) because the learned compression model training uses RGB data, we now ensure that our model cannot take advantage of any color information from RGB that was not available in the YUV420 representation. The described dataset manipulation is relevant for our work because development and assessment handcrafted compression methods use YUV420 data [30] – which is the case of our adopted BL encoder and the baseline reference.

In addition to the Vimeo90k, we assess the performance of the AMLC model using the validation split from CLIC-Professional and CLIC-Mobile datasets [31]. CLIC-Professional has professionally captured images, ranging from $512\times 384$ up to $2048\times 1366$ resolution. CLIC-Mobile has smartphone images captured by regular users, ranging from $996\times 756$ up to $1520\times 2048$ resolution. We apply the center crop to CLIC dataset samples, forcing each of them to have both width and height multiples of 64, which ensures all codecs can operate without requiring padding.

4) Base Layer

The converted datasets have the original images to train the EL compressor, but we still need to pair them with the BL reconstruction for model training. We downsampled the YUV420 frames with FFmpeg using the bicubic method, scaling down each dimension to $0.5\times $ and then compressed each frame using the HEVC Test Model HM-16.42 on All-Intra configuration setting the BL QPs with the values of 22, 27, 32, and 37 [30]. We trained our model considering the results from each QP as an individual dataset to observe how the EL handles the BL encoded at different RD trade-offs.

5) Model Instances

We pre-loaded the learned compressor weights using the CompressAI checkpoints [27]. Although Cheng et al. [12] model has six compression levels available, we limited our research to four, selecting two with the highest compression results (levels 1 and 3), and two with the highest quality results (levels 4 and 6). Level 1 and 3 models set the kernel number to 128, whereas levels 4 and 6 set it to 192 [12], [27]. We set $N=8$ in the Merge block for all model instances. Combining the BL QPs and EL compression levels results in 16 model instances for training.

6) Training Parameters

We used center crop patches of $256\times 256$ resolution, batch size 12, and the Adam optimizer. We set the learning rate of the first training stage to 1e-4, reducing it to 1e-5 in the second stage. We made our experiments using the MSE as the distortion metric in eq. 2 and 3. Table 2 presents the $\lambda $ values empirically found for each model configuration.

TABLE 2 $\lambda $ Training Parameters for Each Combination of BL QP and EL Level

$Table 2- $\lambda $ Training Parameters for Each Combination of BL QP and EL Level$

B. Evaluation Metrics

1) Bitrate

We can obtain the bitrate by adding the total file size from each layer’s bitstream. We also use a bitrate metric that is relative to image resolution, called bits per pixel (bpp).

2) Quality

In this work we employed six different objective quality metrics to assess the AMLC results: the Peak Signal-to-Noise Ratio (PSNR)-YUV with the luma channel weighted by 6 and the chroma channels each weighted at 1) and its components luma (Y) and chroma (U and V averaged); the PSNR-RGB, which is commonly used to assess learned compression models [27]; the PSNR-HVS-M [32], which is still based on PSNR but takes the Human Visual System (HVS) characteristics into consideration (frequency and contrast); we use the MS-SSIM metric [33], which also tries to improve the correlation with the HVS by considering the image structural information. We compute all the metrics using the JPEG AI Quality Assessment Framework [29].

3) Coding Efficiency

After measuring the rate and quality, we can compare the coding efficiency of different encoders using the Bjøntegaard Delta (BD) with polynomial fitting [34], [35]. For BD computation purposes, we generate the convex hull RD curve from the different coding configurations of our solution. Although the original BD formulation is based on the PSNR metric, it can be applied to MS-SSIM after converting it as follows [35]:\begin{equation*} \text { MS-SSIM}_{\text {dB}} = -10 \log _{10}{(1-\text {MS-SSIM})} \tag {5}\end{equation*} View Source

SECTION VI.

Experimental Results

In this section, we evaluate the experimental results by analyzing RD curves, coding efficiency results, complexity, and a subjective evaluation.

A. RD Evaluation

Fig. 8 shows the average RD results on Vimeo90k. The AMLC curves are generally above SHVC curves for all quality metrics: the BL QP 37 curve shows the best results, while the BL QP 22 is the closest one to the SHVC curve. At the highest EL quality level (level 6), AMLC model with BL QP 37 has higher compression rates without losing quality compared to lower QPs. Although counter-intuitive, this happens because the smaller the BL bitstream size, the larger the space available for EL optimizations, which is the only result evaluated by the quality metric. As the EL rate decreases, lower BL QPs sustain better quality results than the others. The BL QP 22 curve, for instance, does not present the best RD results, but it has the best quality results at lower EL levels, as clearly – but not exclusively – observed in Fig. 8(f).

FIGURE 8.

Average RD results of the SHVC, FL-codec and the proposed AMLC model on the Vimeo90k dataset for each quality metric. The QP is used to specify an anchor BL coding parameter for AMLC.

Show All

When compared to the FL-codec, the AMLC achieves higher compression rates while maintaining similar quality results when combining BL QPs 32 and 37, and EL levels 4 and 6. However, to fairly compare both implementations, it is first necessary to assess their BL implementation performance. In the following subsection, we present the coding efficiency comparisons to objectively demonstrate the AMLC advantages.

B. Coding Efficiency Comparisons

Table 3 reports the AMLC coding efficiency results, comparing it to the SHVC. As one could expect from the Vimeo90k RD-curves, AMLC solution has significant coding efficiency gains compared to SHVC, demonstrating the benefits of using an EL learned solution. It is noticeable that the hightest gains in all datasets are for the UV components. Nevertheless, the gains over the SHVC are not only better on PSNR-UV, PSNR-RGB, and MS-SSIM, but also on PSNR-Y and PSNR-HVS.

TABLE 3 Average BD Results Comparing the AMLC Against the SHVC

Mei et al. [14]’s ablation study has shown that training and evaluating a learned model with YUV420 media is not as promising as conducting the research in RGB domain: their proposal has slightly worse results than the SHVC codec, particularly for PSNR-Y. In this paper, the YUV420 domain is relevant because handcrafted video codecs typically operate and are optimized on that domain. Although we do not train with YUV420 as suggested by Mei et al. [14], we guarantee a lossless YUV420 to RGB conversion, and vice versa. Therefore, we train on RGB without using data removed with the YUV420 chroma subsampling. Our coding efficiency gains over SHVC for both chroma (UV) and luma (Y) indicate that our training method is a viable alternative to directly training on YUV420.

HEVC and SHVC introduced tools to improve the coding efficiency of samples with higher resolution than the ones of Vimeo90k samples [36]. Our model, in turn, was trained only with the low resolution images from Vimeo90k dataset. Therefore, SHVC has higher odds to achieve better results in the CLIC datasets. Relative to the results on the Vimeo90k dataset, AMLC has reduced coding efficiency gains compared to SHVC on CLIC datasets. Nevertheless, the AMLC still performs better in all these cases, with an improvement of −6.47% on BD-rate in the worst case.

Before comparing the AMLC results with the FL-codec, it is necessary to first understand the coding efficiency results between their respective BL codecs. Table 4 summarizes the BD results comparing the handcrafted and learned BLs for the PSNR-based metrics. The handcrafted BL is less efficient than the learned BL by 33.68% and −1.5dB in the worst case. However, from Table 5, we observe that AMLC delivers similar or slightly better coding efficiency results than FL-codec across different quality metrics and dataset combinations. Taking the PSNR-RGB evaluation on CLIC-Professional, for instance, our method loses by 0.6% and 0.03dB compared to the FL-codec. However, this is still a remarkable achievement when considering the handcrafted BLs coding efficiency disadvantage.

TABLE 4 Average Handcrafted BL Coding Efficiency Loss Compared to the Learned One

TABLE 5 Average BD Results Comparing the AMLC Against the FL-codec

Finally, we made an equivalent implementation of the AMLC model, but adding residual connections at both encoding- and decoding-side, similarly to a usual multi-layer implementation as depicted in Fig. 2. Table 6 summarizes the coding efficiency results comparing the AMLC model to the modified one. Both present similar coding efficiency in terms of quality for all evaluated metrics. Considering the coding efficiency comparisons in terms of bitrate, the AMLC is slightly better for PSNR-YUV and MS-SSIM, while losing for PSNR-RGB and PSNR-HVS. Nevertheless, the close results between those models are evidence that the encoding-side inter-layer dependency elimination, adopted in AMLC, does not affect negatively the overall coding efficiency performance results compared to a usual multi-layer compression implementation.

TABLE 6 Average AMLC Coding Efficiency Results on Vimeo90k Compared to an Equivalent Implementation Using Residual Connections at the Encoding-side

One can optimize learned models for other metrics through loss function tuning and training it to maximize the MS-SSIM quality instead of the MSE, for instance. By training our model with MS-SSIM, we might improve MS-SSIM results. However, the coding efficiency gains on that metric, even considering the MSE-based training, is evidence of our proposed model’s robustness. Therefore, we did not conduct the MS-SSIM-based training.

C. Complexity

Table 7 reports the encoding and decoding time for Vimeo90k, in seconds, of different codecs and components. The Handcrafted-BL is the BL codec used in both SHVC and AMLC. The FL-Codec uses both Learned-BL and Learned-EL, which are the Cheng2020 model operating at different layers. Meanwhile, the AMLC EL combines both Learned-EL and the Merge Block. Compared to the learned model, the handcrafted BL faster execution times compensate for its disadvantage in coding efficiency loss: the handcrafted codec is at least $2\times $ and $257\times $ faster to encode and decode, respectively. These results, aligned with existing well-established standards and fast implementations, motivated us to choose a handcrafted codec as the BL.

TABLE 7 Average Encoding and Decoding Times on Vimeo90k in Seconds

Table 8 has a similar report as Table 7 but considers the compression of images at 1080p resolution. When compressing at a higher resolution, the handcrafted BL is $2.34\times $ faster than the learned BL to encode. However, notice that the handcrafted codec also cannot output more than one image every second. This is because we use the HEVC reference software to generate the results, which was developed as the golden model for coding efficiency but is not the fastest implementation for that standard. However, even the reference software focuses on a low-complexity decoding time. In such a case, the handcrafted decoder is $357.64\times $ faster than the learned decoder. These results, aligned with existing well-established standards and fast implementations, motivated us to choose a handcrafted codec as the BL.

TABLE 8 Average Encoding and Decoding Times on $1920\times 1080$ Resolution Images in Seconds

$Table 8- Average Encoding and Decoding Times on $1920\times 1080$ Resolution Images in Seconds$

Considering the whole multi-layer coding implementation, the handcrafted codec also overcomes – as expected – both the AMLC and the FL-codec in terms of complexity. Compared to the AMLC, the SHVC is, in the worst case, $1.13\times $ and $205.6\times $ faster to encode and decode on Vimeo90k, respectively. Once again, that speed-up ratio does not change significantly on the encoding side at higher resolutions, but the SHVC is at least $328\times $ faster to decode than the learned solutions.

The asymmetric proposed approach and the use of handcrafted BL result in an advantage to AMLC over the FL-codec: our solution is at least $1.28\times $ and $1.27\times $ faster to encode and decode, respectively. Although the asymmetric approach helps reduce execution times, the learned compressor – which is the same implementation for both AMLC and FL-codec – is still the most time-consuming module. Meanwhile, the Merge Block has a negligible impact on execution time: it represents less than 1% of the total decoding time. Therefore, the EL learned compressor is a candidate for future works focused on complexity optimization.

D. Visual Comparisons

Fig. 9 brings random samples from Vimeo90k validation dataset. At the highest quality configuration, the AMLC produces similar visual quality results compared to the other methods, but with a higher compression rate. These samples and Fig. 10 also demonstrate that the AMLC is not producing visual artifacts such as checkboard patterns and color shifting.

FIGURE 9.

Random samples from Vimeo90k with each codec configure to produce the highest quality.

Show All

FIGURE 10.

Patches from CLIC-Professional images, demonstrating that AMLC is not producing color shifting or checkboard patterns artifacts.

Show All

SECTION VII.

Conclusion

In this work, we propose the AMLC model, exploring the concept of multi-layer compression to combine handcrafted and learned compressors within an asymmetric approach. AMLC successfully reduces complexity compared to a fully learned multi-layer solution and is faster for both coding and decoding. Moreover, the handcrafted BL solution is not only at least $2\times $ and up to $257\times $ faster to encode and decode, respectively, but also operates considering well-established platforms and coding standards for which there are efficient software and hardware implementations.

Regarding coding efficiency, the AMLC model outperforms the SHVC reference software, achieving a BD-Rate gain of −62.42%. Furthermore, AMLC overcomes the challenges posed by the use of a less coding-efficient BL and the lack of inter-layer dependencies on the encoding side, producing results with similar coding efficiency as the fully learned multi-layer model and its equivalent implementation with residual connections on both the encoding and decoding sides. This emphasizes the advantage of using an asymmetric learned compression solution on the EL. The reported gains are corroborated by the analysis considering multiple quality metrics and datasets.

Although we brought an asymmetric multi-layer solution to the discussion, we recognize that there is room for future improvements. We implemented a model and validate the AMLC solution against the relevant codecs, but the asymmetric approach and the model presented can serve as a framework. Therefore, we envisage experiments with other components, either aiming to reduce complexity – by adopting a less complex learned compression solution for the EL and Merge Block.

In this work, we performed experiments on single images. Expanding the proposed solution with frame interpolation for video technology, for instance, requires careful consideration of the learned coding complexity, which is already significant for single-image coding. Moreover, video content may vary along different scenes, affecting the bitrate. This paper and related works demonstrate that learned compression can achieve higher coding efficiency compared to traditional handcrafted codecs. However, extending AMLC to videos requires more research on rate control algorithms to address bitrate and bandwidth variations when transmitting the media. research on rate control algorithms to address bitrate and bandwidth variations when transmitting the media.

Finally, experiments reproducing the method with VVC reference software might help compare our solution to the ones proposed in [16], [17], and [18]. However, we highlight that the AMLC model was fairly compared against relevant and openly available implementations [14], [15], [27], leading to a wider assessment coverage when compared to related works.

References is not available for this document.

Asymmetric Multi-Layer Compression: Decoupling Inter-Layer Coding Dependencies With a Learned Model

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction