Loading web-font TeX/Math/Italic
Lossless Image Compression by Joint Prediction of Pixel and Context Using Duplex Neural Networks | IEEE Journals & Magazine | IEEE Xplore

Lossless Image Compression by Joint Prediction of Pixel and Context Using Duplex Neural Networks


The flowgraph of the proposed method. For each pixel to be encoded, its value and context are estimated in a channel-wise progressive way. The estimated contexts are quan...

Abstract:

This paper presents a new lossless image compression method based on the learning of pixel values and contexts through multilayer perceptrons (MLPs). The prediction error...Show More

Abstract:

This paper presents a new lossless image compression method based on the learning of pixel values and contexts through multilayer perceptrons (MLPs). The prediction errors and contexts obtained by MLPs are forwarded to adaptive arithmetic encoders, like the conventional lossless compression schemes. The MLP-based prediction has long been attempted for lossless compression, and recently convolutional neural networks (CNNs) are also adopted for the lossy/lossless coding. While the existing MLP-based lossless compression schemes focused only on accurate pixel prediction, we jointly predict the pixel values and contexts. We also adopt and design channel-wise progressive learning, residual learning, and duplex network in this MLP-based framework, which leads to improved coding gain compared to the conventional methods. Experiments show that the proposed method performs better than the conventional non-learning algorithms and also recent learning-based compression methods with practical computation time.
The flowgraph of the proposed method. For each pixel to be encoded, its value and context are estimated in a channel-wise progressive way. The estimated contexts are quan...
Published in: IEEE Access ( Volume: 9)
Page(s): 86632 - 86645
Date of Publication: 14 June 2021
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

The need for high-quality images is ever-increasing, and thus the techniques for high-quality image acquisition, display devices, storages, and high-speed image data transmission are rapidly developing. As a result, we now have ultra-high quality displays and broadcasting services, and cameras with so many pixels. Hence, the need for image and video compression is also ever-increasing.

Multimedia data are encoded by lossy or lossless compression methods, where the aim of the lossy compression is to minimize distortion for the given bit budget. In recent image and video codecs, the quality degradation is not easily noticeable at high bit rates, and thus lossy compression is widely used for image communication and storage. However, for image data such as medical, scientific, and artistic images, we may wish to preserve the original contents without any degradation caused by the compression. Hence, in this case, lossless compression methods are required.

The main elements of classical (non-learning) methods are transformation, prediction, quantization, and entropy coding.1 In lossy compression, the transformation followed by quantization scheme is popular since it achieves high energy compaction and sparse representation. In the case of lossless compression, predictive coding is a more widely used approach. Some examples of non-learning lossless codecs are BPG [1], PNG [2], JPEG-LS [3], JPEG2000 [4], LCIC [5], WebP [6], FLIF [7], JPEG-XL [8], etc.

Deep neural networks have achieved significant successes in computer vision and signal processing and also provided promising solutions in lossy compression. Specifically, researchers have developed CNN-based compression methods [9]–​[11], where they extract features from the input image and then encode features into a compressed bitstream. These methods can be considered the same as the transformation-and-quantization method, and thus they do not seem to be suitable for lossless compression.

There have also been learning-based lossless compression methods that learn the probability model of given symbols (pixel values or residual signal) [12]–​[15], or hybrid coding methods that replace the predictor of the non-learning encoders by a deep network and improved entropy encoders [16]–​[18]. These methods perform much better than the best non-learning codec FLIF [7], in terms of the compressed bits per pixel (bpp). However, their computation times are much larger than the FLIF, at least by 102 times even with GPU computation, making them less practical. Hence, Mentzer et al. [19], [20] proposed practical CNN-based lossless codecs, but their performances are on par with the FLIF.

In this paper, to achieve accurate prediction while requiring a reasonable computation time, we propose an efficient encoding method based on the use of simple MLPs. For the accurate prediction, we adopt the ideas of channel-wise, residual, progressive encoding, and duplex network into our MLP-based framework. To achieve a practical computation time, we jointly estimate only the mean (pixel prediction) and variance (context) of the prediction error model for each pixel. In addition, using the MLP reduces system complexity and power consumption compared to the use of CNN or RNN.

To be precise, we propose three encoding schemes for achieving better compression performance. First, we proceed with the prediction in a channel-wise progressive manner to exploit the correlation between color channels, in the case of color inputs. Instead of estimating the means and variances of three channels (Y, U, V) all at once, we initially perform the estimation for the {Y} channel, then utilize the information of {Y} to estimate the {U} and so on. Secondly, the prediction is processed in a residual manner to achieve more stable and accurate performance. Instead of directly predicting the pixel value for the encoding pixel, the difference between the encoding pixel and the neighboring ones is predicted. Finally, we use duplex networks, each of which is specialized for a different type of patch (smooth or textured). Input patches are initially determined as smooth or textured and then fed to the corresponding networks. As a result, we learn the statistics of each patch more easily than learning all the patches in a single network.

In summary, the main contributions of this paper are as follows:

  • We propose a lossless image compression method, where we use MLPs for predicting pixel values and contexts simultaneously.

  • The proposed training method achieves stable and accurate predictions through channel-wise progressive learning, residual learning, and duplex networks.

  • The proposed encoder shows comparable or better performance than non-learning codecs and a recent CNN-based encoder [19] requiring practical computation time.

We published a preliminary work of MLP-based lossless compression scheme in a conference [21]. The difference of this work from our conference version is that we (1) try to explain the algorithm more clearly and in detail, (2) introduce a duplex network scheme (Section III-I) that differentiates the input characteristics and uses two different MLPs depending on the input properties, (3) find more efficient and effective support pixels (Section III-E) to help the MLPs perform more accurate prediction, and (4) conduct ablation studies (Section IV-C) to show the effect of each idea in the proposed framework. As a result of the new proposal, we could have consistently saved more bits than the conference version for several datasets, achieving the maximum bpp reduction of 11.4%.

SECTION II.

Related Work

A. Non-Learning Codecs

JPEG-LS [3] is a lossless and near-lossless image compression method that achieves high compression rates and low complexity. It is based on the LOCO-I, which predicts a pixel value by using its causal neighborhood of three pixels. Conditional expectation of the prediction error is estimated based on the sample mean within each context. The low complexity comes from the assumption that the prediction residuals follow a two-sided geometric distribution and the use of Golomb-like codes.

JPEG2000 [4] provides a lossless image compression method where a reversible wavelet transform is adopted for lossless inversion. Only the integer coefficients are used in this process, implying that no quantization is required. In the coding procedure, all bit planes have to be encoded.

PNG [2] performs compression in two-steps; pre-compression by filtering and compression using DEFLATE [22]. The input image is first transformed into another data using a prediction method, where a single filter is used for the entire image. For each scan line, a filter type is chosen to transform the data to be more efficiently compressed. After the pre-compression of the image, PNG uses DEFLATE, a lossless data compression algorithm involving a combination of LZ77 and Huffman coding.

WebP [6] was introduced as an alternative to the PNG format, and it supports lossless compression and translucent image compression. While preserving the fundamental techniques used in PNG, such as Huffman coding and color indexing transform, WebP introduced separate entropy codes for different color channels, 2D locality of backward reference distances, and color cache of recently used colors.

CALIC [23] is a context-based, adaptive, lossless image codec. It exploits many modeling contexts to condition a nonlinear predictor and adjusts the predictor to varying source statistics. By estimating only the expectation of prediction errors conditioned on different contexts, a large number of modeling contexts can be computed with practicability. Also, CALIC employs Gradient Adjusted Predictor (GAP) to cope with the intensity changes near the predicted pixel.

FLIF [7] employs Meta-Adaptive Near-zero Integer Arithmetic Coding (MANIAC) as the entropy coder, which is a variant of CABAC [24]. The context in this method corresponds to a node of a decision tree, which is dynamically learned at the encoding procedure. The FLIF shows the best performance among the non-learning methods.

JPEG-XL [8] is a next-generation image codec where the architecture is based on the traditional block-transform coding with upgrades to each component. While JPEG is based on the 8\times 8 DCT-II algorithm, JPEG-XL extends to bidimensional DCT-II transform whose sides are one of 8, 16, or 32. Asymmetric Numeral Systems [25] is adopted as the entropy coder to boost the decoding speed. For the lossless mode, variation of adaptive predictors are utilized where the weighted average of the predictions is used.

B. Learning-Based Codecs

The MLP has long been used for the sample prediction in time-series data. Recently, IPFCN [26] employed the MLP for improving the intra prediction in the High Efficiency Video Coding (HEVC) [27]. There have also been several methods that used CNN or RNN for pixel prediction, which generally perform better than the MLP for image processing. These methods achieve outstanding performance by employing auto-regressive models that learn explicit distributions governed by model structures [12]–​[16], [28]. Specifically, PixelRNN [13] and PixelCNN [12] quantified a pixel as a product of conditional distributions determined by all the previously encoded pixels. Specifically, the joint distribution of a pixel over an image x is modeled as the product of conditional distributions p(\textbf {x})=\prod _{i}p(x_{i}\vert x_{1},\ldots,x_{i-1}) , where x_{i} is a single pixel. REP-CNN [17] applied a dual prediction method to the input image to produce the residual errors. The current encoding pixel is first predicted by a traditional method, such as LOCO-I and CALIC, and the prediction residual is again predicted by a deep-learning-based method. In CBPNN [16], they used CNN-based predictors to estimate the residuals and then utilized a context-tree-based bit-plane codec for encoding the prediction errors. IDF [29], a flow-based compression method, proposed discrete flows for ordinal discrete data to resolve the information loss from quantization.

Although the CNN-based auto-regressive prediction methods provide accurate pixel predictions and thus high coding efficiency, it would be difficult to use them practically because they have to perform CNN computations for the whole number of pixels. That is, they perform quite a large number of CNN computations repeatedly, for example, more than 10M times to compress a color UHD image. Hence, it usually takes more than an hour to compress a single HD image on mediocre GPUs. To tackle this problem of CNN-based auto-regressive coders, L3C [19] proposed a practical learning-based lossless image compression system. This method introduced a fully parallel hierarchical probabilistic model for learning predictor and feature extractor. In the case of [20], they modeled and encoded the difference between the input image and its lossy compression, where the coding efficiency is leveraged by the power of the classical lossy compression algorithm.

In this paper, we claim that a simple classical MLP is still useful for lossless image compression when trained to predict the pixel values and contexts jointly, along with residual and progressive learning schemes. The proposed method achieves comparable or better performance than the non-learning codec FLIF and the CNN-based L3C that requires practical GPU-computation time. Also, our encoder and decoder require reasonable CPU-computation times due to the lightweight MLP that we employed. Compared to the CNN-based auto-regressive methods that achieve the least bpp but very long GPU-computation times, our method’s compression rate (bpp) comes between the FLIF and CNN-based auto-regressive methods.

SECTION III.

Method

This section first summarizes several terminologies that frequently appear in this paper and general scan-order predictive coding literature. First, encoding pixel is the pixel that is being encoded at the moment, using the information from the already-encoded pixels. Causal neighbors mean the already encoded pixels on the left and upper part of the encoding pixel. We predict the encoding pixel using the causal neighbors, and the error is termed prediction error. The magnitudes of prediction errors do not form an even distribution over the image but tend to be large at the texture and edge areas. In our method, the MLP generates two outputs using the causal neighbors as the input: one is the prediction error, and the other the context. The structure of MLP for producing these outputs and corresponding loss functions will be explained in the rest of this section. Thus, regions with large local activity (due to textures, edges, etc.) tend to have large contexts, whereas regions with small local activity tend to have small contexts.

A. Overview of the Proposed Method

The proposed method proceeds in the raster scan order, like common predictive lossless compression schemes. For the color inputs, we transform a given RGB image into a YUV format through a reversible color transform. Then, for each encoding pixel x_{i}^{c} , where i is the pixel index, and c (c\in \{Y,U,V\} ) is the channel index, the MLP predicts the pixel value as \hat {x}_{i}^{c} and context as C_{i}^{c} by using the causal neighbors as the input. The prediction error x_{i}^{c} - \hat {x}_{i}^{c} and context C_{i}^{c} are forwarded to the adaptive arithmetic coder (AAC), which produces the final compressed bitstream. The AAC uses the probability distribution of the prediction error, and thus the estimation of the probability density function (pdf) highly affects the compression performance. To be more precise with the entropy coding in our framework, we employ N AACs denoted as \{AAC_{j}\}_{j=1}^{N} , where AAC_{j} learns the statistics of the j -th level local activity. That is, AAC_{1} processes prediction errors from the smallest local activity regions, while AAC_{N} is specialized for the largest local activity regions. As stated previously, the context represents the magnitude of the local activity. We quantize the context into N levels, each of which corresponds to one of the N AACs. Note that the overall process is proceeded in the raster scan order, and in the order of Y , U , V channel. The overall work-flow is illustrated in Fig. 1.

FIGURE 1. - The signal flowgraph of the proposed method. For each pixel to be encoded, its value and context are estimated in a channel-wise progressive way. The process of residual prediction is not shown in the figure for better understanding. The estimated contexts are quantized, which are used to find the corresponding AACs. For brevity, we illustrate the binarization of the 
$Y$
 only, and the flows of other channel signals can be described similarly. The solid blue line implies the positive correspondence between the context and AAC, and the dotted blue lines mean the negative. Then, the prediction error is fed to the corresponding AAC, where each AAC learns the pdf from the input prediction error. 
$AAC_{1}$
 receives prediction errors from the context with the lowest value, and thus the learned pdf shows the lowest variance. The pdf of data to 
$AAC_{N}$
 exhibits higher variance in the order of 
$N$
.
FIGURE 1.

The signal flowgraph of the proposed method. For each pixel to be encoded, its value and context are estimated in a channel-wise progressive way. The process of residual prediction is not shown in the figure for better understanding. The estimated contexts are quantized, which are used to find the corresponding AACs. For brevity, we illustrate the binarization of the Y only, and the flows of other channel signals can be described similarly. The solid blue line implies the positive correspondence between the context and AAC, and the dotted blue lines mean the negative. Then, the prediction error is fed to the corresponding AAC, where each AAC learns the pdf from the input prediction error. AAC_{1} receives prediction errors from the context with the lowest value, and thus the learned pdf shows the lowest variance. The pdf of data to AAC_{N} exhibits higher variance in the order of N .

B. Reversible Color Transform

RGB images generally have a significant correlation between the color channels. In general, we can enhance the compression efficiency of color images by decorrelating the color channels through a transform. For the lossless compression, the color transformation must be itself lossless, i.e., the inverse of the YUV back to the RGB should be lossless in integer arithmetic. We adopt the reversible color transform proposed in [30] that well approximates the conventional YUV transformation (used in most standard image/video compression).

C. Channel-Wise Progressive Prediction

For each encoding pixel, the MLP generates its prediction value and context conditioned on the support pixels. We define the term support pixels as the pixels among the causal neighbors that are used as the input to the MLP. Since the pixels far from the encoding pixel have small contributions to the estimation, we employ the pixels within a short distance as the support pixels, where the “distance” is defined as in Fig. 2. We set the distance proportional to the image resolution. Specifically, for low-resolution, 2K, and 4K UHD images, we set the distance to 1, 2, 4 respectively.

FIGURE 2. - Illustration of the selection of support pixels among the causal neighbors. The red pixel denotes the encoding pixel, and the green denotes corresponding support pixels. Among the causal neighbors, pixels within the distance 
$d$
 are set as support pixels.
FIGURE 2.

Illustration of the selection of support pixels among the causal neighbors. The red pixel denotes the encoding pixel, and the green denotes corresponding support pixels. Among the causal neighbors, pixels within the distance d are set as support pixels.

Formally, we describe the prediction process as \begin{equation*} [\hat {x}_{i}^{c}, \; C_{i}^{c}] = \; {f}_{c}(x_{sup,i})\tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where f_{c} denotes the network for the channel c and x_{sup,i} denotes the support pixels for the encoding pixel x_{i} . Note that Eq. 1 describes the situation that the estimation for Y, U,V channels are performed concurrently, using the same support pixels. However, since there still remains some correlation between Y , U , and V as addressed in [31], a better estimation can be achieved for the U channel if we use more information from the Y . Specifically, for the encoding of x_{i}^{U} , using x_{i}^{Y} can enhance the prediction performance. Likewise, for the prediction of V , exploiting the information from Y and U channels can boost the performance. We refer to this modified prediction scheme as the channel-wise progressive prediction.

We now describe the detail of the channel-wise progressive prediction. We first predict the pixel value and the context of channel Y by using an MLP. This prediction is conditioned only on the support pixels since those are the only available information. Formally, f_{Y} takes the support pixels as input and generates three outputs: pixel prediction \hat {x}_{i}^{Y} , context C_{i}^{Y} , and intermediate feature z_{i}^{Y} , where intermediate feature is the node values of the last hidden layer in MLP. For the prediction of the encoding pixel in U , we use the information from Y , along with the support pixels in U . Specifically, f_{U} takes x_{i}^{Y} , intermediate feature z_{i}^{Y} , and support pixels to predict the encoding pixel, x_{i}^{U} . Similarly, information from both Y and U channels is used to predict the encoding pixel in V . Formally, the f_{V} takes as input x_{i}^{Y} , x_{i}^{U} , z_{i}^{U} , and support pixels. In summary, we formulate the channel-wise progressive prediction as \begin{align*} [\hat {x}_{i}^{Y}, \; C_{i}^{Y}, \; z_{i}^{Y}]=&{f}_{Y}(x_{sup,i}), \tag{2}\\{}[\hat {x}_{i}^{U}, \; C_{i}^{U}, \; z_{i}^{U}]=&{f}_{U}(x_{i}^{Y}, z_{i}^{Y}, x_{sup,i}), \tag{3}\\{}[\hat {x}_{i}^{V}, \; C_{i}^{V}]=&{f}_{V}(x_{i}^{Y}, x_{i}^{U}, z_{i}^{U}, x_{sup,i}).\tag{4}\end{align*}

View SourceRight-click on figure for MathML and additional features. The experiment will show that the channel-wise progressive prediction scheme brings about a 3.3% gain over the baseline (separate channel training) when trained with the DIV2K super-resolution dataset.

D. Residual Prediction

From the experiments, we found that the prediction error using the above progressive scheme (Eqs. 2 to 4) is still too large to yield state-of-the-art compression performance. Hence, we attempt to further reduce the dynamic range of prediction errors by employing the residual signal, motivated by ResNet [32]. We subtract one of the pixel values from all the support pixels and the encoding pixel. Specifically, we choose the pixel to the left of the encoding pixel, i.e., x_{i-1}^{c} , and subtract it from the support pixels and the encoding pixel. Since the encoding pixel and its left pixel tend to have a high correlation, their difference usually has low variance with the mean close to zero. Thus, the network generates more stable and accurate outputs, leading to better compression results. We call the modified prediction scheme the residual prediction, which is formulated as \begin{align*} [\hat {r}_{i}^{Y}, \; C_{i}^{Y}, \; z_{i}^{Y}]=&{f}_{Y}(r_{sup,i}), \tag{5}\\{}[\hat {r}_{i}^{U}, \; C_{i}^{U}, \; z_{i}^{U}]=&{f}_{U}(r_{i}^{Y}, z_{i}^{Y}, r_{sup,i}), \tag{6}\\{}[\hat {r}_{i}^{V}, \; C_{i}^{V}]=&{f}_{V}(r_{i}^{Y}, r_{i}^{U}, z_{i}^{U}, r_{sup,i}),\tag{7}\end{align*}

View SourceRight-click on figure for MathML and additional features. where r_{i}^{c} denotes the encoding pixel of channel c subtracted by the left pixel, and r_{sup,i} means the support pixels subtracted by the left pixel. With the DIV2K dataset, the residual training increases the performance by 2.0%.

E. Efficient Support Pixel

According to Eqs. 5 to 7, the support pixels of channel {Y , U , V } are all used as the input for {f_{Y} , f_{U} , f_{V} }. We note that utilizing the whole support pixel is unnecessary due to the lowered correlation between the transformed color components. Specifically, the reversible color transform mentioned in Section III-B forces the information between channels to have less correlation, and the Y channel usually has more significant features than U and V . Hence, in the prediction of channel Y , using information from U and V support pixels is not efficient or effective. In this respect, we also test a network that uses fewer support pixels. We call this scheme a network using efficient support pixels. To be precise, we use only the support pixels in Y to estimate the Y channel. In the case of U channel, unlike the Y , we use both the support pixels in Y and U . Despite the channel decorrelation, the Y channel still has worthy information for the estimation of other channels, and thus, we utilize the support pixel in Y when estimating the U channel. Likewise, for the prediction of V , the support pixels in Y and V are utilized. In summary, we formulate the prediction scheme using efficient support pixels as \begin{align*} [\hat {r}_{i}^{Y}, \; C_{i}^{Y}, \; z_{i}^{Y}]=&{f}_{Y}(r_{sup,i}^{Y}), \tag{8}\\{}[\hat {r}_{i}^{U}, \; C_{i}^{U}, \; z_{i}^{U}]=&{f}_{U}(r_{i}^{Y}, z_{i}^{Y}, r_{sup,i}^{Y}, r_{sup,i}^{U}), \tag{9}\\{}[\hat {r}_{i}^{V}, \; C_{i}^{V}]=&{f}_{V}(r_{i}^{Y}, r_{i}^{U}, z_{i}^{U}, r_{sup,i}^{Y}, r_{sup,i}^{V}),\tag{10}\end{align*}

View SourceRight-click on figure for MathML and additional features. where r_{sup,i}^{c} denotes the residuals of support pixels in channel c . Experiments show that using efficient support pixels increases the performance by 0.7%.

F. Loss

The loss function is designed to achieve two objectives to (1) minimize prediction error for the accurate reconstruction of the encoding pixel and (2) find appropriate contexts that well represent the local activity.

The first objective can be achieved by minimizing the following loss, which is the difference between the reconstructed encoding pixel and the corresponding ground truth:\begin{equation*} L_{p}^{c} = \sum _{i}{\vert r_{i}^{c}} - \hat {r}_{i}^{c} \vert.\tag{11}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Note that we employ L_{1} loss instead of the L_{2} because most of the prediction errors tend to be small. Precisely, we give a relatively higher penalty to the majority (small) errors by using L_{1} instead of the L_{2} that tends to make the losses too small. For DIV2K dataset, training with L_{1} loss gives 0.6% increased performance compared to the L_{2} .

The formulation of the second objective, i.e., the context loss, should make the contexts reflect the magnitude of the local activity. Motivated by the fact that prediction errors in the region of edges and textures tend to be large, whereas smooth areas produce small prediction errors, we model the contexts proportional to the prediction error. In other words, the network estimates contexts as the prediction error through the following loss:\begin{equation*} L_{C}^{c} = \displaystyle \sum _{i}\vert {C_{i}^{c}-\vert r_{i}^{c}} - \hat {r}_{i}^{c} \vert \vert.\tag{12}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

In summary, the overall objective function is the sum of two objectives:\begin{equation*} L = \displaystyle \sum _{c=Y,U,V}\lambda _{c}({L_{P}^{c} + L_{C}^{c}})\tag{13}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \lambda _{c} is the hyperparameter that balances the contribution of each channel. In addition, equal contributions are given to the reconstruction and the context loss.

G. Progressive Training

Since the encoding is processed sequentially in the order of Y , U , V , the performance of Y influences that of U . Also, the performance of V relies on Y and U . Hence, when we optimize the networks for the YUV together, we sometimes observe unstable convergence. Thus, we sequentially optimize the networks in the order of Y , U , V , and we refer to this modified training scheme as progressive training. Formally, the progressive training is carried out using the following losses:\begin{align*} L^{Y}=&{L_{P}^{Y} + L_{C}^{Y}}, \tag{14}\\ L^{U}=&{L_{P}^{U} + L_{C}^{U}}, \tag{15}\\ L^{V}=&{L_{P}^{V} + L_{C}^{V}}.\tag{16}\end{align*}

View SourceRight-click on figure for MathML and additional features.

We optimize our network in the order of Eq. 14, 15, 16. Afterward, the network is fine-tuned by optimizing Eq. 13.

H. Binarization Through Adaptive Arithmetic Coding

For encoding the prediction errors handed from the networks, we adopt the AAC as our entropy coder. The AAC learns the probability model of the prediction error and binarizes it according to the learned probability model. As mentioned before, we use N AACs, where each AAC learns specific local activity statistics. Precisely, a prediction errors is forwarded to the corresponding AAC, where the correspondence is decided by the context. For mapping the continuous contexts to the N AACs, we quantize the contexts into N steps and denote the center of the n -th quantization step as Q_{n} . The context C_{i} is assigned to Q_{n} when q_{n-1} \leq C_{i} < q_{n} for n = 1,\ldots,N , with q_{0}=0 , and q_{N}=\infty . We empirically found that setting N=24 yields the best performance. For the values of q_{n} , since most of the prediction errors are concentrated near zero, we form the quantized contexts to be dense near zero. Thus, we increase the values of q_{n} by 0.25 until 1.5 and increase by 0.5 afterward.

I. Duplex Network

The pixel estimator generally shows a good estimation performance in smooth areas and relatively poor in texture regions. In other words, the network shows different behavior depending on the area, and thus it would be beneficial to have a different network for each type of region. A similar idea was also proposed in [18], where complicated regions are predicted by a CNN and smooth regions by conventional linear predictors. In this respect, we propose a duplex network illustrated in Fig. 3, where one is specialized for the smooth and the other for the textured. We first need to discriminate whether the encoding pixel is in a smooth or textured area for this scheme. Precisely, a set of support pixels is denoted as a smooth patch if its pixel variation is small, and a texture patch otherwise. We use the mean absolute deviation (MAD) for discriminating smooth/texture areas, which is expressed as \begin{equation*} MAD(x_{sup,i}^{c}) = {\frac{1 }{ N}}\sum _{j=1}^{N}{\vert x_{sup,i,j}^{c} - \overline {x}_{sup,i}^{c}}\vert\tag{17}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where x_{sup,i,j}^{c} denotes the j -th pixel in x_{sup,i}^{c} , and \overline {x}_{sup,i}^{c} denotes the mean of x_{sup,i}^{c} . For each support pixel, if the sum of MAD for each channel is smaller than a certain threshold, it is determined as a smooth patch and a texture patch otherwise, which can be described as \begin{align*} x_{sup,i} \!= \!\begin{cases} \text {Smooth Patch}, &\text {if} \displaystyle \sum _{c=Y,U,V}{MAD(x_{sup,i}^{c})} \!\leq \!\alpha \\ \text {Texture Patch}, &\text {otherwise} \end{cases}\tag{18}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where \alpha denotes the threshold.

FIGURE 3. - The signal flowgraph of the duplex network. The input support pixel is initially determined as a smooth patch if the MAD is below a threshold and as a texture patch otherwise. Afterward, smooth patches are fed to the smooth network, and texture patches are fed to the texture network. Hence, each network can concentrate on specific patches when training. Both networks follow the same architecture in Fig. 1.
FIGURE 3.

The signal flowgraph of the duplex network. The input support pixel is initially determined as a smooth patch if the MAD is below a threshold and as a texture patch otherwise. Afterward, smooth patches are fed to the smooth network, and texture patches are fed to the texture network. Hence, each network can concentrate on specific patches when training. Both networks follow the same architecture in Fig. 1.

After determining the patch type, smooth patches are forwarded to the smooth network, and texture patches to the texture network. In this way, each network can focus on a specific patch type, thus enhancing estimation accuracy. According to our experiments, the duplex network produces more accurate estimation results than the original one in Fig. 1. Although the duplex network doubles the number of parameters, the overall complexity is still small because it is based on simple MLPs. Also, the duplex system does not increase computation time much because the time depends mostly on the numbers of patches, i.e., the number of pixels to be encoded. Although the discrimination process requires more computation time, it is minor compared to the overall time. In conclusion, by dividing one network into two specialized ones, we achieve better estimation performance at the cost of marginal computation time.

SECTION IV.

Experiments

A. Implementation Details

Our method consists of two networks employed for different types of regions, where each network consists of three MLPs, one for each color channel. They have the same internal setting, four hidden layers of 64 units. The size of the intermediate features of each network is also 64. For network training, we use Adam optimizer [33] with a learning rate of 0.001. The balancing parameters \lambda _{Y} , \lambda _{U} , \lambda _{V} are set as 3, 1, 1, respectively. Also, the threshold \alpha is set to 50 in all experiments. We set N=24 in all the experiments, where the threshold value q_{n} for each AAC is explained in Sec. III-H.

We train our model for 2,000 epochs: 500 epochs each for optimizing Y , U , V (Eq. 14 – 16), and fine-tuning for the rest (Eqs. 13). Our network is trained on a GeForce GTX 1080 Ti for 2 hours and tested on an Intel Core-I5 7600 3.50GHZ CPU. The compared methods are tested on the same hardware, except the CBPNN [16] for the unavailability of their code. The GPU time of CBPNN shown in Table 2 is the one reproted in their paper, which was run on Nvidia Titan X GPU and Intel Xeon Processor E5-2620 v3 2.40GHz.

TABLE 1 Comparison of Our Method With Other Non-Learning and Learning-Based Codecs on 2K High Resolution Dataset. We Measure the Performances in Bits Per Pixel (bpp). The Difference in Percentage to Our Method is Highlighted in Green If Ours Outperform and in Red Otherwise
Table 1- 
Comparison of Our Method With Other Non-Learning and Learning-Based Codecs on 2K High Resolution Dataset. We Measure the Performances in Bits Per Pixel (bpp). The Difference in Percentage to Our Method is Highlighted in Green If Ours Outperform and in Red Otherwise
TABLE 2 Comparison of Our Method With Other Compression Methods on 4K UHD Grayscale Dataset
Table 2- 
Comparison of Our Method With Other Compression Methods on 4K UHD Grayscale Dataset

B. Datasets and Results

We first validate our method on widely used classic images which we denote as Classic, along with 2K high-resolution datasets, Flickr2K [34] and DIV2K [35]. The Classic dataset consists of popular 12 images such as Lena, Barbara, Goldhill, etc., which generally have a resolution of 512\times512 . The Flickr2K dataset consists of 2,650 test images, where we randomly select 100 images for evaluation. DIV2K dataset is also a high-resolution image dataset with diverse contents. Our network is trained with the DIV2K training dataset, and tested on the above-stated ones except Classic. Although the authors’ code of L3C [19] processes low-resolution images as a whole and shows comparable performance, they divide the high-resolution images (2K or higher) into sub-images that are calcaulted as individuals. This approach generally results in poor performance since less information is utilzed for processing part of sub-images. Thus, we compare with L3C separately for high-resolution images, by downscaling the images as in [19] so that the images can be processed as a whole. That is, we set the experimental setting the same as L3C for a fair comparison. Specifically, the Flickr2K and DIV2K datasets are downscaled to 768 pixels on the longer side, which are denoted as Flickr2K (downscaled) and DIV2K (downscaled). For the evaluation of the Classic dataset, we downscale DIV2K by half and train the network. This process is done to narrow the domain gap between the 2K resolution of DIV2K and 512\times512 Classic dataset.

The comparisons are shown in Table 1, where we can see that our method generally outperforms others in all test datasets. Our proposed method outperforms BPG, PNG and JPEG-LS by a large margin, especially for Flickr2K and DIV2K datasets, achieving at least 40% higher performance. It also outperforms JPEG2000 and WebP in all datasets, achieving up to 12.6% gain. Our method typcially shows better performance than FLIF and state-of-the-art JPEG-XL. Compared to the learning-based codec L3C, the proposed method also performs better in all test datasets by at least 10.0%.

We point out that the performance gap between ours and other methods generally decreases in low-resolution dataset such as Classic, Flickr2K (downscaled), and DIV2K (downscaled), unlike in the high-resolution datasets such as the original Flickr2K and DIV2K. We believe that the domain gap between the train and test domain influences the performance. To be specific, our network trained with 2K high-resolution dataset shows a higher performance when the test domain agrees.

For comparison with CNN-based or flow-based auto-regressive models [12], [15], [16], [29], we separately perform two experiments. Specifically, we could not compare with these methods on the above 2K color images because the authors’ code is not available in the case of CBPNN [16], and other methods require impractical computation times for encoding HD images. Specifically, [12], [15], [29] require estimated encoding time of 90 hours, 53 minutes and 20 minutes for 2K color images. Hence, for comparison with CBPNN [16], we test our code on two dataset used in [16]: 4K grayscale images [36] and EPFL Light Field dataset [37] available on [38]. For the comparison with other auto-regressive models [12], [15], [29], we test our code on the ImageNet64 [39] with very low resolution (64 \times 64 ).

Table 2 presents the comparisons on 4K UHD grayscale images. Note that L3C is excluded from the comparison since it does not work with high-resolution images. It can be seen that our method shows superior performance to the non-learning methods, whereas it yields lower performance than the CBPNN. However, our method requires 108 seconds in CPU, while CBPNN takes 720 seconds in GPU. Table 3 exhibits the comparisons on EPFL dataset, where we compare with FLIF and CBPNN. Since we were not able to reproduce the results reported in [16], we compare with CBPNN in terms of Relative Compression (RC) of FLIF, which is computed as follows:\begin{equation*} RC_{MX} = \frac {bpp_{MX}}{bpp_{FLIF}}\tag{19}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where bpp_{MX} denotes the bpp of method MX. The CBPNN shows 9% better compression performance compared to FLIF, whereas our method shows 1% improvement. Comparisons with other auto-regressive methods are listed in Table 4. As expected, CNN-based or flow-based auto-regressive models outperform others by large margins while requiring long computation times. But, we can see that our method still performs better than non-learning methods and comparable to practical-time CNN-based L3C.

TABLE 3 Comparison of Our Method With FLIF and CBPNN on EPFL Dataset. * Denotes That We Use the Performance Reported in [16]. Otherwise, the Performances are Generated by Our Experiments
Table 3- 
Comparison of Our Method With FLIF and CBPNN on EPFL Dataset. * Denotes That We Use the Performance Reported in [16]. Otherwise, the Performances are Generated by Our Experiments
TABLE 4 Comparison of Our Method With Other Compression Methods on ImageNet64 ( 64 \times64 Images)
Table 4- 
Comparison of Our Method With Other Compression Methods on ImageNet64 (
$64 \times64$
 Images)

Summarizing the comparisons, we can find a trade-off between the bpp and computation time, which is graphically shown in Fig. 4. The left figure plots the computation time versus bpp of the results for Flickr2K (downscaled) in Table 1, and the right plots the result in Table 2. Since the compared times are measured in CPU or GPU, we differentiate them by dots (CPU time) and stars (GPU times). It can be seen that the CNN-based auto-regressive models yield lower bpp with longer GPU computation time, while non-learning algorithms perform very fast but produce high bpp. Our method comes in between these two approaches, while the computation time is closer to practicability than other auto-regressive models.

FIGURE 4. - The trade-off between computation time and compression rate for various compression methods. (a) Illustration of the trade-off for FLICKR2K (downscaled) dataset. (b) Illustration for 4K UHD grayscale dataset. Methods measured in CPU time are depicted as dots, whereas methods that can be performed only in GPU (and thus their GPU times) are denoted as stars.
FIGURE 4.

The trade-off between computation time and compression rate for various compression methods. (a) Illustration of the trade-off for FLICKR2K (downscaled) dataset. (b) Illustration for 4K UHD grayscale dataset. Methods measured in CPU time are depicted as dots, whereas methods that can be performed only in GPU (and thus their GPU times) are denoted as stars.

C. Ablation Study

We conduct ablation experiments to investigate the contribution of each component of our method. We set the baseline model as a network that predicts pixel value and contexts without any proposed scheme, i.e., reversible color transform, channel-wise progressive prediction, residual prediction, efficient support pixel, progressive training, and duplex network. Table 5 shows the change of encoding performance and encoding/decoding time as these elements are added one by one. It can be observed that each of our components contributes to compression gain. Among them, the channel-wise progressive scheme shows the highest contribution. This implies that transformed channels still have some correlation with each other.

TABLE 5 Ablation Study of Our Method on DIV2K Dataset
Table 5- 
Ablation Study of Our Method on DIV2K Dataset

D. Pixel Prediction Analysis

This section presents the network’s prediction capability quantitatively and qualitatively compared to other scan order methods. We compare with the prediction methods in classic encoders such as LOCO-I [3] and CALIC [23].

In Fig. 5, we visualize the performance qualitatively by showing the magnitudes of prediction errors in Y channel for each method. It can be seen that the error magnitudes of our method are significantly smaller than that of LOCO-I and CALIC. We can also see that the difference is prominent in the edge and texture regions, implying that our method yields more accurate predictions in such areas.

FIGURE 5. - Visualization of the magnitudes of prediction errors.
FIGURE 5.

Visualization of the magnitudes of prediction errors.

For quantitative performance, we measure the zero-order entropy of the prediction error images. Zero-order entropy is calculated as \begin{equation*} H(X) = -\sum _{i=1}^{n}{P(x_{i})\log {P(x_{i})}}\tag{20}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where X denotes the discrete random variable for the prediction error image with possible outcomes x_{1},\ldots, x_{n} , which occur with probability P(x_{1}),\ldots, P(x_{n}) . Lower entropy indicates that the statistics of prediction errors have a smaller mean and variance, implying that accurate prediction has been made. In contrast, prediction errors with a large mean and variance result in high entropy. Table 6 shows the result for Flickr2K and DIV2K datasets, where we can see that our method achieves the lowest entropy. Thus, we can conclude that our prediction method outperforms other classical scan order methods.

TABLE 6 Zero Order Entropy Measured for Quantitative Comparison of Prediction Errors
Table 6- 
Zero Order Entropy Measured for Quantitative Comparison of Prediction Errors

E. Context Analysis

In this section, we validate whether the context model is well designed. One of our goals is to let the network generate contexts proportional to regional activities’ magnitudes. In other words, regions with large activity result in large contexts, and vice versa. We show an example of the context in Fig. 6b, where we can see that smooth regions have lower contexts than texture areas. Thus, we conclude that our context model is well designed.

FIGURE 6. - An analysis of the predicted contexts. (a) Input image. (b) Visualization of contexts obtained in the 
$Y$
 channel. Regions of edge and texture show high context values, whereas smooth areas produce lower values. Hence, contexts are proportional to the magnitude of local activities as expected. (c) Statistical property of prediction error for the corresponding quantized contexts. We show the results for some quantized contexts 
$Q_{1}$
, 
$Q_{5}$
, and 
$Q_{9}$
. It can be seen that the variance of prediction error increases as the context increases. Thus, it can be again implied that our contexts well model the magnitude of local activities.
FIGURE 6.

An analysis of the predicted contexts. (a) Input image. (b) Visualization of contexts obtained in the Y channel. Regions of edge and texture show high context values, whereas smooth areas produce lower values. Hence, contexts are proportional to the magnitude of local activities as expected. (c) Statistical property of prediction error for the corresponding quantized contexts. We show the results for some quantized contexts Q_{1} , Q_{5} , and Q_{9} . It can be seen that the variance of prediction error increases as the context increases. Thus, it can be again implied that our contexts well model the magnitude of local activities.

Additionally, we quantitatively justify our context design. The learned probability model of prediction errors should have variance proportional to the contexts. That is, small contexts constitute a probability density with small variance, where prediction errors are concentrated near zero. On the other hand, large contexts form a probability density where prediction errors are less concentrated near zero. Fig. 6c shows that the learned probability models agree with our hypothesis, verifying that our model of context is well designed.

Finally, we conduct an experiment to demonstrate the effectiveness of our context model by applying the proposed entropy coding method to the predicted signal of JPEG-LS, which originally uses the Golomb-Rice coder [40] as the entropy coder. For a fair comparison, the prediction scheme of JPEG-LS is applied after the reversible color transform mentioned in Section III-B. In addition, we show that utilizing N quantized contexts, denoted as Multi Context, is more effective than using a single quantized context, denoted as Single Context. More specifically, we compare the combinations of two prediction methods (JPEG-LS and our MLP) and three context modeling methods (JPEG-LS, Single Context, and Multi Context), as shown in Table 7. We can first observe that using the Single Context needs about 9% more bits than the Multi Context, implying that adopting N quantized contexts is effective. In addition, the difference due to the prediction is about 9% by comparing the 2nd and 4th rows (“JPEG-LS / Singe Context” vs. “Ours / Single Context”). Comparing the 1st and 2nd rows (“JPEG-LS / JPEG-LS” vs. “JPEG-LS / Single Context”), we observe that the 1st requires about 20% more bits than the 2nd. The performance gap is due to the difference of the entropy coders, between the Golomb-Rice coding and our context-adaptive arithmetic coding. In summary, the JPEG-LS needs more bits than ours by about 40%, where roughly 20%p comes from the use of arithmetic coding instead of Golomb-Rice, around 10%p from our prediction scheme, and the rest 10%p from the N quantized contexts. But it also needs to be noted that these performance gaps are not universal to all the datasets. As shown in other tables, the performance gaps for other datasets can be smaller or larger than these cases (DIV2K and FLICKR2K), which means that the compression performances of the algorithms are often data-dependent.

TABLE 7 Results of JPEG-LS and Our Method on FLICKR2K and DIV2K. Multi Context Refers to the Case Where We Use N Quantized Contexts for Entropy Coding, and Single Context is When Only a Single Quantized Context is Used. The Performance is Measured in bpp
Table 7- 
Results of JPEG-LS and Our Method on FLICKR2K and DIV2K. Multi Context Refers to the Case Where We Use 
$N$
 Quantized Contexts for Entropy Coding, and Single Context is When Only a Single Quantized Context is Used. The Performance is Measured in bpp

F. Result Analysis

This section presents the results on some individual images for analyzing the algorithm’s performance depending on the input properties. We first analyze the compression performance for the DIV2K dataset in Fig. 7. The image index is sorted in the order of bpp and we compare our method only with the FLIF for clarity. We observe that images with low index i.e. low bpp, exhibit small performance gaps between ours and FLIF. For images showing roughly 8 bpp or more when compressed, our method consistently outperforms the FLIF. In Fig. 8, we show the images having the largest and smallest performance gap. As can be seen, images that contain a lot of texture regions show large performance gaps. In contrast, images with large smooth areas show smaller performance gaps, or the FLIF even outperforms ours. The reason behind this lower performance on smooth images is that our method does not fully utilize all the AACs in this case. Precisely, since the majority of prediction errors are very small when the smooth region is large, only the AACs for the lower prediction errors are used. In this scenario, our method becomes similar to the Single Context mentioned in Section IV-E.

FIGURE 7. - Result of individual images in DIV2K dataset for our method and FLIF.
FIGURE 7.

Result of individual images in DIV2K dataset for our method and FLIF.

FIGURE 8. - Comparison of our method and FLIF for individual images from DIV2K. The results are reported in bpp. (a) to (d) are images where our method outperforms FLIF by large margins. (e) to (h) are images where the performance gaps are not significant.
FIGURE 8.

Comparison of our method and FLIF for individual images from DIV2K. The results are reported in bpp. (a) to (d) are images where our method outperforms FLIF by large margins. (e) to (h) are images where the performance gaps are not significant.

In addition, we present the results of individual images in Classic dataset in Table 8 for both color and gray images. For color images, although out method is outperformed by FLIF, it shows lower bit rates than other methods. In the case of gray images, our method delivers the best performance.

TABLE 8 Results on the Individual Images in the Classic Dataset. Each Table Exhibits the Results for Color/Gray Images Respectively
Table 8- 
Results on the Individual Images in the Classic Dataset. Each Table Exhibits the Results for Color/Gray Images Respectively

SECTION V.

Conclusion

In this paper, we have proposed a simple MLP-based lossless image compression method. To achieve high compression performance with reasonable computation time, we process the compression in a raster scan order using simple MLPs. For each encoding pixel, our network predicts the pixel value and the context simultaneously, where the context is modeled based on the prediction error. To boost the estimation accuracy, we introduced a channel-wise progressive compression scheme, which exploits the correlation between color channels. Furthermore, the network is trained in a residual manner, in contrast to the direct prediction of previous works, obtaining stability and accuracy. The optimization is also done in a progressive manner, where the networks are optimized in the order of Y , U , V . Finally, we used duplex networks, where each network is specialized for a different type of pixel distribution.

Experiments show that our method obtains better results than the FLIF, which is the state-of-the-art non-learning codec, while requiring more computation time. Also, our method shows better or comparable performance compared to L3C, which is a practical learning-based method. We also demonstrated the effectiveness of our method in terms of pixel value prediction capability. Finally, we verified that our context design reflects the magnitude of the local activities well. Our codes and dataset will be made publicly available at https://github.com/myideaisgood/LCIC_duplex

References

References is not available for this document.