Introduction
The need for high-quality images is ever-increasing, and thus the techniques for high-quality image acquisition, display devices, storages, and high-speed image data transmission are rapidly developing. As a result, we now have ultra-high quality displays and broadcasting services, and cameras with so many pixels. Hence, the need for image and video compression is also ever-increasing.
Multimedia data are encoded by lossy or lossless compression methods, where the aim of the lossy compression is to minimize distortion for the given bit budget. In recent image and video codecs, the quality degradation is not easily noticeable at high bit rates, and thus lossy compression is widely used for image communication and storage. However, for image data such as medical, scientific, and artistic images, we may wish to preserve the original contents without any degradation caused by the compression. Hence, in this case, lossless compression methods are required.
The main elements of classical (non-learning) methods are transformation, prediction, quantization, and entropy coding.1 In lossy compression, the transformation followed by quantization scheme is popular since it achieves high energy compaction and sparse representation. In the case of lossless compression, predictive coding is a more widely used approach. Some examples of non-learning lossless codecs are BPG [1], PNG [2], JPEG-LS [3], JPEG2000 [4], LCIC [5], WebP [6], FLIF [7], JPEG-XL [8], etc.
Deep neural networks have achieved significant successes in computer vision and signal processing and also provided promising solutions in lossy compression. Specifically, researchers have developed CNN-based compression methods [9]–[11], where they extract features from the input image and then encode features into a compressed bitstream. These methods can be considered the same as the transformation-and-quantization method, and thus they do not seem to be suitable for lossless compression.
There have also been learning-based lossless compression methods that learn the probability model of given symbols (pixel values or residual signal) [12]–[15], or hybrid coding methods that replace the predictor of the non-learning encoders by a deep network and improved entropy encoders [16]–[18]. These methods perform much better than the best non-learning codec FLIF [7], in terms of the compressed bits per pixel (bpp). However, their computation times are much larger than the FLIF, at least by 102 times even with GPU computation, making them less practical. Hence, Mentzer et al. [19], [20] proposed practical CNN-based lossless codecs, but their performances are on par with the FLIF.
In this paper, to achieve accurate prediction while requiring a reasonable computation time, we propose an efficient encoding method based on the use of simple MLPs. For the accurate prediction, we adopt the ideas of channel-wise, residual, progressive encoding, and duplex network into our MLP-based framework. To achieve a practical computation time, we jointly estimate only the mean (pixel prediction) and variance (context) of the prediction error model for each pixel. In addition, using the MLP reduces system complexity and power consumption compared to the use of CNN or RNN.
To be precise, we propose three encoding schemes for achieving better compression performance. First, we proceed with the prediction in a channel-wise progressive manner to exploit the correlation between color channels, in the case of color inputs. Instead of estimating the means and variances of three channels (Y, U, V) all at once, we initially perform the estimation for the
In summary, the main contributions of this paper are as follows:
We propose a lossless image compression method, where we use MLPs for predicting pixel values and contexts simultaneously.
The proposed training method achieves stable and accurate predictions through channel-wise progressive learning, residual learning, and duplex networks.
The proposed encoder shows comparable or better performance than non-learning codecs and a recent CNN-based encoder [19] requiring practical computation time.
We published a preliminary work of MLP-based lossless compression scheme in a conference [21]. The difference of this work from our conference version is that we (1) try to explain the algorithm more clearly and in detail, (2) introduce a duplex network scheme (Section III-I) that differentiates the input characteristics and uses two different MLPs depending on the input properties, (3) find more efficient and effective support pixels (Section III-E) to help the MLPs perform more accurate prediction, and (4) conduct ablation studies (Section IV-C) to show the effect of each idea in the proposed framework. As a result of the new proposal, we could have consistently saved more bits than the conference version for several datasets, achieving the maximum bpp reduction of 11.4%.
Related Work
A. Non-Learning Codecs
JPEG-LS [3] is a lossless and near-lossless image compression method that achieves high compression rates and low complexity. It is based on the LOCO-I, which predicts a pixel value by using its causal neighborhood of three pixels. Conditional expectation of the prediction error is estimated based on the sample mean within each context. The low complexity comes from the assumption that the prediction residuals follow a two-sided geometric distribution and the use of Golomb-like codes.
JPEG2000 [4] provides a lossless image compression method where a reversible wavelet transform is adopted for lossless inversion. Only the integer coefficients are used in this process, implying that no quantization is required. In the coding procedure, all bit planes have to be encoded.
PNG [2] performs compression in two-steps; pre-compression by filtering and compression using DEFLATE [22]. The input image is first transformed into another data using a prediction method, where a single filter is used for the entire image. For each scan line, a filter type is chosen to transform the data to be more efficiently compressed. After the pre-compression of the image, PNG uses DEFLATE, a lossless data compression algorithm involving a combination of LZ77 and Huffman coding.
WebP [6] was introduced as an alternative to the PNG format, and it supports lossless compression and translucent image compression. While preserving the fundamental techniques used in PNG, such as Huffman coding and color indexing transform, WebP introduced separate entropy codes for different color channels, 2D locality of backward reference distances, and color cache of recently used colors.
CALIC [23] is a context-based, adaptive, lossless image codec. It exploits many modeling contexts to condition a nonlinear predictor and adjusts the predictor to varying source statistics. By estimating only the expectation of prediction errors conditioned on different contexts, a large number of modeling contexts can be computed with practicability. Also, CALIC employs Gradient Adjusted Predictor (GAP) to cope with the intensity changes near the predicted pixel.
FLIF [7] employs Meta-Adaptive Near-zero Integer Arithmetic Coding (MANIAC) as the entropy coder, which is a variant of CABAC [24]. The context in this method corresponds to a node of a decision tree, which is dynamically learned at the encoding procedure. The FLIF shows the best performance among the non-learning methods.
JPEG-XL [8] is a next-generation image codec where the architecture is based on the traditional block-transform coding with upgrades to each component. While JPEG is based on the
B. Learning-Based Codecs
The MLP has long been used for the sample prediction in time-series data. Recently, IPFCN [26] employed the MLP for improving the intra prediction in the High Efficiency Video Coding (HEVC) [27]. There have also been several methods that used CNN or RNN for pixel prediction, which generally perform better than the MLP for image processing. These methods achieve outstanding performance by employing auto-regressive models that learn explicit distributions governed by model structures [12]–[16], [28]. Specifically, PixelRNN [13] and PixelCNN [12] quantified a pixel as a product of conditional distributions determined by all the previously encoded pixels. Specifically, the joint distribution of a pixel over an image x is modeled as the product of conditional distributions
Although the CNN-based auto-regressive prediction methods provide accurate pixel predictions and thus high coding efficiency, it would be difficult to use them practically because they have to perform CNN computations for the whole number of pixels. That is, they perform quite a large number of CNN computations repeatedly, for example, more than 10M times to compress a color UHD image. Hence, it usually takes more than an hour to compress a single HD image on mediocre GPUs. To tackle this problem of CNN-based auto-regressive coders, L3C [19] proposed a practical learning-based lossless image compression system. This method introduced a fully parallel hierarchical probabilistic model for learning predictor and feature extractor. In the case of [20], they modeled and encoded the difference between the input image and its lossy compression, where the coding efficiency is leveraged by the power of the classical lossy compression algorithm.
In this paper, we claim that a simple classical MLP is still useful for lossless image compression when trained to predict the pixel values and contexts jointly, along with residual and progressive learning schemes. The proposed method achieves comparable or better performance than the non-learning codec FLIF and the CNN-based L3C that requires practical GPU-computation time. Also, our encoder and decoder require reasonable CPU-computation times due to the lightweight MLP that we employed. Compared to the CNN-based auto-regressive methods that achieve the least bpp but very long GPU-computation times, our method’s compression rate (bpp) comes between the FLIF and CNN-based auto-regressive methods.
Method
This section first summarizes several terminologies that frequently appear in this paper and general scan-order predictive coding literature. First, encoding pixel is the pixel that is being encoded at the moment, using the information from the already-encoded pixels. Causal neighbors mean the already encoded pixels on the left and upper part of the encoding pixel. We predict the encoding pixel using the causal neighbors, and the error is termed prediction error. The magnitudes of prediction errors do not form an even distribution over the image but tend to be large at the texture and edge areas. In our method, the MLP generates two outputs using the causal neighbors as the input: one is the prediction error, and the other the context. The structure of MLP for producing these outputs and corresponding loss functions will be explained in the rest of this section. Thus, regions with large local activity (due to textures, edges, etc.) tend to have large contexts, whereas regions with small local activity tend to have small contexts.
A. Overview of the Proposed Method
The proposed method proceeds in the raster scan order, like common predictive lossless compression schemes. For the color inputs, we transform a given RGB image into a YUV format through a reversible color transform. Then, for each encoding pixel
The signal flowgraph of the proposed method. For each pixel to be encoded, its value and context are estimated in a channel-wise progressive way. The process of residual prediction is not shown in the figure for better understanding. The estimated contexts are quantized, which are used to find the corresponding AACs. For brevity, we illustrate the binarization of the
B. Reversible Color Transform
RGB images generally have a significant correlation between the color channels. In general, we can enhance the compression efficiency of color images by decorrelating the color channels through a transform. For the lossless compression, the color transformation must be itself lossless, i.e., the inverse of the YUV back to the RGB should be lossless in integer arithmetic. We adopt the reversible color transform proposed in [30] that well approximates the conventional YUV transformation (used in most standard image/video compression).
C. Channel-Wise Progressive Prediction
For each encoding pixel, the MLP generates its prediction value and context conditioned on the support pixels. We define the term support pixels as the pixels among the causal neighbors that are used as the input to the MLP. Since the pixels far from the encoding pixel have small contributions to the estimation, we employ the pixels within a short distance as the support pixels, where the “distance” is defined as in Fig. 2. We set the distance proportional to the image resolution. Specifically, for low-resolution, 2K, and 4K UHD images, we set the distance to 1, 2, 4 respectively.
Illustration of the selection of support pixels among the causal neighbors. The red pixel denotes the encoding pixel, and the green denotes corresponding support pixels. Among the causal neighbors, pixels within the distance
Formally, we describe the prediction process as \begin{equation*} [\hat {x}_{i}^{c}, \; C_{i}^{c}] = \; {f}_{c}(x_{sup,i})\tag{1}\end{equation*}
We now describe the detail of the channel-wise progressive prediction. We first predict the pixel value and the context of channel \begin{align*} [\hat {x}_{i}^{Y}, \; C_{i}^{Y}, \; z_{i}^{Y}]=&{f}_{Y}(x_{sup,i}), \tag{2}\\{}[\hat {x}_{i}^{U}, \; C_{i}^{U}, \; z_{i}^{U}]=&{f}_{U}(x_{i}^{Y}, z_{i}^{Y}, x_{sup,i}), \tag{3}\\{}[\hat {x}_{i}^{V}, \; C_{i}^{V}]=&{f}_{V}(x_{i}^{Y}, x_{i}^{U}, z_{i}^{U}, x_{sup,i}).\tag{4}\end{align*}
D. Residual Prediction
From the experiments, we found that the prediction error using the above progressive scheme (Eqs. 2 to 4) is still too large to yield state-of-the-art compression performance. Hence, we attempt to further reduce the dynamic range of prediction errors by employing the residual signal, motivated by ResNet [32]. We subtract one of the pixel values from all the support pixels and the encoding pixel. Specifically, we choose the pixel to the left of the encoding pixel, i.e., \begin{align*} [\hat {r}_{i}^{Y}, \; C_{i}^{Y}, \; z_{i}^{Y}]=&{f}_{Y}(r_{sup,i}), \tag{5}\\{}[\hat {r}_{i}^{U}, \; C_{i}^{U}, \; z_{i}^{U}]=&{f}_{U}(r_{i}^{Y}, z_{i}^{Y}, r_{sup,i}), \tag{6}\\{}[\hat {r}_{i}^{V}, \; C_{i}^{V}]=&{f}_{V}(r_{i}^{Y}, r_{i}^{U}, z_{i}^{U}, r_{sup,i}),\tag{7}\end{align*}
E. Efficient Support Pixel
According to Eqs. 5 to 7, the support pixels of channel {\begin{align*} [\hat {r}_{i}^{Y}, \; C_{i}^{Y}, \; z_{i}^{Y}]=&{f}_{Y}(r_{sup,i}^{Y}), \tag{8}\\{}[\hat {r}_{i}^{U}, \; C_{i}^{U}, \; z_{i}^{U}]=&{f}_{U}(r_{i}^{Y}, z_{i}^{Y}, r_{sup,i}^{Y}, r_{sup,i}^{U}), \tag{9}\\{}[\hat {r}_{i}^{V}, \; C_{i}^{V}]=&{f}_{V}(r_{i}^{Y}, r_{i}^{U}, z_{i}^{U}, r_{sup,i}^{Y}, r_{sup,i}^{V}),\tag{10}\end{align*}
F. Loss
The loss function is designed to achieve two objectives to (1) minimize prediction error for the accurate reconstruction of the encoding pixel and (2) find appropriate contexts that well represent the local activity.
The first objective can be achieved by minimizing the following loss, which is the difference between the reconstructed encoding pixel and the corresponding ground truth:\begin{equation*} L_{p}^{c} = \sum _{i}{\vert r_{i}^{c}} - \hat {r}_{i}^{c} \vert.\tag{11}\end{equation*}
Note that we employ
The formulation of the second objective, i.e., the context loss, should make the contexts reflect the magnitude of the local activity. Motivated by the fact that prediction errors in the region of edges and textures tend to be large, whereas smooth areas produce small prediction errors, we model the contexts proportional to the prediction error. In other words, the network estimates contexts as the prediction error through the following loss:\begin{equation*} L_{C}^{c} = \displaystyle \sum _{i}\vert {C_{i}^{c}-\vert r_{i}^{c}} - \hat {r}_{i}^{c} \vert \vert.\tag{12}\end{equation*}
In summary, the overall objective function is the sum of two objectives:\begin{equation*} L = \displaystyle \sum _{c=Y,U,V}\lambda _{c}({L_{P}^{c} + L_{C}^{c}})\tag{13}\end{equation*}
G. Progressive Training
Since the encoding is processed sequentially in the order of \begin{align*} L^{Y}=&{L_{P}^{Y} + L_{C}^{Y}}, \tag{14}\\ L^{U}=&{L_{P}^{U} + L_{C}^{U}}, \tag{15}\\ L^{V}=&{L_{P}^{V} + L_{C}^{V}}.\tag{16}\end{align*}
We optimize our network in the order of Eq. 14, 15, 16. Afterward, the network is fine-tuned by optimizing Eq. 13.
H. Binarization Through Adaptive Arithmetic Coding
For encoding the prediction errors handed from the networks, we adopt the AAC as our entropy coder. The AAC learns the probability model of the prediction error and binarizes it according to the learned probability model. As mentioned before, we use
I. Duplex Network
The pixel estimator generally shows a good estimation performance in smooth areas and relatively poor in texture regions. In other words, the network shows different behavior depending on the area, and thus it would be beneficial to have a different network for each type of region. A similar idea was also proposed in [18], where complicated regions are predicted by a CNN and smooth regions by conventional linear predictors. In this respect, we propose a duplex network illustrated in Fig. 3, where one is specialized for the smooth and the other for the textured. We first need to discriminate whether the encoding pixel is in a smooth or textured area for this scheme. Precisely, a set of support pixels is denoted as a smooth patch if its pixel variation is small, and a texture patch otherwise. We use the mean absolute deviation (MAD) for discriminating smooth/texture areas, which is expressed as \begin{equation*} MAD(x_{sup,i}^{c}) = {\frac{1 }{ N}}\sum _{j=1}^{N}{\vert x_{sup,i,j}^{c} - \overline {x}_{sup,i}^{c}}\vert\tag{17}\end{equation*}
\begin{align*} x_{sup,i} \!= \!\begin{cases} \text {Smooth Patch}, &\text {if} \displaystyle \sum _{c=Y,U,V}{MAD(x_{sup,i}^{c})} \!\leq \!\alpha \\ \text {Texture Patch}, &\text {otherwise} \end{cases}\tag{18}\end{align*}
The signal flowgraph of the duplex network. The input support pixel is initially determined as a smooth patch if the MAD is below a threshold and as a texture patch otherwise. Afterward, smooth patches are fed to the smooth network, and texture patches are fed to the texture network. Hence, each network can concentrate on specific patches when training. Both networks follow the same architecture in Fig. 1.
After determining the patch type, smooth patches are forwarded to the smooth network, and texture patches to the texture network. In this way, each network can focus on a specific patch type, thus enhancing estimation accuracy. According to our experiments, the duplex network produces more accurate estimation results than the original one in Fig. 1. Although the duplex network doubles the number of parameters, the overall complexity is still small because it is based on simple MLPs. Also, the duplex system does not increase computation time much because the time depends mostly on the numbers of patches, i.e., the number of pixels to be encoded. Although the discrimination process requires more computation time, it is minor compared to the overall time. In conclusion, by dividing one network into two specialized ones, we achieve better estimation performance at the cost of marginal computation time.
Experiments
A. Implementation Details
Our method consists of two networks employed for different types of regions, where each network consists of three MLPs, one for each color channel. They have the same internal setting, four hidden layers of 64 units. The size of the intermediate features of each network is also 64. For network training, we use Adam optimizer [33] with a learning rate of 0.001. The balancing parameters
We train our model for 2,000 epochs: 500 epochs each for optimizing
B. Datasets and Results
We first validate our method on widely used classic images which we denote as Classic, along with 2K high-resolution datasets, Flickr2K [34] and DIV2K [35]. The Classic dataset consists of popular 12 images such as Lena, Barbara, Goldhill, etc., which generally have a resolution of
The comparisons are shown in Table 1, where we can see that our method generally outperforms others in all test datasets. Our proposed method outperforms BPG, PNG and JPEG-LS by a large margin, especially for Flickr2K and DIV2K datasets, achieving at least 40% higher performance. It also outperforms JPEG2000 and WebP in all datasets, achieving up to 12.6% gain. Our method typcially shows better performance than FLIF and state-of-the-art JPEG-XL. Compared to the learning-based codec L3C, the proposed method also performs better in all test datasets by at least 10.0%.
We point out that the performance gap between ours and other methods generally decreases in low-resolution dataset such as Classic, Flickr2K (downscaled), and DIV2K (downscaled), unlike in the high-resolution datasets such as the original Flickr2K and DIV2K. We believe that the domain gap between the train and test domain influences the performance. To be specific, our network trained with 2K high-resolution dataset shows a higher performance when the test domain agrees.
For comparison with CNN-based or flow-based auto-regressive models [12], [15], [16], [29], we separately perform two experiments. Specifically, we could not compare with these methods on the above 2K color images because the authors’ code is not available in the case of CBPNN [16], and other methods require impractical computation times for encoding HD images. Specifically, [12], [15], [29] require estimated encoding time of 90 hours, 53 minutes and 20 minutes for 2K color images. Hence, for comparison with CBPNN [16], we test our code on two dataset used in [16]: 4K grayscale images [36] and EPFL Light Field dataset [37] available on [38]. For the comparison with other auto-regressive models [12], [15], [29], we test our code on the ImageNet64 [39] with very low resolution (
Table 2 presents the comparisons on 4K UHD grayscale images. Note that L3C is excluded from the comparison since it does not work with high-resolution images. It can be seen that our method shows superior performance to the non-learning methods, whereas it yields lower performance than the CBPNN. However, our method requires 108 seconds in CPU, while CBPNN takes 720 seconds in GPU. Table 3 exhibits the comparisons on EPFL dataset, where we compare with FLIF and CBPNN. Since we were not able to reproduce the results reported in [16], we compare with CBPNN in terms of Relative Compression (RC) of FLIF, which is computed as follows:\begin{equation*} RC_{MX} = \frac {bpp_{MX}}{bpp_{FLIF}}\tag{19}\end{equation*}
Summarizing the comparisons, we can find a trade-off between the bpp and computation time, which is graphically shown in Fig. 4. The left figure plots the computation time versus bpp of the results for Flickr2K (downscaled) in Table 1, and the right plots the result in Table 2. Since the compared times are measured in CPU or GPU, we differentiate them by dots (CPU time) and stars (GPU times). It can be seen that the CNN-based auto-regressive models yield lower bpp with longer GPU computation time, while non-learning algorithms perform very fast but produce high bpp. Our method comes in between these two approaches, while the computation time is closer to practicability than other auto-regressive models.
The trade-off between computation time and compression rate for various compression methods. (a) Illustration of the trade-off for FLICKR2K (downscaled) dataset. (b) Illustration for 4K UHD grayscale dataset. Methods measured in CPU time are depicted as dots, whereas methods that can be performed only in GPU (and thus their GPU times) are denoted as stars.
C. Ablation Study
We conduct ablation experiments to investigate the contribution of each component of our method. We set the baseline model as a network that predicts pixel value and contexts without any proposed scheme, i.e., reversible color transform, channel-wise progressive prediction, residual prediction, efficient support pixel, progressive training, and duplex network. Table 5 shows the change of encoding performance and encoding/decoding time as these elements are added one by one. It can be observed that each of our components contributes to compression gain. Among them, the channel-wise progressive scheme shows the highest contribution. This implies that transformed channels still have some correlation with each other.
D. Pixel Prediction Analysis
This section presents the network’s prediction capability quantitatively and qualitatively compared to other scan order methods. We compare with the prediction methods in classic encoders such as LOCO-I [3] and CALIC [23].
In Fig. 5, we visualize the performance qualitatively by showing the magnitudes of prediction errors in
For quantitative performance, we measure the zero-order entropy of the prediction error images. Zero-order entropy is calculated as \begin{equation*} H(X) = -\sum _{i=1}^{n}{P(x_{i})\log {P(x_{i})}}\tag{20}\end{equation*}
E. Context Analysis
In this section, we validate whether the context model is well designed. One of our goals is to let the network generate contexts proportional to regional activities’ magnitudes. In other words, regions with large activity result in large contexts, and vice versa. We show an example of the context in Fig. 6b, where we can see that smooth regions have lower contexts than texture areas. Thus, we conclude that our context model is well designed.
An analysis of the predicted contexts. (a) Input image. (b) Visualization of contexts obtained in the
Additionally, we quantitatively justify our context design. The learned probability model of prediction errors should have variance proportional to the contexts. That is, small contexts constitute a probability density with small variance, where prediction errors are concentrated near zero. On the other hand, large contexts form a probability density where prediction errors are less concentrated near zero. Fig. 6c shows that the learned probability models agree with our hypothesis, verifying that our model of context is well designed.
Finally, we conduct an experiment to demonstrate the effectiveness of our context model by applying the proposed entropy coding method to the predicted signal of JPEG-LS, which originally uses the Golomb-Rice coder [40] as the entropy coder. For a fair comparison, the prediction scheme of JPEG-LS is applied after the reversible color transform mentioned in Section III-B. In addition, we show that utilizing
F. Result Analysis
This section presents the results on some individual images for analyzing the algorithm’s performance depending on the input properties. We first analyze the compression performance for the DIV2K dataset in Fig. 7. The image index is sorted in the order of bpp and we compare our method only with the FLIF for clarity. We observe that images with low index i.e. low bpp, exhibit small performance gaps between ours and FLIF. For images showing roughly 8 bpp or more when compressed, our method consistently outperforms the FLIF. In Fig. 8, we show the images having the largest and smallest performance gap. As can be seen, images that contain a lot of texture regions show large performance gaps. In contrast, images with large smooth areas show smaller performance gaps, or the FLIF even outperforms ours. The reason behind this lower performance on smooth images is that our method does not fully utilize all the AACs in this case. Precisely, since the majority of prediction errors are very small when the smooth region is large, only the AACs for the lower prediction errors are used. In this scenario, our method becomes similar to the Single Context mentioned in Section IV-E.
Comparison of our method and FLIF for individual images from DIV2K. The results are reported in bpp. (a) to (d) are images where our method outperforms FLIF by large margins. (e) to (h) are images where the performance gaps are not significant.
In addition, we present the results of individual images in Classic dataset in Table 8 for both color and gray images. For color images, although out method is outperformed by FLIF, it shows lower bit rates than other methods. In the case of gray images, our method delivers the best performance.
Conclusion
In this paper, we have proposed a simple MLP-based lossless image compression method. To achieve high compression performance with reasonable computation time, we process the compression in a raster scan order using simple MLPs. For each encoding pixel, our network predicts the pixel value and the context simultaneously, where the context is modeled based on the prediction error. To boost the estimation accuracy, we introduced a channel-wise progressive compression scheme, which exploits the correlation between color channels. Furthermore, the network is trained in a residual manner, in contrast to the direct prediction of previous works, obtaining stability and accuracy. The optimization is also done in a progressive manner, where the networks are optimized in the order of
Experiments show that our method obtains better results than the FLIF, which is the state-of-the-art non-learning codec, while requiring more computation time. Also, our method shows better or comparable performance compared to L3C, which is a practical learning-based method. We also demonstrated the effectiveness of our method in terms of pixel value prediction capability. Finally, we verified that our context design reflects the magnitude of the local activities well. Our codes and dataset will be made publicly available at