Introduction
Single-image super-resolution is the process of obtaining a high-resolution image from a given low-resolution image. It is an ill-posed problem because it has to estimate image details under a lack of spatial information. Many researchers have proposed various approaches that can generate upscaled images having better quality than those obtained using simple interpolation methods such as nearest-neighbor, bilinear, and bicubic upscaling.
In recent times, the emergence of deep learning techniques has impacted the super-resolution field [1]. Dong et al. [2] proposed the super-resolution convolutional neural network (SRCNN) model, which shows considerably improved performance compared to previous approaches. Lim et al. [3] proposed the enhanced deep super-resolution (EDSR) model, which employs residual connections and various optimization techniques.
Many recent deep learning-based super-resolution methods tend to increase the number of layers to obtain better upscaled images. However, this dramatically increases the number of model parameters involved. For instance, the number of parameters in the EDSR model is approximately 43M, which is at least 400 times larger than that of the SRCNN model. To deal with this, recursive approaches that use some parameters repeatedly, such as deeply-recursive convolutional network (DRCN) [4] and deep recursive residual network (DRRN) [5], have been proposed.
Despite efficiency in terms of model size, existing recursive super-resolution methods have a limitation compared to non-recursive methods. In the non-recursive methods, stacked convolutional layers gradually reconstruct image details. Through training, each of the layers learns a proper distinct operation to be performed upon the features at the stage where the layer is placed. By contrast, the recursive methods are based on parameter sharing, which means that each recursion implements the same operation. Thus, these methods rely on repeated application of the same operation, which is a significant constraint that can limit the performance of reconstruction of image details.
In order to overcome this limitation, we propose a novel parameter-efficient recursive super-resolution method using Volatile-nonvolatile Memory Network (VMNet). The recursive portion in our model consists of two components: volatile memory and nonvolatile memory. The volatile memory component contains the information that is useful to gradually enhance the intermediate features but is not passed to the post-recursive upscaling part; it is called volatile in the sense that the information is retained only during the recursive operation. The nonvolatile memory component contains the nonvolatile features, i.e., the ones that are passed to and used in the upscaling part to obtain the final output image. Thanks to the interplay of the two distinguished memory components through the recursive operation, our method overcomes the limitation mentioned above. As a result, it is superior to the existing recursive methods in terms of both parameter efficiency and image quality. As shown in Figure 1, our method achieves improved performance in terms of image quality, while the model complexity is significantly reduced. In addition, our VMNet model can generate the super-resolved images in a progressive manner, which is useful for real-world applications such as progressive image loading.
Number of parameters and peak signal-to-noise ratio (PSNR) values of the state-of-the-art and proposed methods for an upscaling factor of 4 on the Urban100 dataset [6].
The main contributions of this work are as follows:
We propose an efficient recursive deep learning model, VMNet, for image super-resolution. The key components of our VMNet are volatile and nonvolatile memory components for effective and stable recursive operations.
We demonstrate that the two memory components adopt distinct roles to perform stable progressive super-resolution.
We show that the proposed method yields improved image quality with reduced complexity in comparison to the existing state-of-the-art methods.
The rest of the paper is organized as follows. First, we briefly survey the related work in Section II. Then, the overall structure of the proposed method is explained in Section III and a comparison with other recursive methods is presented in Section IV. We describe several experiments for in-depth analysis of our method in Section V, including examining the effectiveness of the newly introduced recursive structure and a comparison with the other state-of-the-art methods. Finally, we conclude our work in Section VI.
Related Work
Recent trends of super-resolution have shifted from a feature extraction-based approach towards a deep learning-based approach. Dong et al. [2] first employed deep learning techniques for super-resolution by developing SRCNN, which enhances the interpolated image via three convolutional layers. Kim et al. [7] proposed very deep super-resolution (VDSR), which stacks 20 convolutional layers to improve the performance. Lim et al. [3] suggested the EDSR model, which employs more than 64 convolutional layers.
We observe the following three common characteristics in the aforementioned works. First, increasing the spatial resolution at the latter stage can reduce the computational complexity more than upscaling at the initial stage [8]–[10]. Second, employing multiple residual connections is beneficial for obtaining better upscaled images [5], [11]. Third, obtaining multiple upscaled images from the same super-resolution model and combining them into one provides better quality than acquiring a single image directly [3], [4], [12]. Along with the newly introduced volatile-nonvolatile memory-based architecture, our proposed method is built taking this empirical knowledge into consideration.
While the aforementioned methods share the basic empirical rule of deep learning, i.e., deeper and larger models can achieve better performance, parameter efficiency is also one of the crucial factors in various applications [1], [13]. Therefore, super-resolution methods with shared model parameters have been proposed. DRCN, introduced by Kim et al. [4], proved the effectiveness of parameter sharing, which recursively applies the feature extraction layer 16 times. Tai et al. [5] proposed DRRN, which employs a residual network (ResNet) [14] with sharing the model parameters. They also proposed the memory network (MemNet) model [12], which contains groups of recursive parts called “memory blocks” with skip connections across them. As mentioned in the introduction, however, these methods have a constraint: They perform fixed operations repeatedly in their recursive parts, which is discussed in detail in Section IV.
Some researchers proposed super-resolution methods that do not rely on shared parameters but have fewer model parameters. For example, Lai et al. [15] introduced the Laplacian pyramid super-resolution network (LapSRN) method, which progressively upscales the input image by a factor of 2. Hui et al. [8] proposed the information distillation network (IDN) method, which employs long and short feature extraction paths to maximize the amount of extracted information from a given low-resolution image. Ahn et al. [9] developed the cascading residual network (CARN) model, which employs cascading residual blocks. Li et al. [16] proposed the super-resolution feedback network (SRFBN) model, which processes the intermediate features with a feedback mechanism. Along with the recursive super-resolution methods, the performance of these methods is also compared with that of our proposed method in Section V-E.
Volatile-Nonvolatile Memory Network
Figure 2 shows the overall structure of the proposed super-resolution model. The objective of the super-resolution task is to obtain an image
Algorithm of Processing an Image via a Trained VMNet Model for an Upscaling Factor of 4
Input image
Upscaled image
Load model parameters
for
end for
return
A. Initial Feature Extractor
The VMNet model takes a low-resolution input image \begin{equation*} {H}_{0} = {W}_{I} \ast {X} + {b}_{I},\tag{1}\end{equation*}
B. Volatile-Nonvolatile Memory Block
Starting from the initial features
A VMB consists of three convolutional layers and one rectified linear unit (ReLU) activation. To utilize the nonvolatile feature matrix and the volatile memory matrix simultaneously, the two matrices are concatenated along the last dimension before performing the first convolutional operation. The concatenated matrix is processed via two convolutional layers with one ReLU activation and then split into two output matrices having input matrices of the same size. The same concatenation and splitting procedures are also applied before and after employing the last convolutional layer. In addition, two residual connections are adopted for improved performance [7], [11], [17]. To sum up, the VMB takes
C. Upscaling
Finally, the VMNet model upscales the processed nonvolatile feature matrix
D. Loss Function
We build our loss function based on L1 loss, which is widely used in various state-of-the-art super-resolution methods because of its superiority over other loss functions [1]. To enable our model to generate intermediate upscaled images, which have progressively improved quality as shown in Figure 6, we design the loss function for training our model as \begin{equation*} L \big (\widehat {Y}, Y \big) = \frac {1}{w' \times h'} \sum _{x=1}^{w'} \sum _{y=1}^{h'} \left |{ \widehat {Y_{L}}(x, y) - Y(x, y) }\right |.\tag{2}\end{equation*}
\begin{equation*} \widehat {Y_{L}} = \frac {\sum _{t=1}^{R}{2^{(t-1)} \widehat {Y_{t}} }}{\sum _{t=1}^{R}{2^{(t-1)} }}.\tag{3}\end{equation*}
Number of intermediate channels (
Number of parameters and PSNR values of the
Nonvolatile intermediate features
Comparison to Other Recursive Methods
As mentioned in Section II, there exist a few super-resolution models that employ recursive operations, including DRCN [4], DRRN [5], MemNet [12], and SRFBN [16].
DRCN is one of the simplest recursive super-resolution methods, which employs only a convolutional layer for each recursive operation. Thus, it is largely limited owing to the repeated process of the simple fixed operation. DRRN has an improved recursive structure, which consists of two convolutional layers with a residual connection. Nevertheless, its computation is still constrained because of the same limitation as DRCN. MemNet is based on multiple stacked recursive units, which have different weights and thus act as different operations. It can be considered as a compromise between the recursive and non-recursive approaches, and thus does not fully exploit the advantage of the recursive approach in terms of efficiency. Furthermore, a fixed operation is performed during recursion in each recursive unit, and therefore the aforementioned limitation of the other recursive approaches still exists. SRFBN employs the original pre-processed input and the previous output of its recursive portion during the recursive operations. Hence, it still relies on a fixed operation and has the same limitation as the other recursive approaches.
In our model, employing a separate volatile memory component has important implications. First, from the standpoint of the nonvolatile feature matrix \begin{equation*} {H}_{t}=f({H}_{t-1}, {V}_{t-1}) = {g}_{t-1}({H}_{t-1})\tag{4}\end{equation*}
\begin{equation*} H_{t+1}=f(H_{t},V_{t}) = g_{t}(H_{t}).\tag{5}\end{equation*}
\begin{equation*} f({H}_{t}, {V}_{t}) = {g}_{t}({H}_{t}) \neq {g}_{t-1}({H}_{t}) = f({H}_{t}, {V}_{t-1}).\tag{6}\end{equation*}
Experimental Results
We conduct experiments to investigate the advantages of the proposed VMNet model. We build single-scale (
A. Datasets
We employ the DIV2K dataset [20], which has been widely used to train recent super-resolution models [9], [11], for training the VMNet models. For evaluating the performance of our models, we use four benchmark datasets that are widely used in literature: Set5 [21], Set14 [22], BSD100 [19], and Urban100 [6]. The datasets contain 5, 14, 100, and 100 images, respectively.
B. Implementation Details
We implement the training and evaluation code of the VMNet model on a TensorFlow framework.1 The training procedure for each training step is as follows. First, for the multi-scale VMNet model, one of the upscaling paths (i.e.,
C. Evaluation Metrics
To measure the quality of the upscaled images we employ four popular evaluation metrics, namely, peak signal-to-noise ratio (PSNR), structural similarity (SSIM) [25], naturalness image quality evaluator (NIQE) [26], and perceptual index (PI) [27]. PSNR and SSIM measure the quantitative quality degradation of an image in comparison to its reference image. They are calculated by using SRZoo v0.3.0 [28]. For these metrics, higher values mean better quality. NIQE and PI are no-reference perceptual quality metrics. PI is obtained by combining NIQE and a super-resolution (SR) score proposed by Ma et al. [29] as follows:\begin{equation*} \mathrm {PI}(\widehat {Y}) = \frac {1}{2} \Big (\mathrm {NIQE}(\widehat {Y}) + \big (10 - \mathrm {SR}(\widehat {Y}) \big) \Big),\tag{7}\end{equation*}
D. Benefits of VMNet
1) Parameter Efficiency
We first investigate the effectiveness of employing the volatile memory component by comparing the single-scale VMNet models having an upscaling factor of 4, which are trained with or without using the volatile memory component. For models with the volatile memory component, the number of convolutional channels (
Figure 4 compares the number of parameters with respect to the number of intermediate channels, which is the sum of
Furthermore, employing the volatile memory component is also beneficial to improve the quality of the output images for a fixed number of parameters. Figure 5 compares the performance of the trained VMNet models in terms of the number of parameters and the PSNR values measured for the BSD100 dataset. Overall, the models with and without the volatile memory component both tend to show better performance as the channel dimension (and consequently the number of parameters) increases. However, VMNet models having the volatile memory component outperform those without the volatile memory component, when the same number of parameters is used. This strongly supports our assumption that modulating the intermediate features by employing two memory components instead of a single memory component helps to improve the quality of the upscaled images, when the same number of model parameters is employed.
2) Effective Feature Handling
We further examine changes of the activation patterns over the recursive iterations in our model. Figure 6 shows
3) Progressive Upscaling
As our method employs a recursive operation, it is possible to generate the super-resolved images in a progressive manner, as depicted in Figure 6. Figure 7 shows the PSNR and SSIM values with respect to various numbers of the recursions for our model with
E. Comparison With Existing Methods
We compare the performance of our multi-scale VMNet model with the existing state-of-the-art super-resolution methods having the number of parameters similar to VMNet, such as VDSR [7], DRCN [4], LapSRN [15], DRRN [5], MemNet [12], IDN [8], SRFBN-S [16], and CARN [9]. DRCN, DRRN, MemNet, and SRFBN-S contain parameter-sharing parts. VDSR, LapSRN, IDN, and CARN are also included in the comparison, because they have been recently proposed and have numbers of model parameters similar to ours.
Table 1 presents the performance of the state-of-the-art methods and ours in terms of PSNR, SSIM, NIQE, and PI on the four benchmark datasets. The number of model parameters required to obtain a super-resolved image for the given upscaling factor is also provided. First, the VMNet model mostly outperforms the other methods that do not employ recursive operations or parameter-sharing, including VDSR, LapSRN, and IDN. For example, our method achieves a quality gain of 1.67 dB for a scaling factor of 2 on the Urban100 dataset over the LapSRN model. This confirms that recursive processing helps to obtain better super-resolved images while keeping the number of model parameters small.
In addition, our model employs much smaller numbers of parameters than DRCN and CARN. The VMNet model uses up to 67% fewer numbers of model parameters than DRCN, while it significantly outperforms DRCN in terms of PSNR. In addition, VMNet shows slightly better performance than CARN despite the smaller model size. The VMNet model has 38%, 32%, and 33% fewer numbers of parameters than the CARN model on scaling factors of 2, 3, and 4, respectively. This demonstrates that the proposed method handles image features better than the other state-of-the-art methods.
Table 2 compares the performance of state-of-the-art methods and ours in terms of the number of model parameters, processing time, PSNR, and SSIM. To ensure a fair comparison, we measure processing times on the same computing device having NVIDIA GTX 1080 GPU. Given the nature of recursive operations (i.e., the need to process intermediate features multiple times using the same model parameters), our proposed method takes a slightly longer processing time than the fastest one, i.e., CARN. Nevertheless, thanks to the efficient structure of the VMNet, our method performs fast enough and takes significantly shorter processing time than the other methods, regardless of whether structures of the compared models are recursive or not. This confirms that the VMNet model is efficient in terms of model complexity as well as computational complexity.
Figure 8 provides a showcase of the images reconstructed by our proposed model and the other methods. The figure shows that VMNet is highly reliable in recovering complex textures from low-resolution images. For example, our method successfully upscales fine details of the structures in the Urban100 dataset, while the other methods produce highly blurred images or images containing large amounts of artifacts.
Comparison of the images upscaled by a factor of 4, which are obtained by VMNet and the other state-of-the-art methods.
Conclusion
In this paper, we proposed the VMNet model, which employs a new way of recursive operation using volatile and nonvolatile memories to perform super-resolution. Our VMNet model operates two distinguished memory components through a recursive operation, which provides an effective solution to the limitations of previous super-resolution methods employing a fixed operation in their recursive portions. By dividing recursive portions into two components, our model showed multiple benefits, including the capability of progressive operation and better efficiency in terms of the number of model parameters and image quality. In addition, comparison with other state-of-the-art methods also demonstrated that our method can generate better-quality upscaled images than the others.
We believe that the proposed recursive operation using two separate memory components can be useful in other image enhancement tasks such as denoising and deblurring. Therefore, extending the proposed method to such tasks will be worth investigating in the future.