Processing math: 0%
Volatile-Nonvolatile Memory Network for Progressive Image Super-Resolution | IEEE Journals & Magazine | IEEE Xplore

Volatile-Nonvolatile Memory Network for Progressive Image Super-Resolution


Overall structure of the proposed Volatile-nonvolatile Memory Network (VMNet) model.

Abstract:

Single-image super-resolution, i.e., reconstructing a high-resolution image from a low-resolution image, is a critical concern in many computer vision applications. Recen...Show More

Abstract:

Single-image super-resolution, i.e., reconstructing a high-resolution image from a low-resolution image, is a critical concern in many computer vision applications. Recent deep learning-based image super-resolution methods employ massive numbers of model parameters to obtain quality gain. However, this leads to increased model size and high computational complexity. To mitigate this, some methods employ recursive parameter-sharing for better parameter efficiency. Nevertheless, their designs do not adequately exploit the potential of the recursive operation. In this paper, we propose a novel super-resolution method, called a volatile-nonvolatile memory network (VMNet), to maximize the usefulness of the recursive architecture. Specifically, we design two central components called volatile and nonvolatile memories. By means of these, the recursive feature extraction portion of our model performs effective recursive operations that gradually enhance image quality. Through extensive experiments on \times 2 , \times 3 , and \times 4 super-resolution tasks, we demonstrate that our method outperforms existing state-of-the-art methods in terms of image quality and complexity via stable progressive super-resolution.
Overall structure of the proposed Volatile-nonvolatile Memory Network (VMNet) model.
Published in: IEEE Access ( Volume: 9)
Page(s): 37487 - 37496
Date of Publication: 04 March 2021
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Single-image super-resolution is the process of obtaining a high-resolution image from a given low-resolution image. It is an ill-posed problem because it has to estimate image details under a lack of spatial information. Many researchers have proposed various approaches that can generate upscaled images having better quality than those obtained using simple interpolation methods such as nearest-neighbor, bilinear, and bicubic upscaling.

In recent times, the emergence of deep learning techniques has impacted the super-resolution field [1]. Dong et al. [2] proposed the super-resolution convolutional neural network (SRCNN) model, which shows considerably improved performance compared to previous approaches. Lim et al. [3] proposed the enhanced deep super-resolution (EDSR) model, which employs residual connections and various optimization techniques.

Many recent deep learning-based super-resolution methods tend to increase the number of layers to obtain better upscaled images. However, this dramatically increases the number of model parameters involved. For instance, the number of parameters in the EDSR model is approximately 43M, which is at least 400 times larger than that of the SRCNN model. To deal with this, recursive approaches that use some parameters repeatedly, such as deeply-recursive convolutional network (DRCN) [4] and deep recursive residual network (DRRN) [5], have been proposed.

Despite efficiency in terms of model size, existing recursive super-resolution methods have a limitation compared to non-recursive methods. In the non-recursive methods, stacked convolutional layers gradually reconstruct image details. Through training, each of the layers learns a proper distinct operation to be performed upon the features at the stage where the layer is placed. By contrast, the recursive methods are based on parameter sharing, which means that each recursion implements the same operation. Thus, these methods rely on repeated application of the same operation, which is a significant constraint that can limit the performance of reconstruction of image details.

In order to overcome this limitation, we propose a novel parameter-efficient recursive super-resolution method using Volatile-nonvolatile Memory Network (VMNet). The recursive portion in our model consists of two components: volatile memory and nonvolatile memory. The volatile memory component contains the information that is useful to gradually enhance the intermediate features but is not passed to the post-recursive upscaling part; it is called volatile in the sense that the information is retained only during the recursive operation. The nonvolatile memory component contains the nonvolatile features, i.e., the ones that are passed to and used in the upscaling part to obtain the final output image. Thanks to the interplay of the two distinguished memory components through the recursive operation, our method overcomes the limitation mentioned above. As a result, it is superior to the existing recursive methods in terms of both parameter efficiency and image quality. As shown in Figure 1, our method achieves improved performance in terms of image quality, while the model complexity is significantly reduced. In addition, our VMNet model can generate the super-resolved images in a progressive manner, which is useful for real-world applications such as progressive image loading.

FIGURE 1. - Number of parameters and peak signal-to-noise ratio (PSNR) values of the state-of-the-art and proposed methods for an upscaling factor of 4 on the Urban100 dataset [6].
FIGURE 1.

Number of parameters and peak signal-to-noise ratio (PSNR) values of the state-of-the-art and proposed methods for an upscaling factor of 4 on the Urban100 dataset [6].

The main contributions of this work are as follows:

  • We propose an efficient recursive deep learning model, VMNet, for image super-resolution. The key components of our VMNet are volatile and nonvolatile memory components for effective and stable recursive operations.

  • We demonstrate that the two memory components adopt distinct roles to perform stable progressive super-resolution.

  • We show that the proposed method yields improved image quality with reduced complexity in comparison to the existing state-of-the-art methods.

The rest of the paper is organized as follows. First, we briefly survey the related work in Section II. Then, the overall structure of the proposed method is explained in Section III and a comparison with other recursive methods is presented in Section IV. We describe several experiments for in-depth analysis of our method in Section V, including examining the effectiveness of the newly introduced recursive structure and a comparison with the other state-of-the-art methods. Finally, we conclude our work in Section VI.

SECTION II.

Related Work

Recent trends of super-resolution have shifted from a feature extraction-based approach towards a deep learning-based approach. Dong et al. [2] first employed deep learning techniques for super-resolution by developing SRCNN, which enhances the interpolated image via three convolutional layers. Kim et al. [7] proposed very deep super-resolution (VDSR), which stacks 20 convolutional layers to improve the performance. Lim et al. [3] suggested the EDSR model, which employs more than 64 convolutional layers.

We observe the following three common characteristics in the aforementioned works. First, increasing the spatial resolution at the latter stage can reduce the computational complexity more than upscaling at the initial stage [8]–​[10]. Second, employing multiple residual connections is beneficial for obtaining better upscaled images [5], [11]. Third, obtaining multiple upscaled images from the same super-resolution model and combining them into one provides better quality than acquiring a single image directly [3], [4], [12]. Along with the newly introduced volatile-nonvolatile memory-based architecture, our proposed method is built taking this empirical knowledge into consideration.

While the aforementioned methods share the basic empirical rule of deep learning, i.e., deeper and larger models can achieve better performance, parameter efficiency is also one of the crucial factors in various applications [1], [13]. Therefore, super-resolution methods with shared model parameters have been proposed. DRCN, introduced by Kim et al. [4], proved the effectiveness of parameter sharing, which recursively applies the feature extraction layer 16 times. Tai et al. [5] proposed DRRN, which employs a residual network (ResNet) [14] with sharing the model parameters. They also proposed the memory network (MemNet) model [12], which contains groups of recursive parts called “memory blocks” with skip connections across them. As mentioned in the introduction, however, these methods have a constraint: They perform fixed operations repeatedly in their recursive parts, which is discussed in detail in Section IV.

Some researchers proposed super-resolution methods that do not rely on shared parameters but have fewer model parameters. For example, Lai et al. [15] introduced the Laplacian pyramid super-resolution network (LapSRN) method, which progressively upscales the input image by a factor of 2. Hui et al. [8] proposed the information distillation network (IDN) method, which employs long and short feature extraction paths to maximize the amount of extracted information from a given low-resolution image. Ahn et al. [9] developed the cascading residual network (CARN) model, which employs cascading residual blocks. Li et al. [16] proposed the super-resolution feedback network (SRFBN) model, which processes the intermediate features with a feedback mechanism. Along with the recursive super-resolution methods, the performance of these methods is also compared with that of our proposed method in Section V-E.

SECTION III.

Volatile-Nonvolatile Memory Network

Figure 2 shows the overall structure of the proposed super-resolution model. The objective of the super-resolution task is to obtain an image \widehat {Y} , which is upscaled from a given low-resolution image X , where we want \widehat {Y} to be the same as the ground-truth high-resolution image Y . The initial features are extracted from the given input image. Then, the extracted features are further processed via a volatile-nonvolatile memory block (VMB, Figure 3), which is employed multiple times with the same parameters. The final image is obtained from the upscaling module. The entire procedure for an upscaling factor of 4 is also summarized in Algorithm 1.

FIGURE 2. - Overall structure of the proposed VMNet model.
FIGURE 2.

Overall structure of the proposed VMNet model.

FIGURE 3. - Structure of a volatile-nonvolatile memory block (VMB).
FIGURE 3.

Structure of a volatile-nonvolatile memory block (VMB).

SECTION Algorithm 1

Algorithm of Processing an Image via a Trained VMNet Model for an Upscaling Factor of 4

Input:

Input image {X}\in \mathbb {R}^{w\times {h}\times {3}}

Output:

Upscaled image \widehat {Y}\in \mathbb {R}^{4w\times {4h}\times {3}}

1:

Load model parameters

2:

\{ {W}_{I}, {b}_{I}, {W}_{C_{1}}, {b}_{C_{1}}, {W}_{C_{2}}, {b}_{C_{2}}, {W}_{C_{3}}, {b}_{C_{3}}, \,\,{W}_{U_{1}}, {b}_{U_{1}}, {W}_{U_{2}}, {b}_{U_{2}}, {W}_{U_{3}}, {b}_{U_{3}} \}\,\,{H}_{0} \gets {W}_{I} \ast {X} + {b}_{I} , {V}_{0} \gets 0

3:

for t = 0 to R-1 do

4:

{M}_{t} \gets \mathrm {Concatenate}({H}_{t}, {V}_{t})

5:

{M}_{t} \gets {W}_{C_{1}} \ast {M}_{t} + {b}_{C_{1}}

6:

{M}_{t} \gets \max ({M}_{t}, 0)

7:

{M}_{t} \gets {W}_{C_{2}} \ast {M}_{t} + {b}_{C_{2}}

8:

\{ {H}_{t+1}, {V}_{t+1} \} \gets \mathrm {Split}({M}_{t})

9:

{M}_{t} \gets \mathrm {Concatenate}({H}_{t+1} + {H}_{t}, {V}_{t})

10:

{M}_{t} \gets {W}_{C_{3}} \ast {M}_{t} + {b}_{C_{3}}

11:

\{ {H}_{t+1}, {V}_{t+1} \} \gets \mathrm {Split}({M}_{t})

12:

{H}_{t+1} \gets {H}_{t+1} + {H}_{t}

13:

end for

14:

\widehat {Y} \gets {W}_{U_{1}} \ast {H}_{R} + {b}_{U_{1}}

15:

\widehat {Y} \gets \mathrm {Depth\_{}to\_{}space}_{2}(\widehat {Y})

16:

\widehat {Y} \gets {W}_{U_{2}} \ast \widehat {Y} + {b}_{U_{2}}

17:

\widehat {Y} \gets \mathrm {Depth\_{}to\_{}space}_{2}(\widehat {Y})

18:

\widehat {Y} \gets {W}_{U_{3}} \ast \widehat {Y} + {b}_{U_{3}}

19:

return \widehat {Y}

A. Initial Feature Extractor

The VMNet model takes a low-resolution input image {X}\in \mathbb {R}^{w\times {h}\times {3}} consisting of three channels corresponding to the RGB color space, where {w}\times {h} is the resolution of the image. Before our model recursively processes the image, a convolutional layer extracts the initial features, which can be represented as \begin{equation*} {H}_{0} = {W}_{I} \ast {X} + {b}_{I},\tag{1}\end{equation*} View SourceRight-click on figure for MathML and additional features. where {W}_{I}\in \mathbb {R}^{3\times {3}\times {3}\times {c}} and {b}_{I}\in \mathbb {R}^{c} are the weight and bias parameters, respectively, and the operator \ast denotes the convolution operation. The variable c determines the number of convolutional channels. Thus, the last dimension of {H}_{0} is c .

B. Volatile-Nonvolatile Memory Block

Starting from the initial features {H}_{0} , our model performs the recursive operations in the shared portion named volatile-nonvolatile memory block (VMB), which is shown in Figure 3. The VMB takes two matrices as inputs at a certain recursion t : the nonvolatile feature matrix {H}_{t} and the volatile memory matrix {V}_{t}\in \mathbb {R}^{w\times {h}\times {v}} . Here, v determines the channel dimension of {V}_{t} . Note that {H}_{t} and {V}_{t} have the same spatial dimension but different channel dimensions. The volatile memory matrix is initialized by zero values (i.e., {V}_{0} = 0 ) and handled separately from the nonvolatile memory matrix.

A VMB consists of three convolutional layers and one rectified linear unit (ReLU) activation. To utilize the nonvolatile feature matrix and the volatile memory matrix simultaneously, the two matrices are concatenated along the last dimension before performing the first convolutional operation. The concatenated matrix is processed via two convolutional layers with one ReLU activation and then split into two output matrices having input matrices of the same size. The same concatenation and splitting procedures are also applied before and after employing the last convolutional layer. In addition, two residual connections are adopted for improved performance [7], [11], [17]. To sum up, the VMB takes {H}_{t} and {V}_{t} as inputs and outputs {H}_{t+1} and {V}_{t+1} , which then serve as the inputs of the same VMB for the next recursion. This recursive process is performed R times, which produces {H}_{R} and {V}_{R} . Then, the volatile memory {V}_{R} becomes extinct.

C. Upscaling

Finally, the VMNet model upscales the processed nonvolatile feature matrix {H}_{R} to generate the final upscaled image \widehat {Y} . Specifically we use the depth-to-space operation, which is also known as sub-pixel convolution [18]. For instance, to generate an image with an upscaling factor of 2, the first convolutional layer outputs a matrix having a size of {w}\times {h}\times {4c} , the depth-to-space operator modifies its shape to {2w}\times {2h}\times {c} , and the last convolutional layer outputs the final upscaled image having a shape of {2w}\times {2h}\times {3} . In addition, our model can generate upscaled images not only from the final processed nonvolatile feature matrix {H}_{R} but also from the intermediate nonvolatile feature matrices {H}_{t},\,\,\forall {t}\in \{1, \ldots, R-1\} , by inputting {H}_{t} to the upscaling module.

D. Loss Function

We build our loss function based on L1 loss, which is widely used in various state-of-the-art super-resolution methods because of its superiority over other loss functions [1]. To enable our model to generate intermediate upscaled images, which have progressively improved quality as shown in Figure 6, we design the loss function for training our model as \begin{equation*} L \big (\widehat {Y}, Y \big) = \frac {1}{w' \times h'} \sum _{x=1}^{w'} \sum _{y=1}^{h'} \left |{ \widehat {Y_{L}}(x, y) - Y(x, y) }\right |.\tag{2}\end{equation*} View SourceRight-click on figure for MathML and additional features. Here, w' \times h' is the spatial resolution of \widehat {Y} and Y . \widehat {Y}(x, y) and Y(x, y) are the pixel values at (x, y) of the upscaled and ground-truth images, respectively. \widehat {Y_{L}} is an alternative version of \widehat {Y} obtained from the intermediate outputs via a weighted sum as \begin{equation*} \widehat {Y_{L}} = \frac {\sum _{t=1}^{R}{2^{(t-1)} \widehat {Y_{t}} }}{\sum _{t=1}^{R}{2^{(t-1)} }}.\tag{3}\end{equation*} View SourceRight-click on figure for MathML and additional features. The term {2}^{(t-1)} controls the amount of contribution of each intermediate output in a way that the later outputs contribute to \widehat {Y_{L}} more than the earlier ones.

FIGURE 4. - Number of intermediate channels (
$c+v$
) vs. number of parameters of the 
$\times 4$
-scale VMNet models with or without the volatile memory component.
FIGURE 4.

Number of intermediate channels (c+v ) vs. number of parameters of the \times 4 -scale VMNet models with or without the volatile memory component.

FIGURE 5. - Number of parameters and PSNR values of the 
$\times 4$
-scale VMNet models with or without the volatile memory component for the BSD100 dataset [19].
FIGURE 5.

Number of parameters and PSNR values of the \times 4 -scale VMNet models with or without the volatile memory component for the BSD100 dataset [19].

FIGURE 6. - Nonvolatile intermediate features 
${H}_{t}$
, volatile memory matrix 
${V}_{t}$
, upscaled images 
$\widehat {Y_{t}}$
, and PSNR values of the VMNet models from an image of the BSD100 dataset [19]. Enlarged versions of the regions marked with red rectangles are also shown. (a) Without the volatile memory component (
$c=64, v=0$
) (b) With a small amount of the volatile memory component (
$c=64, v=8$
) (c) With a large amount of the volatile memory component (
$c=64, v=64$
).
FIGURE 6.

Nonvolatile intermediate features {H}_{t} , volatile memory matrix {V}_{t} , upscaled images \widehat {Y_{t}} , and PSNR values of the VMNet models from an image of the BSD100 dataset [19]. Enlarged versions of the regions marked with red rectangles are also shown. (a) Without the volatile memory component (c=64, v=0 ) (b) With a small amount of the volatile memory component (c=64, v=8 ) (c) With a large amount of the volatile memory component (c=64, v=64 ).

SECTION IV.

Comparison to Other Recursive Methods

As mentioned in Section II, there exist a few super-resolution models that employ recursive operations, including DRCN [4], DRRN [5], MemNet [12], and SRFBN [16].

DRCN is one of the simplest recursive super-resolution methods, which employs only a convolutional layer for each recursive operation. Thus, it is largely limited owing to the repeated process of the simple fixed operation. DRRN has an improved recursive structure, which consists of two convolutional layers with a residual connection. Nevertheless, its computation is still constrained because of the same limitation as DRCN. MemNet is based on multiple stacked recursive units, which have different weights and thus act as different operations. It can be considered as a compromise between the recursive and non-recursive approaches, and thus does not fully exploit the advantage of the recursive approach in terms of efficiency. Furthermore, a fixed operation is performed during recursion in each recursive unit, and therefore the aforementioned limitation of the other recursive approaches still exists. SRFBN employs the original pre-processed input and the previous output of its recursive portion during the recursive operations. Hence, it still relies on a fixed operation and has the same limitation as the other recursive approaches.

In our model, employing a separate volatile memory component has important implications. First, from the standpoint of the nonvolatile feature matrix {H}_{t} , the operation in the VMB is not fixed at each recursion owing to the changing values of the volatile memory matrix {V}_{t} . Thus, the fundamental limitation of the existing recursive methods, i.e., repeated fixed operations, is resolved. To see this, let f({H}_{t-1}, {V}_{t-1}) = {H}_{t} be the operation of the VMB to produce {H}_{t} at recursion t . We can rewrite it by defining another function {g}_{t-1} as \begin{equation*} {H}_{t}=f({H}_{t-1}, {V}_{t-1}) = {g}_{t-1}({H}_{t-1})\tag{4}\end{equation*} View SourceRight-click on figure for MathML and additional features. by absorbing {V}_{t-1} into {g}_{t-1} . Similarly, \begin{equation*} H_{t+1}=f(H_{t},V_{t}) = g_{t}(H_{t}).\tag{5}\end{equation*} View SourceRight-click on figure for MathML and additional features. However, {g}_{t-1} and {g}_{t} are not the same from the perspective of {H}_{t} because \begin{equation*} f({H}_{t}, {V}_{t}) = {g}_{t}({H}_{t}) \neq {g}_{t-1}({H}_{t}) = f({H}_{t}, {V}_{t-1}).\tag{6}\end{equation*} View SourceRight-click on figure for MathML and additional features. Second, the volatile memory component serves as separate storage to keep track of the current status during the recursive operation. This status information is not directly passed to the upscaling portion and can therefore be exploited to fully support the gradual enhancement of the (nonvolatile) features as its sole objective. The advantages of this characteristic are experimentally demonstrated; the experiments are described in Section V-D.

SECTION V.

Experimental Results

We conduct experiments to investigate the advantages of the proposed VMNet model. We build single-scale (\times 4 ) and multi-scale (\times 2 , \times 3 , and \times 4 ) VMNet models. The single-scale models are used to examine the benefits of VMNet and the multi-scale model is used to evaluate the performance of our model in comparison to the other super-resolution methods across various scales. The number of the recursive operations (R ) is set to 16.

A. Datasets

We employ the DIV2K dataset [20], which has been widely used to train recent super-resolution models [9], [11], for training the VMNet models. For evaluating the performance of our models, we use four benchmark datasets that are widely used in literature: Set5 [21], Set14 [22], BSD100 [19], and Urban100 [6]. The datasets contain 5, 14, 100, and 100 images, respectively.

B. Implementation Details

We implement the training and evaluation code of the VMNet model on a TensorFlow framework.1 The training procedure for each training step is as follows. First, for the multi-scale VMNet model, one of the upscaling paths (i.e., \times 2 , \times 3 , and \times 4 ) is randomly selected. Then, eight training images are randomly selected. Each selected image is randomly cropped, where cropping sizes are set to 32\times 32 pixels for the single-scale VMNet models and 48\times 48 pixels for the multi-scale VMNet model. For data augmentation, the image patches are then randomly flipped and rotated. These patches are inputted to the model. The Adam optimization method [23] with \beta _{1}=0.9 , \beta _{2}=0.999 , and \hat {\epsilon }={10}^{-8} is used to update the model parameters. To prevent the vanishing or exploding gradients problem [24], we employ the L2 norm-based gradient clipping method, which clips each gradient so as to fit its L2 norm within [-\theta, \theta] . In this study, we set \theta =5 . The learning rate is initially set to {10}^{-4} and reduced by a half at every {2}\times {10}^{5} and {4}\times {10}^{5} training steps for training the single-scale and multi-scale VMNet models, respectively. A total of {1}\times {10}^{6} and {2}\times {10}^{6} steps are executed for training the single-scale and multi-scale VMNet models, respectively.

C. Evaluation Metrics

To measure the quality of the upscaled images we employ four popular evaluation metrics, namely, peak signal-to-noise ratio (PSNR), structural similarity (SSIM) [25], naturalness image quality evaluator (NIQE) [26], and perceptual index (PI) [27]. PSNR and SSIM measure the quantitative quality degradation of an image in comparison to its reference image. They are calculated by using SRZoo v0.3.0 [28]. For these metrics, higher values mean better quality. NIQE and PI are no-reference perceptual quality metrics. PI is obtained by combining NIQE and a super-resolution (SR) score proposed by Ma et al. [29] as follows:\begin{equation*} \mathrm {PI}(\widehat {Y}) = \frac {1}{2} \Big (\mathrm {NIQE}(\widehat {Y}) + \big (10 - \mathrm {SR}(\widehat {Y}) \big) \Big),\tag{7}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \widehat {Y} is a given super-resolved image, \mathrm {NIQE}(\widehat {Y}) is the measured NIQE value, and \mathrm {SR}(\widehat {Y}) is the measured SR score. For NIQE and PI, lower values mean better quality. As in previous works [8], [9], all metrics are calculated on the Y channel of the YCbCr channels converted from the RGB channels.

D. Benefits of VMNet

1) Parameter Efficiency

We first investigate the effectiveness of employing the volatile memory component by comparing the single-scale VMNet models having an upscaling factor of 4, which are trained with or without using the volatile memory component. For models with the volatile memory component, the number of convolutional channels (c ) is fixed as 64 and the number of the volatile memory channels (v ) varies from 1 to 64. For the models without the volatile memory component, v is fixed as 0 to disable the volatile memory component and c varies from 65 to 96.

Figure 4 compares the number of parameters with respect to the number of intermediate channels, which is the sum of c and v . As shown in the figure, the models having the volatile memory component have fewer parameters than the models without the volatile memory component for the same number of intermediate channels. When c increases, the number of model parameters increases across all parts of the model, i.e., the initial feature extraction, VMB, and upscaling parts. By contrast, increasing v affects the number of model parameters only in the VMB portion because the volatile memory component is involved only in VMB.

Furthermore, employing the volatile memory component is also beneficial to improve the quality of the output images for a fixed number of parameters. Figure 5 compares the performance of the trained VMNet models in terms of the number of parameters and the PSNR values measured for the BSD100 dataset. Overall, the models with and without the volatile memory component both tend to show better performance as the channel dimension (and consequently the number of parameters) increases. However, VMNet models having the volatile memory component outperform those without the volatile memory component, when the same number of parameters is used. This strongly supports our assumption that modulating the intermediate features by employing two memory components instead of a single memory component helps to improve the quality of the upscaled images, when the same number of model parameters is employed.

2) Effective Feature Handling

We further examine changes of the activation patterns over the recursive iterations in our model. Figure 6 shows {H}_{t} , {V}_{t} , and \widehat {Y_{t}} of the VMNet models using c=64, v=0 (i.e., without the volatile memory component), c=64, v=8 (i.e., with a small amount of the volatile memory component), and c=64, v=64 (i.e., with a large amount of the volatile memory component), where the corresponding PSNR values are also reported. The values of the nonvolatile and volatile memory matrices are averaged along the last dimension. The models with and without the volatile memory component both generate the upscaled images with gradually improved quality in terms of PSNR over the recursion t . However, the changes of the nonvolatile intermediate features are largely different. When the volatile memory component is not employed (Figure 6(a)), the activation pattern and range of the values drastically change over the recursion, even though the super-resolved image does not. Employing a small amount of the volatile memory component (Figure 6(b)) reduces such changes and improves the performance in terms of PSNR values. However, the activation pattern of {H}_{t} still fluctuates over the recursion to some extent because of insufficient space to store information in the volatile memory component. Employing a sufficiently large amount of the volatile memory component (Figure 6(c)) shows much more stable activations of {H}_{t} than the case without the volatile memory component. Instead, the volatile memory matrix {V}_{t} has major changes of the details (but the changes are still smaller than those of {H}_{t} in Figure 6(a)). At the early stage, {V}_{t} mainly represents major contours of the boat and building. As the recursion proceeds, these major contours become less clear and minute details are more prominent at the later stage. This implies that {V}_{t} has roles both a “tracker” as well as a “guide,” specifying where the image needs enhancement at the current recursion with which information for gradual image quality enhancement. This confirms that the two memory components have distinguished functions that perform effective interplay over the recursions, which leads to improved performance.

3) Progressive Upscaling

As our method employs a recursive operation, it is possible to generate the super-resolved images in a progressive manner, as depicted in Figure 6. Figure 7 shows the PSNR and SSIM values with respect to various numbers of the recursions for our model with c=64 , v=64 , and R=16 . The PSNR and SSIM values both increase when more recursions are performed, which proves that our method has the capability of generating gradually improved output images.

FIGURE 7. - PSNR and SSIM of the 
$\times 4$
-scale VMNet models with respect to the numbers of recursions for the BSD100 dataset [19].
FIGURE 7.

PSNR and SSIM of the \times 4 -scale VMNet models with respect to the numbers of recursions for the BSD100 dataset [19].

E. Comparison With Existing Methods

We compare the performance of our multi-scale VMNet model with the existing state-of-the-art super-resolution methods having the number of parameters similar to VMNet, such as VDSR [7], DRCN [4], LapSRN [15], DRRN [5], MemNet [12], IDN [8], SRFBN-S [16], and CARN [9]. DRCN, DRRN, MemNet, and SRFBN-S contain parameter-sharing parts. VDSR, LapSRN, IDN, and CARN are also included in the comparison, because they have been recently proposed and have numbers of model parameters similar to ours.

Table 1 presents the performance of the state-of-the-art methods and ours in terms of PSNR, SSIM, NIQE, and PI on the four benchmark datasets. The number of model parameters required to obtain a super-resolved image for the given upscaling factor is also provided. First, the VMNet model mostly outperforms the other methods that do not employ recursive operations or parameter-sharing, including VDSR, LapSRN, and IDN. For example, our method achieves a quality gain of 1.67 dB for a scaling factor of 2 on the Urban100 dataset over the LapSRN model. This confirms that recursive processing helps to obtain better super-resolved images while keeping the number of model parameters small.

TABLE 1 Performance Comparison of the State-of-the-Art Methods and Our Model on the Set5 [21], Set14 [22], BSD100 [19], and Urban100 [6] Datasets. Red and Blue Colors Indicate the Best and Second-Best Performance, Respectively
Table 1- 
Performance Comparison of the State-of-the-Art Methods and Our Model on the Set5 [21], Set14 [22], BSD100 [19], and Urban100 [6] Datasets. Red and Blue Colors Indicate the Best and Second-Best Performance, Respectively

In addition, our model employs much smaller numbers of parameters than DRCN and CARN. The VMNet model uses up to 67% fewer numbers of model parameters than DRCN, while it significantly outperforms DRCN in terms of PSNR. In addition, VMNet shows slightly better performance than CARN despite the smaller model size. The VMNet model has 38%, 32%, and 33% fewer numbers of parameters than the CARN model on scaling factors of 2, 3, and 4, respectively. This demonstrates that the proposed method handles image features better than the other state-of-the-art methods.

Table 2 compares the performance of state-of-the-art methods and ours in terms of the number of model parameters, processing time, PSNR, and SSIM. To ensure a fair comparison, we measure processing times on the same computing device having NVIDIA GTX 1080 GPU. Given the nature of recursive operations (i.e., the need to process intermediate features multiple times using the same model parameters), our proposed method takes a slightly longer processing time than the fastest one, i.e., CARN. Nevertheless, thanks to the efficient structure of the VMNet, our method performs fast enough and takes significantly shorter processing time than the other methods, regardless of whether structures of the compared models are recursive or not. This confirms that the VMNet model is efficient in terms of model complexity as well as computational complexity.

TABLE 2 Performance Comparison of State-of-the-Art Methods and Our Method in Terms of the Number of Parameters, Processing Time, PSNR, and SSIM on the BSD100 Dataset [19] for the \times4 -Scale
Table 2- 
Performance Comparison of State-of-the-Art Methods and Our Method in Terms of the Number of Parameters, Processing Time, PSNR, and SSIM on the BSD100 Dataset [19] for the 
$\times4$
-Scale

Figure 8 provides a showcase of the images reconstructed by our proposed model and the other methods. The figure shows that VMNet is highly reliable in recovering complex textures from low-resolution images. For example, our method successfully upscales fine details of the structures in the Urban100 dataset, while the other methods produce highly blurred images or images containing large amounts of artifacts.

FIGURE 8. - Comparison of the images upscaled by a factor of 4, which are obtained by VMNet and the other state-of-the-art methods.
FIGURE 8.

Comparison of the images upscaled by a factor of 4, which are obtained by VMNet and the other state-of-the-art methods.

SECTION VI.

Conclusion

In this paper, we proposed the VMNet model, which employs a new way of recursive operation using volatile and nonvolatile memories to perform super-resolution. Our VMNet model operates two distinguished memory components through a recursive operation, which provides an effective solution to the limitations of previous super-resolution methods employing a fixed operation in their recursive portions. By dividing recursive portions into two components, our model showed multiple benefits, including the capability of progressive operation and better efficiency in terms of the number of model parameters and image quality. In addition, comparison with other state-of-the-art methods also demonstrated that our method can generate better-quality upscaled images than the others.

We believe that the proposed recursive operation using two separate memory components can be useful in other image enhancement tasks such as denoising and deblurring. Therefore, extending the proposed method to such tasks will be worth investigating in the future.

References

References is not available for this document.