Introduction
Single-image super-resolution (SISR) is a long-standing computer vision task that aims to reconstruct high-resolution (HR) images from low-resolution (LR) counterparts [1]. With the rapid development of deep learning, the SR community has witnessed major shifts from model-based solutions to deep learning approaches [2]–[4]. Following the pioneering SRCNN [5], convolutional neural network (CNN) based SR models have achieved remarkable performance [2], [6], [7]. Since these methods use the standard 3 × 3 kernel to hierarchically extract local features, they require deeper and more complex network topologies to enlarge the receptive field for performance improvement [1]. Thus, the required computational costs and memory consumption limit their practical applications on edge devices [8]. Recently, Transformer-based approaches have rapidly advanced the SR field, due to the superior ability of the self-attention (SA) mechanism in long-range modeling [9], [10]. However, the complexity of SA grows quadratically with the spatial resolution, which is unaffordable for low-level vision tasks. In addition, treating images as one-dimensional sequences ignores their two-dimensional properties, which is not conducive to learning the essential texture information from local structures [11]. This prompted us to develop a more efficient neural network for accurate image super-resolution.
Comparison of our method with the state-of-the-arts in both performance and efficiency on Set5 for ×4 SR. The disk area indicates the numbers of parameters.
By revisiting previous successful image SR models, we have summarized several key characteristics of different works as shown in Table I. We argue that a successful SR model needs to satisfy the following properties: (i) Large receptive field. Since image super-resolution is a typical inverse problem, it only needs to remove the undesirable degraded image content, and this functionality can be better incorporated into the neural network with the help of large receptive fields [12]. In addition, the similar patches in images serve as mutual references and can better restore the details from each other by using large receptive fields [13]. (ii) Local structure. Different from semantic-level vision tasks, image SR focuses more on fine image details, and therefore requires learning the fundamental texture information from the local structure. (iii) Attention mechanism. It can be regarded as the process of adaptive recalibration of the input signal. In image SR, spatial attention allows the network to focus more on recovering complex texture regions, while channel attention can effectively improve the feature representation capacity. (iv) Low computational complexity. This is critical for the SR task since it mainly handles high-resolution image reconstruction.
Based on these insights, we rethink efficient visual attention design for image SR and propose a novel lightweight network, dubbed EvaSR. Our approach adopts the ViT architecture [14]–[16], but introduces the local global convolutional attention (LGCA) and locality-enhanced feedforward networks (LEFN) to replace the self-attention and vanilla feedforward networks (FFN). 1) LGCA exploits large kernel decomposition [11] to achieve large receptive fields with greatly reduced parameters and limited computational overhead. In addition, inspired by the classical edge detection operators, we design paired horizontal and vertical asymmetric convolutions to capture the local texture information. In the end, we leverage the generated features as spatial and channel attention weights for triggering attention mechanisms via element-wise multiplication. 2) We strengthen the LEFN by incorporating small depth-wise convolution with self-residual [17] strategy to emphasize the local structural information. With this design, the proposed EvaSR is able to capture both long-range dependence and local information, perform adaptive recalibration in spatial and channel dimensions, and ultimately direct the network to focus on fine image structure and textures. More importantly, our EvaSR is a pure CNN architecture that circumvents the quadratic complexity of self-attention methods and facilitates practical deployment on edge devices. The proposed EvaSR achieves state-of-the-art performance, as shown in Fig. 1. The main contributions of this work include:
After comprehensively revisiting the previous representative SR models, we identify the key characteristics of these successful networks.
Based on these insights, we carefully design a novel efficient SR framework termed EvaSR which simultaneously absorbs the advantages of cheap convolution, local information clues, long-range modeling along with spatial-channel adaptivity.
Experimental results show that our EvaSR achieves state-of-the-art performance with fewer computational costs.
(a) The overall pipeline of our EvaSR for image SR. (b) The inner structure of EVAB.
Proposed Method
A. Network Architecture
As shown in Fig. 2(a), the overall architecture of our EvaSR mainly consists of three parts: shallow feature extraction, deep feature extraction, and upscaling reconstruction.
(a) Local-Global Convolutional Attention (LGCA) obtains local and global features while evoking attention mechanism via element-wise multiplication. (b) Locally Enhanced Feed-forward Network (LEFN) performs local structural enhancement to focus on fundamental detail contents.
Specifically, we use a 3 × 3 convolution for extracting shallow features f0 from the input LR image ILR. Subsequently, the shallow feature is used for the deep feature extraction by a stack of efficient visual attention blocks (EVAB). We employ a 3 × 3 convolution at the tail of the deep feature extraction, and add the residual f0 to the output feature. Finally, the SR images are constructed by the resulting deep features using a pixel shuffle operator and a 3 × 3 convolution layer.
B. Efficient Viusal Attention Block
Local-Global Convolutional Attention
The proposed LGCA is shown in Fig. 3(a). Haase et al. [18] reveal that DSC-based architectures implicitly rely on cross-kernel correlations. Inspired by this, we designed a novel multi-branch depth-wise separable structure. Specifically, the depth-wise convolutions on multi-branches share the point-wise convolution on the trunk, where the former obtains local and global information while the latter captures relationship along the channel dimension. Subsequently, we use the generated features as attention maps to recalibrate the inputs of LGCA. The computational process can be expressed as follows:
\begin{align*} & Attention = Con{v_p}\sum\limits_{i = 0}^3 {\left({Branc{h_i}\left({Con{v_p}(F)}\right)}\right)} \tag{1} \\ & Output = Attention \odot F\tag{2}\end{align*}
Locally Enhanced Feed-forward Network
As shown in Fig. 3(b), our LEFN adopts two 1 × 1 convolutions as projection layers, with one expanding the feature channels and another shrinking back to the original dimension. In addition, we incorporate a 3 × 3 depth-wise convolution to learn the local structure, whilst introduce the self-residual strategy [17] to overcome the drawbacks associated with depth-wise convolution. Mathematically, our LEFN can be formulated as:
\begin{align*} & {F_e} = Con{v_e}\left(F\right)\tag{3} \\ & Output = Con{v_c}\left({\phi \left({\left({Con{v_d}\left({{F_e}}\right)}\right){} + {F_e}}\right)}\right)\tag{4}\end{align*}
Experiments
A. Datasets and Evaluation Metrics
Following [10], [20], [21], our model is trained on DIV2K [22] and Filckr2K [23] datasets.The evaluation is implemented on five public benchmark datasets: Set5 [24], Set14 [25], B100 [26], Urban100 [27] and Manga109 [28]. The PSNR and the SSIM are used for performance measures. Following standard practice, both metrics are computed on the Y channel in the transformed YCbCr space.
B. Implementation Details
The proposed EvaSR and its tiny version EvaSR-S consist of respective 13 and 12 EVABs, and the number of channels was set to 48 and 32. Data augmentation strategies involved in our experiments included horizontal and vertical flips, and random rotations. In terms of the model optimization, the Adam optimizer was used with β1 = 0.9 and β2 = 0.99. The initial learning rate of EvaSR was set to 1 × 10−3 and reduced by half after 2 × 105 iterations. In addition, a smaller initial learning rate 5 × 10−4 was set for EvaSR-S. These models are trained using L1 loss with 106 iterations. During the training process, the mini-batch and the input patch size of EvaSR were set to 64 and 48 × 48 while the counterpart setup were 128 and 64 × 64 for EvaSR-S.
C. Ablation Study
Ablation Study of LGCA
We conducted ablation experiments to investigate the importance of different components of LGCA. Experimental results in Table II show that all components of LGCA are essential for performance improvement. Specifically, the collaboration of depth-wise convolution and depth-wise dilated convolution plays the role of capturing long-range dependencies, and removing either of them will lead to severe performance deterioration. Moreover, the combination of asymmetric convolutions can extract local structural information at different scales. Excluding either Branch1 or Branch2 will result in a 0.04dB performance degradation. In addition, the two 1 × 1 convolutions are used for channel mixing and cooperating with the attention mechanism to trigger adaptability in the channel dimension, which results in further performance gains.
Ablation Study of LEFN
Table II presents the effect of different components of LEFN on model performance. We observe that depth-wise convolution achieves substantial performance improvements of 0.07 dB, which demonstrates that local structural information is critical for the SR task. The self-residual strategy also provides 0.02dB PSNR rise with barely any additional parameters and computational overhead. In addition, the inverse bottleneck structure prevalent in FFN further pushes performance forward by 0.12 dB.
D. Comparison with State-of-the-art Methods
Quantitative Results
As shown in Table III, our EvaSR achieves the best results on almost all benchmarks with all upsampling scales. For instance, compared with RLFN [32] which is the champion of the NTIRE 2022 [21] efficient SR competition, the proposed EvaSR achieves an average 0.21 dB improvement on four datasets (×4 scale) with 38% fewer parameters. With comparable computational complexity and parameters, our EvaSR exhibits significant performance improvement to BSRN [33], ShuffleMixer [8]. Moreover, our EvaSR outperforms the most advanced NGswin [34] with 64% fewer parameters.
Visual comparisons for ×4 SR on Urban100 and B100 datasets. The images produced by our EvaSR are more faithful to the ground truth than other competing methods.
Comparison with Tiny SR Methods
We further compared our EvaSR-S with other state-of-the-art tiny image SR methods, as shown in Table V. The comparison results show that our EvaSR-S achieves the best performance among all algorithms. In particular, our method clearly outperforms BSRN-S [33], which is the champion of the model complexity subtrack in NTIRE 2022 [21] efficient SR challenge. Note that our tiny EvaSR surpasses even several much larger models, which demonstrate the superiority of our EvaSR.
Comparison with Lite Transformer-based Methods
To further demonstrate the superiority of our method, we compare the proposed EvaSR model with the lightweight Transformer-based models. As shown in Table VI, our EvaSR surpasses the competing ESRT network with less than half of the parameters. Moreover, EvaSR achieves competitive performance compared to the prominent SwinIR-light, especially at a cost of about 1/3 parameters and computational cost of the latter. Compared to the latest DiVANet, EvaSR achieves overall better performance with 64% fewer parameters. These results show that EvaSR serving as a pure CNN architecture can compete favorably with the advanced Transformer-based approaches.
Run Time and Model Complexity
In order to fully demonstrate the efficiency of our method, we further evaluated our method against representative CNN and Transformer-based methods, including LPAPR-A [20], Shufflemixer [8], SwinIR-light [10] and ESRT [13]. The average run time of different methods is tested for processing an image on ×2 Urban100 dataset. All the experiments were implemented on an Nvidia Tesla V100. The experimental results are shown in Table IV. By using a pure CNN architecture, our EvaSR runs 10 and 5 times faster than SwinIR-light and ESRT, respectively. Our method is even more than twice as fast as CNN-based LPAPR-A. Compared to Shufflemixer, our method achieves similar running speed with fewer parameters. These results show that our EvaSR achieves a promising trade-off between performance, model complexity and inference speed.
Qualitative Results
In Fig. 4, we compare different reconstruction results achieved by different methods on Urban100 and B100 (×4). The qualitative comparison results demonstrate that our EvaSR is able to reconstruct more visually promising results. In particular, the competing algorithms tend to produce images with distorted structures and details, whereas our EvaSR produces accurate results that are more faithful to ground truth.
Conclusion
Recently, vision Transformer began to overtake CNN as the dominant architecture for various vision tasks. However, the quadratic complexity of computing self-attention is too expensive for low-level vision, which limits their practical application in edge devices. Our study shows that a well-designed CNN-based approach can favorably compete with Transformer-based models. After revisiting previous successful CNN-based and Transformer-based image SR models, we summarize several desirable properties that contribute to performance improvement. Based on these insights, we redesigned a CNN-based visual attention approach and proposed a novel image SR network dubbed EvaSR. Experimental results show that our EvaSR method outperforms state-of-the-art algorithms with fewer parameters and computational overhead. We hope this research could facilitate further exploration of CNNs in image SR.