Loading [MathJax]/jax/output/HTML-CSS/autoload/mtable.js
EvaSR: Rethinking Efficient Visual Attention Design for Image Super-Resolution | IEEE Conference Publication | IEEE Xplore

EvaSR: Rethinking Efficient Visual Attention Design for Image Super-Resolution


Abstract:

Due to the advantages of long-range modeling via the self-attention mechanism, Transformer has taken various vision tasks by storm, including image super-resolution (SR)....Show More

Abstract:

Due to the advantages of long-range modeling via the self-attention mechanism, Transformer has taken various vision tasks by storm, including image super-resolution (SR). In this study, we reveal that the convolutional neural network (CNN) with proper visual attention is a more simple and effective paradigm than Transformer in image SR tasks. We reexamine the successful SR models and discover several key characteristics that contribute to accurate image reconstruction. Built on this recipe, we propose a pure CNN-based SR network using efficient visual attention, dubbed EvaSR. Benefiting from the carefully designed visual attention, our EvaSR can favorably capture both local structure and long-range dependencies, and achieve adaptivity in spatial and channel dimensions while retaining the simplicity and efficiency of CNNs. The experimental results demonstrate that our EvaSR achieves state-of-the-art performance among the existing efficient SR methods. Especially, the tiny version of EvaSR needs 21.4% and 15.2% parameters of IMDN and SMSR with better performance.
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:

ISSN Information:

Conference Location: Hyderabad, India

Funding Agency:

References is not available for this document.

SECTION I.

Introduction

Single-image super-resolution (SISR) is a long-standing computer vision task that aims to reconstruct high-resolution (HR) images from low-resolution (LR) counterparts [1]. With the rapid development of deep learning, the SR community has witnessed major shifts from model-based solutions to deep learning approaches [2]–​[4]. Following the pioneering SRCNN [5], convolutional neural network (CNN) based SR models have achieved remarkable performance [2], [6], [7]. Since these methods use the standard 3 × 3 kernel to hierarchically extract local features, they require deeper and more complex network topologies to enlarge the receptive field for performance improvement [1]. Thus, the required computational costs and memory consumption limit their practical applications on edge devices [8]. Recently, Transformer-based approaches have rapidly advanced the SR field, due to the superior ability of the self-attention (SA) mechanism in long-range modeling [9], [10]. However, the complexity of SA grows quadratically with the spatial resolution, which is unaffordable for low-level vision tasks. In addition, treating images as one-dimensional sequences ignores their two-dimensional properties, which is not conducive to learning the essential texture information from local structures [11]. This prompted us to develop a more efficient neural network for accurate image super-resolution.

Fig. 1. - 
Comparison of our method with the state-of-the-arts in both performance and efficiency on Set5 for ×4 SR. The disk area indicates the numbers of parameters.
Fig. 1.

Comparison of our method with the state-of-the-arts in both performance and efficiency on Set5 for ×4 SR. The disk area indicates the numbers of parameters.

By revisiting previous successful image SR models, we have summarized several key characteristics of different works as shown in Table I. We argue that a successful SR model needs to satisfy the following properties: (i) Large receptive field. Since image super-resolution is a typical inverse problem, it only needs to remove the undesirable degraded image content, and this functionality can be better incorporated into the neural network with the help of large receptive fields [12]. In addition, the similar patches in images serve as mutual references and can better restore the details from each other by using large receptive fields [13]. (ii) Local structure. Different from semantic-level vision tasks, image SR focuses more on fine image details, and therefore requires learning the fundamental texture information from the local structure. (iii) Attention mechanism. It can be regarded as the process of adaptive recalibration of the input signal. In image SR, spatial attention allows the network to focus more on recovering complex texture regions, while channel attention can effectively improve the feature representation capacity. (iv) Low computational complexity. This is critical for the SR task since it mainly handles high-resolution image reconstruction.

Based on these insights, we rethink efficient visual attention design for image SR and propose a novel lightweight network, dubbed EvaSR. Our approach adopts the ViT architecture [14]–​[16], but introduces the local global convolutional attention (LGCA) and locality-enhanced feedforward networks (LEFN) to replace the self-attention and vanilla feedforward networks (FFN). 1) LGCA exploits large kernel decomposition [11] to achieve large receptive fields with greatly reduced parameters and limited computational overhead. In addition, inspired by the classical edge detection operators, we design paired horizontal and vertical asymmetric convolutions to capture the local texture information. In the end, we leverage the generated features as spatial and channel attention weights for triggering attention mechanisms via element-wise multiplication. 2) We strengthen the LEFN by incorporating small depth-wise convolution with self-residual [17] strategy to emphasize the local structural information. With this design, the proposed EvaSR is able to capture both long-range dependence and local information, perform adaptive recalibration in spatial and channel dimensions, and ultimately direct the network to focus on fine image structure and textures. More importantly, our EvaSR is a pure CNN architecture that circumvents the quadratic complexity of self-attention methods and facilitates practical deployment on edge devices. The proposed EvaSR achieves state-of-the-art performance, as shown in Fig. 1. The main contributions of this work include:

  • After comprehensively revisiting the previous representative SR models, we identify the key characteristics of these successful networks.

  • Based on these insights, we carefully design a novel efficient SR framework termed EvaSR which simultaneously absorbs the advantages of cheap convolution, local information clues, long-range modeling along with spatial-channel adaptivity.

  • Experimental results show that our EvaSR achieves state-of-the-art performance with fewer computational costs.

TABLE I Desirable properties we summarized from different successful image SR methods. Our approach absorbs these superior characteristics.
Table I- 
Desirable properties we summarized from different successful image SR methods. Our approach absorbs these superior characteristics.
Fig. 2. - 
(a) The overall pipeline of our EvaSR for image SR. (b) The inner structure of EVAB.
Fig. 2.

(a) The overall pipeline of our EvaSR for image SR. (b) The inner structure of EVAB.

SECTION II.

Proposed Method

A. Network Architecture

As shown in Fig. 2(a), the overall architecture of our EvaSR mainly consists of three parts: shallow feature extraction, deep feature extraction, and upscaling reconstruction.

Fig. 3. - 
(a) Local-Global Convolutional Attention (LGCA) obtains local and global features while evoking attention mechanism via element-wise multiplication. (b) Locally Enhanced Feed-forward Network (LEFN) performs local structural enhancement to focus on fundamental detail contents.
Fig. 3.

(a) Local-Global Convolutional Attention (LGCA) obtains local and global features while evoking attention mechanism via element-wise multiplication. (b) Locally Enhanced Feed-forward Network (LEFN) performs local structural enhancement to focus on fundamental detail contents.

Specifically, we use a 3 × 3 convolution for extracting shallow features f0 from the input LR image ILR. Subsequently, the shallow feature is used for the deep feature extraction by a stack of efficient visual attention blocks (EVAB). We employ a 3 × 3 convolution at the tail of the deep feature extraction, and add the residual f0 to the output feature. Finally, the SR images are constructed by the resulting deep features using a pixel shuffle operator and a 3 × 3 convolution layer.

B. Efficient Viusal Attention Block

Local-Global Convolutional Attention

The proposed LGCA is shown in Fig. 3(a). Haase et al. [18] reveal that DSC-based architectures implicitly rely on cross-kernel correlations. Inspired by this, we designed a novel multi-branch depth-wise separable structure. Specifically, the depth-wise convolutions on multi-branches share the point-wise convolution on the trunk, where the former obtains local and global information while the latter captures relationship along the channel dimension. Subsequently, we use the generated features as attention maps to recalibrate the inputs of LGCA. The computational process can be expressed as follows: \begin{align*} & Attention = Con{v_p}\sum\limits_{i = 0}^3 {\left({Branc{h_i}\left({Con{v_p}(F)}\right)}\right)} \tag{1} \\ & Output = Attention \odot F\tag{2}\end{align*}

View SourceRight-click on figure for MathML and additional features. where F denotes the input feature, while Attention and Output indicate attention map and the output feature respectively. ⊙ is element-wise multiplication. Convp represents 1 × 1 point-wise convolution, and Branchi, i ∈ {0, 1, 2, 3} denotes i-th branch as shown in Fig. 3(a). Branch0 consists of a 5 × 5 depth-wise convolution and a 5 × 5 depth-wise dilation convolution with dilation of 3 to capture long-range dependencies. This combination achieves large receptive fields of 17 while reducing a large number of parameters. Branch1 and Branch2 respectively consist of paired depth-wise asymmetric convolutions with kernel sizes of 3 and 5, which are used to extract local and texture information. This design is inspired by traditional edge detection operators, such as the Sobel filter, which utilizes horizontal and vertical filters to extract local structure information in different directions respectively. In addition, this lightweight design is more effective than the standard convolution. Besides, Branch3 is an identity mapping to allow feature reuse in adjacent layers. With this setting, the proposed LGCA can encode both local structure information and long-range dependencies, while enjoying the flexibility of adapting to the input content. More importantly, it maintains the simplicity and efficiency of ConvNet.

TABLE II Ablation study of LGCA and LEFN with different settings on Manga109 for ×2 scale. 1×1 Conv(A / B) indicates the (first / second) 1×1 convolution respectively.
Table II- 
Ablation study of LGCA and LEFN with different settings on Manga109 for ×2 scale. 1×1 Conv(A / B) indicates the (first / second) 1×1 convolution respectively.

Locally Enhanced Feed-forward Network

As shown in Fig. 3(b), our LEFN adopts two 1 × 1 convolutions as projection layers, with one expanding the feature channels and another shrinking back to the original dimension. In addition, we incorporate a 3 × 3 depth-wise convolution to learn the local structure, whilst introduce the self-residual strategy [17] to overcome the drawbacks associated with depth-wise convolution. Mathematically, our LEFN can be formulated as: \begin{align*} & {F_e} = Con{v_e}\left(F\right)\tag{3} \\ & Output = Con{v_c}\left({\phi \left({\left({Con{v_d}\left({{F_e}}\right)}\right){} + {F_e}}\right)}\right)\tag{4}\end{align*}

View SourceRight-click on figure for MathML and additional features. where F denotes the input feature. Conve and Convc are two 1 × 1 convolutions carrying out channel expansion and contraction respectively. ϕ indicates GELU activation [19]. Thus, LEFN specializes in capturing local content to preserve favorable image structure and texture. It collaborates with LGCA, leading to high-quality image reconstruction.

SECTION III.

Experiments

A. Datasets and Evaluation Metrics

Following [10], [20], [21], our model is trained on DIV2K [22] and Filckr2K [23] datasets.The evaluation is implemented on five public benchmark datasets: Set5 [24], Set14 [25], B100 [26], Urban100 [27] and Manga109 [28]. The PSNR and the SSIM are used for performance measures. Following standard practice, both metrics are computed on the Y channel in the transformed YCbCr space.

B. Implementation Details

The proposed EvaSR and its tiny version EvaSR-S consist of respective 13 and 12 EVABs, and the number of channels was set to 48 and 32. Data augmentation strategies involved in our experiments included horizontal and vertical flips, and random rotations. In terms of the model optimization, the Adam optimizer was used with β1 = 0.9 and β2 = 0.99. The initial learning rate of EvaSR was set to 1 × 10−3 and reduced by half after 2 × 105 iterations. In addition, a smaller initial learning rate 5 × 10−4 was set for EvaSR-S. These models are trained using L1 loss with 106 iterations. During the training process, the mini-batch and the input patch size of EvaSR were set to 64 and 48 × 48 while the counterpart setup were 128 and 64 × 64 for EvaSR-S.

TABLE III Quantitative comparison of different state-of-the-art methods. FLOPs are calculated based on HR image with a resolution of 1280 × 720. The best and second-best performances are highlighted and underlined, respectively.
Table III- 
Quantitative comparison of different state-of-the-art methods. FLOPs are calculated based on HR image with a resolution of 1280 × 720. The best and second-best performances are highlighted and underlined, respectively.

C. Ablation Study

Ablation Study of LGCA

We conducted ablation experiments to investigate the importance of different components of LGCA. Experimental results in Table II show that all components of LGCA are essential for performance improvement. Specifically, the collaboration of depth-wise convolution and depth-wise dilated convolution plays the role of capturing long-range dependencies, and removing either of them will lead to severe performance deterioration. Moreover, the combination of asymmetric convolutions can extract local structural information at different scales. Excluding either Branch1 or Branch2 will result in a 0.04dB performance degradation. In addition, the two 1 × 1 convolutions are used for channel mixing and cooperating with the attention mechanism to trigger adaptability in the channel dimension, which results in further performance gains.

Ablation Study of LEFN

Table II presents the effect of different components of LEFN on model performance. We observe that depth-wise convolution achieves substantial performance improvements of 0.07 dB, which demonstrates that local structural information is critical for the SR task. The self-residual strategy also provides 0.02dB PSNR rise with barely any additional parameters and computational overhead. In addition, the inverse bottleneck structure prevalent in FFN further pushes performance forward by 0.12 dB.

D. Comparison with State-of-the-art Methods

Quantitative Results

As shown in Table III, our EvaSR achieves the best results on almost all benchmarks with all upsampling scales. For instance, compared with RLFN [32] which is the champion of the NTIRE 2022 [21] efficient SR competition, the proposed EvaSR achieves an average 0.21 dB improvement on four datasets (×4 scale) with 38% fewer parameters. With comparable computational complexity and parameters, our EvaSR exhibits significant performance improvement to BSRN [33], ShuffleMixer [8]. Moreover, our EvaSR outperforms the most advanced NGswin [34] with 64% fewer parameters.

TABLE IV Comparison of different methods regarding the number of parameters, the FLOPs, and the average runtime.
Table IV- 
Comparison of different methods regarding the number of parameters, the FLOPs, and the average runtime.
Fig. 4. - 
Visual comparisons for ×4 SR on Urban100 and B100 datasets. The images produced by our EvaSR are more faithful to the ground truth than other competing methods.
Fig. 4.

Visual comparisons for ×4 SR on Urban100 and B100 datasets. The images produced by our EvaSR are more faithful to the ground truth than other competing methods.

Comparison with Tiny SR Methods

We further compared our EvaSR-S with other state-of-the-art tiny image SR methods, as shown in Table V. The comparison results show that our EvaSR-S achieves the best performance among all algorithms. In particular, our method clearly outperforms BSRN-S [33], which is the champion of the model complexity subtrack in NTIRE 2022 [21] efficient SR challenge. Note that our tiny EvaSR surpasses even several much larger models, which demonstrate the superiority of our EvaSR.

Comparison with Lite Transformer-based Methods

To further demonstrate the superiority of our method, we compare the proposed EvaSR model with the lightweight Transformer-based models. As shown in Table VI, our EvaSR surpasses the competing ESRT network with less than half of the parameters. Moreover, EvaSR achieves competitive performance compared to the prominent SwinIR-light, especially at a cost of about 1/3 parameters and computational cost of the latter. Compared to the latest DiVANet, EvaSR achieves overall better performance with 64% fewer parameters. These results show that EvaSR serving as a pure CNN architecture can compete favorably with the advanced Transformer-based approaches.

Run Time and Model Complexity

In order to fully demonstrate the efficiency of our method, we further evaluated our method against representative CNN and Transformer-based methods, including LPAPR-A [20], Shufflemixer [8], SwinIR-light [10] and ESRT [13]. The average run time of different methods is tested for processing an image on ×2 Urban100 dataset. All the experiments were implemented on an Nvidia Tesla V100. The experimental results are shown in Table IV. By using a pure CNN architecture, our EvaSR runs 10 and 5 times faster than SwinIR-light and ESRT, respectively. Our method is even more than twice as fast as CNN-based LPAPR-A. Compared to Shufflemixer, our method achieves similar running speed with fewer parameters. These results show that our EvaSR achieves a promising trade-off between performance, model complexity and inference speed.

TABLE V Quantitative comparison with state-of-the-art approaches for tiny image SR on benchmark datasets (×4).
Table V- 
Quantitative comparison with state-of-the-art approaches for tiny image SR on benchmark datasets (×4).
TABLE VI Quantitative comparison with lightweight Transformer-based methods on benchmark datasets.
Table VI- 
Quantitative comparison with lightweight Transformer-based methods on benchmark datasets.

Qualitative Results

In Fig. 4, we compare different reconstruction results achieved by different methods on Urban100 and B100 (×4). The qualitative comparison results demonstrate that our EvaSR is able to reconstruct more visually promising results. In particular, the competing algorithms tend to produce images with distorted structures and details, whereas our EvaSR produces accurate results that are more faithful to ground truth.

SECTION IV.

Conclusion

Recently, vision Transformer began to overtake CNN as the dominant architecture for various vision tasks. However, the quadratic complexity of computing self-attention is too expensive for low-level vision, which limits their practical application in edge devices. Our study shows that a well-designed CNN-based approach can favorably compete with Transformer-based models. After revisiting previous successful CNN-based and Transformer-based image SR models, we summarize several desirable properties that contribute to performance improvement. Based on these insights, we redesigned a CNN-based visual attention approach and proposed a novel image SR network dubbed EvaSR. Experimental results show that our EvaSR method outperforms state-of-the-art algorithms with fewer parameters and computational overhead. We hope this research could facilitate further exploration of CNNs in image SR.

Select All
1.
Z. Wang, J. Chen and S. C. Hoi, "Deep learning for image super-resolution: A survey", IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3365-3387, 2020.
2.
Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong and Y. Fu, "Image super-resolution using very deep residual channel attention networks", Proceedings of the European conference on computer vision (ECCV), pp. 286-301, 2018.
3.
Y. Zhang, K. Li, K. Li, B. Zhong and Y. Fu, "Residual non-local attention networks for image restoration", ICLR, 2019.
4.
Y. Mei, Y. Fan and Y. Zhou, "Image super-resolution with non-local sparse attention", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3517-3526, 2021.
5.
C. Dong, C. C. Loy, K. He and X. Tang, "Image super-resolution using deep convolutional networks", IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 2, pp. 295-307, 2015.
6.
B. Lim, S. Son, H. Kim, S. Nah and K. Mu Lee, "Enhanced deep residual networks for single image super-resolution", Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 136-144, 2017.
7.
Y. Zhang, D. Wei, C. Qin, H. Wang, H. Pfister and Y. Fu, "Context reasoning attention network for image super-resolution", Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4278-4287, 2021.
8.
L. Sun et al., "Shufflemixer: An efficient convnet for image super-resolution", Advances in Neural Information Processing Systems, 2022.
9.
H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, et al., "Pre-trained image processing transformer", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12 299-12 310, 2021.
10.
J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool and R. Timofte, "Swinir: Image restoration using swin transformer", Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1833-1844, 2021.
11.
M.-H. Guo, C.-Z. Lu, Z.-N. Liu, M.-M. Cheng and S.-M. Hu, "Visual attention network", Computational Visual Media, 2023.
12.
S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, et al., "Learning enriched features for fast image restoration and enhancement", IEEE transactions on pattern analysis and machine intelligence, 2022.
13.
Z. Lu, H. Liu, J. Li and L. Zhang, "Efficient transformer for single image super-resolution", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop, 2022.
14.
L. Chen, X. Chu, X. Zhang and J. Sun, "Simple baselines for image restoration", Proceedings of the European conference on computer vision (ECCV), 2022.
15.
S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan and M.-H. Yang, "Restormer: Efficient transformer for high-resolution image restoration", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5728-5739, 2022.
16.
M.-H. Guo, T.-X. Xu, J.-J. Liu, Z.-N. Liu, P.-T. Jiang, T.-J. Mu, et al., "Attention mechanisms in computer vision: A survey", Computational Visual Media, pp. 1-38, 2022.
17.
B. Sun, Y. Zhang, S. Jiang and Y. Fu, "Hybrid pixel-unshuffled network for lightweight image super-resolution", AAAI, 2023.
18.
D. Haase and M. Amthor, "Rethinking depthwise separable convolutions: How intra-kernel correlations lead to improved mobilenets", Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14 600-14 609, 2020.
19.
D. Hendrycks and K. Gimpel, "Gaussian error linear units (gelus)", 2016.
20.
W. Li, K. Zhou, L. Qi, N. Jiang, J. Lu and J. Jia, "Lapar: Linearly-assembled pixel-adaptive regression network for single image super-resolution and beyond", Advances in Neural Information Processing Systems, vol. 33, pp. 20 343-20 355, 2020.
21.
Y. Li, K. Zhang, R. Timofte, L. Van Gool, F. Kong, M. Li, S. Liu, Z. Du, D. Liu, C. Zhou et al., "Ntire 2022 challenge on efficient super-resolution: Methods and results", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1062-1102, 2022.
22.
E. Agustsson and R. Timofte, "Ntire 2017 challenge on single image super-resolution: Dataset and study", Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 126-135, 2017.
23.
R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang and L. Zhang, "Ntire 2017 challenge on single image super-resolution: Methods and results", Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 114-125, 2017.
24.
M. Bevilacqua, A. Roumy, C. Guillemot and M. L. Alberi-Morel, "Low-complexity single-image super-resolution based on nonnegative neighbor embedding", BMVC, 2012.
25.
J. Yang, J. Wright, T. S. Huang and Y. Ma, "Image super-resolution via sparse representation", IEEE transactions on image processing, vol. 19, no. 11, pp. 2861-2873, 2010.
26.
D. Martin, C. Fowlkes, D. Tal and J. Malik, "A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics", Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, vol. 2, pp. 416-423, 2001.
27.
J.-B. Huang, A. Singh and N. Ahuja, "Single image super-resolution from transformed self-exemplars", Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5197-5206, 2015.
28.
Y. Matsui, K. Ito, Y. Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki, et al., "Sketch-based manga retrieval using manga109 dataset", Multimedia Tools and Applications, vol. 76, no. 20, pp. 21 811-21 838, 2017.
29.
J. Liu, J. Tang and G. Wu, "Residual feature distillation network for lightweight image super-resolution", European Conference on Computer Vision, pp. 41-55, 2020.
30.
L. Wang, X. Dong, Y. Wang, X. Ying, Z. Lin, W. An, et al., "Exploring sparsity in image super-resolution for efficient inference", Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4917-4926, 2021.

References

References is not available for this document.