Conferences >ICASSP 2025 - 2025 IEEE Inter...

EvaSR: Rethinking Efficient Visual Attention Design for Image Super-Resolution

Abstract:

Due to the advantages of long-range modeling via the self-attention mechanism, Transformer has taken various vision tasks by storm, including image super-resolution (SR)....Show More

Metadata

Abstract:

Due to the advantages of long-range modeling via the self-attention mechanism, Transformer has taken various vision tasks by storm, including image super-resolution (SR). In this study, we reveal that the convolutional neural network (CNN) with proper visual attention is a more simple and effective paradigm than Transformer in image SR tasks. We reexamine the successful SR models and discover several key characteristics that contribute to accurate image reconstruction. Built on this recipe, we propose a pure CNN-based SR network using efficient visual attention, dubbed EvaSR. Benefiting from the carefully designed visual attention, our EvaSR can favorably capture both local structure and long-range dependencies, and achieve adaptivity in spatial and channel dimensions while retaining the simplicity and efficiency of CNNs. The experimental results demonstrate that our EvaSR achieves state-of-the-art performance among the existing efficient SR methods. Especially, the tiny version of EvaSR needs 21.4% and 15.2% parameters of IMDN and SMSR with better performance.

Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 06-11 April 2025

Date Added to IEEE Xplore: 07 March 2025

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49660.2025.10889074

Conference Location: Hyderabad, India

Funding Agency:

References is not available for this document.

Contents

SECTION I.

Introduction

Single-image super-resolution (SISR) is a long-standing computer vision task that aims to reconstruct high-resolution (HR) images from low-resolution (LR) counterparts [1]. With the rapid development of deep learning, the SR community has witnessed major shifts from model-based solutions to deep learning approaches [2]–[4]. Following the pioneering SRCNN [5], convolutional neural network (CNN) based SR models have achieved remarkable performance [2], [6], [7]. Since these methods use the standard 3 × 3 kernel to hierarchically extract local features, they require deeper and more complex network topologies to enlarge the receptive field for performance improvement [1]. Thus, the required computational costs and memory consumption limit their practical applications on edge devices [8]. Recently, Transformer-based approaches have rapidly advanced the SR field, due to the superior ability of the self-attention (SA) mechanism in long-range modeling [9], [10]. However, the complexity of SA grows quadratically with the spatial resolution, which is unaffordable for low-level vision tasks. In addition, treating images as one-dimensional sequences ignores their two-dimensional properties, which is not conducive to learning the essential texture information from local structures [11]. This prompted us to develop a more efficient neural network for accurate image super-resolution.

Fig. 1.

Comparison of our method with the state-of-the-arts in both performance and efficiency on Set5 for ×4 SR. The disk area indicates the numbers of parameters.

Show All

By revisiting previous successful image SR models, we have summarized several key characteristics of different works as shown in Table I. We argue that a successful SR model needs to satisfy the following properties: (i) Large receptive field. Since image super-resolution is a typical inverse problem, it only needs to remove the undesirable degraded image content, and this functionality can be better incorporated into the neural network with the help of large receptive fields [12]. In addition, the similar patches in images serve as mutual references and can better restore the details from each other by using large receptive fields [13]. (ii) Local structure. Different from semantic-level vision tasks, image SR focuses more on fine image details, and therefore requires learning the fundamental texture information from the local structure. (iii) Attention mechanism. It can be regarded as the process of adaptive recalibration of the input signal. In image SR, spatial attention allows the network to focus more on recovering complex texture regions, while channel attention can effectively improve the feature representation capacity. (iv) Low computational complexity. This is critical for the SR task since it mainly handles high-resolution image reconstruction.

Based on these insights, we rethink efficient visual attention design for image SR and propose a novel lightweight network, dubbed EvaSR. Our approach adopts the ViT architecture [14]–[16], but introduces the local global convolutional attention (LGCA) and locality-enhanced feedforward networks (LEFN) to replace the self-attention and vanilla feedforward networks (FFN). 1) LGCA exploits large kernel decomposition [11] to achieve large receptive fields with greatly reduced parameters and limited computational overhead. In addition, inspired by the classical edge detection operators, we design paired horizontal and vertical asymmetric convolutions to capture the local texture information. In the end, we leverage the generated features as spatial and channel attention weights for triggering attention mechanisms via element-wise multiplication. 2) We strengthen the LEFN by incorporating small depth-wise convolution with self-residual [17] strategy to emphasize the local structural information. With this design, the proposed EvaSR is able to capture both long-range dependence and local information, perform adaptive recalibration in spatial and channel dimensions, and ultimately direct the network to focus on fine image structure and textures. More importantly, our EvaSR is a pure CNN architecture that circumvents the quadratic complexity of self-attention methods and facilitates practical deployment on edge devices. The proposed EvaSR achieves state-of-the-art performance, as shown in Fig. 1. The main contributions of this work include:

After comprehensively revisiting the previous representative SR models, we identify the key characteristics of these successful networks.
Based on these insights, we carefully design a novel efficient SR framework termed EvaSR which simultaneously absorbs the advantages of cheap convolution, local information clues, long-range modeling along with spatial-channel adaptivity.
Experimental results show that our EvaSR achieves state-of-the-art performance with fewer computational costs.

TABLE I Desirable properties we summarized from different successful image SR methods. Our approach absorbs these superior characteristics.

Fig. 2.

(a) The overall pipeline of our EvaSR for image SR. (b) The inner structure of EVAB.

Show All

SECTION II.

Proposed Method

A. Network Architecture

As shown in Fig. 2(a), the overall architecture of our EvaSR mainly consists of three parts: shallow feature extraction, deep feature extraction, and upscaling reconstruction.

Fig. 3.

(a) Local-Global Convolutional Attention (LGCA) obtains local and global features while evoking attention mechanism via element-wise multiplication. (b) Locally Enhanced Feed-forward Network (LEFN) performs local structural enhancement to focus on fundamental detail contents.

Show All

Specifically, we use a 3 × 3 convolution for extracting shallow features f₀ from the input LR image I_LR. Subsequently, the shallow feature is used for the deep feature extraction by a stack of efficient visual attention blocks (EVAB). We employ a 3 × 3 convolution at the tail of the deep feature extraction, and add the residual f₀ to the output feature. Finally, the SR images are constructed by the resulting deep features using a pixel shuffle operator and a 3 × 3 convolution layer.

B. Efficient Viusal Attention Block

Local-Global Convolutional Attention

The proposed LGCA is shown in Fig. 3(a). Haase et al. [18] reveal that DSC-based architectures implicitly rely on cross-kernel correlations. Inspired by this, we designed a novel multi-branch depth-wise separable structure. Specifically, the depth-wise convolutions on multi-branches share the point-wise convolution on the trunk, where the former obtains local and global information while the latter captures relationship along the channel dimension. Subsequently, we use the generated features as attention maps to recalibrate the inputs of LGCA. The computational process can be expressed as follows:

$\begin{align*} & Attention = Con{v_p}\sum\limits_{i = 0}^3 {\left({Branc{h_i}\left({Con{v_p}(F)}\right)}\right)} \tag{1} \\ & Output = Attention \odot F\tag{2}\end{align*}$ View Source

where F denotes the input feature, while Attention and Output indicate attention map and the output feature respectively. ⊙ is element-wise multiplication. Conv_p represents 1 × 1 point-wise convolution, and Branch_i, i ∈ {0, 1, 2, 3} denotes i-th branch as shown in Fig. 3(a). Branch₀ consists of a 5 × 5 depth-wise convolution and a 5 × 5 depth-wise dilation convolution with dilation of 3 to capture long-range dependencies. This combination achieves large receptive fields of 17 while reducing a large number of parameters. Branch₁ and Branch₂ respectively consist of paired depth-wise asymmetric convolutions with kernel sizes of 3 and 5, which are used to extract local and texture information. This design is inspired by traditional edge detection operators, such as the Sobel filter, which utilizes horizontal and vertical filters to extract local structure information in different directions respectively. In addition, this lightweight design is more effective than the standard convolution. Besides, Branch₃ is an identity mapping to allow feature reuse in adjacent layers. With this setting, the proposed LGCA can encode both local structure information and long-range dependencies, while enjoying the flexibility of adapting to the input content. More importantly, it maintains the simplicity and efficiency of ConvNet.

TABLE II Ablation study of LGCA and LEFN with different settings on Manga109 for ×2 scale. 1×1 Conv(A / B) indicates the (first / second) 1×1 convolution respectively.

Locally Enhanced Feed-forward Network

As shown in Fig. 3(b), our LEFN adopts two 1 × 1 convolutions as projection layers, with one expanding the feature channels and another shrinking back to the original dimension. In addition, we incorporate a 3 × 3 depth-wise convolution to learn the local structure, whilst introduce the self-residual strategy [17] to overcome the drawbacks associated with depth-wise convolution. Mathematically, our LEFN can be formulated as:

$\begin{align*} & {F_e} = Con{v_e}\left(F\right)\tag{3} \\ & Output = Con{v_c}\left({\phi \left({\left({Con{v_d}\left({{F_e}}\right)}\right){} + {F_e}}\right)}\right)\tag{4}\end{align*}$ View Source

where F denotes the input feature. Conv_e and Conv_c are two 1 × 1 convolutions carrying out channel expansion and contraction respectively. ϕ indicates GELU activation [19]. Thus, LEFN specializes in capturing local content to preserve favorable image structure and texture. It collaborates with LGCA, leading to high-quality image reconstruction.

SECTION III.

Experiments

A. Datasets and Evaluation Metrics

Following [10], [20], [21], our model is trained on DIV2K [22] and Filckr2K [23] datasets.The evaluation is implemented on five public benchmark datasets: Set5 [24], Set14 [25], B100 [26], Urban100 [27] and Manga109 [28]. The PSNR and the SSIM are used for performance measures. Following standard practice, both metrics are computed on the Y channel in the transformed YCbCr space.

B. Implementation Details

The proposed EvaSR and its tiny version EvaSR-S consist of respective 13 and 12 EVABs, and the number of channels was set to 48 and 32. Data augmentation strategies involved in our experiments included horizontal and vertical flips, and random rotations. In terms of the model optimization, the Adam optimizer was used with β₁ = 0.9 and β₂ = 0.99. The initial learning rate of EvaSR was set to 1 × 10⁻³ and reduced by half after 2 × 10⁵ iterations. In addition, a smaller initial learning rate 5 × 10⁻⁴ was set for EvaSR-S. These models are trained using L₁ loss with 10⁶ iterations. During the training process, the mini-batch and the input patch size of EvaSR were set to 64 and 48 × 48 while the counterpart setup were 128 and 64 × 64 for EvaSR-S.

TABLE III Quantitative comparison of different state-of-the-art methods. FLOPs are calculated based on HR image with a resolution of 1280 × 720. The best and second-best performances are highlighted and underlined, respectively.

C. Ablation Study

Ablation Study of LGCA

We conducted ablation experiments to investigate the importance of different components of LGCA. Experimental results in Table II show that all components of LGCA are essential for performance improvement. Specifically, the collaboration of depth-wise convolution and depth-wise dilated convolution plays the role of capturing long-range dependencies, and removing either of them will lead to severe performance deterioration. Moreover, the combination of asymmetric convolutions can extract local structural information at different scales. Excluding either Branch₁ or Branch₂ will result in a 0.04dB performance degradation. In addition, the two 1 × 1 convolutions are used for channel mixing and cooperating with the attention mechanism to trigger adaptability in the channel dimension, which results in further performance gains.

Ablation Study of LEFN

Table II presents the effect of different components of LEFN on model performance. We observe that depth-wise convolution achieves substantial performance improvements of 0.07 dB, which demonstrates that local structural information is critical for the SR task. The self-residual strategy also provides 0.02dB PSNR rise with barely any additional parameters and computational overhead. In addition, the inverse bottleneck structure prevalent in FFN further pushes performance forward by 0.12 dB.

D. Comparison with State-of-the-art Methods

Quantitative Results

As shown in Table III, our EvaSR achieves the best results on almost all benchmarks with all upsampling scales. For instance, compared with RLFN [32] which is the champion of the NTIRE 2022 [21] efficient SR competition, the proposed EvaSR achieves an average 0.21 dB improvement on four datasets (×4 scale) with 38% fewer parameters. With comparable computational complexity and parameters, our EvaSR exhibits significant performance improvement to BSRN [33], ShuffleMixer [8]. Moreover, our EvaSR outperforms the most advanced NGswin [34] with 64% fewer parameters.

TABLE IV Comparison of different methods regarding the number of parameters, the FLOPs, and the average runtime.

Fig. 4.

Visual comparisons for ×4 SR on Urban100 and B100 datasets. The images produced by our EvaSR are more faithful to the ground truth than other competing methods.

Show All

Comparison with Tiny SR Methods

We further compared our EvaSR-S with other state-of-the-art tiny image SR methods, as shown in Table V. The comparison results show that our EvaSR-S achieves the best performance among all algorithms. In particular, our method clearly outperforms BSRN-S [33], which is the champion of the model complexity subtrack in NTIRE 2022 [21] efficient SR challenge. Note that our tiny EvaSR surpasses even several much larger models, which demonstrate the superiority of our EvaSR.

Comparison with Lite Transformer-based Methods

To further demonstrate the superiority of our method, we compare the proposed EvaSR model with the lightweight Transformer-based models. As shown in Table VI, our EvaSR surpasses the competing ESRT network with less than half of the parameters. Moreover, EvaSR achieves competitive performance compared to the prominent SwinIR-light, especially at a cost of about 1/3 parameters and computational cost of the latter. Compared to the latest DiVANet, EvaSR achieves overall better performance with 64% fewer parameters. These results show that EvaSR serving as a pure CNN architecture can compete favorably with the advanced Transformer-based approaches.

Run Time and Model Complexity

In order to fully demonstrate the efficiency of our method, we further evaluated our method against representative CNN and Transformer-based methods, including LPAPR-A [20], Shufflemixer [8], SwinIR-light [10] and ESRT [13]. The average run time of different methods is tested for processing an image on ×2 Urban100 dataset. All the experiments were implemented on an Nvidia Tesla V100. The experimental results are shown in Table IV. By using a pure CNN architecture, our EvaSR runs 10 and 5 times faster than SwinIR-light and ESRT, respectively. Our method is even more than twice as fast as CNN-based LPAPR-A. Compared to Shufflemixer, our method achieves similar running speed with fewer parameters. These results show that our EvaSR achieves a promising trade-off between performance, model complexity and inference speed.

TABLE V Quantitative comparison with state-of-the-art approaches for tiny image SR on benchmark datasets (×4).

TABLE VI Quantitative comparison with lightweight Transformer-based methods on benchmark datasets.

Qualitative Results

In Fig. 4, we compare different reconstruction results achieved by different methods on Urban100 and B100 (×4). The qualitative comparison results demonstrate that our EvaSR is able to reconstruct more visually promising results. In particular, the competing algorithms tend to produce images with distorted structures and details, whereas our EvaSR produces accurate results that are more faithful to ground truth.

SECTION IV.

Conclusion

Recently, vision Transformer began to overtake CNN as the dominant architecture for various vision tasks. However, the quadratic complexity of computing self-attention is too expensive for low-level vision, which limits their practical application in edge devices. Our study shows that a well-designed CNN-based approach can favorably compete with Transformer-based models. After revisiting previous successful CNN-based and Transformer-based image SR models, we summarize several desirable properties that contribute to performance improvement. Based on these insights, we redesigned a CNN-based visual attention approach and proposed a novel image SR network dubbed EvaSR. Experimental results show that our EvaSR method outperforms state-of-the-art algorithms with fewer parameters and computational overhead. We hope this research could facilitate further exploration of CNNs in image SR.

References is not available for this document.

MIT Libraries

MIT Libraries

EvaSR: Rethinking Efficient Visual Attention Design for Image Super-Resolution

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction