Loading [MathJax]/jax/output/HTML-CSS/autoload/mtable.js
Long-Range Multi-Scale Fusion for Efficient Single Image Super-Resolution | IEEE Conference Publication | IEEE Xplore

Long-Range Multi-Scale Fusion for Efficient Single Image Super-Resolution


Abstract:

Improving the performance of single image super-resolution (SISR) via extending the effective receptive field (ERF) of the model has become an admired paradigm in the fie...Show More

Abstract:

Improving the performance of single image super-resolution (SISR) via extending the effective receptive field (ERF) of the model has become an admired paradigm in the field due to the universal self-similarity prior of natural images. However, it cannot fully explore model capability by solely increasing the ERF to capture long-range dependencies as the non-local self-similarity is typically multi-scale and cross-scale. To this end, a Long-range Multi-scale Fusion Network (LMFN) is devised in this work to simultaneously excavate both long-range and multi-scale priors in images, and the interaction between the both. Within the same scale, our model employs large kernel attention (LKA) and multi-scale modulation (MSM) to learn long-range and multi-scale features. To exploit the interaction between long-range and multi-scale dependencies within one single scale and across scales, we design an Interactive Fusion Modulation (IFM) module for the effective fusion of the non-local and multi-scale features. Extensive experiments on the benchmark datasets illustrate the significant superiority of the proposed LMFN over the advanced SISR models.
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:

ISSN Information:

Conference Location: Hyderabad, India

Funding Agency:

References is not available for this document.

SECTION 1.

Introduction

Single image super-resolution (SISR) is a long-standing task in low-level computer vision, which aims at reconstructing a high-resolution (HR) image from a single low-resolution (LR) counterpart. A straightforward and intuitive way to promote the performance of deep SISR models is to increase network depth and model scale. However, this strategy restricts their applicability in resource-constrained scenarios, and increasing network depth or model scale indiscriminately may result in issues like gradient vanishing. Therefore, an important branch in this field is the exploration of lightweight and efficient super-resolution (SR) models tailored for practical scenarios and application deployment, with striking a nice equilibrium between model performance and overhead.

One promising approach to improve SR performance with constrained model capacity is to incorporate image priors into model inference. In deep learning-based SISR models, self-similarity [2] is a common prior explored through non-local or global attentions. In general, making full use of the self-similarity of an image helps to expand the effective receptive field (ERF) of the model, thereby facilitating the extraction of long-range dependencies [3]. This also establishes an admired paradigm for boosting SISR performance through increasing the ERF[4], [5]. However, image self-similarity is typically sparse and multi-scale [2], [6], [7], it cannot fully exploit the model’s capability in an effective way by solely and exhaustively modeling long-range dependencies.

Fig. 1. - 
The compromise comparison between our LMFN and other lightweight SISR models w.r.t. model performance and overhead on Manga109 [1] for SR×4. The size of the circles indicates the scale of model parameters.
Fig. 1.

The compromise comparison between our LMFN and other lightweight SISR models w.r.t. model performance and overhead on Manga109 [1] for SR×4. The size of the circles indicates the scale of model parameters.

Fig. 2. - 
The network topology of the proposed LMFN. We take LKDN [8] as the backbone and the framework of building module (b) is instantiated by different modulation cells (c) ~ (e) via replacing the placeholder, i.e., (b) ↢ (c) ≜ MSM-M and (b) ↢ (e) ≜ IFM-M. It is worth noting that LKA-M is obtained by instantiating (b) with LKA [9] and IFF (d) is a structural component of IFM (e).
Fig. 2.

The network topology of the proposed LMFN. We take LKDN [8] as the backbone and the framework of building module (b) is instantiated by different modulation cells (c) ~ (e) via replacing the placeholder, i.e., (b)(c)MSM-M and (b)(e)IFM-M. It is worth noting that LKA-M is obtained by instantiating (b) with LKA [9] and IFF (d) is a structural component of IFM (e).

Fig. 3. - 
Our intuition is that fusing long-range and multi-scale features could be expected to compensate and promote each other. (a) and (b): Diagrammatic sketches of long-range and multi-scale features. (c) Multi-scale prior makes up for long-range feature learning. (d) Multi-scale feature learning could be enabled in larger ranges by long-range modelling.
Fig. 3.

Our intuition is that fusing long-range and multi-scale features could be expected to compensate and promote each other. (a) and (b): Diagrammatic sketches of long-range and multi-scale features. (c) Multi-scale prior makes up for long-range feature learning. (d) Multi-scale feature learning could be enabled in larger ranges by long-range modelling.

In view of the multi-scale and sparsity of non-local self-similarity, this work seeks to provide efficient SISR models with the learning ability of multi-scale and non-local dependencies. Specifically, we present a Long-range Multi-scale Fusion Network (LMFN) to improve the trade-off between performance and overhead of efficient SISR models. We use large kernel attention (LKA) [9] to capture sparse non-local features, which is primarily done by the dilated depth-wise convolution. Meanwhile, multi-scale information is acquired via a Multi-Scale Modulation (MSM) module equipped with dynamic snake convolution [10], which presents good adaptability to multi-scale nonlinear features in images. For fusing non-local and multi-scale features effectively, we also design an Interactive Fusion Modulation (IFM) module based on the operations of channel splitting and merge-and-run. The structural parts are integrated into a unified building module framework and periodically stacked together to elevate the synergy between multi-scale and sparse nonlocal features, assisting our model in gracefully compromising between SR results and overhead, as shown in Fig. 1.

SECTION 2.

METHODOLOGY

2.1. Motivation

The overall structure of our LMFN is shown in Fig. 2. Using LKA to model long-range dependencies not only disables the model to learn multi-scale features, but also fails to effectively extract local features, as shown in Fig. 3(a). Similarly, it will be incapable of modeling non-local self-similarity by solely learning multi-scale features.

Fig. 4. - 
Several schemes for the core component of our IFF.
Fig. 4.

Several schemes for the core component of our IFF.

To learn long-range features at across scales and perform multi-scale feature learning on longer ranges, the proposed LMFN is endowed with the capability to combine long-range and multi-scale dependency from two aspects. Firstly, in the backbone of the model, LKA-M and MSM-M are periodically stacked, allowing for the sequential extraction of long-range and multi-scale features; Secondly, within the IFM-M, we explicitly integrate two kinds of features with the simple merge-and-run. Both of these strategies can contribute to effective feature fusion.

2.2. Long-Range and Multi-Scale Modulation

Given a temporary feature xt, the process to generate output feature yt with LKA [9] can be formulated as: \begin{equation*}{{\mathbf{y}}_t} = \operatorname{DWDConv} \left( {\operatorname{DWConv} \left( {\operatorname{PWConv} \left( {{{\mathbf{x}}_t}} \right)} \right)} \right) \odot {{\mathbf{x}}_t},\tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where PWConv(•) denotes the point-wise convolution with kernel size of 1×1. DWConv(•) stands for the depth-wise convolution with kernel size of 5×5, and DWDConv(•) is a dilated depth-wise convolution with kernel size of 5×5. The symbol ⊙ indicates element-wise product.

For multi-scale feature learning, we first split a temporary feature xt ∈ ℝh×w×c into four parts, where h, w and c are the height, width and number of channels of xt, respectively. \begin{equation*}\left[ {{{\mathbf{x}}_0},{{\mathbf{x}}_1},{{\mathbf{x}}_2},{{\mathbf{x}}_3}} \right] = \operatorname{Split} \left( {{{\mathbf{x}}_t}} \right),\tag{2}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where Split(•) stands for the operation of evenly splitting xt along the channel direction. And xi ∈ ℝh×w×c/4 denotes the sub-features of the output (i = 0,…,3). Then we extract multi-scale features within down-sampled feature spaces, using dynamic snake convolutions (DSConv) [6], as exhibited in Fig. 2(c). Formally, the procedure can be formulated as: \begin{equation*}{{\mathbf{\tilde x}}_i} = {{\text{U}}_ \uparrow }\left( {\operatorname{DSConv} \left( {{{\text{D}}_ \downarrow }\left( {{{\mathbf{x}}_t},{2^i}} \right)} \right),{2^i}} \right),\quad i = 0, \ldots ,3,\tag{3}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where the DSConv is with kernel size of 3×3, and U(•,•) and D(•,•) represent the up-sampling and down-sampling operations, respectively. 2i stands for the scale factor of the i-th branch. Next, these sub-features {{\mathbf{\tilde x}}_i} are concatenated together to generate a single intermediate feature {{\mathbf{\tilde x}}_i}: \begin{equation*}{{\mathbf{\tilde x}}_t} = \operatorname{Concat} \left( {{{{\mathbf{\tilde x}}}_0},{{{\mathbf{\tilde x}}}_1},{{{\mathbf{\tilde x}}}_2},{{{\mathbf{\tilde x}}}_3}} \right)\tag{4}\end{equation*}
View SourceRight-click on figure for MathML and additional features.

To obtain the final output of MSM, we fuse these separately processed subfeatures with a 1×1 normal conv followed by a GeLU layer and a Hadamard product with the original input: \begin{equation*}{{\mathbf{y}}_t} = \operatorname{GeLU} \left( {\operatorname{PWConv} \left( {{{{\mathbf{\tilde x}}}_t}} \right)} \right) \odot {{\mathbf{x}}_t}.\tag{5}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Incorporating MSM with LKA can enable multi-scale feature learning over wider ranges, as shown in Fig. 3(d).

2.3. Interactive Fusion Modulation

To further integrate long-range and multi-scale features and adequately explore the synergies between them, we design a block cell for interactive fusion modulation (IFM) equipped with an interactive feature fusion (IFF), as shown in Fig. 2(d).

Given a temporary feature xt, our IFF first splits it evenly into two subfeatures xl and xm: [xl, xm] = Split(xt), and then fuses them with the merge-and-run [11]: \begin{align*} & \left[ {\begin{array}{c} {{{\mathbf{z}}_l}} \\ {{{\mathbf{z}}_m}} \end{array}} \right] = \left[ {\begin{array}{c} {{\text{LKA}}\left( {{{\mathbf{x}}_l}} \right)} \\ {{\text{LKA}}\left( {{{\mathbf{x}}_m}} \right)} \end{array}} \right] + \frac{1}{2}\left[ {\begin{array}{ll} {\mathbf{I}}&{\mathbf{I}} \\ {\mathbf{I}}&{\mathbf{I}} \end{array}} \right]\left[ {\begin{array}{c} {{{\mathbf{x}}_l}} \\ {{{\mathbf{x}}_m}} \end{array}} \right],\tag{6} \\ & \left[ {\begin{array}{c} {{{{\mathbf{\tilde x}}}_l}} \\ {{{{\mathbf{\tilde x}}}_m}} \end{array}} \right] = \left[ {\begin{array}{c} {{\text{MSM}}\left( {{{\mathbf{z}}_l}} \right)} \\ {{\text{MSM}}\left( {{{\mathbf{z}}_m}} \right)} \end{array}} \right] + \frac{1}{2}\left[ {\begin{array}{cc} {\mathbf{I}}&{\mathbf{I}} \\ {\mathbf{I}}&{\mathbf{I}} \end{array}} \right]\left[ {\begin{array}{c} {{{\mathbf{z}}_l}} \\ {{{\mathbf{z}}_m}} \end{array}} \right],\tag{7}\end{align*}

View SourceRight-click on figure for MathML and additional features. where I denotes the identity matrix. The final output of our IFF is obtained by: \begin{equation*}{{\mathbf{y}}_t} = \operatorname{PWConv} \left( {\operatorname{Concat} \left( {{{{\mathbf{\tilde x}}}_l},{{{\mathbf{\tilde x}}}_m}} \right)} \right) + {{\mathbf{x}}_t}.\tag{8}\end{equation*}
View SourceRight-click on figure for MathML and additional features.

Table 1. Ablation study of model backbone configurations on {\color{Blue}\mathbf{Manga109}} [1] with {\color{Blue}}{\text{SR}} \times {\text{4}}.
Table 1.- 
Ablation study of model backbone configurations on ${\color{Blue}\mathbf{Manga109}}$ [1] with ${\color{Blue}}{\text{SR}} \times {\text{4}}$.
Table 2. The comparison between IFF variants on {\color{Blue}\mathbf{Manga109}} [1] with {\color{Blue}}{\text{SR}} \times {\text{4}}, corresponding to Fig. 4.
Table 2.- 
The comparison between IFF variants on ${\color{Blue}\mathbf{Manga109}}$ [1] with ${\color{Blue}}{\text{SR}} \times {\text{4}}$, corresponding to Fig. 4.

Within the IFM, our IFF is used to generate the weights for fusing long-range and multi-scale dependencies. Assuming that the input feature of IFM is still xt, IFM generates the subfeatures for IFF with the following procedure: \begin{equation*}\left[ {{\mathbf{x}}_t^1,{\mathbf{x}}_t^2} \right] = \operatorname{Split} \left( {\operatorname{PWConv} \left( {\operatorname{LayerNorm} \left( {{{\mathbf{x}}_t}} \right)} \right)} \right),\tag{9}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where {\mathbf{x}}_t^2 is input into IFF for weight generation, so we can produce the output of our IFM by: \begin{equation*}{{\mathbf{y}}_t} = \operatorname{PWConv} \left( {{\mathbf{x}}_t^1 \odot \operatorname{IFF} \left( {{\mathbf{x}}_t^2} \right)} \right) + {{\mathbf{x}}_t}.\tag{10}\end{equation*}
View SourceRight-click on figure for MathML and additional features.

We use the merge-and-run to fuse long-range and multi-scale features, which is expected to promote the synergy between the self-similarity and multi-scale priors. Besides, we primarily integrate different features for weight generation, instead of direct feature learning as in CSN [11].

SECTION 3.

EXPERIMENTAL RESULTS

3.1. Datasets and Metrics

Our models are trained with the DF2K dataset (DIV2K [12] + Flickr2K [13]). We also use five benchmark datasets for testing: Set5 [14], Set14 [15], B100 [16], Urban100 [17] and Manga109 [1]. SR results are evaluated with peak signal to noise ratio (PSNR) and structural similarity (SSIM) [18] on the Y channel of the YCbCr color space.

3.2. Implementation Details

We randomly crop 64 patches with 48×48 pixels from LR images as the basic training inputs. During model training, data argumentation is performed on the input patches with random horizontal flips and rotations. We train our models using the common ℒ1 loss and the Adan optimizer. We set the exponential moving average (EMA) to stabilize training to 0.999. The learning rate is set to a constant 5 × 10−3 for 2×106 training iterations. All experiments are conducted with the PyTorch framework on an NVIDIA A100 GPU.

Table 3. Comparisons between lightweight SISR models on benchmark datasets. FLOPs is measured corresponding to an HR image with size of 1280×720 pixels.
Table 3.- 
Comparisons between lightweight SISR models on benchmark datasets. FLOPs is measured corresponding to an HR image with size of 1280×720 pixels.

3.3. Ablation Studies

We will illustrate the effectiveness of fusing long-range and multi-scale features from two aspects, i.e., the modules in the backbone and the IFF in the IFM.

Model Backbone: We perform an ablation study by replacing the placeholder in LMFN. The results in Table 1 show that the our combination used in LMFN achieves the best tradeoff between model performance and overhead.

IFF in IFM: We conduct an ablation study by changing the integrating of long-range and multi-scale feature. The results in Table 2 illustrate that the strategies shown in Fig. 4(c) and Fig. 4(d) achieve the most competitive results.

3.4. Comparison with Advanced Models

To evaluate the performance of our approach, we compare it with state-of-the-art lightweight SISR methods, including SAFMN [6], VapSR [4], HPUN [7], OSFFNet [19], and Di-VANet [20], NGswin [5], etc.

Quantitative Comparison: Benefiting from the simple yet efficient structure, the proposed LMFN obtains comparable SR results to state-of-the-art models with significantly fewer parameters, as shown in Table. 3. It can be seen that, our LMFN achieves the best performance overall but maintains moderate parameters and FLOPs for all SR scales.

Qualitative Comparison: Fig. 5 provides the visual comparisons on several testing image from Urban100 [17] dataset for SR×4. Our approach generates parallel straight lines and grid patterns more accurately compared to other methods. These results also demonstrate the effectiveness of our method for efficient integration of multi-scale information by utilizing a dual-attention mechanism for channel interaction.

Fig. 5. - 
Visual comparison on Urban100 for SR×4.
Fig. 5.

Visual comparison on Urban100 for SR×4.

Fig. 6. - 
The LAM attribution indicates the significance of each pixel in the input LR image during the reconstruction of the patch highlighted in the red rectangle. The diffusion indices (DI) are shown beneath the LAM outcomes. A larger DI value means a wider range of attention.
Fig. 6.

The LAM attribution indicates the significance of each pixel in the input LR image during the reconstruction of the patch highlighted in the red rectangle. The diffusion indices (DI) are shown beneath the LAM outcomes. A larger DI value means a wider range of attention.

LAM Comparison: As depicted in Fig. 6, it can be observed that our model reconstruct the target area with the surrounding pixels in a wider range. A more comprehensive perception of the receptive field implies the effectiveness of long-range multi-scale feature fusion.

SECTION 4.

CONCLUSION

In this work, we present a simple and efficient LMFN model for lightweight SISR tasks, motivated by the incorporation of long-range sparse dependencies with multi-scale priors of images. In virtue of the large kernel attention and multi-scale feature modulation, our LMFN can extract long-range and multi-scale features in a simple and flexible way. To explore the synergy between different features, we illustrate a novel feature fusion strategy, i.e., IFM built upon the merge-and-run of sub-features. Unlike the previous methods of directly using the merge-and-run for feature learning, we adopt it to learn weights for feature modulation based on the universal self-similarity. The experiments demonstrate the benefits of combining the long-range and multi-scale dependencies for SISR tasks, and illustrate the superiority of our LMFN over other models in compromising between performance and overhead.

Select All
1.
Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fujimoto, Toru Ogawa, Toshihiko Yamasaki, et al., "Sketch-based manga retrieval using manga109 dataset", Multimedia Tools and Applications, vol. 76, pp. 21811-21838, 2017.
2.
Daniel Glasner, Shai Bagon and Michal Irani, "Super-resolution from a single image", IEEE International Conference on Computer Vision, pp. 349-356, 2009.
3.
Jing Luo, Lin Zhao, Li Zhu and Wenbing Tao, "Multi-scale receptive field fusion network for lightweight image super-resolution", Neurocomputing, vol. 493, pp. 314-326, 2022.
4.
Lin Zhou, Haoming Cai, Jinjin Gu, Zheyuan Li, Yingqi Liu, Xiangyu Chen, et al., "Efficient image super-resolution using vast-receptive-field attention", European Conference on Computer Vision, pp. 256-272, 2022.
5.
Haram Choi, Jeongmin Lee and Jihoon Yang, "N-gram in swin transformers for efficient lightweight image super-resolution", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2071-2081, 2023.
6.
Long Sun, Jiangxin Dong, Jinhui Tang and Jinshan Pan, "Spatially-adaptive feature modulation for efficient image super-resolution", Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13190-13199, 2023.
7.
Bin Sun, Yulun Zhang, Songyao Jiang and Yun Fu, "Hybrid pixel-unshuffled network for lightweight image super-resolution", Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 2375-2383, 2023.
8.
Chengxing Xie, Xiaoming Zhang, Linze Li, Haiteng Meng, Tianlin Zhang, Tianrui Li, et al., "Large kernel distillation network for efficient single image super-resolution", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 1283-1292, 2023.
9.
Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng and Shi-Min Hu, "Visual attention network", Computational Visual Media, vol. 9, no. 4, pp. 733-752, 2023.
10.
Yaolei Qi, Yuting He, Xiaoming Qi, Yuan Zhang and Guanyu Yang, "Dynamic snake convolution based on topological geometric constraints for tubular structure segmentation", Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6070-6079, 2023.
11.
Xiaole Zhao, Yulun Zhang, Tao Zhang and Xueming Zou, "Channel splitting network for single MR image super-resolution", IEEE Transactions on Image Processing, vol. 28, no. 11, pp. 5649-5662, 2019.
12.
Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-Hsuan Yang and Lei Zhang, "Ntire 2017 challenge on single image super-resolution: Methods and results", Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 114-125, 2017.
13.
Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah and Kyoung Mu Lee, "Enhanced deep residual networks for single image super-resolution", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 1132-1140, 2017.
14.
Marco Bevilacqua, Aline Roumy, Christine Guillemot and Marie Line Alberi-Morel, "Low-complexity single-image super-resolution based on nonnegative neighbor embedding", BMVC, pp. 1-10, 2012.
15.
Roman Zeyde, Michael Elad and Matan Protter, "On single image scale-up using sparse-repre sentations", Proc. 7th Int. Conf. Curves Surf, pp. 711-730, 2010.
16.
David R. Martin, Charless C. Fowlkes, Doron Tal and Jitendra Malik, "A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics", ICCV, pp. 416-425, 2001.
17.
Jia-Bin Huang, Abhishek Singh and Narendra Ahuja, "Single image super-resolution from transformed self-exemplars", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5197-5206, 2015.
18.
Zhou Wang, Alan C. Bovik, Hamid R. Sheikh and Eero P. Simoncelli, "Image quality assessment: from error visibility to structural similarity", IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, 2004.
19.
Yang Wang and Tao Zhang, "OSFFNet: Omni-stage feature fusion network for lightweight image super-resolution", Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 5660-5668, 2024.
20.
Parichehr Behjati, Pau Rodriguez, Carles Fernández, Isabelle Hupont, Armin Mehri and Jordi González, "Single image super-resolution based on directional variance attention network", Pattern Recognition, vol. 133, pp. 108997, 2023.

References

References is not available for this document.