Conferences >ICASSP 2025 - 2025 IEEE Inter...

Long-Range Multi-Scale Fusion for Efficient Single Image Super-Resolution

Abstract:

Improving the performance of single image super-resolution (SISR) via extending the effective receptive field (ERF) of the model has become an admired paradigm in the fie...Show More

Metadata

Abstract:

Improving the performance of single image super-resolution (SISR) via extending the effective receptive field (ERF) of the model has become an admired paradigm in the field due to the universal self-similarity prior of natural images. However, it cannot fully explore model capability by solely increasing the ERF to capture long-range dependencies as the non-local self-similarity is typically multi-scale and cross-scale. To this end, a Long-range Multi-scale Fusion Network (LMFN) is devised in this work to simultaneously excavate both long-range and multi-scale priors in images, and the interaction between the both. Within the same scale, our model employs large kernel attention (LKA) and multi-scale modulation (MSM) to learn long-range and multi-scale features. To exploit the interaction between long-range and multi-scale dependencies within one single scale and across scales, we design an Interactive Fusion Modulation (IFM) module for the effective fusion of the non-local and multi-scale features. Extensive experiments on the benchmark datasets illustrate the significant superiority of the proposed LMFN over the advanced SISR models.

Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 06-11 April 2025

Date Added to IEEE Xplore: 07 March 2025

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49660.2025.10888924

Conference Location: Hyderabad, India

Funding Agency:

References is not available for this document.

Contents

SECTION 1.

Introduction

Single image super-resolution (SISR) is a long-standing task in low-level computer vision, which aims at reconstructing a high-resolution (HR) image from a single low-resolution (LR) counterpart. A straightforward and intuitive way to promote the performance of deep SISR models is to increase network depth and model scale. However, this strategy restricts their applicability in resource-constrained scenarios, and increasing network depth or model scale indiscriminately may result in issues like gradient vanishing. Therefore, an important branch in this field is the exploration of lightweight and efficient super-resolution (SR) models tailored for practical scenarios and application deployment, with striking a nice equilibrium between model performance and overhead.

One promising approach to improve SR performance with constrained model capacity is to incorporate image priors into model inference. In deep learning-based SISR models, self-similarity [2] is a common prior explored through non-local or global attentions. In general, making full use of the self-similarity of an image helps to expand the effective receptive field (ERF) of the model, thereby facilitating the extraction of long-range dependencies [3]. This also establishes an admired paradigm for boosting SISR performance through increasing the ERF[4], [5]. However, image self-similarity is typically sparse and multi-scale [2], [6], [7], it cannot fully exploit the model’s capability in an effective way by solely and exhaustively modeling long-range dependencies.

Fig. 1.

The compromise comparison between our LMFN and other lightweight SISR models w.r.t. model performance and overhead on Manga109 [1] for SR×4. The size of the circles indicates the scale of model parameters.

Show All

Fig. 2.

The network topology of the proposed LMFN. We take LKDN [8] as the backbone and the framework of building module (b) is instantiated by different modulation cells (c) ~ (e) via replacing the placeholder, i.e., (b) ↢ (c) ≜ MSM-M and (b) ↢ (e) ≜ IFM-M. It is worth noting that LKA-M is obtained by instantiating (b) with LKA [9] and IFF (d) is a structural component of IFM (e).

Show All

Fig. 3.

Our intuition is that fusing long-range and multi-scale features could be expected to compensate and promote each other. (a) and (b): Diagrammatic sketches of long-range and multi-scale features. (c) Multi-scale prior makes up for long-range feature learning. (d) Multi-scale feature learning could be enabled in larger ranges by long-range modelling.

Show All

In view of the multi-scale and sparsity of non-local self-similarity, this work seeks to provide efficient SISR models with the learning ability of multi-scale and non-local dependencies. Specifically, we present a Long-range Multi-scale Fusion Network (LMFN) to improve the trade-off between performance and overhead of efficient SISR models. We use large kernel attention (LKA) [9] to capture sparse non-local features, which is primarily done by the dilated depth-wise convolution. Meanwhile, multi-scale information is acquired via a Multi-Scale Modulation (MSM) module equipped with dynamic snake convolution [10], which presents good adaptability to multi-scale nonlinear features in images. For fusing non-local and multi-scale features effectively, we also design an Interactive Fusion Modulation (IFM) module based on the operations of channel splitting and merge-and-run. The structural parts are integrated into a unified building module framework and periodically stacked together to elevate the synergy between multi-scale and sparse nonlocal features, assisting our model in gracefully compromising between SR results and overhead, as shown in Fig. 1.

SECTION 2.

METHODOLOGY

2.1. Motivation

The overall structure of our LMFN is shown in Fig. 2. Using LKA to model long-range dependencies not only disables the model to learn multi-scale features, but also fails to effectively extract local features, as shown in Fig. 3(a). Similarly, it will be incapable of modeling non-local self-similarity by solely learning multi-scale features.

Fig. 4.

Several schemes for the core component of our IFF.

Show All

To learn long-range features at across scales and perform multi-scale feature learning on longer ranges, the proposed LMFN is endowed with the capability to combine long-range and multi-scale dependency from two aspects. Firstly, in the backbone of the model, LKA-M and MSM-M are periodically stacked, allowing for the sequential extraction of long-range and multi-scale features; Secondly, within the IFM-M, we explicitly integrate two kinds of features with the simple merge-and-run. Both of these strategies can contribute to effective feature fusion.

2.2. Long-Range and Multi-Scale Modulation

Given a temporary feature x_t, the process to generate output feature y_t with LKA [9] can be formulated as:

$\begin{equation*}{{\mathbf{y}}_t} = \operatorname{DWDConv} \left( {\operatorname{DWConv} \left( {\operatorname{PWConv} \left( {{{\mathbf{x}}_t}} \right)} \right)} \right) \odot {{\mathbf{x}}_t},\tag{1}\end{equation*}$ View Source

where PWConv(•) denotes the point-wise convolution with kernel size of 1×1. DWConv(•) stands for the depth-wise convolution with kernel size of 5×5, and DWDConv(•) is a dilated depth-wise convolution with kernel size of 5×5. The symbol ⊙ indicates element-wise product.

For multi-scale feature learning, we first split a temporary feature x_t ∈ ℝ^h×w×c into four parts, where h, w and c are the height, width and number of channels of x_t, respectively.

$\begin{equation*}\left[ {{{\mathbf{x}}_0},{{\mathbf{x}}_1},{{\mathbf{x}}_2},{{\mathbf{x}}_3}} \right] = \operatorname{Split} \left( {{{\mathbf{x}}_t}} \right),\tag{2}\end{equation*}$ View Source

where Split(•) stands for the operation of evenly splitting x_t along the channel direction. And x_i ∈ ℝ^h×w×c/4 denotes the sub-features of the output (i = 0,…,3). Then we extract multi-scale features within down-sampled feature spaces, using dynamic snake convolutions (DSConv) [6], as exhibited in Fig. 2(c). Formally, the procedure can be formulated as:

$\begin{equation*}{{\mathbf{\tilde x}}_i} = {{\text{U}}_ \uparrow }\left( {\operatorname{DSConv} \left( {{{\text{D}}_ \downarrow }\left( {{{\mathbf{x}}_t},{2^i}} \right)} \right),{2^i}} \right),\quad i = 0, \ldots ,3,\tag{3}\end{equation*}$

View Source

where the DSConv is with kernel size of 3×3, and U_↑(•,•) and D_↓(•,•) represent the up-sampling and down-sampling operations, respectively. 2ⁱ stands for the scale factor of the i-th branch. Next, these sub-features

${{\mathbf{\tilde x}}_i}$

are concatenated together to generate a single intermediate feature

${{\mathbf{\tilde x}}_i}$

$\begin{equation*}{{\mathbf{\tilde x}}_t} = \operatorname{Concat} \left( {{{{\mathbf{\tilde x}}}_0},{{{\mathbf{\tilde x}}}_1},{{{\mathbf{\tilde x}}}_2},{{{\mathbf{\tilde x}}}_3}} \right)\tag{4}\end{equation*}$

View Source

To obtain the final output of MSM, we fuse these separately processed subfeatures with a 1×1 normal conv followed by a GeLU layer and a Hadamard product with the original input:

$\begin{equation*}{{\mathbf{y}}_t} = \operatorname{GeLU} \left( {\operatorname{PWConv} \left( {{{{\mathbf{\tilde x}}}_t}} \right)} \right) \odot {{\mathbf{x}}_t}.\tag{5}\end{equation*}$ View Source

Incorporating MSM with LKA can enable multi-scale feature learning over wider ranges, as shown in Fig. 3(d).

2.3. Interactive Fusion Modulation

To further integrate long-range and multi-scale features and adequately explore the synergies between them, we design a block cell for interactive fusion modulation (IFM) equipped with an interactive feature fusion (IFF), as shown in Fig. 2(d).

Given a temporary feature x_t, our IFF first splits it evenly into two subfeatures x_l and x_m: [x_l, x_m] = Split(x_t), and then fuses them with the merge-and-run [11]:

$\begin{align*} & \left[ {\begin{array}{c} {{{\mathbf{z}}_l}} \\ {{{\mathbf{z}}_m}} \end{array}} \right] = \left[ {\begin{array}{c} {{\text{LKA}}\left( {{{\mathbf{x}}_l}} \right)} \\ {{\text{LKA}}\left( {{{\mathbf{x}}_m}} \right)} \end{array}} \right] + \frac{1}{2}\left[ {\begin{array}{ll} {\mathbf{I}}&{\mathbf{I}} \\ {\mathbf{I}}&{\mathbf{I}} \end{array}} \right]\left[ {\begin{array}{c} {{{\mathbf{x}}_l}} \\ {{{\mathbf{x}}_m}} \end{array}} \right],\tag{6} \\ & \left[ {\begin{array}{c} {{{{\mathbf{\tilde x}}}_l}} \\ {{{{\mathbf{\tilde x}}}_m}} \end{array}} \right] = \left[ {\begin{array}{c} {{\text{MSM}}\left( {{{\mathbf{z}}_l}} \right)} \\ {{\text{MSM}}\left( {{{\mathbf{z}}_m}} \right)} \end{array}} \right] + \frac{1}{2}\left[ {\begin{array}{cc} {\mathbf{I}}&{\mathbf{I}} \\ {\mathbf{I}}&{\mathbf{I}} \end{array}} \right]\left[ {\begin{array}{c} {{{\mathbf{z}}_l}} \\ {{{\mathbf{z}}_m}} \end{array}} \right],\tag{7}\end{align*}$ View Source

where I denotes the identity matrix. The final output of our IFF is obtained by:

$\begin{equation*}{{\mathbf{y}}_t} = \operatorname{PWConv} \left( {\operatorname{Concat} \left( {{{{\mathbf{\tilde x}}}_l},{{{\mathbf{\tilde x}}}_m}} \right)} \right) + {{\mathbf{x}}_t}.\tag{8}\end{equation*}$

View Source

Table 1. Ablation study of model backbone configurations on

${\color{Blue}\mathbf{Manga109}}$ [1] with

${\color{Blue}}{\text{SR}} \times {\text{4}}$ .

$Table 1.- Ablation study of model backbone configurations on ${\color{Blue}\mathbf{Manga109}}$ [1] with ${\color{Blue}}{\text{SR}} \times {\text{4}}$.$

Table 2. The comparison between IFF variants on

${\color{Blue}\mathbf{Manga109}}$ [1] with

${\color{Blue}}{\text{SR}} \times {\text{4}}$ , corresponding to Fig. 4.

$Table 2.- The comparison between IFF variants on ${\color{Blue}\mathbf{Manga109}}$ [1] with ${\color{Blue}}{\text{SR}} \times {\text{4}}$, corresponding to Fig. 4.$

Within the IFM, our IFF is used to generate the weights for fusing long-range and multi-scale dependencies. Assuming that the input feature of IFM is still x_t, IFM generates the subfeatures for IFF with the following procedure:

$\begin{equation*}\left[ {{\mathbf{x}}_t^1,{\mathbf{x}}_t^2} \right] = \operatorname{Split} \left( {\operatorname{PWConv} \left( {\operatorname{LayerNorm} \left( {{{\mathbf{x}}_t}} \right)} \right)} \right),\tag{9}\end{equation*}$ View Source

where

${\mathbf{x}}_t^2$

is input into IFF for weight generation, so we can produce the output of our IFM by:

$\begin{equation*}{{\mathbf{y}}_t} = \operatorname{PWConv} \left( {{\mathbf{x}}_t^1 \odot \operatorname{IFF} \left( {{\mathbf{x}}_t^2} \right)} \right) + {{\mathbf{x}}_t}.\tag{10}\end{equation*}$

View Source

We use the merge-and-run to fuse long-range and multi-scale features, which is expected to promote the synergy between the self-similarity and multi-scale priors. Besides, we primarily integrate different features for weight generation, instead of direct feature learning as in CSN [11].

SECTION 3.

EXPERIMENTAL RESULTS

3.1. Datasets and Metrics

Our models are trained with the DF2K dataset (DIV2K [12] + Flickr2K [13]). We also use five benchmark datasets for testing: Set5 [14], Set14 [15], B100 [16], Urban100 [17] and Manga109 [1]. SR results are evaluated with peak signal to noise ratio (PSNR) and structural similarity (SSIM) [18] on the Y channel of the YCbCr color space.

3.2. Implementation Details

We randomly crop 64 patches with 48×48 pixels from LR images as the basic training inputs. During model training, data argumentation is performed on the input patches with random horizontal flips and rotations. We train our models using the common ℒ₁ loss and the Adan optimizer. We set the exponential moving average (EMA) to stabilize training to 0.999. The learning rate is set to a constant 5 × 10⁻³ for 2×10⁶ training iterations. All experiments are conducted with the PyTorch framework on an NVIDIA A100 GPU.

Table 3. Comparisons between lightweight SISR models on benchmark datasets. FLOPs is measured corresponding to an HR image with size of 1280×720 pixels.

3.3. Ablation Studies

We will illustrate the effectiveness of fusing long-range and multi-scale features from two aspects, i.e., the modules in the backbone and the IFF in the IFM.

Model Backbone: We perform an ablation study by replacing the placeholder in LMFN. The results in Table 1 show that the our combination used in LMFN achieves the best tradeoff between model performance and overhead.

IFF in IFM: We conduct an ablation study by changing the integrating of long-range and multi-scale feature. The results in Table 2 illustrate that the strategies shown in Fig. 4(c) and Fig. 4(d) achieve the most competitive results.

3.4. Comparison with Advanced Models

To evaluate the performance of our approach, we compare it with state-of-the-art lightweight SISR methods, including SAFMN [6], VapSR [4], HPUN [7], OSFFNet [19], and Di-VANet [20], NGswin [5], etc.

Quantitative Comparison: Benefiting from the simple yet efficient structure, the proposed LMFN obtains comparable SR results to state-of-the-art models with significantly fewer parameters, as shown in Table. 3. It can be seen that, our LMFN achieves the best performance overall but maintains moderate parameters and FLOPs for all SR scales.

Qualitative Comparison: Fig. 5 provides the visual comparisons on several testing image from Urban100 [17] dataset for SR×4. Our approach generates parallel straight lines and grid patterns more accurately compared to other methods. These results also demonstrate the effectiveness of our method for efficient integration of multi-scale information by utilizing a dual-attention mechanism for channel interaction.

Fig. 5.

Visual comparison on Urban100 for SR×4.

Show All

Fig. 6.

The LAM attribution indicates the significance of each pixel in the input LR image during the reconstruction of the patch highlighted in the red rectangle. The diffusion indices (DI) are shown beneath the LAM outcomes. A larger DI value means a wider range of attention.

Show All

LAM Comparison: As depicted in Fig. 6, it can be observed that our model reconstruct the target area with the surrounding pixels in a wider range. A more comprehensive perception of the receptive field implies the effectiveness of long-range multi-scale feature fusion.

SECTION 4.

CONCLUSION

In this work, we present a simple and efficient LMFN model for lightweight SISR tasks, motivated by the incorporation of long-range sparse dependencies with multi-scale priors of images. In virtue of the large kernel attention and multi-scale feature modulation, our LMFN can extract long-range and multi-scale features in a simple and flexible way. To explore the synergy between different features, we illustrate a novel feature fusion strategy, i.e., IFM built upon the merge-and-run of sub-features. Unlike the previous methods of directly using the merge-and-run for feature learning, we adopt it to learn weights for feature modulation based on the universal self-similarity. The experiments demonstrate the benefits of combining the long-range and multi-scale dependencies for SISR tasks, and illustrate the superiority of our LMFN over other models in compromising between performance and overhead.

References is not available for this document.

Long-Range Multi-Scale Fusion for Efficient Single Image Super-Resolution

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction