Introduction
Single image super-resolution (SISR) is a long-standing task in low-level computer vision, which aims at reconstructing a high-resolution (HR) image from a single low-resolution (LR) counterpart. A straightforward and intuitive way to promote the performance of deep SISR models is to increase network depth and model scale. However, this strategy restricts their applicability in resource-constrained scenarios, and increasing network depth or model scale indiscriminately may result in issues like gradient vanishing. Therefore, an important branch in this field is the exploration of lightweight and efficient super-resolution (SR) models tailored for practical scenarios and application deployment, with striking a nice equilibrium between model performance and overhead.
One promising approach to improve SR performance with constrained model capacity is to incorporate image priors into model inference. In deep learning-based SISR models, self-similarity [2] is a common prior explored through non-local or global attentions. In general, making full use of the self-similarity of an image helps to expand the effective receptive field (ERF) of the model, thereby facilitating the extraction of long-range dependencies [3]. This also establishes an admired paradigm for boosting SISR performance through increasing the ERF[4], [5]. However, image self-similarity is typically sparse and multi-scale [2], [6], [7], it cannot fully exploit the model’s capability in an effective way by solely and exhaustively modeling long-range dependencies.
The compromise comparison between our LMFN and other lightweight SISR models w.r.t. model performance and overhead on Manga109 [1] for SR×4. The size of the circles indicates the scale of model parameters.
The network topology of the proposed LMFN. We take LKDN [8] as the backbone and the framework of building module (b) is instantiated by different modulation cells (c) ~ (e) via replacing the placeholder, i.e., (b) ↢ (c) ≜ MSM-M and (b) ↢ (e) ≜ IFM-M. It is worth noting that LKA-M is obtained by instantiating (b) with LKA [9] and IFF (d) is a structural component of IFM (e).
Our intuition is that fusing long-range and multi-scale features could be expected to compensate and promote each other. (a) and (b): Diagrammatic sketches of long-range and multi-scale features. (c) Multi-scale prior makes up for long-range feature learning. (d) Multi-scale feature learning could be enabled in larger ranges by long-range modelling.
In view of the multi-scale and sparsity of non-local self-similarity, this work seeks to provide efficient SISR models with the learning ability of multi-scale and non-local dependencies. Specifically, we present a Long-range Multi-scale Fusion Network (LMFN) to improve the trade-off between performance and overhead of efficient SISR models. We use large kernel attention (LKA) [9] to capture sparse non-local features, which is primarily done by the dilated depth-wise convolution. Meanwhile, multi-scale information is acquired via a Multi-Scale Modulation (MSM) module equipped with dynamic snake convolution [10], which presents good adaptability to multi-scale nonlinear features in images. For fusing non-local and multi-scale features effectively, we also design an Interactive Fusion Modulation (IFM) module based on the operations of channel splitting and merge-and-run. The structural parts are integrated into a unified building module framework and periodically stacked together to elevate the synergy between multi-scale and sparse nonlocal features, assisting our model in gracefully compromising between SR results and overhead, as shown in Fig. 1.
METHODOLOGY
2.1. Motivation
The overall structure of our LMFN is shown in Fig. 2. Using LKA to model long-range dependencies not only disables the model to learn multi-scale features, but also fails to effectively extract local features, as shown in Fig. 3(a). Similarly, it will be incapable of modeling non-local self-similarity by solely learning multi-scale features.
To learn long-range features at across scales and perform multi-scale feature learning on longer ranges, the proposed LMFN is endowed with the capability to combine long-range and multi-scale dependency from two aspects. Firstly, in the backbone of the model, LKA-M and MSM-M are periodically stacked, allowing for the sequential extraction of long-range and multi-scale features; Secondly, within the IFM-M, we explicitly integrate two kinds of features with the simple merge-and-run. Both of these strategies can contribute to effective feature fusion.
2.2. Long-Range and Multi-Scale Modulation
Given a temporary feature xt, the process to generate output feature yt with LKA [9] can be formulated as:
\begin{equation*}{{\mathbf{y}}_t} = \operatorname{DWDConv} \left( {\operatorname{DWConv} \left( {\operatorname{PWConv} \left( {{{\mathbf{x}}_t}} \right)} \right)} \right) \odot {{\mathbf{x}}_t},\tag{1}\end{equation*}
For multi-scale feature learning, we first split a temporary feature xt ∈ ℝh×w×c into four parts, where h, w and c are the height, width and number of channels of xt, respectively.
\begin{equation*}\left[ {{{\mathbf{x}}_0},{{\mathbf{x}}_1},{{\mathbf{x}}_2},{{\mathbf{x}}_3}} \right] = \operatorname{Split} \left( {{{\mathbf{x}}_t}} \right),\tag{2}\end{equation*}
\begin{equation*}{{\mathbf{\tilde x}}_i} = {{\text{U}}_ \uparrow }\left( {\operatorname{DSConv} \left( {{{\text{D}}_ \downarrow }\left( {{{\mathbf{x}}_t},{2^i}} \right)} \right),{2^i}} \right),\quad i = 0, \ldots ,3,\tag{3}\end{equation*}
\begin{equation*}{{\mathbf{\tilde x}}_t} = \operatorname{Concat} \left( {{{{\mathbf{\tilde x}}}_0},{{{\mathbf{\tilde x}}}_1},{{{\mathbf{\tilde x}}}_2},{{{\mathbf{\tilde x}}}_3}} \right)\tag{4}\end{equation*}
To obtain the final output of MSM, we fuse these separately processed subfeatures with a 1×1 normal conv followed by a GeLU layer and a Hadamard product with the original input:
\begin{equation*}{{\mathbf{y}}_t} = \operatorname{GeLU} \left( {\operatorname{PWConv} \left( {{{{\mathbf{\tilde x}}}_t}} \right)} \right) \odot {{\mathbf{x}}_t}.\tag{5}\end{equation*}
Incorporating MSM with LKA can enable multi-scale feature learning over wider ranges, as shown in Fig. 3(d).
2.3. Interactive Fusion Modulation
To further integrate long-range and multi-scale features and adequately explore the synergies between them, we design a block cell for interactive fusion modulation (IFM) equipped with an interactive feature fusion (IFF), as shown in Fig. 2(d).
Given a temporary feature xt, our IFF first splits it evenly into two subfeatures xl and xm: [xl, xm] = Split(xt), and then fuses them with the merge-and-run [11]:
\begin{align*} & \left[ {\begin{array}{c} {{{\mathbf{z}}_l}} \\ {{{\mathbf{z}}_m}} \end{array}} \right] = \left[ {\begin{array}{c} {{\text{LKA}}\left( {{{\mathbf{x}}_l}} \right)} \\ {{\text{LKA}}\left( {{{\mathbf{x}}_m}} \right)} \end{array}} \right] + \frac{1}{2}\left[ {\begin{array}{ll} {\mathbf{I}}&{\mathbf{I}} \\ {\mathbf{I}}&{\mathbf{I}} \end{array}} \right]\left[ {\begin{array}{c} {{{\mathbf{x}}_l}} \\ {{{\mathbf{x}}_m}} \end{array}} \right],\tag{6} \\ & \left[ {\begin{array}{c} {{{{\mathbf{\tilde x}}}_l}} \\ {{{{\mathbf{\tilde x}}}_m}} \end{array}} \right] = \left[ {\begin{array}{c} {{\text{MSM}}\left( {{{\mathbf{z}}_l}} \right)} \\ {{\text{MSM}}\left( {{{\mathbf{z}}_m}} \right)} \end{array}} \right] + \frac{1}{2}\left[ {\begin{array}{cc} {\mathbf{I}}&{\mathbf{I}} \\ {\mathbf{I}}&{\mathbf{I}} \end{array}} \right]\left[ {\begin{array}{c} {{{\mathbf{z}}_l}} \\ {{{\mathbf{z}}_m}} \end{array}} \right],\tag{7}\end{align*}
\begin{equation*}{{\mathbf{y}}_t} = \operatorname{PWConv} \left( {\operatorname{Concat} \left( {{{{\mathbf{\tilde x}}}_l},{{{\mathbf{\tilde x}}}_m}} \right)} \right) + {{\mathbf{x}}_t}.\tag{8}\end{equation*}
Within the IFM, our IFF is used to generate the weights for fusing long-range and multi-scale dependencies. Assuming that the input feature of IFM is still xt, IFM generates the subfeatures for IFF with the following procedure:
\begin{equation*}\left[ {{\mathbf{x}}_t^1,{\mathbf{x}}_t^2} \right] = \operatorname{Split} \left( {\operatorname{PWConv} \left( {\operatorname{LayerNorm} \left( {{{\mathbf{x}}_t}} \right)} \right)} \right),\tag{9}\end{equation*}
\begin{equation*}{{\mathbf{y}}_t} = \operatorname{PWConv} \left( {{\mathbf{x}}_t^1 \odot \operatorname{IFF} \left( {{\mathbf{x}}_t^2} \right)} \right) + {{\mathbf{x}}_t}.\tag{10}\end{equation*}
We use the merge-and-run to fuse long-range and multi-scale features, which is expected to promote the synergy between the self-similarity and multi-scale priors. Besides, we primarily integrate different features for weight generation, instead of direct feature learning as in CSN [11].
EXPERIMENTAL RESULTS
3.1. Datasets and Metrics
Our models are trained with the DF2K dataset (DIV2K [12] + Flickr2K [13]). We also use five benchmark datasets for testing: Set5 [14], Set14 [15], B100 [16], Urban100 [17] and Manga109 [1]. SR results are evaluated with peak signal to noise ratio (PSNR) and structural similarity (SSIM) [18] on the Y channel of the YCbCr color space.
3.2. Implementation Details
We randomly crop 64 patches with 48×48 pixels from LR images as the basic training inputs. During model training, data argumentation is performed on the input patches with random horizontal flips and rotations. We train our models using the common ℒ1 loss and the Adan optimizer. We set the exponential moving average (EMA) to stabilize training to 0.999. The learning rate is set to a constant 5 × 10−3 for 2×106 training iterations. All experiments are conducted with the PyTorch framework on an NVIDIA A100 GPU.
3.3. Ablation Studies
We will illustrate the effectiveness of fusing long-range and multi-scale features from two aspects, i.e., the modules in the backbone and the IFF in the IFM.
Model Backbone: We perform an ablation study by replacing the placeholder in LMFN. The results in Table 1 show that the our combination used in LMFN achieves the best tradeoff between model performance and overhead.
IFF in IFM: We conduct an ablation study by changing the integrating of long-range and multi-scale feature. The results in Table 2 illustrate that the strategies shown in Fig. 4(c) and Fig. 4(d) achieve the most competitive results.
3.4. Comparison with Advanced Models
To evaluate the performance of our approach, we compare it with state-of-the-art lightweight SISR methods, including SAFMN [6], VapSR [4], HPUN [7], OSFFNet [19], and Di-VANet [20], NGswin [5], etc.
Quantitative Comparison: Benefiting from the simple yet efficient structure, the proposed LMFN obtains comparable SR results to state-of-the-art models with significantly fewer parameters, as shown in Table. 3. It can be seen that, our LMFN achieves the best performance overall but maintains moderate parameters and FLOPs for all SR scales.
Qualitative Comparison: Fig. 5 provides the visual comparisons on several testing image from Urban100 [17] dataset for SR×4. Our approach generates parallel straight lines and grid patterns more accurately compared to other methods. These results also demonstrate the effectiveness of our method for efficient integration of multi-scale information by utilizing a dual-attention mechanism for channel interaction.
The LAM attribution indicates the significance of each pixel in the input LR image during the reconstruction of the patch highlighted in the red rectangle. The diffusion indices (DI) are shown beneath the LAM outcomes. A larger DI value means a wider range of attention.
LAM Comparison: As depicted in Fig. 6, it can be observed that our model reconstruct the target area with the surrounding pixels in a wider range. A more comprehensive perception of the receptive field implies the effectiveness of long-range multi-scale feature fusion.
CONCLUSION
In this work, we present a simple and efficient LMFN model for lightweight SISR tasks, motivated by the incorporation of long-range sparse dependencies with multi-scale priors of images. In virtue of the large kernel attention and multi-scale feature modulation, our LMFN can extract long-range and multi-scale features in a simple and flexible way. To explore the synergy between different features, we illustrate a novel feature fusion strategy, i.e., IFM built upon the merge-and-run of sub-features. Unlike the previous methods of directly using the merge-and-run for feature learning, we adopt it to learn weights for feature modulation based on the universal self-similarity. The experiments demonstrate the benefits of combining the long-range and multi-scale dependencies for SISR tasks, and illustrate the superiority of our LMFN over other models in compromising between performance and overhead.