Introduction
Higher-resolution images generally offer superior quality and greater detail compared to low-resolution (LR) images. While high-quality cameras can capture high-resolution (HR) images, they tend to be more costly and demand more resources. As a result, employing visually pleasing high-resolution cameras not provide a feasible solution in real-time situations. A viable solution to this challenge is to use super-resolution techniques to reconstruct HR outputs. Alternatively, one can use one or more lower-resolution images, as super-resolution (SR) techniques can produce visually appealing higher-resolution images [1], [2], [3], [4]. Many applications, including virtual reality, medical imaging, closed-circuit television surveillance, high-definition television, and ultra-high-definition television, necessitate super-resolution (SR) images. Current image super-resolution methods can be categorized into three main types: interpolation-based methods [5], [6], [7], [8], [9], reconstruction-based methods, and learning-based methods [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20].
Interpolation-based methods enhance an image’s resolution by leveraging the relationships between adjacent pixels. Common methods include bicubic, bilinear, and nearest-neighbor interpolation. These techniques compute interpolated values for the pixels in low-resolution images to produce high-resolution outputs. Although interpolation methods are simple and quick, they frequently result in blurry images because they have difficulty capturing fine details [21]. The reconstruction-based approaches are inefficient, and performance quickly decreases when low-resolution (LR) images upscale on factors
The most popular research trend now is focused on deep convolutional neural network (CNN) models [25], [26], [27], [28], [29], [30], which have made great progress and improved HR image reconstruction significantly. Initially, Dong et al. [13] proposed shallow network architecture on image SR. In this method [13], the authors used three fundamental CNN layers to create a direct link between low-resolution (LR) input images and high-resolution (HR) output images, known as a super-resolution convolutional neural network (SRCNN) [13]. SRCNN employed a manually designed pre-processing technique that involved bicubic interpolation to upscale the LR image into an HR image. Although SRCNN has outperformed earlier methods, it still encounters challenges in the quality of the reconstructed results. The model mainly focuses on the interpolated version of the HR image, which frequently results in blurry output HR images.
To overcome these challenges, the same authors proposed a different shallow architecture called the fast super-resolution convolutional neural network (FSRCNN) [14]. In this approach, they replaced bicubic interpolation with a transposed CNN layer to produce the high-resolution output image. However, a notable limitation of FSRCNN is its shallow architecture, which necessitates faster convergence. To address the shortcomings of shallow network architectures, Kim et al. [15] introduced Very Deep Super-Resolution (VDSR), drawing inspiration from the VGG model used for ImageNet classification. The first time, VDSR [15] increased the image super-resolution network depth to 20 layers, finding that their model performed substantially better than earlier approaches. The leading methods for single image super-resolution are typically illustrated in figure 1. While these advanced algorithms utilizing single-path convolutional neural networks have demonstrated enhanced performance, they face various challenges that may impede their effectiveness. The dying ReLU problem is a significant concern in deep networks. In this scenario, some neurons become inactive during training, consistently outputting zero. This situation can arise when the weights are adjusted to keep in the neuron’s input negative, resulting in a loss of learning ability. Inactive neurons do not participate in the learning process, which can diminish the network’s capacity to capture intricate features. Gradient flow issues occur when gradients for these neurons become zero. It prevents them from updating their weights during backpropagation, rendering them useless. Single-path deep CNNs are especially vulnerable to overfitting, particularly when trained on datasets of varying sizes. Overfitting happens when the model memorizes the training data rather than generalizing it to new, unseen images. This can lead to poor generalization, where the model performs well on the training data but struggles to generate high-quality outputs on test data or real-world images. As the model becomes more complex (e.g., by adding more layers in a single path), the risk of overfitting increases, necessitating careful regularization techniques to mitigate this risk. Furthermore, in a single-path architecture, the backend layers may function as dead layers, where they fail to learn or update effectively. This can happen due to accumulated inactivity, and earlier layers experience the dying ReLU problem. Subsequent layers may inherit this inactivity, leading to a cascade effect where multiple layers become ineffective. Dead layers contribute little to the final output, losing important features crucial for high-quality image reconstruction. As the depth of a single-path CNN increases, it may face an information bottleneck, causing important details to be lost as data flows through multiple layers. This makes it challenging for the network to retain the context for effective image reconstruction. Additionally, the intensity of feature maps can decrease toward the end of the network, resulting in a lack of detail in the final output images. Furthermore, a single-path architecture relies on a linear sequence of layers for feature extraction, which may restrict the model’s ability to learn a varied and thorough representation of the input data. This limitation can adversely affect performance in super-resolution tasks, where it is crucial to capture both low-level details and high-level semantics to produce high-quality outputs.
To resolve challenging tasks and boost LR image quality, we proposed the novel Combined Attention Network for SISR (CANS) to reconstruct the visually attractive high-quality HR image from low-quality LR image. Our proposed network architecture is primarily composed of a Shallow Block-based Network (SBN) path, a Deep Block-based Network (DeBN) path, and a Dense Block-based Network (DBN) path. Each type of network block path has unique characteristics that can benefit different aspects of image super-resolution. A Shallow Block-based Network typically has fewer layers and can quickly capture basic features and patterns in the input images. They are efficient for processing and can serve as a foundation for more complex features. A Deep Block-based Network path consists of multiple layers that allow for extracting intricate features and representations. The depth of the network allows it to learn intricate mappings from low-resolution to high-resolution images, which is essential for achieving effective super-resolution. A Dense Block-based Network path utilizes dense connections between layers, allowing for better feature propagation and reuse. Our proposed architecture helps alleviate the vanishing gradient problem. It ensures that all layers contribute to the final output, enhancing the network’s ability to learn rich feature representations without requiring excessive parameters.
By integrating these various architectures, the model can extract a more comprehensive set of features from the input images. The shallow blocks efficiently capture low-level features, while the deep blocks concentrate on high-level abstractions. The dense blocks facilitate the sharing of features across layers, leading to a more comprehensive understanding of the image content. This multi-faceted feature extraction is particularly beneficial in super-resolution tasks, where fine details and overall structure are important. Employing a parallel architecture with diverse blocks enables the model to generate more effectively across different types of images and resolutions. Each block can learn to address various aspects of the input data, resulting in enhanced performance on previously unseen images. This is especially relevant in super-resolution, where the model may encounter various image characteristics. The parallel approach can also result in more efficient training and inference processes. Since each block can operate independently, they can be trained simultaneously, potentially reducing the overall training time.
Additionally, during inference, the model can leverage the strengths of each block without the need for a single, monolithic architecture, allowing for faster processing times. Finally, CANS provides flexibility in design and scalability. We can also modify the number of layers in each block or the types of connections employed according to the specific needs of the super-resolution task. This flexibility facilitates experimentation with various configurations to attain the best results.
In summary, the contributions of this paper can be outlined as follows:
We introduce a new architecture that integrates all attention blocks followed by a multi-path structure, referred to as the Combined Attention Network for Single Image Super-Resolution (CANS). CANS is based on local and global residual skip connections within the different shallow, deep, and dense block-based mechanisms. It accomplishes quick and precise image super-resolution, exhibiting competitive outcomes in the SISR challenges with a moderate amount of parameters.
The Triplet Channel Attention Network (TCAN) block proposes paying attention to spatial regions and treating information rich areas, like textures and boundaries, differently. Consequently, the network can place greater emphasis on these areas.
The network architecture employs a Depthwise Separable Convolution (DSC) layer to decrease the computational cost by reducing the number of parameters. It also uses a customized UpSampling Block (U-Block) rather than a single UpSampling layer to recreate the HR image’s visually pleasant quality.
The remainder of this paper is organized as follows: Section II reviews related work, Section III describes the proposed method along with its corresponding attention blocks, Section IV analyzes the experimental results, and Section V concludes with a summary and recommendations for future research.
Related Work
This section examines the leading performers and key image super-resolution architectures utilized in CNN-based approaches. Next, we present recently proposed methods that incorporate local and global attention mechanisms. Lastly, we explore various techniques that employ attention mechanisms in image enhancement, with a particular focus on image super-resolution.
A. CNN Based Single Image Super-Resolution
Deep learning approaches based on CNNs have emerged as a crucial research area for tackling single-image super-resolution tasks. The initial image super-resolution method presented by Dong et al. [13], known as super-resolution CNN (SRCNN), demonstrated remarkable performance. In this approach, the authors employed three CNN layers to estimate the output mapping function that converts the upscaled bicubic version of the image into the high-resolution output image. However, this model has some weaknesses, such as extra pre-processing bicubic interpolation and the need for faster training convergence. The same author revised the previously proposed SRCNN to improve HR image quality further, called Fast SRCNN (FSRCNN) [14]. The authors [25], [31] implemented a recursive deep learning strategy to address the issues of computational cost and memory usage related to the network parameters. Kim et al. authors of VDSR, also introduced a novel method known as Deeply-Recursive Convolutional Network (DRCN) for image SR [25]. In DRCN [25], authors used the CNN layer multiple times in recursive manners. The primary benefits of DRCN [25] are its effectiveness and simplicity; however, it suffers from slow training speeds. Tai et al. [31] suggested that the deep recursive residual network (DRRN) be used to determine the number of model parameters in a multi-path by recursively learning with the residual unit. Shi et al. [19] proposed an efficient sub-pixel convolutional neural network (ESPCN) for realtime image and video applications, enhancing both quantitative and qualitative performance. ESPCN lowers computational complexity by substituting the transposed convolutional layer with a sub-pixel convolutional layer, allowing it to extract rich feature information from the original low-resolution image. Wang et al. [32] introduced the concept of a sparse prior deep CNN for image super-resolution, termed the Sparse Coding Network. This methodology is straightforward to implement and shows significant improvements over SRCNN.
Following the “The Deeper the Better” idea, Mao et al. [33] suggested the Residual Encoder-Decoder Networks (RED) architecture. In this [33] architecture, authors apply the symmetric convolution and deconvolution layers with residual learning. On the other hand, Romano et al. proposed a quick and shallow deep CNN method named Rapid and Accurate Image Super-Resolution (RAISR) [34]. Using this method, the author extracts the features map from the LR to HR by classifying the input image patches based on their strength, coherence, and angle of patches. Lai et al. [35] suggested a progressive SR image method known as the Laplacian Pyramid SR Network, abbreviated as LapSRN. The network of LapSRN [35] is based on multiple pyramid levels, with each level being owing to deconvolution layers as an upscale layer. However, the network’s flexibility is limited because it uses a fixed integer scaling factor. The feed-forward denoising CNN networks (DnCNNs) were suggested by Zhang et al. [36] to speed up the improvement of a very deep CNN architecture and enhance the visually pleasing quality of the image. DnCNNs design is the same as SRCNN’s, with the convolution layer stacked with Batch normalization (BN) layer and ReLU functions. Even though the model produces positive results, using a batch normalization layer makes it computationally expensive. Zhao et al. [37] suggest a gradual UpSampling network abbreviated as GUN. The GUN architecture is the more adjustable scaling factor for super-resolving LR images. The authors used the bicubic-interpolation approach to upsample the LR features. Li et al. introduce the novel-based multi-scale residual network architecture, abbreviated as (MSRN) [38] to extract the complete image feature information. In this approach, the authors replaced the residual block with different convolution kernel size blocks to extract the feature information simultaneously at various scales. End-to-end image SR via deep-and-shallow CNN (EEDS) architecture introduced by Wang et al. [39] to further improve LR input image quantitatively and qualitatively. The author removes the pre-processing upscaling technique with the deep learning CNN technique in this approach. The primary architecture can be divided into three distinct modules: feature extraction, upsampling, and ultimately, multi-scale reconstruction. Pandey et al. [40] proposed the multi-scale-based features enhancement image SR algorithm to resolve the issue of blurring and checkerboard during the deconvolution layer. The author replaced the conventional upscaling approach with pixel deconvolution to recreate this algorithmś visually attractive, high-quality output image.
B. Densenet and Attention-Based Single-Image Super-Resolution
A dense connection is a particular class of short and long skip connections in which the current output sends data to all subsequent inputs. Huang et al. [41] were the first to present the DenseNet architecture, which received recognition as the best paper in computer vision and pattern recognition at CVPR 2017. Following this, SRDenseNet [42] introduced the concept of dense skip connections and demonstrated strong performance in image super-resolution. Subsequently, dense connection-based architecture was used by various networks in the SR challenging tasks, and the performance is admirably acceptable. Nevertheless, the densely connected structure continues to introduce redundant and irrelevant information during the image enhancement reconstruction process [43].
The idea of attention block plays a crucial part in reconstructing high-level images and computer vision-type applications [44], [45], [46]. To address the issue of feature correlations between the intermediate layers, Dai et al. introduced the concept of a second-order type of attention network known as SAN [46]. The main architecture contains four stages: shallow features extractions, non-locally enhanced residual groups (NLRG) type of deep features extractions, upscaled type module, and reconstructions step. Choi et al. [47] suggested a new architecture for interpreting ReLU with point-wise multiplication called Selection Units for Super-Resolution (SelNet). Ahn et al. [48] suggested a local and global cascading mechanism-based image SR with cascading residual network (CARN) architecture to incorporate the low and high-level features from the multiple layers. CARN also provided its modified version, CARN-M, which is the most efficient but has several operations.
Niu et al. [49] proposed the idea of a holistic attention network (HAN) to resolve the missing information between various types of layers. HAN architecture consists of layers and a channel-spatial attention module. These modules are simultaneously applied to extract low, mid, and high-feature information. To enhance the perceptual condition of the blur kernel, Gu et al. [50] proposed the iterative type kernel correction for estimating the blind image super-resolution. The proposed algorithm comprises an SR model, predictors, corrector, and pseudo-code. Hu et al. [51] introduced a special block termed squeeze-and-excitation (SE), which uses channel-wise relations with average pooled features to boost a CNN’s symbolic power and assure reliable image classification. Zheng et al. [52] proposed a novel approach with the implementation of an Efficient Mixed Transformer (EMT) for single image super-resolution to resolve the problems of the lack of locality mechanism and high complexity limit. This approach used the Mixed Transformer Block (MTB), which is composed of several sequential transformer layers, some of which replace the Self-Attention (SA) with the Pixel Mixer (PM). PM can leverage Pixel-shifting techniques to improve local knowledge aggregation. Liu et al. [53] proposed a deep recursive residual channel attention network to resolve the problem of feature fusion which is partially fused between different layers in the deeper network model. To recover the distinct features unique to each palmprint and use dense hybrid attention (DHA) network for palmprint image super-resolution (SR) [54]. The proposed DHA network initially utilizes a single convolution layer to achieve a high-dimensional shallow representation. It then employs parallel convolutional neural networks (CNNs) and transformer-based branches to collaboratively learn the local and global features specific to each palmprint. To enhance information flow both before and after the network and to extract local fusion features, a residual dense Swin Transformer (RDST) [55] algorithm has been introduced. In this approach, authors employed residual dense Transformer block (RDTB) to get richer feature information. Behjati et al. [56] suggested a computationally effective and precise network for SISR called the directional variance attention network (DiVANet) to solve the scaling and memory consumption problems. Furthermore, the author describes a novel directional variance attention (DiVA) mechanism that simultaneously exploits inter-channel dependencies and captures long-range spatial correlations to produce more discriminative representations. Song et al. [57] discuss the challenges of computational cost and more memory usage, especially when applied to mobile devices. These challenges are resolved with an effective residual dense block search algorithm that aims to find quick, lightweight, and precise networks for super-resolution images. In this approach [57], the suggested evolutionary algorithm automatically seeks pooling and upsampling operators. Jiang et al. [58] propose an innovative hierarchical dense connection network (HDN) for image super-resolution, focusing on both reconstruction quality and efficiency. This algorithm [58] employs a hierarchical matrix structure and constructs a hierarchical dense residual block (HDB) to enhance feature representation while minimizing memory usage. Chen et al. proposed a novel Hybrid Attention Transformer (HAT) [59] that activates more input pixels and leverages the complimentary benefits of window-based self-attention techniques and channels attention together. Furthermore, the authors incorporate an overlapping cross-attention module to enhance the interaction between adjacent window features, allowing for the gathering of cross-window information. Xiao et al. proposed TTST [60]. They identified two major challenges associated with the current Transformer in large-scale earth observation scenarios: (1) a single-scale representation that neglects the modeling of scale correlations among similar ground observation objects; and (2) redundant token representation resulting from the presence of numerous irrelevant tokens. This research [60] suggested adaptively removing irreverent token interference to allow for a more compact self-attention computation.
The first-time channel attention (CA) mechanism introduced in the image super-resolution by Zhang et al. [61] is a deep residual channel attention network (RCAN). In this approach, authors use four parts: shallow, deep feature extraction, upscaling, and reconstruction sections. To explore the relationship between inter-channel and intra-channel aspects of spatial and channel attention, Hui et al. [62] proposed a lightweight image super-resolution architecture called the Information Multi-Distillation Network (IMDN). In this architecture, the author employs the idea of replacing the global average pooling operation with the cumulative sum of the mean and standard deviation. Additionally, the authors utilize cascaded distillation blocks to progressively extract hierarchical features.
To resolve the challenging tasks of network size and further improve the training efficiency, stepwise feedback training and multi-feature maps fusion (SFTMFM) are proposed by Yao et al. [63]. In addition, a stepwise feedback training strategy is designed to enhance the model’s reconstruction ability further. This strategy combines the multi-feature maps fusion module with the cross-feature maps attention module as a feedback mechanism to reconstruct the model with higher-quality images gradually. For single image super-resolution, Zhao et al. [64] suggested a spatial shuffle multi-head self-attention that can effectively represent long-range pixel dependencies without requiring additional processing power. Additionally, a local perception module is proposed that combines the translational invariance and local connectivity of convolutional neural networks. The feature enhanced fused attention (FE-FAIR) method for image super-resolution was proposed by Guo et al. [65] to resolve the issue of multi-head self-attention with a shifted window. Cao et al. [66] proposed an unsupervised hybrid network of transformer and CNN (uHNTC) for blind hyperspectral image (HSI) and multispectral image (MSI) fusion to resolve the critical challenges in generating high spatial resolution hyperspectral images (HR-HSI) from low spatial resolution hyperspectral images (LR-HSI). In this approach author’s used three subnetworks designed to enhance feature capture and performance more efficiently and same approach also discuss in [67]. To further improve the performance of SR networks, EDT [68] investigates the influence of pre-training procedures on transformer techniques. Nevertheless, these studies need to mix global and local features properly, thus overvaluing shallow features significance. To successfully improve the model’s ability to display image details, our model concentrates on more efficient shallow feature extraction and feature fusion throughout the deep feature extraction phase. These attention-based methods have achieved good performance, but an efficient and lightweight SR model is still needed because these methods have a higher computational cost and more parameters. This paper suggests a novel technique to tackle these problems using a combined attention mechanism to super-resolution networks. Our approach completely utilizes the benefits of combined attention operation to build lightweight, highly efficient architectures that produce excellent results for image super-resolution.
Proposed Method
The proposed network architecture of CANS for the lightweight image SR is shown in figure 2. The idea of CANS is based on a multi-path approach for feature extraction. The multi-path includes a shallow block-based network (SBN) path, a deep block-based network (DeBN) path, and a dense block-based network (DBN) path, as shown in figure 2. Connections of different attention blocks are shown in figure 3. Based on this multi-path architecture, the triplet channel attention network (TCAN) block is designed to be the backbone of the proposed architecture. To further improve the model performance, a Depthwise Separable Convolution (DSC) operation and UpSampling block (U-Block) are used to recover the visually pleasing HR out image. The idea of shallow, deep, and dense network architecture is borrowed from [39], [40], and [69]. In the shallow block-based network (SBN) path, we use two Depthwise separable convolution layers, and LReLU follows each one before and after the UpSampling (Deconvolution layer). A deep block-based network (DeBN) path used two Depthwise separable convolution layers with LReLU. The resultant features are fed to the Triplet Channel Attention Network (TCAN) block set with a local skip connection. Similarly connections used in DBN block to extract the high quality features from the original LR input\begin{equation*} I^{HR} = R_{S}^{HR}(I_{LR}, \phi _{S}) + R_{D}^{HR}(I_{LR}, \phi _{D}) + R_{DE}^{HR}(I_{LR}, \phi _{DE}), \tag {1}\end{equation*}
A. Depthwise Separable Convolution
The fundamental operation of standard convolution involves calculating the number of parameters as input \begin{equation*} G(y,x,j) = \sum _{u=1}^{K} \sum _{v=1}^{K} K (u,v,j) \times I(y+u-1, x+v,j), \tag {2}\end{equation*}
\begin{equation*} O(y,x,l) = \sum _{j=1}^{C_{in}} G(x,y,j) \times P(j,l), \tag {3}\end{equation*}
Basic diagram of the different operations performed on convolutions: (a) standard convolution operation and (b) depthwise separable convolution operation.
B. Activation Functions
Earlier image super-resolution algorithms, such as SRCNN and VDSR, utilized ReLU as their activation function. The ReLU activation option is best for shallow networks, but for deeper networks, the death of the ReLU [70] occurred, introducing the vanishing gradient problem in training. We substituted the ReLU activation function with the Parametric Rectified Linear Unit (PReLU) and Leaky ReLU (LReLU) activation functions to address this issue. The PReLU and LReLU are activation functions in our CANS model that resolve the said problems and have a more rapid convergence time throughout the training. PReLU mathematically can be written as:\begin{equation*} PReLU (x_{i}) = max(x_{i}, 0) + a_{i} min(0, x_{i}), \tag {4}\end{equation*}
\begin{equation*} F_{l} (Y) = PReLU (W_{l} \times F_{l-1} (Y) + B_{l}), \tag {5}\end{equation*}
C. Attention Block
Previously, most deep learning-based image super-resolution algorithms treated the features in intermediate layers uniformly. To address this, the Squeeze-and-Excitation Network (SENet) [51] architecture was introduced to recalibrate the values of channel-wise features in convolutional neural network models. Furthermore, recent research [46], [51] on image SR demonstrates that attention strategy helps reconstruct high-quality HR images, improving performance significantly compared to other state-of-the-art methods. Enhancing the high-frequency information by learning different feature maps using the attentive block is the best choice for any deep CNN model.
Our proposed attention block is named the Triplet Channel Attention Network (TCAN) block, as shown in figure 5. The TCAN is an innovative architecture to improve feature extraction in deep learning models, especially for image processing tasks like image super-resolution. The design of TCAN is driven by the necessity to effectively utilize both spatial and channel information to enhance feature representation quality. This is accomplished by incorporating two essential components: the Spatial Attention Network (SAN) block and the Channel Attention Network (CHAN) block. The motivation behind the TCAN block is to enhance feature capture more efficiently because traditional convolutional neural networks (CNNs) often struggle to capture complex relationships between spatial and channel features due to their linear processing nature. TCAN overcomes this limitation by utilizing an attention mechanism that emphasizes the most pertinent features in both dimensions, thereby improving the model’s capacity to learn complex patterns within the data. Furthermore, many existing attention mechanisms, such as those relying solely on global average pooling, can lead to information loss and fail to capture fine-grained details. TCAN alleviates this problem by employing a triplet structure that integrates spatial and channel attention, enabling a more thorough representation of the input data. Finally, integrating SAN and CHAN blocks improves the model’s robustness. The TCAN architecture consists of two primary branches that work in parallel to extract and enhance features from the input data, such as the Spatial Attention Network (SAN) Block and Channel Attention Network (CHAN) Block.
The SAN block is a sophisticated type of block designed for image super-resolution tasks. It employs a combination of pooling, deconvolution, convolutional layers, and activation functions to effectively extract and reconstruct high-resolution features from low-resolution inputs. The SAN block begins with a max-pooling layer, which reduces the spatial dimensions of the input feature maps while retaining the most salient features. Max-pooling operates by selecting the maximum value from each local region of the input, effectively downsampling the feature map. The main purpose of this layer is to capture the most significant features while discarding less important information, which is crucial for focusing on the essential aspects of the image that contribute to super-resolution. Following the max pooling layer, a deconvolution layer (also known as a transposed convolution layer) is employed. This layer increases the spatial dimensions of the pooled feature maps, reversing the down-sampling effect of the pooling layer. The deconvolution layer helps in reconstructing the spatial resolution of the feature maps, allowing the network to generate high-resolution outputs from the lower-dimensional representations. After the deconvolution layer, two convolutional layers (CNN) are applied sequentially. These layers perform standard convolution operations, applying learned filters to the feature maps. The convolutional layers refine the features extracted from the previous layers. They enhance important details and patterns in the data, enabling the network to learn complex representations necessary for high-quality image reconstruction. At the end of the two CNN layers, a PReLU (Parametric Rectified Linear Unit) activation function is applied. PReLU is an advanced version of the ReLU activation function that allows for a small, learnable slope for negative inputs. This activation function helps to mitigate the “dying ReLU” problem by allowing a small gradient when the input is negative, thus ensuring that all neurons can contribute to learning. This leads to better performance and faster convergence during training. Finally, a single CNN layer is used to produce the output feature map. This layer consolidates the features learned from the previous layers into a final representation that is suitable for generating a high-resolution image. The final CNN layer ensures that the output retains the necessary details and structures for accurate image reconstruction, effectively combining all the learned features from the previous layers.
CHAN block used two average pooling layers and one max-pooling layer connected in parallel. Average pooling reduces the spatial dimensions of the input feature maps while retaining essential information by averaging values in local regions. This multi-scale feature extraction is crucial for understanding fine details and broader contextual information in the image. A max-pooling layer is introduced alongside the average pooling layers as a third branch. Max pooling selects the maximum value from each local region, emphasizing the most prominent features while discarding less significant information. This approach helps retain critical features lost through averaging, thus providing a complementary perspective to the average pooling layers. The outputs are summed together after processing through the three branches (two average pooling and one max pooling). This summation integrates the diverse information captured by each pooling method, creating a more robust feature representation that is then used for subsequent reconstruction of the high-resolution image. The main advantages of CHAN block architecture are enhanced feature representation, multi-scale information capture, and efficient computation. The resultant pooled features are sent through an UpSampling (deconvolution) layer with a CNN layer followed by LReLU to reconstruct the HR features. Similarly, the spatial attention network block depends on one max-pooling layer after a deconvolution and convolution layers to generate the HR features. Finally, both attention branches are concatenated and fed to the UpSampling Block (U-Block).
D. U-Block
Researchers [14], [19] suggested that increasing the resolution of LR images ahead of the first CNN layer raises the computational cost and harms important information because processing speed depends on the resolution size. Additionally, the interpolation technique used as a pre-processing step of the first layer increases the extra burden on model training compared to post-processing UpSampling at the later end layer. Furthermore, earlier approaches depend on a single path with one deconvolution layer to enlarge the LR image. However, such a strategy is applied, and improved results are obtained compared to previous approaches, but the reconstructed HR image appears as a blurry output HR image. To handle these problems, we apply the UpSampling Block technique and extract the features at the final stage to enlarge LR into a reconstructed HR out of the image, as shown in figure 6.
In figure 6, the UpSampling layer US\begin{align*} L^{f\downarrow } & = US\uparrow (US^{in\uparrow }), \tag {6}\\ H^{f\uparrow } & = D_{S} (P_{ReLU}(Weight_{US\uparrow }^{3\times 3} \times US^{in\uparrow } + (bias_{US\uparrow }^{3\times 3} )), \tag {7}\\ US^{out\uparrow } & = L^{f\downarrow } + H^{f\uparrow } \tag {8}\end{align*}
E. Shallow, Deep, and Dense Network Branches
The design of the shallow network is straightforward, which helps to improve the training efficiency and supports restoring more features of the image. With the joint training of three parallel paths of networks, the detailed reconstructed image is more accurate, and the performance is admirable. The shallow network branch depends on the feature extraction and reconstruction part. It uses one Depthwise separable convolution layer, one bottleneck layer (
Experiment
In this section, we first describe the technique for selecting training datasets with various model hyper-parameters. After that, we assess both quantitative and qualitative assessments using five publicly available benchmark datasets. Finally, we evaluate the computational cost and running time regarding PSNR vs the K number of parameters.
A. Training and Testing Datasets
The most often utilized image datasets for training purposes are Yang et al. [71] Berkeley Segmentation Dataset (BSDS100) [72], and DIV2K dataset [73], such types of datasets are employed by most cited image SR methods, such as VDSR [15], DRCN [25], LapSRN [35], RCAN [61], and EDSR [16] for training purposes. In our approach, we utilized images from 91 samples by Yang et al. [71], along with 100 images from the BSDS100 [72] dataset and 100 images from the DIV2K [73] dataset to validate the training performance of the CANS model. Data augmentation techniques are commonly employed to stabilize the training process and mitigate the risk of overfitting in the proposed model. Add the original high-resolution selected images from all three datasets in the dataset generation process. Initially, combine all images in one folder and then crop all high-resolution images to a multiple of the enlargement factors such as \begin{equation*} L(\theta ) = \frac {1}{m} \sum _{i=1}^{m} \|F(Y_{i}; \theta ) - X_{i}\|^{2}, \tag {9}\end{equation*}
From an evaluation perspective, the authors utilize five widely recognized test datasets: Set5 [72], Set14 [74], BSDS100 [75], Urban100 [76] and Manga109 [77]. The quantitative performance of a single image SR is evaluated on two quality metrics: PSNR/SSIM. The implementation details of our architecture under the Windows 10 operating system supporting NVIDIA GeForce RTX 2070 GPU with 64GB RAM. Under the Keras 2.2.1 framework, we trained our CANS on factors of
B. Comparisons with Existing State-of-the-Art Methods
The performance of our CANS was assessed alongside several image super-resolution algorithms, including Bicubic, SRCNN [13], VDSR [15], DRCN [25], LapSRN [35], CARN [48], IMDN [62], EDT [68], FE-FAIR [65], and SFTMFM [63]. Table 1 displays the quantitative results for scaling factors of
For enlargement factor
Figure 7 depicts the reconstruction performance of CNN-based SR approaches versus the number of network parameters. Using the proposed Depthwise separable convolutions, our CANS method has fewer model parameters than other state-of-the-art methods. Additionally, we assess our suggested model in terms of time versus PSNR. The time (execution time) is calculated by the average of 14 images obtained from Set14 for each method, as shown in figure 8. Time processing is executed on a complete image. For other methods, we used openly accessible codes offered by the authors to compare with existing methods on a 3.6GHz Intel i7 CPU machine with (32GB RAM). The 11 GB RAM of an NVIDIA RTX 2070 GPU is used for testing purposes. Finally, we used eight deep image SR methods to compare the computational cost of floating point operations per second (FLOPs) versus PSNR on enlargement factor
Performance comparison of K parameters in relation to PSNR (dB) and SSIM. The results are derived from the BSDS100 image dataset for a scaling factor of
Speed comparison of PSNR against inference time (in seconds) on the Set14 image dataset for a scaling factor
Comparison of computational cost regarding FLOPs versus PSNR using SET14 with enlargement factor
Quantitative comparison of PSNR and SSIM across all publicly available test datasets with a scaling factor of
C. Qualitative Comparison
This section assesses our CANS model qualitatively. We take the images from Set14 and BSDS100 test datasets for testing on a challenging enlargement scale of
Visual performance of image super-resolution on the Set14 and BSDS100 datasets with a
Visual performance of image super-resolution on the Set5, Set14, and BSDS100 datasets with an
D. Ablation Study
1) Investigation of the Components of CANS
We perform an ablation study of our proposed model on SBN, DeBN, and DBN block-based network paths using the Yang91 and BSDS100 training samples for
2) Training Convergence Analysis
Figure 13 shows PSNR and SSIM curves for our CANS in the training and validation phase on Set5 with enlargement factors
3) Quantitative Analysis of Attention Blocks
In this section, we conduct a more detailed assessment of the performance of our proposed TCAN-Block using various attention blocks, such as the CAN, SA, and SEA blocks. The performance of the SEA block and our proposed block are near each other, but our TCAN block has a high margin of several parameters compared to all attention blocks. Table 3 compares attention blocks with quality metrics that should be added, such as inference times, Parameters, Multi-Adds, Memory Size, FLOPs, PSNR, and SSIM. The overall performance provides a clearer and more practical understanding of the model’s performance.
4) Impact of Input Image Size
The dimensions of the input image are also essential for assessing the performance of the network architecture. An ablation study was performed on an input image size of
Ablation study perform on the input sizes of the image with different attention blocks.
E. Discussion
The comparative analysis of different attention mechanisms in image super-resolution centers on our proposed channel attention block, referred to as the TCAN block. The attention mechanisms chosen for this comparison consist of the Channel-Attention Block (CA), Spatial-Attention Block (SA), and Self-Attention Block (SEA).The CA block approach concentrates on the interdependencies among different channels in the feature maps. By adjusting the significance of each channel, the model can enhance the features most relevant to the super-resolution task. The SA block focuses on the spatial characteristics of the input image, enabling the model to target specific regions that are more pertinent to the task. This helps in amplifying important areas while diminishing less informative ones. The SEA mechanism allows the model to concentrate on various parts of the input image by analyzing relationships among all pixels, capturing long-range dependencies that are vital for understanding image context. Our proposed TCAN block seeks to integrate the strengths of these mechanisms, potentially improving performance in reconstructing high-resolution images from low-resolution inputs. To ensure a fair and thorough comparison among the different attention mechanisms, six quality metrics were utilized. The PSNR metric evaluates the ratio between the maximum possible power of an image and the power of noise that degrades its representation fidelity; a higher PSNR value indicates superior image quality, making it an essential metric for assessing super-resolution performance. SSIM measures perceptual image quality by comparing the similarity between original and reconstructed images. It considers luminance, contrast, and structure, providing a more holistic view of image quality than pixel-wise metrics. The number of parameters indicates the complexity of each attention block. Fewer parameters often lead to faster inference times and lower memory usage, which are important for real-time applications. Floating-point operations per second (FLOPs) measure the computational cost associated with each model, offering insights into its efficiency in terms of the processing power needed for inference. Inference time measures the time each attention block takes to process an input image and produce an output. Memory usage assesses the amount of memory consumed by each attention block during inference. Efficient memory usage allows for deploying models on devices with limited resources. Comparing the TCAN block with widely used attention mechanisms through these metrics, as shown in figure 15. The results shown in figure 15 will help understand that the proposed TCAN block has a higher performance and low resource consumption. Furthermore, we assess more intuitive quality metrics to evaluate the performance of our proposed TCAN-Block. In the ablation study, we employed all four attention blocks alongside nine quality metrics, including PSNR, SSIM, Mean Squared Error (MSE), Visual Information Fidelity (VIF), Universal Image Quality Index (UQI), Root Mean Square Deviation (RMSE), Multi-Scale Structural Similarity (MS-SSIM), Quality Without Reference (QNR), and Learned Perceptual Image Patch Similarity (LPIPS), as detailed in Table 4. A qualitative evaluation was conducted using the same Set5 test dataset with a scaling factor of
Practical implementation of our proposed CANS model for image super-resolution in resource-constrained applications requires careful consideration of computational efficiency and performance. Here are some important factors to consider for the practical implementation of our proposed CANS. The CANS model features a lightweight architecture suitable for resource-constrained environments, such as mobile devices or edge computing platforms, which can greatly benefit from using lightweight models. Incorporating a channel attention mechanism within our architecture can substantially decrease both the number of parameters and the computational costs. For instance, a TCAN attention block allows the model to focus on essential features without excessive complexity, optimizing performance while maintaining low resource usage. Furthermore, incorporating adaptive residual connections within the channel attention framework can enhance feature fusion without increasing the model size. This method enables the model to modify the significance of features according to the input, resulting in improved performance in low-resource situations. Applications of the CANS model are used in mobile devices or Internet of Things (IoT) systems because the channel attention-based model can be integrated into AI enhancement features, such as improving the quality of images captured in low-light conditions or enhancing video streams in real time. The second main application of our model in remote sensing and surveillance, where bandwidth and processing power may be limited, is that a channel attention model can help improve image quality for analysis without requiring extensive computational.
Conclusion and Future Work
This paper presents a rapid and efficient Combined Attention Network (CANS) designed for single image super-resolution (SISR). The network directly derives features from the original low-resolution (LR) input to produce a high-quality, high-resolution (HR) output image. The proposed approach employs the concepts of shallow, deep, and dense network architectures along with a combined attention mechanism to extract various features. The whole framework includes three paths: shallow block-based network (SBN) path, deep block-based network (DeBN) path, and dense block-based network (DBN) path. The suggested strategies ensure that the CANS network architecture achieves rapid convergence while maintaining a low computational cost. This is achieved by substituting the traditional pre-processing interpolation upscaling with a learned deconvolution layer and replacing standard convolution with depthwise separable convolution operations. Additionally, our CANS models incorporate the Triplet Channel Attention Network mechanism, which selectively emphasizes crucial input elements, thereby enhancing prediction accuracy and boosting computational efficiency. Comprehensive quantitative and qualitative evaluations on a range of images from five representative image super-resolution datasets demonstrate that our algorithm delivers impressive super-resolution performance while also offering a favorable balance between computational cost and visually appealing perceptual quality. Although our proposed method can increase the perceptual quality of the LR image to some extent, there are some limitations. The reconstructed image is still blurry when the result is evaluated on a scale factor higher than