Introduction
Video super-resolution (VSR) task is a crucial component of computer vision and image processing, playing an important role in video enhancement. Video super-resolution (VSR) has a wide range of applications, including satellite remote sensing [1], unmanned aerial vehicle surveillance [2], and high-definition television [3]. With advancements in mobile communication technology, we can efficiently perform super-resolution reconstruction of larger-sized images or videos. Moreover, the popularity of high-definition (HD) and ultra-high-definition (UHD) display devices has created a pressing need for the development of super-resolution techniques. High-definition videos have become an essential requirement in our daily lives, making the research on super-resolution of low-resolution videos particularly important.
Before the emergence of deep learning, traditional super-resolution methods relied on simple upsampling techniques such as nearest neighbor interpolation, bilinear interpolation, or bicubic interpolation. However, the images restored using these algorithms failed to achieve the desired results. In recent years, deep learning techniques [4] have rapidly developed and extensively permeated various fields, showing tremendous application prospects in the field of super-resolution. The “pioneer” of deep learning in the field of image super-resolution is the SRCNN (Super-Resolution Convolutional Neural Network) proposed by Dong et al. [5] from the Chinese University of Hong Kong. In SRCNN, a 3-layer convolutional neural network model is employed to sequentially perform feature extraction, non-linear mapping, and image reconstruction. This process trains an end-to-end single-image super-resolution network, marking the initial attempt of deep learning in the field of super-resolution reconstruction. The results obtained surpass those of traditional super-resolution reconstruction methods. Additionally, SRCNN conducts multiple comparative experiments regarding convolutional kernel size and network depth. It suggests that due to the network’s sensitivity to parameter initialization and learning rate, deeper networks are more challenging to converge or may converge to local minima. Thus, deeper networks are not necessarily better. However, the shallowness of the network also results in inadequate exploration and utilization of image features, leading to the deficiency of fine details in the generated SR images. With the continuous development of mainstream classification model architectures such as AlexNet [6] and VGGNet [7], the use of consecutive small convolution kernels instead of larger convolution kernels in AlexNet has increased the depth of the network while ensuring a sufficient receptive field. Correspondingly, Dong et al. [8] started using smaller convolution kernels and more mapping layers in image super-resolution networks. Following ResNet [9], various variants have emerged, with DenseNet [10] being a representative one. Tong et al. [11] and Zhang et al. [12] made use of the dense connection idea proposed by DenseNet to fully extract multi-level features, achieving good results in image super-resolution. Lim et al. [13] removed Batch Normalization [14] from the residual network, increased the number of residual layers from 16 to 32, and improved model performance by removing redundant modules from SRResNet [15] and scaling up the model size. In 2018, Zhang et al. [21] introduced an image super-resolution reconstruction algorithm based on the Residual Channel Attention Network (RCAN). RCAN innovatively proposes a deep residual channel attention network that weights and allocates attention to features across different channels. This enhancement significantly improves the network’s ability to discern and learn the high and low-frequency information in low-resolution images.
Unlike image super-resolution, video super-resolution in the task of video data reconstruction possesses more low-frequency information, which refers to the correlation between adjacent frames. Effectively utilizing the inter-frame relationships becomes crucial in video super-resolution reconstruction. To make good use of this information, aligning the adjacent frames is inevitable, and the alignment accuracy directly affects the reconstruction results. Regarding how to address the alignment problem, several different approaches have been proposed. Caballero et al. [16] proposed a method where a convolutional neural network is used to estimate the displacement parameters. These parameters are then used to perform spatial transformations on the adjacent frames. The aligned low-resolution (LR) images are subsequently stacked together, and a sub-pixel convolutional layer is applied to reconstruct the stacked LR images into a high-resolution (SR) image. This approach enables an end-to-end video super-resolution network.Wang et al. [17] proposed a network called SOF-VSR that utilizes optical flow in a coarse-to-fine manner. It first predicts the high-resolution (HR) optical flow progressively from coarse to fine and then uses the HR optical flow to perform motion compensation on the low-resolution (LR) images. Finally, the SR image is reconstructed. In the aforementioned methods, optical flow estimation is performed on each frame individually, followed by frame-by-frame alignment. This approach only utilizes temporal information during the alignment process and incurs high computational costs. Tian et al. [18] proposed a temporally-deformable alignment network for video super-resolution, where motion estimation and motion compensation are treated as a single-stage task. Wang et al. [19] proposed video restoration with enhanced deformable convolutional networks (EDVR) and achieved first place with a significant advantage in the CVPR NTIRE 2019 Image/Video Restoration Challenge. Kim et al. [20] constructed a simple end-to-end video super-resolution reconstruction network using 3D convolutional neural networks (CNNs) in their work. They demonstrated that 3D convolutions can achieve better results in video super-resolution reconstruction tasks.
The reconstruction performance of these methods heavily relies on the quality of alignment. While some progress has been made by investigating more accurate optical flow estimation networks or utilizing deformable convolutions for improved alignment, the problem of excessive parameter and computational requirements in 3D convolution makes it difficult to construct deep neural networks.Although 3D convolution can effectively utilize temporal information, it treats information across channels equally without considering the varying impact of different channels on the reconstruction quality. To address above aproblems, this paper proposed the following innovations based on a 3D convolutional neural network:
In the conventional two-stage methods for video super-resolution reconstruction, which involve optical flow estimation and motion compensation, the initial step involves feature extraction in the spatial domain, followed by motion compensation in the temporal domain. However, this approach leads to the separation of spatio-temporal information in video sequences and a consequent reduction in their coherence. To address the limitations of the two-stage methods, this paper introduces a novel approach called the Fusion Attention Mechanism-based Back-Projection Recursive Network (D3DRRN), building upon previous research.
We propose a multi-feature extraction module in order to differentiate the importance level of information across different dimensions.
To address the limitations of global average pooling in merging information within the same channel, which may hinder the reconstruction of texture details, we propose a multi-attention (MA) module that combines temporal attention, spatial attention, and channel attention.
By further combining recursive structures and residual connections, we accelerate training speed, alleviate the loss of high-frequency information during training in deep networks, reduce parameter count, and decrease network size.
We utilize multiple short-skip connections and long-skip connections in our model to further expedite the propagation of feature information within the network, enabling the extraction of more high-level features.
Overall Network Framework
To address the limitations of two-stage methods, this paper introduces a novel approach called D3DRRN based on deformable 3D convolution and incorporating a fusion attention mechanism. Building upon existing research, D3DRRN aims to overcome the drawbacks of the traditional two-stage pipeline by iteratively applying deformable 3D convolution and back-projection with the integration of a fusion attention mechanism. This combination allows for the fusion of spatial and temporal information, enhancing the coherence and effectiveness of depth completion in video sequences. D3DRRN addresses the problem of excessive parameter and computational requirements in 3D convolution, which limits the construction of deep networks, by combining residual networks with recursive networks.
Furthermore, although 3D convolution effectively utilizes temporal information, it treats information across channels equally without considering the distinct impact that different channels have on the reconstruction quality. To enhance the network’s ability to capture complex inter-channel correlations and introduce greater non-linearity in the channel dimension, this paper improves upon the channel attention mechanism proposed in RCAN [21]. Additionally, multiple short-skip connections and long-skip connections are employed to facilitate the faster propagation of low-frequency information.
The D3DRRN structure proposed in this paper, as shown in Figure 1, consists of three main components: multi-scale feature extraction, deep feature extraction, and reconstruction network.
In Figure 1, the low-resolution images \begin{equation*}\tilde {F}=MFE([LR_{1}, \ldots, LR_{t}, \ldots, LR_{2t}]) \tag{1}\end{equation*}
\begin{equation*}F'=RU(\tilde {F})+F' \tag{2}\end{equation*}
\begin{equation*}HR=Re(F')+B(LR_{t}) \tag{3}\end{equation*}
\begin{equation*}HR=D3DRRN([LR_{1}, \ldots, LR_{t}, \ldots, LR_{2t}]) \tag{4}\end{equation*}
A. Multiscale Feature Extraction Network
Due to the local connectivity property of convolution, smaller convolutional kernels tend to extract more localized and detailed features. Conversely, larger kernels possess a larger receptive field, enabling them to capture more global features. In video super-resolution tasks, significant motion can exist within the same scene, making a larger receptive field crucial for capturing these critical pieces of information.
In order to simultaneously accommodate a large receptive field and local details, inspired by GoogLeNet [22], which introduced a parallel structure that concurrently processes the feature matrix across multiple branches and subsequently concatenates the resulting feature matrices along the depth dimension to obtain the final output, this section presents the Multiscale Feature Extraction (MFE) module. The structure of the MFE module is depicted in Figure 2.
In Figure 2, the sequence of input LR images \begin{equation*}F=W^{3}([LR_{1}, \ldots, LR_{t}, \ldots, LR_{2t}]) \tag{5}\end{equation*}
\begin{equation*}F_{0,1,2}=W^{1,3,5}(\delta (F)) \tag{6}\end{equation*}
\begin{equation*}\tilde {F}=W^{1}(\delta ([F_{0},F_{1},F_{2}])) \tag{7}\end{equation*}
B. Multi-Attention Module
In the channel attention module of RCAN, the fusion of information within the same channel using global average pooling restricts the reconstruction of texture details. To simultaneously consider global information and local details, we presents a channel attention module (CA) that combines global average pooling and global maximum pooling. Taking inspiration from CBAM [23], we integrate time attention module (TA), channel attention module (CA), and spatial attention module (SA) to propose a multi-attention module (MA).
In multi-frame inputs, not every frame has an equal impact on the reconstruction performance of the intermediate frame. In fact, some frames may introduce adverse information that negatively affects the reconstruction results. Hence, the purpose of time attention is to allocate different weights to each frame in the temporal dimension, amplifying the impact of frames that are more beneficial to the reconstruction quality. The time attention network is illustrated in Figure 3.
Assuming the temporal dimension is represented as \begin{equation*}t=(W^{T}_{D}(H_{avg}(\tilde {F}))) \tag{8}\end{equation*}
\begin{equation*}T'=f(W^{T}_{U}(\delta (t))) \tag{9}\end{equation*}
\begin{equation*}F'_{T}=T'\tilde {F} \tag{10}\end{equation*}
For continuous multiple frames, the channel attention assigned to each frame is consistent. Thus, Figure 4 presents an example of the channel attention mechanism using a feature map with a time dimension of 1. To allocate attention weights to each channel, it is necessary to compress the input feature map spatially. Assuming the input feature size is h \begin{align*}C_{1}&=W^{C}_{U}(\delta (W^{C}_{D}(H_{max}(\tilde {F})))) \tag{11}\\ C_{2}&=W^{C}_{U}(\delta (W^{C}_{D}(H_{avg}(\tilde {F})))) \tag{12}\end{align*}
\begin{equation*}F'_{c}=f(C_{1}+C_{2})\tilde {F} \tag{13}\end{equation*}
The importance of features may vary across different channels, and thus it is necessary to assign weights to different channels. Similarly, within the same channel, each pixel may have a different importance level. By assigning weights to each pixel, spatial attention can be formed as depicted in Figure 5. Similar to channel attention, spatial attention aims to assign weights to each pixel, requiring compression of the feature map along the channel dimension. Assuming an input feature size of \begin{align*}S_{1} &= W^{S}_{U}(\delta (W^{S}_{D}(H_{max}(\tilde {F})))) \tag{14}\\ S_{2} &= W^{S}_{U}(\delta (W^{S}_{D}(H_{avg}(\tilde {F})))) \tag{15}\end{align*}
\begin{equation*}F'_{s}=f(S_{1}+S_{2})\tilde {F} \tag{16}\end{equation*}
C. Iterative Back Projection Module Based on Deformable 3D Convolution
DBPN [24] integrates iterative back-projection with deep learning and presents a Deep Back-Projection Network (DBPN). To perform the back-projection process using convolutional neural networks, DBPN introduces both an up-projection unit and a down-projection unit. Each projection unit consists of two fundamental modules: a deconvolution module for the upsampling layer and a convolution module for the downsampling layer. The parameter count of each three-dimensional upsampling layer is three times that of a two-dimensional upsampling layer. The utilization of multiple three-dimensional upsampling layers further amplifies both the parameter count and computational load. Apart from the challenges associated with excessive parameter count and computational load, the limited graphics memory of each card poses additional constraints. Excessive parameter count can hinder training on GPUs.
In order to construct an upsampling layer without introducing excessive parameter count, we compress and activate the temporal features as illustrated in Figure 6. Prior to upsampling, the feature maps are compressed to a size of \begin{equation*}F_{S}=F'_{S}*S \tag{17}\end{equation*}
After undergoing a sequence of projection operations, the feature \begin{equation*}F_{E}=L_{p}*E \tag{18}\end{equation*}
In the up-projection unit, the feature map \begin{align*}H^{t-1}_{0}&=L^{t-1}*U^{1}_{t} \tag{19}\\ L^{t-1}_{0}&=H^{t-1}_{0}*D_{t} \tag{20}\\ e^{t}_{l}&=L^{t-1}_{0}-L^{t-1} \tag{21}\\ H^{t}_{1}&=e^{t}_{l}*U^{2}_{t} \tag{22}\\ H^{t}&=H^{t-1}_{0}+H^{t}_{1} \tag{23}\end{align*}
The computation process in the down-projection unit is the reverse of the up-projection unit. The upsampled feature \begin{align*}L^{t}_{0}&=H^{t}*D^{1}_{t} \tag{24}\\ H^{t}_{0}&=L^{t}_{0}*U_{t} \tag{25}\\ e^{t}_{h}&=H^{t}_{0}-H^{t} \tag{26}\\ L^{t}_{1}&=e^{t}_{h}*D^{2}_{t} \tag{27}\\ L^{t}&=L^{t}_{0}+L^{t}_{1} \tag{28}\end{align*}
In order to extract high-frequency features, the up-projection and down-projection necessitate the use of larger convolutional kernels, leading to an increase in parameters. However, repeatedly performing up-projection and down-projection, as in DBPN, would not only result in a substantial parameter count but also make the network difficult to train. To enhance the network depth without adding parameters, we employ the parameter sharing characteristic of recursive neural networks and construct a recursive neural network based on iterative back-projection modules and attention modules.
Within this module, a fundamental attention back projection unit is constructed, consisting of an attention module, a compression network, n iterative back-projection modules, and an activation network. Nonetheless, excessive recursion can pose challenges in training the model effectively. To mitigate this concern, residual connections are introduced to facilitate the flow of information by connecting the outputs of each recursive module. The Recursive Unit (RU) structure, as depicted in figure 8, illustrates the architectural design of each recursive unit. In the case of the first recursive module, the input comprises shallow features \begin{equation*}F'_{u}=RU^{u}(F'_{u-1}+\tilde {F}) \tag{29}\end{equation*}
\begin{align*}F'=RU^{u}(RU^{u-1}(\ldots (RU^{u1}(\tilde {F})+\tilde {F})+\tilde {F})+\tilde {F})+\tilde {F} \tag{30}\end{align*}
D. Reconstruction Network
To alleviate computational demands, we employ low-resolution images during the training phase while keeping the image resolution constant. Subsequently, in the final stage of the network, we integrate sub-pixel convolution to facilitate the image upscaling process. Sub-pixel convolution is a technique used to perform upsampling by leveraging pixel points from different channels. Assuming an upsampling factor of
The primary goal of the network is to reconstruct a high-resolution (HR) image, which is represented as a two-dimensional dataset. However, the feature maps within the network exist in a three-dimensional format prior to the reconstruction process. In order to transform the feature maps from three-dimensional to two-dimensional representation, specific methods are employed to reduce the dimensionality of the network. During the convolution process, the absence of padding leads to a progressive reduction in the size of feature maps. Motivated by this observation, 3DSRNet [20] adopts a strategy that eliminates temporal padding. Specifically, it applies two consecutive convolutions with a kernel size of \begin{equation*}\vec {F}=\rho (F') \tag{31}\end{equation*}
\begin{equation*}f=\mu (\vec {F}) \tag{32}\end{equation*}
\begin{equation*}HR=s(f)+B(LR_{t}) \tag{33}\end{equation*}
Experimental and Analysis
A. Preparation Before Experiment
During the training of neural networks, the effectiveness of the network is not only dependent on its architecture and parameter count but also on the quality of the training dataset. Insufficient training samples or a lack of diversity and representativeness in the dataset can result in overfitting. Conversely, an excessive presence of noise and interference in the training set can impede network convergence. To achieve optimal network performance and facilitate fair comparisons with other algorithms, we utilized the publicly available Vimeo-90K dataset [25] for training purposes. Vimeo-90K is a widely used, large-scale, and high-quality dataset that is specifically curated for video processing tasks. It plays a significant role in training models for various video-related tasks, including video denoising, super-resolution reconstruction, and video frame interpolation.
To obtain the training data pairs for HR and LR images, the original video frames are utilized as HR ground truth images, whereas the downsampled video frames are employed as LR images. During network training, random \begin{align*}L_{1}(y,\hat {y})&=w(\theta)|\hat {y}-y| \tag{34}\\ L_{2}(y,\hat {y})&=w(\theta)(\hat {y}-y)^{2} \tag{35}\end{align*}
The datasets utilized for testing are Vid4 [27] and SPMC-11 [28]. Vid4 is a test dataset composed of four video clips: “calendar,” “city,” “foliage,” and “walk.” Each video sequence comprises 31 frames and is employed to assess the overall performance of the trained models. SPMC-11 is constructed by assembling video clips depicting 11 diverse scenes, such as “car,” “city,” “people,” and “landscape.” Each video sequence in SPMC-11 consists of 31 frames and is employed for model evaluation. The experimental network architecture in this study was implemented using Python 3.7 and PyTorch 1.10.1 deep learning framework. The corresponding CUDA version used was 11.3. The experimental setup and hardware configuration are presented in Table 1.
B. Ablation Experiment
To validate the effectiveness of the proposed modules, this section conducts ablation experiments on different modules. All networks are trained on the Vimeo-90K dataset and evaluated on the Vid4 dataset after applying the same data preprocessing. For all ablation experiments, the comparison models include all modules, with 4 recursive units and 2 attention feedback units. The comparisons are performed at 2x and 4x magnification scales.
1) Ablation Experiments on the Multi-Scale Feature Fusion Module
In this experiment, the comparative setup involves replacing the multi-scale feature extraction module with a conventional 3D convolution. The results are shown in Table 2, where bold font indicates the best results at different magnification scales. It can be observed that regardless of the 2x or 4x magnification, the utilization of the multi-scale fusion module leads to performance improvement. For the 2x magnification, the PSNR and SSIM values are improved by 0.42dB and 0.011, respectively, compared to not using the multi-scale fusion module. Similarly, for the 4x magnification, the PSNR and SSIM values are improved by 0.31dB and 0.017, respectively. These results indicate that leveraging the information between different scale feature maps can complement each other and yield better reconstruction results.
2) Ablation Experiments on the Attention Mechanism
In order to assess the distinct impacts of channel attention, spatial attention, and temporal attention on the network’s reconstruction performance, comparative experiments were carried out by individually combining SA, TA, and CA separately, and evaluating the results on the Vid4 dataset at 2x and 4x magnification scales. The experimental results in Table 3 demonstrate that the model utilizing SA+CA+TA achieves the highest performance, followed by SA+CA. The TA module contributes only marginal improvements. This can be attributed to the strong temporal feature extraction capabilities of 3D convolutions. In sufficiently deep networks, allocating separate weights to temporal features becomes unnecessary.
3) Ablation Experiments on the Recursive Units
This section presents experiments that compare the number of attention back projection (ABP) modules and recursive units (RU) in the network. By increasing the number of RU units while keeping the ABP module count constant, the network depth is increased without affecting the parameter count. The network configuration with m RU units and n ABP units is denoted as
4) Ablation Experiments on Different Input Frames
In order to determine the optimal number of input consecutive frames, which contain different spatio-temporal information, ablation experiments were conducted using 3, 5, and 7 input frames. The experimental results are shown in Table 5. It can be observed that the network performs best when the number of input consecutive frames is 7. This is because increasing the number of input frames allows for more spatio-temporal information to be included. Increasing the number of frames from 3 to 5 introduces a significant amount of spatio-temporal features, resulting in a 0.33dB improvement in PSNR. However, the performance improvement is not significant when increasing the number of frames from 5 to 7, with only a 0.09dB increase in PSNR. This is because the input of consecutive 5 frames already contains a substantial amount of spatio-temporal information.
C. Experimental Comparison With Other Algorithms
In order to validate the effectiveness of D3DRRN, we conducted a comparative analysis with multiple video super-resolution reconstruction algorithms, focusing on the evaluation of PSNR and SSIM values. The comparison encompasses not only the renowned VSRNet [29], a classic video super-resolution algorithm, but also motion-based techniques such as SOF-VSR [17] and TOFlow [25]. Furthermore, we included other algorithms that leverage deformable convolutions, such as TDAN [18] and D3DNet [30]. Additionally, we performed a quantitative assessment of the network models based on their parameter count.
1) Objective Data Comparison
It is evident from the results presented in Figure 9 that our proposed algorithm has been compared with other algorithms in terms of parameter count, considering a scaling factor of 4. In comparison to TDAN and D3DNet, which also employ deformable convolutions, D3DRRN exhibits a parameter increase of approximately 0.2M. However, it achieves a PSNR improvement of approximately 0.4dB. When compared to D3DNet, D3DRRN not only has a lower parameter count but also achieves superior PSNR performance. These findings demonstrate the effectiveness of the multiscale attention mechanism and recursive structure proposed in this chapter, as they enhance the network’s feature representation capability while reducing parameter requirements.
In contrast to the two-stage methods, TOFlow and SOF-VSR, which rely on optical flow, D3DRRN surpasses them in various aspects. Despite a parameter increase of approximately 0.5M compared to SOF-VSR, D3DRRN achieves a PSNR improvement of approximately 0.6dB. This indicates the superior spatiotemporal feature extraction capability of 3D convolutions.
The Vid4 dataset and SPMC-11 dataset were both subjected to downsampling and underwent 4x super-resolution reconstruction using various algorithms. The results, as presented in Table 6, indicate that our proposed D3DRRN method consistently achieves the best performance in the tests conducted with a scaling factor of 4 for both datasets. Specifically, D3DRRN exhibits notable improvements over the classical video super-resolution reconstruction methods VESPCN and VSRNet on both the Vid4 and SPMC-11 datasets. Comparing against the current state-of-the-art methods that employ optical flow, namely TOFlow and SOF-VSR, D3DRRN achieves PSNR improvements of 0.96dB and 0.84dB on the Vid4 dataset, and 0.76dB and 0.91dB on the more diverse SPMC-11 dataset, respectively. Additionally, when compared to methods employing deformable convolutions such as TDAN and D3DNet, D3DRRN demonstrates superior performance. Specifically, on the Vid4 dataset, D3DRRN achieves PSNR improvements of 0.67dB and 0.33dB over TDAN and D3DNet, respectively, while on the SPMC-11 dataset, it achieves improvements of 0.70dB and 0.39dB, respectively. These findings suggest that the increased network depth, coupled with the multiscale attention channel mechanism and deformable convolutions, synergistically enhance the network’s performance.
2) Subjective Effect Comparison
We performed a comparative analysis of 4x super-resolution reconstruction on two video sequences from the Vid4 dataset and two video sequences from the SPMC-11 dataset. Specifically, we selected the “walk” and “city” video sequences from the Vid4 dataset, as well as the “car” and “jvc” video sequences from the SPMC-11 dataset. These sequences were subjected to different super-resolution reconstruction algorithms, and the results are illustrated in Figure 10.
During the reconstruction of the “walk” video sequence, VSRNet produced blurry images overall. TOFlow, TDAN, and D3DNet incorrectly reconstructed the belt, resulting in artifacts, and introduced a checkerboard effect in the hand region. The reconstruction by SOF-VSR exhibited partial blurriness. In contrast, our proposed D3DRRN method achieved better reconstruction of the fine details of the belt.
In the reconstruction of the “city” video sequence, other algorithms showed varying degrees of “moire patterns” in the reconstructed windows of the buildings. In contrast, D3DRRN preserved details more effectively and reduced the occurrence of “moire patterns,” resulting in reconstructions that closely resembled the ground truth.
For the “car” video sequence, TDAN, D3DNet, and D3DRRN all demonstrated good overall performance. However, the proposed D3DRRN method outperformed the others in accurately reconstructing the contour of the letter “F” in the license plate.
Conclusion
This paper primarily introduces the overall network architecture of a fusion attention mechanism-based back-projection recursive network. From the perspective of improving network width and depth, the paper designs a multi-scale feature extraction network, attention network, and reconstruction network. The method utilizes iterative back-projection to iteratively utilize the up-projection and down-projection features. To reduce parameter count and computation time, bottleneck layers and recursive back-projection networks are designed.
Subsequently, a brief introduction is provided for the datasets used in this paper, namely Vimeo-90k, Vid4, and SPMC-11. The experimental software and hardware environments are described, along with the specific experimental settings. To demonstrate the effectiveness of the proposed modules, multiple ablation experiments are conducted. Finally, objective comparisons are made between the proposed algorithm and other algorithms on the Vid4 and SPMC-11 datasets. Additionally, a comparative evaluation of the 4x upscaling results is performed on the Vid4 dataset, confirming the effectiveness of the proposed algorithm. However, the algorithm proposed in this paper still suffers from the issue of slow computation speed, and there is room for further improvement in the reconstruction quality, particularly in terms of fine details. In the next step, we will conduct further research to address these aforementioned issues.