Loading web-font TeX/Main/Regular
Iterative Back Projection Network Based on Deformable 3D Convolution | IEEE Journals & Magazine | IEEE Xplore

Iterative Back Projection Network Based on Deformable 3D Convolution


To address the limitations of two-stage methods, we introduces a novel approach called D3DRRN based on deformable 3D convolution and incorporating a fusion attention mech...

Abstract:

Video super-resolution technology enhances the display quality of videos by obtaining high-resolution videos from low-resolution videos. Unlike single-image super-resolut...Show More

Abstract:

Video super-resolution technology enhances the display quality of videos by obtaining high-resolution videos from low-resolution videos. Unlike single-image super-resolution, utilizing information between adjacent video frames is crucial in video super-resolution. To improve the performance of video super-resolution reconstruction, a model combining deformable 3D convolution and iterative back projection is proposed to fully exploit the temporal-spatial correlation of video frames. The model takes multiple consecutive video frames as input and outputs the super-resolution reconstruction of the middle frame, including three modules: multi-scale feature extraction, feature fusion, and high-resolution reconstruction. Firstly, multi-scale 3D convolution is used for preliminary feature extraction. Then, deformable 3D convolution and iterative back projection are combined for feature fusion. Finally, multiple residual dense blocks and sub-pixel convolution are used for high-resolution reconstruction, and global residual connections are utilized to obtain the reconstructed high-resolution video. Experimental results on the Vid4 dataset demonstrate that compared to existing methods, this method can effectively improve the peak signal-to-noise ratio and structural similarity performance and achieve better visual effects with 4x super-resolution magnification.
To address the limitations of two-stage methods, we introduces a novel approach called D3DRRN based on deformable 3D convolution and incorporating a fusion attention mech...
Published in: IEEE Access ( Volume: 11)
Page(s): 122586 - 122597
Date of Publication: 18 October 2023
Electronic ISSN: 2169-3536

SECTION I.

Introduction

Video super-resolution (VSR) task is a crucial component of computer vision and image processing, playing an important role in video enhancement. Video super-resolution (VSR) has a wide range of applications, including satellite remote sensing [1], unmanned aerial vehicle surveillance [2], and high-definition television [3]. With advancements in mobile communication technology, we can efficiently perform super-resolution reconstruction of larger-sized images or videos. Moreover, the popularity of high-definition (HD) and ultra-high-definition (UHD) display devices has created a pressing need for the development of super-resolution techniques. High-definition videos have become an essential requirement in our daily lives, making the research on super-resolution of low-resolution videos particularly important.

Before the emergence of deep learning, traditional super-resolution methods relied on simple upsampling techniques such as nearest neighbor interpolation, bilinear interpolation, or bicubic interpolation. However, the images restored using these algorithms failed to achieve the desired results. In recent years, deep learning techniques [4] have rapidly developed and extensively permeated various fields, showing tremendous application prospects in the field of super-resolution. The “pioneer” of deep learning in the field of image super-resolution is the SRCNN (Super-Resolution Convolutional Neural Network) proposed by Dong et al. [5] from the Chinese University of Hong Kong. In SRCNN, a 3-layer convolutional neural network model is employed to sequentially perform feature extraction, non-linear mapping, and image reconstruction. This process trains an end-to-end single-image super-resolution network, marking the initial attempt of deep learning in the field of super-resolution reconstruction. The results obtained surpass those of traditional super-resolution reconstruction methods. Additionally, SRCNN conducts multiple comparative experiments regarding convolutional kernel size and network depth. It suggests that due to the network’s sensitivity to parameter initialization and learning rate, deeper networks are more challenging to converge or may converge to local minima. Thus, deeper networks are not necessarily better. However, the shallowness of the network also results in inadequate exploration and utilization of image features, leading to the deficiency of fine details in the generated SR images. With the continuous development of mainstream classification model architectures such as AlexNet [6] and VGGNet [7], the use of consecutive small convolution kernels instead of larger convolution kernels in AlexNet has increased the depth of the network while ensuring a sufficient receptive field. Correspondingly, Dong et al. [8] started using smaller convolution kernels and more mapping layers in image super-resolution networks. Following ResNet [9], various variants have emerged, with DenseNet [10] being a representative one. Tong et al. [11] and Zhang et al. [12] made use of the dense connection idea proposed by DenseNet to fully extract multi-level features, achieving good results in image super-resolution. Lim et al. [13] removed Batch Normalization [14] from the residual network, increased the number of residual layers from 16 to 32, and improved model performance by removing redundant modules from SRResNet [15] and scaling up the model size. In 2018, Zhang et al. [21] introduced an image super-resolution reconstruction algorithm based on the Residual Channel Attention Network (RCAN). RCAN innovatively proposes a deep residual channel attention network that weights and allocates attention to features across different channels. This enhancement significantly improves the network’s ability to discern and learn the high and low-frequency information in low-resolution images.

Unlike image super-resolution, video super-resolution in the task of video data reconstruction possesses more low-frequency information, which refers to the correlation between adjacent frames. Effectively utilizing the inter-frame relationships becomes crucial in video super-resolution reconstruction. To make good use of this information, aligning the adjacent frames is inevitable, and the alignment accuracy directly affects the reconstruction results. Regarding how to address the alignment problem, several different approaches have been proposed. Caballero et al. [16] proposed a method where a convolutional neural network is used to estimate the displacement parameters. These parameters are then used to perform spatial transformations on the adjacent frames. The aligned low-resolution (LR) images are subsequently stacked together, and a sub-pixel convolutional layer is applied to reconstruct the stacked LR images into a high-resolution (SR) image. This approach enables an end-to-end video super-resolution network.Wang et al. [17] proposed a network called SOF-VSR that utilizes optical flow in a coarse-to-fine manner. It first predicts the high-resolution (HR) optical flow progressively from coarse to fine and then uses the HR optical flow to perform motion compensation on the low-resolution (LR) images. Finally, the SR image is reconstructed. In the aforementioned methods, optical flow estimation is performed on each frame individually, followed by frame-by-frame alignment. This approach only utilizes temporal information during the alignment process and incurs high computational costs. Tian et al. [18] proposed a temporally-deformable alignment network for video super-resolution, where motion estimation and motion compensation are treated as a single-stage task. Wang et al. [19] proposed video restoration with enhanced deformable convolutional networks (EDVR) and achieved first place with a significant advantage in the CVPR NTIRE 2019 Image/Video Restoration Challenge. Kim et al. [20] constructed a simple end-to-end video super-resolution reconstruction network using 3D convolutional neural networks (CNNs) in their work. They demonstrated that 3D convolutions can achieve better results in video super-resolution reconstruction tasks.

The reconstruction performance of these methods heavily relies on the quality of alignment. While some progress has been made by investigating more accurate optical flow estimation networks or utilizing deformable convolutions for improved alignment, the problem of excessive parameter and computational requirements in 3D convolution makes it difficult to construct deep neural networks.Although 3D convolution can effectively utilize temporal information, it treats information across channels equally without considering the varying impact of different channels on the reconstruction quality. To address above aproblems, this paper proposed the following innovations based on a 3D convolutional neural network:

  1. In the conventional two-stage methods for video super-resolution reconstruction, which involve optical flow estimation and motion compensation, the initial step involves feature extraction in the spatial domain, followed by motion compensation in the temporal domain. However, this approach leads to the separation of spatio-temporal information in video sequences and a consequent reduction in their coherence. To address the limitations of the two-stage methods, this paper introduces a novel approach called the Fusion Attention Mechanism-based Back-Projection Recursive Network (D3DRRN), building upon previous research.

  2. We propose a multi-feature extraction module in order to differentiate the importance level of information across different dimensions.

  3. To address the limitations of global average pooling in merging information within the same channel, which may hinder the reconstruction of texture details, we propose a multi-attention (MA) module that combines temporal attention, spatial attention, and channel attention.

  4. By further combining recursive structures and residual connections, we accelerate training speed, alleviate the loss of high-frequency information during training in deep networks, reduce parameter count, and decrease network size.

  5. We utilize multiple short-skip connections and long-skip connections in our model to further expedite the propagation of feature information within the network, enabling the extraction of more high-level features.

SECTION II.

Overall Network Framework

To address the limitations of two-stage methods, this paper introduces a novel approach called D3DRRN based on deformable 3D convolution and incorporating a fusion attention mechanism. Building upon existing research, D3DRRN aims to overcome the drawbacks of the traditional two-stage pipeline by iteratively applying deformable 3D convolution and back-projection with the integration of a fusion attention mechanism. This combination allows for the fusion of spatial and temporal information, enhancing the coherence and effectiveness of depth completion in video sequences. D3DRRN addresses the problem of excessive parameter and computational requirements in 3D convolution, which limits the construction of deep networks, by combining residual networks with recursive networks.

Furthermore, although 3D convolution effectively utilizes temporal information, it treats information across channels equally without considering the distinct impact that different channels have on the reconstruction quality. To enhance the network’s ability to capture complex inter-channel correlations and introduce greater non-linearity in the channel dimension, this paper improves upon the channel attention mechanism proposed in RCAN [21]. Additionally, multiple short-skip connections and long-skip connections are employed to facilitate the faster propagation of low-frequency information.

The D3DRRN structure proposed in this paper, as shown in Figure 1, consists of three main components: multi-scale feature extraction, deep feature extraction, and reconstruction network.

FIGURE 1. - D3DRRN network model.
FIGURE 1.

D3DRRN network model.

In Figure 1, the low-resolution images [LR_{1}, \ldots , LR_{t}, \ldots , LR_{2t}] are processed using the Multiscale Feature Extraction module (MFE) to extract features, resulting in shallow-level features \tilde {F} . This process can be formalized as the following equation 1.\begin{equation*}\tilde {F}=MFE([LR_{1}, \ldots, LR_{t}, \ldots, LR_{2t}]) \tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features. After undergoing multiple rounds of recursive networks for deep feature extraction, followed by residual connections, we obtain deep residual features F' , which can be formalized as the following equation 2.\begin{equation*}F'=RU(\tilde {F})+F' \tag{2}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Finally, the deep residual features F' are upsampled and connected with the LR_{t} ’s bilinear upsampled image through residual connections to obtain HR image. This process can be formalized as the equation 3.\begin{equation*}HR=Re(F')+B(LR_{t}) \tag{3}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
The overall structure can be represented by the following equation 4:\begin{equation*}HR=D3DRRN([LR_{1}, \ldots, LR_{t}, \ldots, LR_{2t}]) \tag{4}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
LR_{t} refers to the low-resolution image, and HR represents the high-resolution image after processing. MFE() denotes the shallow-level feature extraction network, RU() corresponds to the deep-level feature extraction network, Re() represents the reconstruction network, and B() is bilinear upsampling.

A. Multiscale Feature Extraction Network

Due to the local connectivity property of convolution, smaller convolutional kernels tend to extract more localized and detailed features. Conversely, larger kernels possess a larger receptive field, enabling them to capture more global features. In video super-resolution tasks, significant motion can exist within the same scene, making a larger receptive field crucial for capturing these critical pieces of information.

In order to simultaneously accommodate a large receptive field and local details, inspired by GoogLeNet [22], which introduced a parallel structure that concurrently processes the feature matrix across multiple branches and subsequently concatenates the resulting feature matrices along the depth dimension to obtain the final output, this section presents the Multiscale Feature Extraction (MFE) module. The structure of the MFE module is depicted in Figure 2.

FIGURE 2. - Multiscale Feature Extraction(MFE) module.
FIGURE 2.

Multiscale Feature Extraction(MFE) module.

In Figure 2, the sequence of input LR images [LR_{1}, \ldots , LR_{t}, \ldots , LR_{2t}] is initially subjected to a 3\times 3\times3 convolution operation, resulting in the extraction of the feature F . This process can be formulated using equation 5.\begin{equation*}F=W^{3}([LR_{1}, \ldots, LR_{t}, \ldots, LR_{2t}]) \tag{5}\end{equation*}

View SourceRight-click on figure for MathML and additional features. The feature F is subjected to three separate convolutional layers with kernel sizes of 1\times 1\times1 , 3\times 3\times3 , and 5\times 5\times5 to extract features at different scales, resulting in F_{0} , F_{1} , and F_{2} . Following each convolutional operation, the LRelu activation function is applied to introduce non-linearity into the model. This process can be formulated as an equation 6.\begin{equation*}F_{0,1,2}=W^{1,3,5}(\delta (F)) \tag{6}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
The given numbers F_{0}, F_{1}, F_{2} are concatenated and then subjected to a 1\times 1\times1 convolutional layer for compression fusion, resulting in the fused multiscale feature \tilde {F} . This process can be formalized as equation 7.\begin{equation*}\tilde {F}=W^{1}(\delta ([F_{0},F_{1},F_{2}])) \tag{7}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
F represents the shallow-level feature, W() represents a convolutional operation with a kernel size of n \times n \times n, and W^{1,3,5} correspond to convolutional operations with kernel sizes of 1\times 1\times1 , 3\times 3\times3 , and 5\times 5\times5 , respectively. The symbol \delta () denotes the LRelu activation function. F_{0}, F_{1}, F_{2} represent features at different scales, while \tilde {F} represents the fused multiscale feature after fusion.

B. Multi-Attention Module

In the channel attention module of RCAN, the fusion of information within the same channel using global average pooling restricts the reconstruction of texture details. To simultaneously consider global information and local details, we presents a channel attention module (CA) that combines global average pooling and global maximum pooling. Taking inspiration from CBAM [23], we integrate time attention module (TA), channel attention module (CA), and spatial attention module (SA) to propose a multi-attention module (MA).

In multi-frame inputs, not every frame has an equal impact on the reconstruction performance of the intermediate frame. In fact, some frames may introduce adverse information that negatively affects the reconstruction results. Hence, the purpose of time attention is to allocate different weights to each frame in the temporal dimension, amplifying the impact of frames that are more beneficial to the reconstruction quality. The time attention network is illustrated in Figure 3.

FIGURE 3. - Time attention module.
FIGURE 3.

Time attention module.

Assuming the temporal dimension is represented as t , we begin by applying global average pooling to the input frames. As the number of consecutive frames is odd, a specific scaling size is utilized instead of a specified multiple. Subsequently, global maximum pooling is applied to the input features, followed by the compression of the temporal dimension. This process can be mathematically expressed as equation 8.\begin{equation*}t=(W^{T}_{D}(H_{avg}(\tilde {F}))) \tag{8}\end{equation*}

View SourceRight-click on figure for MathML and additional features. The compressed temporal dimension t is subsequently expanded through a scaling process, resulting in the transformation T \to t \to T . To control the values of T within the range [0, 1], a gating function is applied. This process can be expressed as equation 9.\begin{equation*}T'=f(W^{T}_{U}(\delta (t))) \tag{9}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Finally, the obtained temporal attention T' is multiplied element-wise with the features as equation 10.\begin{equation*}F'_{T}=T'\tilde {F} \tag{10}\end{equation*}
View SourceRight-click on figure for MathML and additional features.

For continuous multiple frames, the channel attention assigned to each frame is consistent. Thus, Figure 4 presents an example of the channel attention mechanism using a feature map with a time dimension of 1. To allocate attention weights to each channel, it is necessary to compress the input feature map spatially. Assuming the input feature size is h \times w \times c, the resulting channel attention weights should be 1\times 1\times c. The input to the CA module is the fused multi-scale features \tilde {F} . The CA module applies global max pooling and global average pooling to \tilde {F} , compressing the feature map to 1\times 1\times c. The compressed feature map is then downscaled to 1\times 1\times c/r of its original size using the downsampling operation W_{D} and restored to its original size using the upsampling operation W_{U} . The different pooling operations result in attention features C_{1}, C_{2} . This process can be formulated as equations 11 and 12.\begin{align*}C_{1}&=W^{C}_{U}(\delta (W^{C}_{D}(H_{max}(\tilde {F})))) \tag{11}\\ C_{2}&=W^{C}_{U}(\delta (W^{C}_{D}(H_{avg}(\tilde {F})))) \tag{12}\end{align*}

View SourceRight-click on figure for MathML and additional features. As shown in the figure, \oplus represents element-wise addition. By performing element-wise addition on C_{1}, C_{2} , and applying the sigmoid gate function, the fused channel attention C is obtained. \otimes represents element-wise multiplication. The element-wise multiplication of C and \tilde {F} yields the feature F' with channel attention. This process can be formulated as equation 13.\begin{equation*}F'_{c}=f(C_{1}+C_{2})\tilde {F} \tag{13}\end{equation*}
View SourceRight-click on figure for MathML and additional features.

FIGURE 4. - Channel attention module.
FIGURE 4.

Channel attention module.

The importance of features may vary across different channels, and thus it is necessary to assign weights to different channels. Similarly, within the same channel, each pixel may have a different importance level. By assigning weights to each pixel, spatial attention can be formed as depicted in Figure 5. Similar to channel attention, spatial attention aims to assign weights to each pixel, requiring compression of the feature map along the channel dimension. Assuming an input feature size of h \times w \times c , the spatial attention generates attention weights of size h \times w \times 1 . In the SA module, the feature map is first compressed along the channel dimension, resulting in a size of h \times w \times 1 . It is then downscaled using W_{S} , resulting in a size of h \times w \times 1 . Finally, it is upscaled using W_{E} to restore the feature map to its original size of h \times w \times 1 . This process can be formulated as equations 14 and 15.\begin{align*}S_{1} &= W^{S}_{U}(\delta (W^{S}_{D}(H_{max}(\tilde {F})))) \tag{14}\\ S_{2} &= W^{S}_{U}(\delta (W^{S}_{D}(H_{avg}(\tilde {F})))) \tag{15}\end{align*}

View SourceRight-click on figure for MathML and additional features. Similarly to CA, the feature maps obtained from different pooling operations are element-wise summed to obtain the fused attention map S. Finally, the attention map S is element-wise multiplied with F'_{c} to obtain the feature map F'_{s} with both spatial and channel attention. This process is described by equation 16.\begin{equation*}F'_{s}=f(S_{1}+S_{2})\tilde {F} \tag{16}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
f() represents the sigmoid gate function, and \delta () represents the LRelu activation function. H_{max} denotes the global maximum pooling operation, and H_{avg} denotes the global average pooling operation. W^{T,C,S}_{D} represents the compression operation performed on the time, channel, and spatial dimensions using a 1 \times 1 \times 1 convolution. Conversely, W^{T,C,S}_{U} represent the activation operation performed using a 1 \times 1 \times 1 convolution. The terms C_{1}, C_{2}, S_{1}, S_{2} respectively represent the feature attention allocated through global maximum pooling and global average pooling. F'_{T,C,S} represent the features that have been assigned attention.

FIGURE 5. - Spatial attention module.
FIGURE 5.

Spatial attention module.

C. Iterative Back Projection Module Based on Deformable 3D Convolution

DBPN [24] integrates iterative back-projection with deep learning and presents a Deep Back-Projection Network (DBPN). To perform the back-projection process using convolutional neural networks, DBPN introduces both an up-projection unit and a down-projection unit. Each projection unit consists of two fundamental modules: a deconvolution module for the upsampling layer and a convolution module for the downsampling layer. The parameter count of each three-dimensional upsampling layer is three times that of a two-dimensional upsampling layer. The utilization of multiple three-dimensional upsampling layers further amplifies both the parameter count and computational load. Apart from the challenges associated with excessive parameter count and computational load, the limited graphics memory of each card poses additional constraints. Excessive parameter count can hinder training on GPUs.

In order to construct an upsampling layer without introducing excessive parameter count, we compress and activate the temporal features as illustrated in Figure 6. Prior to upsampling, the feature maps are compressed to a size of 1 \times h \times w \times c , as shown in Equation 17.\begin{equation*}F_{S}=F'_{S}*S \tag{17}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

FIGURE 6. - Time series feature scaling operations.
FIGURE 6.

Time series feature scaling operations.

After undergoing a sequence of projection operations, the feature L_{p} is obtained. To facilitate the construction of the subsequent recursive structure, it is essential to activate the resulting feature maps from the up-projection and down-projection as d \times h \times w \times c , as shown in Equation 18.\begin{equation*}F_{E}=L_{p}*E \tag{18}\end{equation*}

View SourceRight-click on figure for MathML and additional features. The symbol * denotes the convolution operation, S represents the convolution operation used for compressing temporal features, and E represents the convolution operation used for activating temporal features. Building upon the methodology of DBPN, we construct upsampling convolutional layers with a kernel size of 8 \times 8 , a stride of 4, For the downsampling convolutional layers, we utilize a 3 \times 3 kernel size, a stride of 1, and a padding of 1. These layers serve to extract high-frequency and low-frequency features, respectively. The up-projection and down-projection units are structured in a similar manner, as illustrated in Figure 7.

FIGURE 7. - Projection module based on deformable 3D convolution.
FIGURE 7.

Projection module based on deformable 3D convolution.

In the up-projection unit, the feature map L^{t-1} with a size of d \times m \times n is initially projected through upsampling, yielding the preliminary estimation of the upsampled feature H^{t-1}_{0} . Subsequently, the preliminary estimation of the upsampled feature H^{t-1}_{0} undergoes downsampling for feature extraction, resulting in the first estimated downsampled feature L^{t-1}_{0} . The subtraction between L^{t-1}_{0} and L^{t-1} yields the error between the downsampled features, denoted as e^{t}_{l} . This error e^{t}_{l} is then subjected to a single round of back-projection, generating the upsampled error feature H^{t}_{1} . Finally, the upsampled error feature H^{t}_{1} is added to the initial estimation of the upsampled feature H^{t-1}_{0} , resulting in the upsampled feature H^{t} after one round of back-projection. The above processes can be described using equations 19 – 23.\begin{align*}H^{t-1}_{0}&=L^{t-1}*U^{1}_{t} \tag{19}\\ L^{t-1}_{0}&=H^{t-1}_{0}*D_{t} \tag{20}\\ e^{t}_{l}&=L^{t-1}_{0}-L^{t-1} \tag{21}\\ H^{t}_{1}&=e^{t}_{l}*U^{2}_{t} \tag{22}\\ H^{t}&=H^{t-1}_{0}+H^{t}_{1} \tag{23}\end{align*}

View SourceRight-click on figure for MathML and additional features.

The computation process in the down-projection unit is the reverse of the up-projection unit. The upsampled feature H_{t} obtained after t-1 times of upsampling projection is projected for low-frequency information, producing the initial downsampled feature L^{t}_{0} . Following that, the initial downsampled feature L^{t}_{0} undergoes one round of back-projection, resulting in the initial upsampled feature H^{t}_{0} . The initial upsampled feature H^{t}_{0} is used to calculate the error with the input H^{t} , resulting in the upsampled error e^{t}_{h} . Subsequently, the upsampled error e^{t}_{h} undergoes one round of downsampling mapping, generating the downsampled error feature L^{t}_{1} . Finally, the downsampled error feature L^{t}_{1} is connected to the initial downsampled feature L^{t}_{0} through a residual connection, resulting in the downsampled feature L^{t} after one projection. The above processes can be described using equations 23–​27.\begin{align*}L^{t}_{0}&=H^{t}*D^{1}_{t} \tag{24}\\ H^{t}_{0}&=L^{t}_{0}*U_{t} \tag{25}\\ e^{t}_{h}&=H^{t}_{0}-H^{t} \tag{26}\\ L^{t}_{1}&=e^{t}_{h}*D^{2}_{t} \tag{27}\\ L^{t}&=L^{t}_{0}+L^{t}_{1} \tag{28}\end{align*}

View SourceRight-click on figure for MathML and additional features. The symbol * represents the convolution operation, D_{t} denotes the ordinary convolution used for extracting downsampled features, and U_{t} represents the transpose convolution used for extracting upsampled features.

In order to extract high-frequency features, the up-projection and down-projection necessitate the use of larger convolutional kernels, leading to an increase in parameters. However, repeatedly performing up-projection and down-projection, as in DBPN, would not only result in a substantial parameter count but also make the network difficult to train. To enhance the network depth without adding parameters, we employ the parameter sharing characteristic of recursive neural networks and construct a recursive neural network based on iterative back-projection modules and attention modules.

Within this module, a fundamental attention back projection unit is constructed, consisting of an attention module, a compression network, n iterative back-projection modules, and an activation network. Nonetheless, excessive recursion can pose challenges in training the model effectively. To mitigate this concern, residual connections are introduced to facilitate the flow of information by connecting the outputs of each recursive module. The Recursive Unit (RU) structure, as depicted in figure 8, illustrates the architectural design of each recursive unit. In the case of the first recursive module, the input comprises shallow features \tilde {F} , assumed to have a size of d \times h \times w \times c . Through the network architecture design, the output of each recursive module is shaped to match the dimensions of d \times h \times w \times c . In the context of the u-th recursive unit, its input is represented as F'_{u-1} . This entire process can be succinctly expressed using equation 29.\begin{equation*}F'_{u}=RU^{u}(F'_{u-1}+\tilde {F}) \tag{29}\end{equation*}

View SourceRight-click on figure for MathML and additional features. The overall output is represented as equation 30.\begin{align*}F'=RU^{u}(RU^{u-1}(\ldots (RU^{u1}(\tilde {F})+\tilde {F})+\tilde {F})+\tilde {F})+\tilde {F} \tag{30}\end{align*}
View SourceRight-click on figure for MathML and additional features.
RU^{u} represents the u-th recursive unit, and \tilde {F} denotes the shallow features processed by the MFE network.

FIGURE 8. - Recursive module.
FIGURE 8.

Recursive module.

D. Reconstruction Network

To alleviate computational demands, we employ low-resolution images during the training phase while keeping the image resolution constant. Subsequently, in the final stage of the network, we integrate sub-pixel convolution to facilitate the image upscaling process. Sub-pixel convolution is a technique used to perform upsampling by leveraging pixel points from different channels. Assuming an upsampling factor of r , if the initial image size before sub-pixel convolution is (r^{2}c) \times h \times w , the resulting feature f after applying sub-pixel convolution will have a size of c \times (rh) \times (rw) .

The primary goal of the network is to reconstruct a high-resolution (HR) image, which is represented as a two-dimensional dataset. However, the feature maps within the network exist in a three-dimensional format prior to the reconstruction process. In order to transform the feature maps from three-dimensional to two-dimensional representation, specific methods are employed to reduce the dimensionality of the network. During the convolution process, the absence of padding leads to a progressive reduction in the size of feature maps. Motivated by this observation, 3DSRNet [20] adopts a strategy that eliminates temporal padding. Specifically, it applies two consecutive convolutions with a kernel size of 3 \times 3 \times 3 , effectively reducing the temporal dimension of the feature maps from 5 to 1. Following the fusion of temporal features, the network leverages sub-pixel convolution for the subsequent upsampling process. Another approach for feature fusion involves reorganizing the three-dimensional feature maps and applying 2D convolution for upsampling. Experimental findings indicate that the restructuring method, along with the gradual fusion of spatial-temporal features using 3D convolution, yields comparable results. However, the restructuring method exhibits the advantage of a reduced parameter count. Hence, we adopt the restructuring approach to construct the upsampling reconstruction network. In the restructuring method, the length and width of the feature maps remain constant, while the pixel redistribution occurs along the channel dimension. This conversion transforms a feature map with dimensions c \times d \times h \times w into a feature map with dimensions (c \times d) \times h \times w, as depicted in equation 31.\begin{equation*}\vec {F}=\rho (F') \tag{31}\end{equation*}

View SourceRight-click on figure for MathML and additional features. The dimension of the restructured feature map, \vec {F} , is modified according to the upsampling factor. Assuming the input feature map has a channel dimension of c_{in} and the HR image has a final channel number of c_{out} , where the HR image is s times larger than the LR image, the channel dimension is adjusted to s^{2}c_{out} , as depicted in equation 32.\begin{equation*}f=\mu (\vec {F}) \tag{32}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Finally, the feature map f undergoes sub-pixel convolution to achieve a size of c_{out} \times (sh) \times (sw) . Subsequently, a residual connection is established between f and the upsampled LR_{t} , leading to the generation of the ultimate HR output as depicted in equation 33.\begin{equation*}HR=s(f)+B(LR_{t}) \tag{33}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
In this context, \vec {F} corresponds to a tensor of size c \times d \times h \times w , F' corresponds to a tensor of size c \times d \times h \times w , f corresponds to a tensor of size (s^{2}c_{out}) \times h \times w , and HR corresponds to a tensor of size c_{out} \times (sh) \times (sw) . The function \rho () denotes the restructuring operation, \mu () denotes dimension transformation using a 1 \times 1 convolution, s() represents sub-pixel convolution, and B() signifies bilinear interpolation for upsampling.

SECTION III.

Experimental and Analysis

A. Preparation Before Experiment

During the training of neural networks, the effectiveness of the network is not only dependent on its architecture and parameter count but also on the quality of the training dataset. Insufficient training samples or a lack of diversity and representativeness in the dataset can result in overfitting. Conversely, an excessive presence of noise and interference in the training set can impede network convergence. To achieve optimal network performance and facilitate fair comparisons with other algorithms, we utilized the publicly available Vimeo-90K dataset [25] for training purposes. Vimeo-90K is a widely used, large-scale, and high-quality dataset that is specifically curated for video processing tasks. It plays a significant role in training models for various video-related tasks, including video denoising, super-resolution reconstruction, and video frame interpolation.

To obtain the training data pairs for HR and LR images, the original video frames are utilized as HR ground truth images, whereas the downsampled video frames are employed as LR images. During network training, random 32\times 32 blocks are cropped from the LR images as inputs to the network, and corresponding blocks are cropped from the HR images. The input images are converted from RGB to YCbCr color space, and the luminance channel (Y) is extracted for network processing. The batch size is set to 64, and data augmentation techniques such as rotation are employed to enhance the network’s generalization capability. The Adam optimizer is utilized for gradient optimization during training, with an initial learning rate of 0.0002. The learning rate is halved every 10 epochs, and training is concluded after 50 epochs. In the task of super-resolution reconstruction, commonly used loss functions include Mean Absolute Error (MAE) and Mean Squared Error (MSE), which are also referred to as L1 loss and L2 loss, respectively. The formulas for these loss functions are given by equations 34 and 35.\begin{align*}L_{1}(y,\hat {y})&=w(\theta)|\hat {y}-y| \tag{34}\\ L_{2}(y,\hat {y})&=w(\theta)(\hat {y}-y)^{2} \tag{35}\end{align*}

View SourceRight-click on figure for MathML and additional features. The symbol \hat {y} represents the reconstructed result, while y denotes the ground truth high-resolution image. L2 loss is known to produce smoother outputs, converge faster, and achieve higher PSNR. However, according to reference [26], networks trained with L1 loss may yield lower PSNR values but exhibit more comprehensive reconstruction details and better visual effects. To leverage the advantages of both loss functions, our experiment initially employs L2 loss for the first 20 epochs, prioritizing faster convergence. Subsequently, L1 loss is adopted for the remaining training epochs, emphasizing richer reconstruction details.

The datasets utilized for testing are Vid4 [27] and SPMC-11 [28]. Vid4 is a test dataset composed of four video clips: “calendar,” “city,” “foliage,” and “walk.” Each video sequence comprises 31 frames and is employed to assess the overall performance of the trained models. SPMC-11 is constructed by assembling video clips depicting 11 diverse scenes, such as “car,” “city,” “people,” and “landscape.” Each video sequence in SPMC-11 consists of 31 frames and is employed for model evaluation. The experimental network architecture in this study was implemented using Python 3.7 and PyTorch 1.10.1 deep learning framework. The corresponding CUDA version used was 11.3. The experimental setup and hardware configuration are presented in Table 1.

TABLE 1 Experimental Equipment Configuration
Table 1- 
Experimental Equipment Configuration

B. Ablation Experiment

To validate the effectiveness of the proposed modules, this section conducts ablation experiments on different modules. All networks are trained on the Vimeo-90K dataset and evaluated on the Vid4 dataset after applying the same data preprocessing. For all ablation experiments, the comparison models include all modules, with 4 recursive units and 2 attention feedback units. The comparisons are performed at 2x and 4x magnification scales.

1) Ablation Experiments on the Multi-Scale Feature Fusion Module

In this experiment, the comparative setup involves replacing the multi-scale feature extraction module with a conventional 3D convolution. The results are shown in Table 2, where bold font indicates the best results at different magnification scales. It can be observed that regardless of the 2x or 4x magnification, the utilization of the multi-scale fusion module leads to performance improvement. For the 2x magnification, the PSNR and SSIM values are improved by 0.42dB and 0.011, respectively, compared to not using the multi-scale fusion module. Similarly, for the 4x magnification, the PSNR and SSIM values are improved by 0.31dB and 0.017, respectively. These results indicate that leveraging the information between different scale feature maps can complement each other and yield better reconstruction results.

TABLE 2 Multiscale Feature Fusion Ablation Experiment
Table 2- 
Multiscale Feature Fusion Ablation Experiment

2) Ablation Experiments on the Attention Mechanism

In order to assess the distinct impacts of channel attention, spatial attention, and temporal attention on the network’s reconstruction performance, comparative experiments were carried out by individually combining SA, TA, and CA separately, and evaluating the results on the Vid4 dataset at 2x and 4x magnification scales. The experimental results in Table 3 demonstrate that the model utilizing SA+CA+TA achieves the highest performance, followed by SA+CA. The TA module contributes only marginal improvements. This can be attributed to the strong temporal feature extraction capabilities of 3D convolutions. In sufficiently deep networks, allocating separate weights to temporal features becomes unnecessary.

TABLE 3 Multiple Attention Module Ablation Experiment
Table 3- 
Multiple Attention Module Ablation Experiment

3) Ablation Experiments on the Recursive Units

This section presents experiments that compare the number of attention back projection (ABP) modules and recursive units (RU) in the network. By increasing the number of RU units while keeping the ABP module count constant, the network depth is increased without affecting the parameter count. The network configuration with m RU units and n ABP units is denoted as R_{m}A_{n} . The experimental results are summarized in Table 4. The R_{2}A_{2} network has the same depth as the R_{4}A_{1} network, but R_{2}A_{2} achieves higher PSNR and SSIM values by 0.06dB and 0.004, respectively. It can be concluded that, under the same network depth, increasing the number of ABP modules, which corresponds to a higher parameter count, leads to better network performance. Moreover, comparing R_{4}A_{2} and R_{2}A_{2} , both networks have the same parameter count, but R_{4}A_{2} has approximately twice the depth of R_{2}A_{2} . This observation suggests that, with an equal parameter count, deeper network architectures are capable of capturing more complex nonlinear mapping relationships, resulting in improved network performance.

TABLE 4 Recursive Unit Ablation Experiment
Table 4- 
Recursive Unit Ablation Experiment

4) Ablation Experiments on Different Input Frames

In order to determine the optimal number of input consecutive frames, which contain different spatio-temporal information, ablation experiments were conducted using 3, 5, and 7 input frames. The experimental results are shown in Table 5. It can be observed that the network performs best when the number of input consecutive frames is 7. This is because increasing the number of input frames allows for more spatio-temporal information to be included. Increasing the number of frames from 3 to 5 introduces a significant amount of spatio-temporal features, resulting in a 0.33dB improvement in PSNR. However, the performance improvement is not significant when increasing the number of frames from 5 to 7, with only a 0.09dB increase in PSNR. This is because the input of consecutive 5 frames already contains a substantial amount of spatio-temporal information.

TABLE 5 Number of Input Frames Ablation Experiment
Table 5- 
Number of Input Frames Ablation Experiment

C. Experimental Comparison With Other Algorithms

In order to validate the effectiveness of D3DRRN, we conducted a comparative analysis with multiple video super-resolution reconstruction algorithms, focusing on the evaluation of PSNR and SSIM values. The comparison encompasses not only the renowned VSRNet [29], a classic video super-resolution algorithm, but also motion-based techniques such as SOF-VSR [17] and TOFlow [25]. Furthermore, we included other algorithms that leverage deformable convolutions, such as TDAN [18] and D3DNet [30]. Additionally, we performed a quantitative assessment of the network models based on their parameter count.

1) Objective Data Comparison

It is evident from the results presented in Figure 9 that our proposed algorithm has been compared with other algorithms in terms of parameter count, considering a scaling factor of 4. In comparison to TDAN and D3DNet, which also employ deformable convolutions, D3DRRN exhibits a parameter increase of approximately 0.2M. However, it achieves a PSNR improvement of approximately 0.4dB. When compared to D3DNet, D3DRRN not only has a lower parameter count but also achieves superior PSNR performance. These findings demonstrate the effectiveness of the multiscale attention mechanism and recursive structure proposed in this chapter, as they enhance the network’s feature representation capability while reducing parameter requirements.

FIGURE 9. - Parameter quantity comparison.
FIGURE 9.

Parameter quantity comparison.

In contrast to the two-stage methods, TOFlow and SOF-VSR, which rely on optical flow, D3DRRN surpasses them in various aspects. Despite a parameter increase of approximately 0.5M compared to SOF-VSR, D3DRRN achieves a PSNR improvement of approximately 0.6dB. This indicates the superior spatiotemporal feature extraction capability of 3D convolutions.

The Vid4 dataset and SPMC-11 dataset were both subjected to downsampling and underwent 4x super-resolution reconstruction using various algorithms. The results, as presented in Table 6, indicate that our proposed D3DRRN method consistently achieves the best performance in the tests conducted with a scaling factor of 4 for both datasets. Specifically, D3DRRN exhibits notable improvements over the classical video super-resolution reconstruction methods VESPCN and VSRNet on both the Vid4 and SPMC-11 datasets. Comparing against the current state-of-the-art methods that employ optical flow, namely TOFlow and SOF-VSR, D3DRRN achieves PSNR improvements of 0.96dB and 0.84dB on the Vid4 dataset, and 0.76dB and 0.91dB on the more diverse SPMC-11 dataset, respectively. Additionally, when compared to methods employing deformable convolutions such as TDAN and D3DNet, D3DRRN demonstrates superior performance. Specifically, on the Vid4 dataset, D3DRRN achieves PSNR improvements of 0.67dB and 0.33dB over TDAN and D3DNet, respectively, while on the SPMC-11 dataset, it achieves improvements of 0.70dB and 0.39dB, respectively. These findings suggest that the increased network depth, coupled with the multiscale attention channel mechanism and deformable convolutions, synergistically enhance the network’s performance.

TABLE 6 Comparison of 4x Reconstruction Performance on Vid4 and SPMC-11
Table 6- 
Comparison of 4x Reconstruction Performance on Vid4 and SPMC-11

2) Subjective Effect Comparison

We performed a comparative analysis of 4x super-resolution reconstruction on two video sequences from the Vid4 dataset and two video sequences from the SPMC-11 dataset. Specifically, we selected the “walk” and “city” video sequences from the Vid4 dataset, as well as the “car” and “jvc” video sequences from the SPMC-11 dataset. These sequences were subjected to different super-resolution reconstruction algorithms, and the results are illustrated in Figure 10.

FIGURE 10. - Subjective effect comparison.
FIGURE 10.

Subjective effect comparison.

During the reconstruction of the “walk” video sequence, VSRNet produced blurry images overall. TOFlow, TDAN, and D3DNet incorrectly reconstructed the belt, resulting in artifacts, and introduced a checkerboard effect in the hand region. The reconstruction by SOF-VSR exhibited partial blurriness. In contrast, our proposed D3DRRN method achieved better reconstruction of the fine details of the belt.

In the reconstruction of the “city” video sequence, other algorithms showed varying degrees of “moire patterns” in the reconstructed windows of the buildings. In contrast, D3DRRN preserved details more effectively and reduced the occurrence of “moire patterns,” resulting in reconstructions that closely resembled the ground truth.

For the “car” video sequence, TDAN, D3DNet, and D3DRRN all demonstrated good overall performance. However, the proposed D3DRRN method outperformed the others in accurately reconstructing the contour of the letter “F” in the license plate.

SECTION IV.

Conclusion

This paper primarily introduces the overall network architecture of a fusion attention mechanism-based back-projection recursive network. From the perspective of improving network width and depth, the paper designs a multi-scale feature extraction network, attention network, and reconstruction network. The method utilizes iterative back-projection to iteratively utilize the up-projection and down-projection features. To reduce parameter count and computation time, bottleneck layers and recursive back-projection networks are designed.

Subsequently, a brief introduction is provided for the datasets used in this paper, namely Vimeo-90k, Vid4, and SPMC-11. The experimental software and hardware environments are described, along with the specific experimental settings. To demonstrate the effectiveness of the proposed modules, multiple ablation experiments are conducted. Finally, objective comparisons are made between the proposed algorithm and other algorithms on the Vid4 and SPMC-11 datasets. Additionally, a comparative evaluation of the 4x upscaling results is performed on the Vid4 dataset, confirming the effectiveness of the proposed algorithm. However, the algorithm proposed in this paper still suffers from the issue of slow computation speed, and there is room for further improvement in the reconstruction quality, particularly in terms of fine details. In the next step, we will conduct further research to address these aforementioned issues.

References

References is not available for this document.