Journals & Magazines >IEEE Access >Volume: 12

A Novel Zero-Shot Real World Spatio-Temporal Super-Resolution (ZS-RW-STSR) Model for Video Super-Resolution

The proposed methodology

Abstract:

Super-resolution (SR) of the degraded and real low-resolution (LR) video remains a challenging problem despite the development of deep learning-based SR models. Most exis...Show More

Metadata

Abstract:

Super-resolution (SR) of the degraded and real low-resolution (LR) video remains a challenging problem despite the development of deep learning-based SR models. Most existing state-of-the-art networks focus on getting high-resolution (HR) videos from the corresponding down-sampled LR video but fail in scenarios with noisy or degraded low-resolution video. In this article, a novel real-world “zero-shot” video spatio-temporal SR model, i.e., 3D-Deep Convolutional Auto-Encoder (3D-CAE) guided attention-based deep spatio-temporal back-projection network has been proposed. 3D-CAE is utilized for extracting noise-free features from real low-resolution video and used in the attention-based deep spatio-temporal back-projection network for clean, high-resolution video reconstruction. In the proposed framework, the denoising loss of low-resolution video with high-resolution video reconstruction loss is jointly used in an end-to-end manner with a zero-shot setting. Further, Meta-learning is used to initialize the weights of the proposed model to take advantage of learning on the external dataset with internal learning in a zero-shot environment. To maintain the temporal coherency, we have used the Motion Compensation Transformer (MCT) for motion estimation and the Sub-Pixel Motion Compensation (SPMC) layer for motion compensation. We have evaluated the performance of our proposed model on REDS and Vid4 Dataset. The PSNR value of our model is 25.13 dB for the RealVSR dataset, which is 0.72 dB more than the next-best performing model, EAVSR+. For MVSR4x, our model provides 24.61 db PSNR, 0.67 dB more than the EAVSR+ model. Experimental results demonstrate the effectiveness of the proposed framework on degraded and noisy real low-resolution video compared to the existing methods. Furthermore, an ablation study has been conducted to highlight the contribution of 3D-CAE and attention layer to the overall network performance.

The proposed methodology

Published in: IEEE Access ( Volume: 12)

Page(s): 123969 - 123984

Date of Publication: 28 May 2024

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2024.3406476

Funding Agency:

Contents

SECTION I.

Introduction

Videos are the most common and comprehensive source of information in today’s day-to-day life. With the advent of high-tech imaging technologies, videos can be captured in HD and UHD quality, thereby enhancing one’s perceptual experience. However, with such technologies, there are certain situations (remote sensing [1], [2], [3], UAV surveillance, etc.) where capturing HD videos is complex or involves more cost. The high-resolution cameras in these cases are expensive, and transmitting such media requires enormous bandwidth.

Video super-resolution (VSR) is a computational technique that intends to address these challenges by generating a high-resolution (HR) sequence of video frames from corresponding low-resolution (LR) video frames. The VSR [4] has numerous applications in the fields of remote sensing, UAV surveillance, Panorama video super-resolution, security, High Definition(HD) and Ultra HD (UHD) Televisions, etc.

Though various techniques have evolved for single image super-resolution (SISR) [5], [6], [7], [8], the VSR is still a demanding and ill-posed problem. Contrary to SISR, where a single image is super-resolved, VSR intends to encapsulate the inter-frame alignments while performing frame-to-frame super-resolution.

The VSR method has two domains: spatial and temporal SR. The spatial SR intends to increase the size of the frames while preserving-cum-adding additional information. In contrast, temporal SR [9] intends to reduce the quantization loss of information between two frames. Temporal SR is the retrieval of those dynamic events that occur faster than the provided frame rate by predicting mid-frame information. These pieces of inter-frame information are critical in VSR to maintain consistency in the motion of the video. Space-time video SR is more challenging as spatial and temporal SR are both ill-posed problems [10]. This problem is more interesting and valuable in many computer vision and biomedical tasks for pre-processing of videos. The problem is further complicated when there is degradation in the video frames.

Recently, Deep learning based VSR models have provided state-of-the-art (SOTA) performance but have limitations such as high computational complexity, dataset-specific performance [11], and adaptation to synthetic LR degradations (such as bicubic, etc.) exclusively [12]. These frameworks were not generalized and were trained over synthetically generated LR sequences. Hence, they performed well over the dataset with synthetic LR video sequences. Still, their performance deteriorated significantly when the video frame sequences from other datasets having real degradation were evaluated [13]. The real noises are heterogeneous, with different degradations, having no verifiable mathematical models. Generating HR video frame sequences from the dataset with such noises is extremely difficult. To address the issues of real-world noise scenarios in SISR-based models, the ZSSR [14], a self-supervised method, used a zero-shot setting to learn the internal information (non-local structures) of the image on a simple CNN model. The model outperformed different blur kernels compared to other SISR SOTA models with minimal computational complexity. However, it could not exploit the patterns of large external datasets, resulting in a non-generalized and less adaptive model. It took thousands of iterations to learn the information in the sample before it could produce good results. This problem was overcome using meta-learning in MZSR [15]. In MZSR [15], they first trained the model on an external large dataset, a transfer learning step, and then they meta-trained the model over different blur kernels to incorporate kernel-agnostic characteristics in the model. With these weights, they used zero-shot training over the LR image to produce the SR Image. The model was able to exploit the information of the large external dataset and internal non-local structures of the images to produce exceptionally good and generalized results with faster learning. The model was able to adapt to new samples with different blur kernels in a few gradient descent steps as compared to ZSSR [14].

Inspired by the success of these results on SISR problems, in this article, a novel zero-shot and meta-learning-based real-world video space-time SR method (3D-deep Convolutional Auto-encoder guided attention-based deep Spatio- Temporal back-projection Network (CASTNet)) is introduced for super-resolution of real LR videos. The 3D Convluational auto-encoder (3D-CAE) learns the noise-free features from noisy LR videos, and the Attention-based Deep Spatio-Temporal back-projection Network generates the HR video from these noise-free features. The last three layers of the 3D-CAE are concatenated, and these concatenated features are fed to the deep spatio-temporal network. The spatio-temporal SR network upscaled the features in the spatial and temporal domain and reconstructed the HR video using these up-scaled features. The denoising and SR loss gradients were used to update the denoising model weights and SR weights were updated by the gradient of SR loss to get the improved resultant HR video. Both models were jointly trained in the end-to-end fashion. As given in Figure 1, the model was initially trained on a large dataset, where various blur kernels were used to create different tasks, and these tasks were learned in a meta-learning fashion to update the weights of the model and reduce its kernel dependency. After the learning on the external dataset, for the Meta-test phase, the model was updated using a zero-shot setting for a sample of video frame sequence to generate video-specific SR. Detailed experimental results demonstrate the effectiveness of the proposed meta-learning and zero-shot-based video SR framework on degraded and noisy real low-resolution video compared to the existing methods. Furthermore, an ablation study has been conducted to highlight the contribution of each component of the proposed network. The enhancement module is referred to as ADST-BPN in this article.

$FIGURE 1. - The proposed methodology. In the first stage, the external dataset is used for large-scale training. From the initial point, the $\theta _{0}$ model is trained to obtain $\theta _{T}$ using large-scale dataset REDS. Then, during meta-transfer learning, $\theta _{M}$ is obtained for various blur kernels and real-world degradations. During the meta-test phase, a test input is downsampled and the model is trained using a self-supervised internal training mechanism.$

FIGURE 1.

The proposed methodology. In the first stage, the external dataset is used for large-scale training. From the initial point, the $\theta _{0}$ model is trained to obtain $\theta _{T}$ using large-scale dataset REDS. Then, during meta-transfer learning, $\theta _{M}$ is obtained for various blur kernels and real-world degradations. During the meta-test phase, a test input is downsampled and the model is trained using a self-supervised internal training mechanism.

Show All

SECTION II.

Related Work

A. Image Super-Resolution

Recent advancements in image super-resolution (SR) have been driven by deep learning techniques like [5], [7], [8], [16], [17], [18], [19], [20], [21]. While effective, these methods rely on knowing the exact degradation kernel, posing limitations in real-world applications. To overcome this, blind image super-resolution has emerged, focusing on self-supervised estimation of unknown degradation kernels. This approach categorizes methods into non-blind SR and blind SR, offering solutions for scenarios where precise kernel information is unavailable.

Non-blind SR methods use the known degradation kernel to generate high-quality, high-resolution (HR) images. Examples include SRM [22], which uses the low-resolution (LR) image and its corresponding degradation kernel as inputs, and ZSSR [14], which trains an image-specific network on the pseudo-LR-HR image-pairs obtained using the same kernel which used to generate LR image from the test image itself.

In contrast to supervised methods, blind super-resolution (SR) techniques aim to infer unknown kernels through self-supervision and subsequently apply these estimated kernels to non-blind SR models. Various strategies have been developed for kernel estimation, leveraging self-similarity or employing iterative self-correction mechanisms. The pioneering work by Michaeli and Irani [23] introduced a method to estimate downscaling kernels by exploiting the patch-recurrence characteristic within a single image. Building upon this, KernelGAN [24] enhanced kernel estimation by incorporating Internal-GAN. Additionally, IKC [25] proposed an iterative correction approach, demonstrating its efficacy in producing high-fidelity SR images.

Also, the feedback mechanism is exploited by Li et al. [26] to refine the output of the network. Another mechanism is proposed in FENet by Behjati et al. [27] using a frequency-based enhancement network. Luo et al. [28] proposed a novel adversarial neural degradation (AND) model for blind image SR to generate a wide range of complex degradation effects that are highly non-linear.

In the context of blind image super-resolution (SR), self-supervised methodologies have been put forth [29], [30]. Dong et al. [29] introduced a self-supervised technique that estimates the blur kernel and intermediary high-resolution (HR) image from a single low-resolution (LR) input image. This approach employs a variational model, grounded in the image formation of SR, to enhance the quality of the intermediary HR images. A separate self-supervised method [30] has integrated contrastive learning into blind remote sensing image SR, directing the reconstruction process by promoting positive representations and penalizing negatives. Recently, diffusion-based techniques have attracted significant attention in the field of image SR. One such method, SinSR [31], accomplishes single-step SR generation through the derivation of a deterministic sampling process from the most recent state-of-the-art (SOTA) method, thereby expediting diffusion-based SR. Another diffusion-based SR method, EDiffSR [32], utilizes a diffusion probabilistic model, incorporating an Efficient Activation Network (EANet) for enhanced noise prediction performance and a Conditional Prior Enhancement Module (CPEM) for precise super-resolution. Guo et al. [33] have proposed a face video SR method that addresses video compression artifacts by capitalizing on the correlation among video, audio, and the emotional state of the face.

B. Video Super-Resolution

Video super-resolution (VSR) methods are commonly classified into traditional and deep learning-based approaches. Schultz and Stevenson [34] introduced a conventional method employing affine models for motion estimation, while 3D steering kernel regression was applied in [35]. Ma et al. [36]utilized the expectation-maximization technique to reconstruct high-resolution frames by estimating the blur kernel. Furthermore, a method in [37] concurrently estimates the blur kernel, motion, and noise level through a Bayesian approach for reconstructing high-resolution frames in VSR.

In recent times, deep learning-based strategies have emerged to address the image super-resolution (SR) challenge. Given that a video comprises a sequence of moving images over time, image SR methodologies can be adapted for VSR by incorporating necessary modifications. Notable deep learning-based image SR models include SRCNN [5], SRGAN [20], FSRCNN [18], ESPCN [6], and ZSSR [14]. VSRnet [21], a model derived from SRCNN, is proposed for video super-resolution.

Deep-learning-based VSR methodologies are commonly categorized into two distinct categories: those incorporating frame alignment and those operating without it. The former class leverages motion estimation and compensation techniques as initial processing steps to extract precise inter-frame motion details [38], [39], [40], [41] and facilitate frame alignment [10], [21], [42], [43], [44]. This approach, demonstrated in studies such as [45], [46], [47], and [48], proves particularly effective in scenarios involving significant motion dynamics. The optical flow method that uses variations and correlations in the temporal domain to compute the motion between two nearby frames is popularly used in most motion estimation techniques [49], [50]. The motion compensation methods can also be applied using either the traditional methods [44], [51] or using deep learning-based approaches [49], [50], [52]. Deformable Convolution Methods for video SR were first proposed by [53], [54], [55], [56], [57], and [58]. BasicVSR [59] used a simple RNN architecture with propagation, alignment, and upsampling modules to make VSR suitable for real-time applications. Later, for better handling of misalignment, enhanced propagation along with flow-guided deformable alignment is introduced in BasicVSR [59] to the proposed BasicVSR++ [60].

Deformable attention mechanisms have gained prominence in the area of blind video super-resolution (VSR) due to their ability to handle complex spatial transformations and focus on relevant features. One notable work in this domain is the Deep Blind Super-Resolution for Satellite Video [61]. The proposed BSVSR algorithm in [61] is an empirical approach for blind SVSR that emphasizes sharper cues by considering pixel-wise blur levels using the approach called coarse-to-fine manner. It utilizes multi-scale deformable convolution for aggregating the temporal redundancy across adjacent frames through window-slid progressive fusion, followed by deformable attention for meticulous integration of adjacent features into mid-feature.

Another significant contribution is the Bidirectional Multi-scale Deformable Attention for Video Super-Resolution [62]. This method uses a Deformable Alignment Module (DAM) which contains two types of modules: Multi-scale Deformable Convolution Module (MDCM) used to improve the robustness of the adjacent frame alignment process by using the offset information in the different scales and aligning the frames at the feature level. The second module is the Multi-scale Attention Module (MAM), used to extract the local as well as global features of the aligned features.

Moreover, for lightweight VSR [63], the Deformable Spatial-Temporal Attention aggregates the spatial-temporal information obtained from the multiple reference frames into the current frame to improve the reconstruction effect.

These works demonstrate the effectiveness of deformable attention mechanisms in handling the challenges of alignment and fusion in blind VSR tasks. They provide a robust and effective way to enhance the resolution of video sequences while maintaining temporal consistency.

In methods without alignment techniques, alignment is not performed by aligning neighbouring frames; instead, the Spatio-temporal or spatial information is used for executing feature extraction. This technique can further be classified into four types i.e. 2D convolution methods (2D Conv) [64], non-local network-based methods [58], [65], [66], 3D convolution methods (3D Conv) [67], and Recurrent CNN (R-CNN) based methods [68]. The 2D convolution methods (2D Conv) fall under the umbrella of spatial methods whereas the remaining three methods belong to the Spatio-temporal category and utilize both the temporal as well as spatial information obtained from the input videos [4]. In the 2D convolution method [64], [69], a 2D convolutional network absorbs the correlation information existing within the frames by itself. The three important stages, feature extraction, fusion and SR are performed spatially on the frames passed as input to the model instead of performing actions like motion estimation and motion compensation [4] whereas 3D Convolution Methods [67], [70], [71] utilizes spatial as well as the temporal information to super-resolve the video and Recurrent CNNs [68], [72], [73], [74] can accurately represent the temporal dependency in sequential data and has been used for video SR. The use of direct 3D convolution is introduced by Li et al. [75] to generate video sequences. This eliminates the need for RNNs and allows for efficient processing. To avoid the use of explicit motion compensation, Jo et al. [70] proposed an architecture that learns the dynamic upsampling filters for each pixel based on its local spatio-temporal neighbourhood. CycMu-Net [76] used the cycle-projected mutual learning between spatial and temporal super-resolution tasks to produce optimal results in terms of both detail and consistency. In this method, spatial features refine temporal predictions, while temporal information helps extract finer spatial details. Addressing the issue of blind VSR, DynaVSR [77]utilizes a dynamic encoder-decoder architecture that adapts to different degradation types at runtime. Apart from CNN architecture, transformer architecture is also exploited for VSR. Liang et al. [78] proposed a Recurrent Video Restoration Transformer (RVRT) with guided deformable attention to handle the complex temporal dependencies and object deformations. In recent years, the video inbetweening technique is also utilized aiming to enhance the temporal resolution of video sequences by creating new frames between known keyframes. Initial methods for video inbetweening include optical flow-based interpolation [52], [79] and pixel motion transformation [80], [81]. To perform long-term video interpolation, block-based motion estimation/compensation methods or LSTM models [82] were used.

C. Zero-Shot Super-Resolution With Meta-Learning

The SOTA SISR methods fail, and their performance deteriorates in the case of real LR input (LR image with compression, sensor, and random noises) as they have trained on datasets where LR counterpart is generated synthetically by bicubic down-sampling of corresponding HR Image. These SOTA SR methods are very deep and computationally complex, too. To overcome these challenges, Shocher et al. [14] have proposed zero-shot learning for SISR. The model proposed was very light and did not need any training on the external dataset. The proposed method performs image-specific training at test time and outperforms SOTA in the case of real LR images like old photos and degraded images. However, due to its limitation of fast adaptation to datasets having statistically distant noises and kernels, Soh et al. have presented meta-transfer learning along with ZSSR [83] to initialize the weights of the model to take advantage of learning from the external dataset with internal learning in the zero-shot setting. Here, meta-transfer learning helps in fast convergence during training in a zero-shot setting at test time. Mohammad Emad has presented dual-path zero-shot learning [84], which attempts to train cycle GAN-based architecture in the zero-shot setting to improve performance for real-world LR images further. Reference [85] also employs a similar approach to meta-transfer learning.

Performing Spatio-temporal SR, i.e., creating intermittent frames along with SR, is more challenging and intimidating. Spatio-temporal VSR methods [9], [86], [87], [88], primarily work on bicubic and tricubic methods to form LR images, they require large datasets to generalize, they fail to adapt to blurriness created from different kernels and were computationally complex. This limits their use in practical applications such as HD television [89], UAV surveillance [90], [91], security [92], [93], etc. Due to the heterogeneous nature of noise in real LR, these models do not produce effective motion-consistent HR videos.

D. Motion Estimation and Compensation

In video super-resolution, simply applying image super-resolution methods to each frame independently may not produce satisfactory results, because there may be inter-frame motion that causes temporal distortion and blur. Inter-frame motion is the movement of objects or cameras between consecutive frames, which can create misalignment and inconsistency between the frames. Therefore, motion estimation and compensation are needed to handle the inter-frame motion and align the frames before applying super-resolution.

To solve the motion estimation problem several deep learning-based networks were used. For stereo matching, Zbontar and LeCun [94] and Lou et al. [95] used to learn patch distance measures by employing a CNN whereas Fischer et al. [49] and Mayer et al. [96] proposed the use of end-to-end architectures to predict the optical flow and it’s stereo disparity. For the motion compensation problem, earlier VSR methods [10], [35], [38], [97], [98] achieved the inter-frame motion compensation by either estimating optical flow or by applying block-matching. In deep-learning-based VSR methods use backward warping by aligning all other frames to the reference frame to achieve inter-frame motion compensation.

The existing work on zero-shot super-resolution with meta-learning was limited to images. This motivated us to design an architecture for meta-learning-based zero-shot video space-time SR. Our model incorporates all benefits obtained by the zero-shot and meta-learning-based training. Along with this, our model solves denoising and super-resolution simultaneously with a dedicated architecture for joint optimization of denoising and super-resolution.

SECTION III.

Methodology

A. Proposed Architecture and Optimisation

The block diagram for the proposed methodology is depicted in Figure 2. The Motion estimation and compensation modules are employed to maintain the temporal coherency in the output video sequence. After the frame alignment, the enhancement module which consists of two parts, i.e., the denoising module and the spatio-temporal super-resolution module, is employed to get spatio-temporal super-resolution. The details of each module are as follows:

$FIGURE 2. - Block diagram for the proposed framework. Here H and W are the height and width of the image frame respectively whereas n and m are the number of frames where $m\lt n$ .$

FIGURE 2.

Block diagram for the proposed framework. Here H and W are the height and width of the image frame respectively whereas n and m are the number of frames where $m\lt n$ .

Show All

1) Motion Estimation

Motion estimation (ME) is the process of estimating the motion vectors between successive frames in a given video sequence. Motion vectors are the displacement of the pixels from one frame to the other, which indicate the direction as well as the magnitude of motion. The ME module takes two LR frames as input and generates an LR motion field as output, and it can be defined as\begin{equation*} F_{i\rightarrow j} = Net_{ME}(I_{i}^{L}, I_{j}^{L};\theta _{ME}) \tag {1}\end{equation*} View Sourcewhere $F_{i\rightarrow j} = (u_{i \rightarrow j, v_{i \rightarrow j}})$ is the motion field calculted from frame $I_{i}^{L}, I_{j}^{L}$ , and $\theta _{ME}$ is the set of motion estimation parameters. The use of neural networks for motion estimation is a well-established concept, and several existing works [10], [49], [96], [99] have demonstrated significant achievements in this area. In our study, we tested FlowNet-S [49] and the motion compensation transformer (MCT) module from VESPCN [10] and opted for MCT due to its lower parameter count and, consequently, reduced computational cost. The MCT uses self-attention along with cross-attention mechanisms to capture the long-range dependencies and temporal alignment between the frames. The self-attention mechanism computes the similarity between each pixel and all other pixels within the same frame and generates a weighted sum of the pixel features. The cross-attention mechanism calculates the similarity between each pixel in the reference frame and all other pixels in the neighbouring frames and generates a weighted sum of the pixel features. The output of the MCT is a set of motion vectors for each pixel in the reference frame.

2) Motion Compensation

Motion compensation is the process of aligning the frames according to the motion vectors to reduce temporal distortion and blur. Motion compensation can help to improve the temporal consistency and details of the video. We used the motion compensation layer proposed by Tao et al. [42] that leverages sub-pixel information from motion to achieve simultaneous sub-pixel motion compensation (SPMC) The SPMC performs sub-pixel shifting and interpolation to generate aligned frames. The sub-pixel shifting operation shifts each pixel in the neighboring frames according to its corresponding motion vector, and produces a set of shifted frames. The interpolation operation combines the shifted frames using a convolutional filter and produces an aligned frame. The output of the SPMC is a set of aligned frames that match the reference frame. It can be defined as\begin{equation*} J_{i}^{LR} = L_{SPMC}(I_{i}^{LR}, F; \alpha ) \tag {2}\end{equation*} View Sourcewhere $I_{i}^{LR}$ is the LR image and $J_{i}^{LR}$ is the output image, F denotes the optical flow utilized for transposed warping, and $\alpha $ represents the scaling factor. As illustrated in Figure 2, transformed coordinates are first calculated using the estimated motion flow $F = (u,v)$ and can be expressed as:\begin{align*} \begin{pmatrix}x_{p}^{s} \\ y_{p}^{s}\end{pmatrix} = W_{F;\alpha }\begin{pmatrix} x_{p} \\ y_{p} \end{pmatrix} = \alpha \begin{pmatrix} x_{p} + u_{p} \\ y_{p} + v_{p} \end{pmatrix} \tag {3}\end{align*} View Source

here, p indexes pixels in the LR image space, where $x_{p}$ and $y_{p}$ represent the two coordinates of p. Additionally, $u_{p}$ and $v_{p}$ denote the flow vectors estimated from the previous stage. We use the operator $W_{F;\alpha }$ to denote the transformation of coordinates, which depends on the flow field F and the scale factor$\alpha $ . Subsequently, $x_{p}^{s}$ and $ y_{p}^{s}$ refer to the transformed coordinates in output image space. Finally, the resulting image $J_{q}^{H}$ can be constructed in the output image space using\begin{equation*} J_{q}^{H} = \sum _{p=1} J_{p}^{L} M(x_{p}^{s} - x_{q})M(y_{p}^{s} - y_{q}) \tag {4}\end{equation*} View Sourcewhere q indexes output image pixels, $x_{q}$ and $y_{q}$ are the coordinates for pixel q in the output image grid. $M(\cdot)$ is the sampling kernel.

3) The Enhancement Module

The detailed architecture for the enhancement module is shown in Figure 3, which is used to improve the spatial and temporal resolution of real LR video. Since the input video is of low resolution and has noise and degradation, improving the resolution of the given input real LR video along with removing the noise embedded in it is the main objective of the proposed enhancement module.

FIGURE 3.

Proposed Architecture of 3D-CAE guided Deep Spatio-Temporal Back Projection Network for Video Super-Resolution.

Show All

The enhancement module consists of two sub-modules; one is for denoising, and another is for spatio-temporal super-resolution. The denoising module is used to remove the unwanted noise and degradation in real LR video. It is a 3D-deep convolutional auto-encoder (3D-CAE) that consists of three 3D-convolution layers followed by three 3D-convolutional transpose layers. The output of this 3D-CAE will be noise-free LR video with the same level of spatial and temporal resolution as in the input.

The features from 3D-deconvolution layers are concatenated as feed to the Spatio-temporal super-resolution module, which is an ADST-BPN. In the spatio-temporal super-resolution module, a 3D-attention layer is present at the beginning, followed by one 3D-convolution layer. After that, there are a series of spatio-temporal up-projection and down-projection blocks arranged in a cascade manner. These spatio-temporal blocks, as shown in Figure 4, are discussed in detail in the next section. This spatio-temporal up projection and down projection is the 3D version of the up and down projection block given by [104]. All the features from Spatio-temporal up-projection blocks are concatenated at last, and these concatenated features are then passed through one more 3D-convolutional layer. The output of this final 3D-convolutional layer is our desired HR output video. In the enhancement module, one 3D-self-attention layer is also used to learn the relation of prominent features that contribute mostly to feature enhancement.

FIGURE 4.

Internal Architecture of 3D Spatio-Temporal Up and Down Projection Blocks.

Show All

4) Spatial-Temporal up and Back Projection

As given in Figure 4, the spatio-temporal up-projection unit takes LR video features coming from previous calculations and maps it to intermediate HR features $H_{t}$ (spatio-temporal upscaled) using a 3D-convolution and 3D deconvolutional layer employed in series. First, the low-resolution feature at time t is upscaled using a 3D-deconvolution layer to get the high-resolution feature $H_{t}$ . Mathematically this up-projection can be defined as:\begin{equation*} H_{t} = (L_{t} * p_{t})\uparrow _{s} \tag {5}\end{equation*} View Sourcewhere s is the scale factor of upsampling and downsampling and $p_{t}$ is the 3D-deconvolution layer. we have selected $s=2$ for our proposed framework. Then one more 3D-convolutional $g_{t}$ layer maps $H_{t}$ to intermediate LR feature $L_{t+1}$ . This LR feature can be written as:\begin{equation*} L_{t+1} = (H_{t} * g_{t})\downarrow _{s} \tag {6}\end{equation*} View SourceThen we calculate the residual $e_{t}$ between this LR feature map $L_{t+1}$ and first calculated LR map $L_{t}$ coming from the first 3D-convolutional layer, defined as:\begin{equation*} r_{t} = L_{t+1} - L_{t} \tag {7}\end{equation*} View SourceThis residue is again passed through a 3D-deconvolutional layer $q_{t+1}$ to get one more intermediate HR map $H_{t+1}$ .\begin{equation*} H_{t+1} = (r_{t} * q_{t+1})\uparrow _{s} \tag {8}\end{equation*} View SourceThe final Spatio-temporal upscaled HR feature map is achieved by adding outputs of two 3D-deconvolutional layers.\begin{equation*} HR_{map} = H_{t+1} + H_{t} \tag {9}\end{equation*} View SourceThe spatio-temporal down-projection unit, as shown in Figure 4, works similarly to the spatio-temporal up-projection layer. It takes previously computed spatio-temporal up-scaled HR features as input and produces the final LR using a 3D-convolution layer $M_{t}$ feature map defined as:\begin{equation*} L_{t} = (H_{t} * M_{t})\downarrow _{s} \tag {10}\end{equation*} View SourceThen, $L_{t}$ is upsampled using 3D-deconvolution layer $N_{t}$ and it can be written as:\begin{equation*} H_{t+1} = (L_{t} * N_{t}) \uparrow _{s} \tag {11}\end{equation*} View SourceNow, similar to the Up-projection block, residue $r_{n}$ is calculated by subtracting the HR feature $H{t+1}$ with intermediate HR feature $H_{t}$ ,\begin{equation*} r_{n} = H_{t+1} - H_{t} \tag {12}\end{equation*} View SourceNext, we downscale this calculated residual feature using a 3D-convolution layer $R_{t+1}$ to get the LR feature defined as:\begin{equation*} L_{t+1} = (r_{n} * R_{t+1})\downarrow _{s} \tag {13}\end{equation*} View SourceFinal LR features are obtained by adding the intermediate LR features maps $L_{t}$ and $L_{t+1}$ .\begin{equation*} LR_{map} = L_{t} + L_{t+1} \tag {14}\end{equation*} View Source

B. Learning on External Dataset

Assume the high-quality dataset $ D_{HR} $ having $ n $ number of pairs of HR video and corresponding LR video ($ V_{HR} $ , $ V_{LR} $ ), Here $ V_{LR} $ is synthetically generated by bicubic down-sampling of HR Video frames and dropping of mid frames. Then noise is added to the LR video frames to generate noisy or degraded LR video $ V_{NLR} $ . Now our proposed model (ZS-RW-ZSSR) given in Figure 3, is learning the robust spatio-temporal SR mapping $ f_{\theta } $ , where $\theta $ is the parameter of the model, between noisy-LR and HR Video pairs ($ V_{NLR} $ , $ V_{HR} $ ) and also learning the LR video denoising mapping $ f_{\theta CAE} $ by 3D-CAE from noisy-LR and LR video pairs ($ V_{NLR} $ , $ V_{LR} $ ) by minimizing the loss\begin{align*} \mathscr {L}^{D}(\theta )& =w_{1}\frac {1}{n}\sum _{i=1}^{n}\frac {1}{2}\parallel V_{HRi} - f_{\theta }(V_{NLRi}) \parallel _{2} \\ & \quad +w_{2}\frac {1}{n}\sum _{i=1}^{n}\frac {1}{2}\parallel V_{LRi} - f_{\theta CAE}(V_{NLRi}) \parallel _{2} \tag {15}\end{align*} View Source

here, n is the total number of pairs of LR-HR videos in the external dataset. minimization of weighted linear combination of denoising loss $loss_{denoising}$ and super-resolution loss $loss_{SR}~w_{1} loss_{SR} + w_{2} loss_{denoising}$ is performed to update the weights of model. here, to optimise the weights of ADST-BPN $w_{1} =1$ and $w_{2} =0$ and to optimize the weights of 3D-CAE, $w_{1} =0.3$ and $w_{2} =0.7$ . $w_{1}$ and $w_{2}$ have been decided by trial and error. 3D-CAE and ADST-BPN are end-to-end optimized.

C. Meta Transfer Learning for Video SR

After learning model parameters ($\theta $ ) on an external dataset, the model is adapted to different settings (different down-scaling kernels + random/Gaussian noise/blur). One another synthetic dataset has been generated for meta-transfer learning, denoted as $D_{Meta}$ . $D_{Meta}$ consist of pairs ($ V_{HR} $ , $ {V_{LR}}^{k} $ ), here k represents different down-scaling kernel and degradation settings.

The Meta dataset is further divided into task-level training and testing datasets ($D_{training}$ and $D_{testing}$ ).

For each new task $T_{i}$ , parameters are updated using one or more gradient decent updates. The new parameter $\theta _{i}$ is then\begin{equation*} \theta _{i} = \theta - \alpha \nabla {\theta }L_{T_{i}}^{^{training}}\left ({{ \theta }}\right ) \tag {16}\end{equation*} View Source

here, $\alpha $ is a task level learning rate, $\theta $ is old parameter.

The model parameters are further optimized by minimizing the loss,\begin{equation*} \theta _{M} = arg min_{\theta } \sum _{T_{i}} L_{T_{i}}^{test} \left ({{ \theta - \alpha \nabla _{\theta }L_{T_{i}}^{training}\left ({{ \theta }}\right ) }}\right ) \tag {17}\end{equation*} View SourceThe model parameters $\theta _{M}$ are generated by the above optimization equation

D. Adaption to Test Image: Meta Test

Then $\theta _{M}$ (pre-trained weight) is used as initial weights to train our proposed enhancement module as given in Figure 3 in a zero-shot manner [14], [83] as shown in Figure 1.

This step is one test video-specific training at test time. Given a test LR video sequence $V_{LR}$ , first, we will down-sample each frame spatially and drop the even frames to generate its spatio-temporal son $V_{son}$ . We performed some epoch of gradient update with $V_{son}$ as input and $V_{LR}$ as a ground truth. After a few epoch updates, we feed the given test sequence $V_{LR}$ as input to the updated learned model to spatio-temporally super-resolve the test video sequence.

SECTION IV.

Experimental Results

In this section, we exhibit the implementation details of our proposed method. Also, the outcomes of our proposed network and its comparison with existing SOTA methods for video SR. The comparative analysis of our experiment in terms of evaluation metrics, as well as the visual comparison, is also presented. This section also discusses the ablation study and information about computational complexity.

A. Implementation Details

Dataset: We used two popular video super-resolution datasets for training and evaluation of our proposed model: Vid4 [43], and REDS [105]. The Vid4 dataset is a commonly used benchmark for video super-resolution algorithms. It consists of four video sequences: “walk,” “city,” “foliage,” and “calendar.” The video’s length ranges from 26 to 47 frames, and the resolution of videos ranges from $704\times 576$ to $740\times 480$ . The REDS (Realistic and Dynamic Scenes) dataset is a large-scale video super-resolution dataset. It is designed for video deblurring and video SR tasks. It utilizes 120 fps videos to create blurry frames by merging consecutive frames.

These datasets provide a diverse range of video sequences essential for training robust and generalizable video super-resolution models. The high-quality ground truth frames in these datasets allow for precise evaluation of the super-resolution performance.

Degraded Video Sequence Generation: To add realistic degradation in images for video super-resolution, we used the degradation model that simulates real-world degradation. Several models [24], [25], [107], [108] have been proposed that can be used for this purpose. Using these models, we generated degraded LR images that resemble real-world degradation and used them to train the proposed VSR model.

The proposed model was initially trained on a large-scale dataset for 200 epochs, utilizing the Adam optimizer with a learning rate of 0.0001. Meta-learning Configuration: Our meta-learning configuration is based on the MAML (Model-Agnostic Meta-Learning) framework. For meta-transfer learning, we used the Model-Agnostic Meta-Learning (MAML) algorithm with an inner learning rate of 0.01 and an outer learning rate of 0.001. The meta-training was performed for 50 epochs. For Motion Compensation Transformer (MCT), a pretrained MCT model is used, which is trained on a large-scale video dataset for motion estimation. The MCT module underwent the fine-tuning process during the training of the entire model to adapt the specific characteristics of the degraded and noisy real low-resolution videos.

B. Result Analysis

The research work is evaluated in two ways: first, quantitative evaluation, and second, qualitative evaluation. In quantitative evaluation, we compare the achieved result using PSNR and SSIM metrics with other SOTA methods for video super-resolution. Gaussian noise is added with different densities in test images for evaluation also in one case Random noise is added in the test images as it is evident from Table 1 and 2, PSNR/SSIM values are closer to MuCAN [66] when no noise is present, but PSNR/SSIM increases compared to other methods as we include the noise. This is because the proposed model is specifically trained on real-world noisy video frames. The proposed method surpasses the other methods for all varieties of noise densities and on both datasets.

TABLE 1 Performance Comparison of Various Video SR Algorithms on Vid4 [43] Dataset With Different Densities of Noise. Here Rn Denotes Random Noise. The Top 2 Results are Shown in Bold

TABLE 2 Performance Comparison of Various Video SR Algorithms on REDS4 [105] Dataset With Different Densities of Noise. Here Rn Denotes Random Noise. The Top 2 Results are Shown in Bold

In qualitative evaluation, the visual results of the proposed methods are presented and compared to the visual results of other SOTA methods. In Figure 5, a comparison of the proposed CASTNet with SOTA methods is shown, whereas in Figure 6 and 7, a comparison of spatial super-resolution on REDS dataset is shown. In all these figures, it can be observed that the proposed method can reconstruct much finer details and texture. Also, in the temporal domain, mid-frames between two successive frames of the given input video sequence are being constructed. The proposed model is also able to reconstruct the mid-frames with less amount of flicker. This is due to the better reconstruction of sharpness and smooth edges in mid-frames. In Figure 8, visual results of temporal super-resolution are shown for two types of video sequences. For both video sequences, row (a) is the original video frame sequence, and row (b) is the reconstructed video frames, which show the better transition from one frame to the next frame.

FIGURE 5.

Qualitative Comparison of Spatial SR on Vid4 Dataset.

Show All

FIGURE 6.

Qualitative Comparison of Spatial SR on REDS Dataset.

Show All

FIGURE 7.

Qualitative Comparison of Spatial Video on RealVSR [13] Dataset.

Show All

FIGURE 8.

Visual Results of Temporal SR on Vid4 Dataset. (a) and (c) Original Frames (b) and (d) Predicted Frames. Even numbered frames are reconstructed frames between two successive odd-numbered frames.

Show All

We conducted an additional experiment where we quantitatively measured the flicker by calculating the temporal intensity variation between consecutive frames. This provides an objective measure of the flicker. This quantitative analysis, provided in Table 7, indicates that our proposed methods generated mid-frames almost similar to the ground truth.

TABLE 3 Performance Comparison of our Proposed CASTNet With Existing Methods on RealVSR and MVSR4x Dataset

TABLE 4 Estimated FLOPS for Each Component of our Proposed Model

TABLE 5 Ablation Study to Show the Importance of Meta-Learning, Denoising Module and Attention Layer in Proposed Video 4X Real VSR Architecture on Vid4 [43]+G 0.01 and REDS4 [105]+G 0.1

TABLE 6 Comparison of Model Efficiency of our Proposed Method With Existing Methods on the Basis of Model Size, Runtime, and Memory for LR Input of $320\times 180$

$Table 6- Comparison of Model Efficiency of our Proposed Method With Existing Methods on the Basis of Model Size, Runtime, and Memory for LR Input of $320\times 180$$

TABLE 7 Quantitative Analysis of Temporal Intensity Variation for Vid4 Dataset

In Figure 9, a comparison of the temporal super-resolution performance of the proposed model is shown with the traditional interpolation approach. In any video sequence, when the temporal frequency of a moving object is greater than the camera frame rate, then the aliasing problem occurs. This aliasing problem is known as Motion aliasing or Temporal aliasing. In this aliasing problem, the object seems to move to a false trajectory, or its motion could be distorted. This problem can be solved by reconstruction of mid-frames between successive frames. In Figure 9, the real movement direction is clockwise, but on the left side, the fan seems to be rotating in another direction. On the right side, this aliasing effect is reduced by using the reconstructed mid-frames.

FIGURE 9.

Comparison of Temporal Super-Resolution (3x): Bicubic Interpolation Method (Left) vs. Our Method (Right).

Show All

C. Computational Complexity

This study presents an extensive investigation into the computational complexity of our proposed model, focusing primarily on the number of Floating Point Operations Per Second (FLOPs). Our model demonstrates exceptional performance in various applications, achieving state-of-the-art results while maintaining a balance between computational efficiency and performance. Our model consists of six main components: 3D Convolutional Neural Network (3D-CNN) blocks, PReLU, addition, subtraction, concatenation, and attention. Each component contributes to the total computational load of the model, making it imperative to analyze each part individually. Table 4 summarizes the FLOPs breakdown for all major components of our model.

Additionally, We demonstrate the comparison of our proposed model with existing models in terms of model size, runtime, and memory in Table 6. Our proposed model stands out with only 1.69 million parameters, demonstrating efficient parameter utilization. It boasts a fast runtime of 56 ms and minimal memory consumption at 6.48 MB, showcasing superior computational efficiency without compromising performance. Our proposed model excels in performance metrics by demonstrating efficient parameter utilization, fast runtime, and minimal memory consumption compared to existing models.

D. Ablation Study

In Table 5, it can be seen that, with meta-learning, we see a 1.5 dB improvement in PSNR as it harnesses the benefits of pre-trained weights. It also helped in obtaining kernel-agnostic properties for the model. We can also see, that without the denoising module, the PSNR decreases significantly. The denoising module improved the quality of the LR image, which is fed to our enhancement (ADST-BPN) module, which significantly improved the quality of the SR Video sequence. The attention module helped in localizing the optical flow in the video which significantly helped in recovering and improving the temporal and motion coherence in the video.

SECTION V.

Limitations and Future Work

A. Limitations

The proposed framework may not be as effective in scenarios with extremely low-resolution videos or highly noisy environments. The proposed framework may not be as effective in scenarios with complex motion patterns or large motion displacements. The proposed framework may not be as effective in scenarios with limited training data or when the training data does not accurately represent the target domain. The proposed framework may not be as effective in scenarios with limited computational resources or when real-time performance is required.

B. Future Points

Extending the proposed framework to handle more complex scenarios, such as extremely low-resolution videos, multi-modal data, and real-time applications. Investigating the potential of the proposed framework for other computer vision tasks, such as video denoising, video deblurring, and video inpainting. Investigating the application of the suggested framework in different areas, including medical imaging, remote sensing, and security surveillance. To further validate the effectiveness of the proposed framework, it is important to assess its performance on more extensive and varied datasets. Additionally, employing alternative evaluation metrics like the Perceptual Index (PI) could offer deeper insights.

SECTION VI.

Conclusion

This paper presents a zero-shot learning and meta-learning-based video space-time Super-Resolution (SR) algorithm. A novel noise-robust video space-time SR architecture, namely, 3D-Deep Convolutional Auto-Encoder guided attention-based deep spatio-temporal back-projection network (CASTNet), is introduced. This proposed method can effectively handle real degradation and noises while super-resolving Low-Resolution (LR) video by jointly optimizing two different SR and denoising losses. The proposed solution converges faster and is robust against realistic degradation. Several comparative studies are conducted and shown in the experiment section to validate the efficacy of the proposed framework. A detailed ablation study is also conducted to highlight the significance of each component in the methodology. We also proposed some limitations and future points of our proposed work. One possible limitation is that our proposed framework may need to be more effective in scenarios with extremely low-resolution videos, highly noisy videos, complex motion pattern scenarios, or scenarios with large motion displacements. Looking forward, we plan to address these limitations in our future work. We aim to extend our framework for other computer vision tasks, such as video denoising and video inpainting. We also plan to explore the possibility of using our framework for other domains, such as remote sensing and medical imaging. We also plan to test our model on a wider variety of videos to ensure its robustness and generalizability. We believe that our work provides a solid foundation for future research in video space-time SR and opens up new avenues for exploration.

ACKNOWLEDGMENT

The authors extend their appreciation to Intelligent Prognostic Private Limited Delhi, India; Universiti Sultan Zainal Abidin (UniSZA) Malaysia and Graphic Era (Deemed to be University), Dehradun, India; Bennett University, India; CSIR-CEERI, Pilani, India; and the Department of Physics, Badghis University, Afghanistan, for providing technical and non-technical support in this research work.

References is not available for this document.

A Novel Zero-Shot Real World Spatio-Temporal Super-Resolution (ZS-RW-STSR) Model for Video Super-Resolution

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction