Introduction
Videos are the most common and comprehensive source of information in today’s day-to-day life. With the advent of high-tech imaging technologies, videos can be captured in HD and UHD quality, thereby enhancing one’s perceptual experience. However, with such technologies, there are certain situations (remote sensing [1], [2], [3], UAV surveillance, etc.) where capturing HD videos is complex or involves more cost. The high-resolution cameras in these cases are expensive, and transmitting such media requires enormous bandwidth.
Video super-resolution (VSR) is a computational technique that intends to address these challenges by generating a high-resolution (HR) sequence of video frames from corresponding low-resolution (LR) video frames. The VSR [4] has numerous applications in the fields of remote sensing, UAV surveillance, Panorama video super-resolution, security, High Definition(HD) and Ultra HD (UHD) Televisions, etc.
Though various techniques have evolved for single image super-resolution (SISR) [5], [6], [7], [8], the VSR is still a demanding and ill-posed problem. Contrary to SISR, where a single image is super-resolved, VSR intends to encapsulate the inter-frame alignments while performing frame-to-frame super-resolution.
The VSR method has two domains: spatial and temporal SR. The spatial SR intends to increase the size of the frames while preserving-cum-adding additional information. In contrast, temporal SR [9] intends to reduce the quantization loss of information between two frames. Temporal SR is the retrieval of those dynamic events that occur faster than the provided frame rate by predicting mid-frame information. These pieces of inter-frame information are critical in VSR to maintain consistency in the motion of the video. Space-time video SR is more challenging as spatial and temporal SR are both ill-posed problems [10]. This problem is more interesting and valuable in many computer vision and biomedical tasks for pre-processing of videos. The problem is further complicated when there is degradation in the video frames.
Recently, Deep learning based VSR models have provided state-of-the-art (SOTA) performance but have limitations such as high computational complexity, dataset-specific performance [11], and adaptation to synthetic LR degradations (such as bicubic, etc.) exclusively [12]. These frameworks were not generalized and were trained over synthetically generated LR sequences. Hence, they performed well over the dataset with synthetic LR video sequences. Still, their performance deteriorated significantly when the video frame sequences from other datasets having real degradation were evaluated [13]. The real noises are heterogeneous, with different degradations, having no verifiable mathematical models. Generating HR video frame sequences from the dataset with such noises is extremely difficult. To address the issues of real-world noise scenarios in SISR-based models, the ZSSR [14], a self-supervised method, used a zero-shot setting to learn the internal information (non-local structures) of the image on a simple CNN model. The model outperformed different blur kernels compared to other SISR SOTA models with minimal computational complexity. However, it could not exploit the patterns of large external datasets, resulting in a non-generalized and less adaptive model. It took thousands of iterations to learn the information in the sample before it could produce good results. This problem was overcome using meta-learning in MZSR [15]. In MZSR [15], they first trained the model on an external large dataset, a transfer learning step, and then they meta-trained the model over different blur kernels to incorporate kernel-agnostic characteristics in the model. With these weights, they used zero-shot training over the LR image to produce the SR Image. The model was able to exploit the information of the large external dataset and internal non-local structures of the images to produce exceptionally good and generalized results with faster learning. The model was able to adapt to new samples with different blur kernels in a few gradient descent steps as compared to ZSSR [14].
Inspired by the success of these results on SISR problems, in this article, a novel zero-shot and meta-learning-based real-world video space-time SR method (3D-deep Convolutional Auto-encoder guided attention-based deep Spatio- Temporal back-projection Network (CASTNet)) is introduced for super-resolution of real LR videos. The 3D Convluational auto-encoder (3D-CAE) learns the noise-free features from noisy LR videos, and the Attention-based Deep Spatio-Temporal back-projection Network generates the HR video from these noise-free features. The last three layers of the 3D-CAE are concatenated, and these concatenated features are fed to the deep spatio-temporal network. The spatio-temporal SR network upscaled the features in the spatial and temporal domain and reconstructed the HR video using these up-scaled features. The denoising and SR loss gradients were used to update the denoising model weights and SR weights were updated by the gradient of SR loss to get the improved resultant HR video. Both models were jointly trained in the end-to-end fashion. As given in Figure 1, the model was initially trained on a large dataset, where various blur kernels were used to create different tasks, and these tasks were learned in a meta-learning fashion to update the weights of the model and reduce its kernel dependency. After the learning on the external dataset, for the Meta-test phase, the model was updated using a zero-shot setting for a sample of video frame sequence to generate video-specific SR. Detailed experimental results demonstrate the effectiveness of the proposed meta-learning and zero-shot-based video SR framework on degraded and noisy real low-resolution video compared to the existing methods. Furthermore, an ablation study has been conducted to highlight the contribution of each component of the proposed network. The enhancement module is referred to as ADST-BPN in this article.
The proposed methodology. In the first stage, the external dataset is used for large-scale training. From the initial point, the
Related Work
A. Image Super-Resolution
Recent advancements in image super-resolution (SR) have been driven by deep learning techniques like [5], [7], [8], [16], [17], [18], [19], [20], [21]. While effective, these methods rely on knowing the exact degradation kernel, posing limitations in real-world applications. To overcome this, blind image super-resolution has emerged, focusing on self-supervised estimation of unknown degradation kernels. This approach categorizes methods into non-blind SR and blind SR, offering solutions for scenarios where precise kernel information is unavailable.
Non-blind SR methods use the known degradation kernel to generate high-quality, high-resolution (HR) images. Examples include SRM [22], which uses the low-resolution (LR) image and its corresponding degradation kernel as inputs, and ZSSR [14], which trains an image-specific network on the pseudo-LR-HR image-pairs obtained using the same kernel which used to generate LR image from the test image itself.
In contrast to supervised methods, blind super-resolution (SR) techniques aim to infer unknown kernels through self-supervision and subsequently apply these estimated kernels to non-blind SR models. Various strategies have been developed for kernel estimation, leveraging self-similarity or employing iterative self-correction mechanisms. The pioneering work by Michaeli and Irani [23] introduced a method to estimate downscaling kernels by exploiting the patch-recurrence characteristic within a single image. Building upon this, KernelGAN [24] enhanced kernel estimation by incorporating Internal-GAN. Additionally, IKC [25] proposed an iterative correction approach, demonstrating its efficacy in producing high-fidelity SR images.
Also, the feedback mechanism is exploited by Li et al. [26] to refine the output of the network. Another mechanism is proposed in FENet by Behjati et al. [27] using a frequency-based enhancement network. Luo et al. [28] proposed a novel adversarial neural degradation (AND) model for blind image SR to generate a wide range of complex degradation effects that are highly non-linear.
In the context of blind image super-resolution (SR), self-supervised methodologies have been put forth [29], [30]. Dong et al. [29] introduced a self-supervised technique that estimates the blur kernel and intermediary high-resolution (HR) image from a single low-resolution (LR) input image. This approach employs a variational model, grounded in the image formation of SR, to enhance the quality of the intermediary HR images. A separate self-supervised method [30] has integrated contrastive learning into blind remote sensing image SR, directing the reconstruction process by promoting positive representations and penalizing negatives. Recently, diffusion-based techniques have attracted significant attention in the field of image SR. One such method, SinSR [31], accomplishes single-step SR generation through the derivation of a deterministic sampling process from the most recent state-of-the-art (SOTA) method, thereby expediting diffusion-based SR. Another diffusion-based SR method, EDiffSR [32], utilizes a diffusion probabilistic model, incorporating an Efficient Activation Network (EANet) for enhanced noise prediction performance and a Conditional Prior Enhancement Module (CPEM) for precise super-resolution. Guo et al. [33] have proposed a face video SR method that addresses video compression artifacts by capitalizing on the correlation among video, audio, and the emotional state of the face.
B. Video Super-Resolution
Video super-resolution (VSR) methods are commonly classified into traditional and deep learning-based approaches. Schultz and Stevenson [34] introduced a conventional method employing affine models for motion estimation, while 3D steering kernel regression was applied in [35]. Ma et al. [36]utilized the expectation-maximization technique to reconstruct high-resolution frames by estimating the blur kernel. Furthermore, a method in [37] concurrently estimates the blur kernel, motion, and noise level through a Bayesian approach for reconstructing high-resolution frames in VSR.
In recent times, deep learning-based strategies have emerged to address the image super-resolution (SR) challenge. Given that a video comprises a sequence of moving images over time, image SR methodologies can be adapted for VSR by incorporating necessary modifications. Notable deep learning-based image SR models include SRCNN [5], SRGAN [20], FSRCNN [18], ESPCN [6], and ZSSR [14]. VSRnet [21], a model derived from SRCNN, is proposed for video super-resolution.
Deep-learning-based VSR methodologies are commonly categorized into two distinct categories: those incorporating frame alignment and those operating without it. The former class leverages motion estimation and compensation techniques as initial processing steps to extract precise inter-frame motion details [38], [39], [40], [41] and facilitate frame alignment [10], [21], [42], [43], [44]. This approach, demonstrated in studies such as [45], [46], [47], and [48], proves particularly effective in scenarios involving significant motion dynamics. The optical flow method that uses variations and correlations in the temporal domain to compute the motion between two nearby frames is popularly used in most motion estimation techniques [49], [50]. The motion compensation methods can also be applied using either the traditional methods [44], [51] or using deep learning-based approaches [49], [50], [52]. Deformable Convolution Methods for video SR were first proposed by [53], [54], [55], [56], [57], and [58]. BasicVSR [59] used a simple RNN architecture with propagation, alignment, and upsampling modules to make VSR suitable for real-time applications. Later, for better handling of misalignment, enhanced propagation along with flow-guided deformable alignment is introduced in BasicVSR [59] to the proposed BasicVSR++ [60].
Deformable attention mechanisms have gained prominence in the area of blind video super-resolution (VSR) due to their ability to handle complex spatial transformations and focus on relevant features. One notable work in this domain is the Deep Blind Super-Resolution for Satellite Video [61]. The proposed BSVSR algorithm in [61] is an empirical approach for blind SVSR that emphasizes sharper cues by considering pixel-wise blur levels using the approach called coarse-to-fine manner. It utilizes multi-scale deformable convolution for aggregating the temporal redundancy across adjacent frames through window-slid progressive fusion, followed by deformable attention for meticulous integration of adjacent features into mid-feature.
Another significant contribution is the Bidirectional Multi-scale Deformable Attention for Video Super-Resolution [62]. This method uses a Deformable Alignment Module (DAM) which contains two types of modules: Multi-scale Deformable Convolution Module (MDCM) used to improve the robustness of the adjacent frame alignment process by using the offset information in the different scales and aligning the frames at the feature level. The second module is the Multi-scale Attention Module (MAM), used to extract the local as well as global features of the aligned features.
Moreover, for lightweight VSR [63], the Deformable Spatial-Temporal Attention aggregates the spatial-temporal information obtained from the multiple reference frames into the current frame to improve the reconstruction effect.
These works demonstrate the effectiveness of deformable attention mechanisms in handling the challenges of alignment and fusion in blind VSR tasks. They provide a robust and effective way to enhance the resolution of video sequences while maintaining temporal consistency.
In methods without alignment techniques, alignment is not performed by aligning neighbouring frames; instead, the Spatio-temporal or spatial information is used for executing feature extraction. This technique can further be classified into four types i.e. 2D convolution methods (2D Conv) [64], non-local network-based methods [58], [65], [66], 3D convolution methods (3D Conv) [67], and Recurrent CNN (R-CNN) based methods [68]. The 2D convolution methods (2D Conv) fall under the umbrella of spatial methods whereas the remaining three methods belong to the Spatio-temporal category and utilize both the temporal as well as spatial information obtained from the input videos [4]. In the 2D convolution method [64], [69], a 2D convolutional network absorbs the correlation information existing within the frames by itself. The three important stages, feature extraction, fusion and SR are performed spatially on the frames passed as input to the model instead of performing actions like motion estimation and motion compensation [4] whereas 3D Convolution Methods [67], [70], [71] utilizes spatial as well as the temporal information to super-resolve the video and Recurrent CNNs [68], [72], [73], [74] can accurately represent the temporal dependency in sequential data and has been used for video SR. The use of direct 3D convolution is introduced by Li et al. [75] to generate video sequences. This eliminates the need for RNNs and allows for efficient processing. To avoid the use of explicit motion compensation, Jo et al. [70] proposed an architecture that learns the dynamic upsampling filters for each pixel based on its local spatio-temporal neighbourhood. CycMu-Net [76] used the cycle-projected mutual learning between spatial and temporal super-resolution tasks to produce optimal results in terms of both detail and consistency. In this method, spatial features refine temporal predictions, while temporal information helps extract finer spatial details. Addressing the issue of blind VSR, DynaVSR [77]utilizes a dynamic encoder-decoder architecture that adapts to different degradation types at runtime. Apart from CNN architecture, transformer architecture is also exploited for VSR. Liang et al. [78] proposed a Recurrent Video Restoration Transformer (RVRT) with guided deformable attention to handle the complex temporal dependencies and object deformations. In recent years, the video inbetweening technique is also utilized aiming to enhance the temporal resolution of video sequences by creating new frames between known keyframes. Initial methods for video inbetweening include optical flow-based interpolation [52], [79] and pixel motion transformation [80], [81]. To perform long-term video interpolation, block-based motion estimation/compensation methods or LSTM models [82] were used.
C. Zero-Shot Super-Resolution With Meta-Learning
The SOTA SISR methods fail, and their performance deteriorates in the case of real LR input (LR image with compression, sensor, and random noises) as they have trained on datasets where LR counterpart is generated synthetically by bicubic down-sampling of corresponding HR Image. These SOTA SR methods are very deep and computationally complex, too. To overcome these challenges, Shocher et al. [14] have proposed zero-shot learning for SISR. The model proposed was very light and did not need any training on the external dataset. The proposed method performs image-specific training at test time and outperforms SOTA in the case of real LR images like old photos and degraded images. However, due to its limitation of fast adaptation to datasets having statistically distant noises and kernels, Soh et al. have presented meta-transfer learning along with ZSSR [83] to initialize the weights of the model to take advantage of learning from the external dataset with internal learning in the zero-shot setting. Here, meta-transfer learning helps in fast convergence during training in a zero-shot setting at test time. Mohammad Emad has presented dual-path zero-shot learning [84], which attempts to train cycle GAN-based architecture in the zero-shot setting to improve performance for real-world LR images further. Reference [85] also employs a similar approach to meta-transfer learning.
Performing Spatio-temporal SR, i.e., creating intermittent frames along with SR, is more challenging and intimidating. Spatio-temporal VSR methods [9], [86], [87], [88], primarily work on bicubic and tricubic methods to form LR images, they require large datasets to generalize, they fail to adapt to blurriness created from different kernels and were computationally complex. This limits their use in practical applications such as HD television [89], UAV surveillance [90], [91], security [92], [93], etc. Due to the heterogeneous nature of noise in real LR, these models do not produce effective motion-consistent HR videos.
D. Motion Estimation and Compensation
In video super-resolution, simply applying image super-resolution methods to each frame independently may not produce satisfactory results, because there may be inter-frame motion that causes temporal distortion and blur. Inter-frame motion is the movement of objects or cameras between consecutive frames, which can create misalignment and inconsistency between the frames. Therefore, motion estimation and compensation are needed to handle the inter-frame motion and align the frames before applying super-resolution.
To solve the motion estimation problem several deep learning-based networks were used. For stereo matching, Zbontar and LeCun [94] and Lou et al. [95] used to learn patch distance measures by employing a CNN whereas Fischer et al. [49] and Mayer et al. [96] proposed the use of end-to-end architectures to predict the optical flow and it’s stereo disparity. For the motion compensation problem, earlier VSR methods [10], [35], [38], [97], [98] achieved the inter-frame motion compensation by either estimating optical flow or by applying block-matching. In deep-learning-based VSR methods use backward warping by aligning all other frames to the reference frame to achieve inter-frame motion compensation.
The existing work on zero-shot super-resolution with meta-learning was limited to images. This motivated us to design an architecture for meta-learning-based zero-shot video space-time SR. Our model incorporates all benefits obtained by the zero-shot and meta-learning-based training. Along with this, our model solves denoising and super-resolution simultaneously with a dedicated architecture for joint optimization of denoising and super-resolution.
Methodology
A. Proposed Architecture and Optimisation
The block diagram for the proposed methodology is depicted in Figure 2. The Motion estimation and compensation modules are employed to maintain the temporal coherency in the output video sequence. After the frame alignment, the enhancement module which consists of two parts, i.e., the denoising module and the spatio-temporal super-resolution module, is employed to get spatio-temporal super-resolution. The details of each module are as follows:
Block diagram for the proposed framework. Here H and W are the height and width of the image frame respectively whereas n and m are the number of frames where
1) Motion Estimation
Motion estimation (ME) is the process of estimating the motion vectors between successive frames in a given video sequence. Motion vectors are the displacement of the pixels from one frame to the other, which indicate the direction as well as the magnitude of motion. The ME module takes two LR frames as input and generates an LR motion field as output, and it can be defined as\begin{equation*} F_{i\rightarrow j} = Net_{ME}(I_{i}^{L}, I_{j}^{L};\theta _{ME}) \tag {1}\end{equation*}
2) Motion Compensation
Motion compensation is the process of aligning the frames according to the motion vectors to reduce temporal distortion and blur. Motion compensation can help to improve the temporal consistency and details of the video. We used the motion compensation layer proposed by Tao et al. [42] that leverages sub-pixel information from motion to achieve simultaneous sub-pixel motion compensation (SPMC) The SPMC performs sub-pixel shifting and interpolation to generate aligned frames. The sub-pixel shifting operation shifts each pixel in the neighboring frames according to its corresponding motion vector, and produces a set of shifted frames. The interpolation operation combines the shifted frames using a convolutional filter and produces an aligned frame. The output of the SPMC is a set of aligned frames that match the reference frame. It can be defined as\begin{equation*} J_{i}^{LR} = L_{SPMC}(I_{i}^{LR}, F; \alpha ) \tag {2}\end{equation*}
\begin{align*} \begin{pmatrix}x_{p}^{s} \\ y_{p}^{s}\end{pmatrix} = W_{F;\alpha }\begin{pmatrix} x_{p} \\ y_{p} \end{pmatrix} = \alpha \begin{pmatrix} x_{p} + u_{p} \\ y_{p} + v_{p} \end{pmatrix} \tag {3}\end{align*}
here, p indexes pixels in the LR image space, where \begin{equation*} J_{q}^{H} = \sum _{p=1} J_{p}^{L} M(x_{p}^{s} - x_{q})M(y_{p}^{s} - y_{q}) \tag {4}\end{equation*}
3) The Enhancement Module
The detailed architecture for the enhancement module is shown in Figure 3, which is used to improve the spatial and temporal resolution of real LR video. Since the input video is of low resolution and has noise and degradation, improving the resolution of the given input real LR video along with removing the noise embedded in it is the main objective of the proposed enhancement module.
Proposed Architecture of 3D-CAE guided Deep Spatio-Temporal Back Projection Network for Video Super-Resolution.
The enhancement module consists of two sub-modules; one is for denoising, and another is for spatio-temporal super-resolution. The denoising module is used to remove the unwanted noise and degradation in real LR video. It is a 3D-deep convolutional auto-encoder (3D-CAE) that consists of three 3D-convolution layers followed by three 3D-convolutional transpose layers. The output of this 3D-CAE will be noise-free LR video with the same level of spatial and temporal resolution as in the input.
The features from 3D-deconvolution layers are concatenated as feed to the Spatio-temporal super-resolution module, which is an ADST-BPN. In the spatio-temporal super-resolution module, a 3D-attention layer is present at the beginning, followed by one 3D-convolution layer. After that, there are a series of spatio-temporal up-projection and down-projection blocks arranged in a cascade manner. These spatio-temporal blocks, as shown in Figure 4, are discussed in detail in the next section. This spatio-temporal up projection and down projection is the 3D version of the up and down projection block given by [104]. All the features from Spatio-temporal up-projection blocks are concatenated at last, and these concatenated features are then passed through one more 3D-convolutional layer. The output of this final 3D-convolutional layer is our desired HR output video. In the enhancement module, one 3D-self-attention layer is also used to learn the relation of prominent features that contribute mostly to feature enhancement.
4) Spatial-Temporal up and Back Projection
As given in Figure 4, the spatio-temporal up-projection unit takes LR video features coming from previous calculations and maps it to intermediate HR features \begin{equation*} H_{t} = (L_{t} * p_{t})\uparrow _{s} \tag {5}\end{equation*}
\begin{equation*} L_{t+1} = (H_{t} * g_{t})\downarrow _{s} \tag {6}\end{equation*}
\begin{equation*} r_{t} = L_{t+1} - L_{t} \tag {7}\end{equation*}
\begin{equation*} H_{t+1} = (r_{t} * q_{t+1})\uparrow _{s} \tag {8}\end{equation*}
\begin{equation*} HR_{map} = H_{t+1} + H_{t} \tag {9}\end{equation*}
\begin{equation*} L_{t} = (H_{t} * M_{t})\downarrow _{s} \tag {10}\end{equation*}
\begin{equation*} H_{t+1} = (L_{t} * N_{t}) \uparrow _{s} \tag {11}\end{equation*}
\begin{equation*} r_{n} = H_{t+1} - H_{t} \tag {12}\end{equation*}
\begin{equation*} L_{t+1} = (r_{n} * R_{t+1})\downarrow _{s} \tag {13}\end{equation*}
\begin{equation*} LR_{map} = L_{t} + L_{t+1} \tag {14}\end{equation*}
B. Learning on External Dataset
Assume the high-quality dataset \begin{align*} \mathscr {L}^{D}(\theta )& =w_{1}\frac {1}{n}\sum _{i=1}^{n}\frac {1}{2}\parallel V_{HRi} - f_{\theta }(V_{NLRi}) \parallel _{2} \\ & \quad +w_{2}\frac {1}{n}\sum _{i=1}^{n}\frac {1}{2}\parallel V_{LRi} - f_{\theta CAE}(V_{NLRi}) \parallel _{2} \tag {15}\end{align*}
here, n is the total number of pairs of LR-HR videos in the external dataset. minimization of weighted linear combination of denoising loss
C. Meta Transfer Learning for Video SR
After learning model parameters (
The Meta dataset is further divided into task-level training and testing datasets (
For each new task \begin{equation*} \theta _{i} = \theta - \alpha \nabla {\theta }L_{T_{i}}^{^{training}}\left ({{ \theta }}\right ) \tag {16}\end{equation*}
here,
The model parameters are further optimized by minimizing the loss,\begin{equation*} \theta _{M} = arg min_{\theta } \sum _{T_{i}} L_{T_{i}}^{test} \left ({{ \theta - \alpha \nabla _{\theta }L_{T_{i}}^{training}\left ({{ \theta }}\right ) }}\right ) \tag {17}\end{equation*}
D. Adaption to Test Image: Meta Test
Then
This step is one test video-specific training at test time. Given a test LR video sequence
Experimental Results
In this section, we exhibit the implementation details of our proposed method. Also, the outcomes of our proposed network and its comparison with existing SOTA methods for video SR. The comparative analysis of our experiment in terms of evaluation metrics, as well as the visual comparison, is also presented. This section also discusses the ablation study and information about computational complexity.
A. Implementation Details
Dataset: We used two popular video super-resolution datasets for training and evaluation of our proposed model: Vid4 [43], and REDS [105]. The Vid4 dataset is a commonly used benchmark for video super-resolution algorithms. It consists of four video sequences: “walk,” “city,” “foliage,” and “calendar.” The video’s length ranges from 26 to 47 frames, and the resolution of videos ranges from
These datasets provide a diverse range of video sequences essential for training robust and generalizable video super-resolution models. The high-quality ground truth frames in these datasets allow for precise evaluation of the super-resolution performance.
Degraded Video Sequence Generation: To add realistic degradation in images for video super-resolution, we used the degradation model that simulates real-world degradation. Several models [24], [25], [107], [108] have been proposed that can be used for this purpose. Using these models, we generated degraded LR images that resemble real-world degradation and used them to train the proposed VSR model.
The proposed model was initially trained on a large-scale dataset for 200 epochs, utilizing the Adam optimizer with a learning rate of 0.0001. Meta-learning Configuration: Our meta-learning configuration is based on the MAML (Model-Agnostic Meta-Learning) framework. For meta-transfer learning, we used the Model-Agnostic Meta-Learning (MAML) algorithm with an inner learning rate of 0.01 and an outer learning rate of 0.001. The meta-training was performed for 50 epochs. For Motion Compensation Transformer (MCT), a pretrained MCT model is used, which is trained on a large-scale video dataset for motion estimation. The MCT module underwent the fine-tuning process during the training of the entire model to adapt the specific characteristics of the degraded and noisy real low-resolution videos.
B. Result Analysis
The research work is evaluated in two ways: first, quantitative evaluation, and second, qualitative evaluation. In quantitative evaluation, we compare the achieved result using PSNR and SSIM metrics with other SOTA methods for video super-resolution. Gaussian noise is added with different densities in test images for evaluation also in one case Random noise is added in the test images as it is evident from Table 1 and 2, PSNR/SSIM values are closer to MuCAN [66] when no noise is present, but PSNR/SSIM increases compared to other methods as we include the noise. This is because the proposed model is specifically trained on real-world noisy video frames. The proposed method surpasses the other methods for all varieties of noise densities and on both datasets.
In qualitative evaluation, the visual results of the proposed methods are presented and compared to the visual results of other SOTA methods. In Figure 5, a comparison of the proposed CASTNet with SOTA methods is shown, whereas in Figure 6 and 7, a comparison of spatial super-resolution on REDS dataset is shown. In all these figures, it can be observed that the proposed method can reconstruct much finer details and texture. Also, in the temporal domain, mid-frames between two successive frames of the given input video sequence are being constructed. The proposed model is also able to reconstruct the mid-frames with less amount of flicker. This is due to the better reconstruction of sharpness and smooth edges in mid-frames. In Figure 8, visual results of temporal super-resolution are shown for two types of video sequences. For both video sequences, row (a) is the original video frame sequence, and row (b) is the reconstructed video frames, which show the better transition from one frame to the next frame.
Visual Results of Temporal SR on Vid4 Dataset. (a) and (c) Original Frames (b) and (d) Predicted Frames. Even numbered frames are reconstructed frames between two successive odd-numbered frames.
We conducted an additional experiment where we quantitatively measured the flicker by calculating the temporal intensity variation between consecutive frames. This provides an objective measure of the flicker. This quantitative analysis, provided in Table 7, indicates that our proposed methods generated mid-frames almost similar to the ground truth.
In Figure 9, a comparison of the temporal super-resolution performance of the proposed model is shown with the traditional interpolation approach. In any video sequence, when the temporal frequency of a moving object is greater than the camera frame rate, then the aliasing problem occurs. This aliasing problem is known as Motion aliasing or Temporal aliasing. In this aliasing problem, the object seems to move to a false trajectory, or its motion could be distorted. This problem can be solved by reconstruction of mid-frames between successive frames. In Figure 9, the real movement direction is clockwise, but on the left side, the fan seems to be rotating in another direction. On the right side, this aliasing effect is reduced by using the reconstructed mid-frames.
Comparison of Temporal Super-Resolution (3x): Bicubic Interpolation Method (Left) vs. Our Method (Right).
C. Computational Complexity
This study presents an extensive investigation into the computational complexity of our proposed model, focusing primarily on the number of Floating Point Operations Per Second (FLOPs). Our model demonstrates exceptional performance in various applications, achieving state-of-the-art results while maintaining a balance between computational efficiency and performance. Our model consists of six main components: 3D Convolutional Neural Network (3D-CNN) blocks, PReLU, addition, subtraction, concatenation, and attention. Each component contributes to the total computational load of the model, making it imperative to analyze each part individually. Table 4 summarizes the FLOPs breakdown for all major components of our model.
Additionally, We demonstrate the comparison of our proposed model with existing models in terms of model size, runtime, and memory in Table 6. Our proposed model stands out with only 1.69 million parameters, demonstrating efficient parameter utilization. It boasts a fast runtime of 56 ms and minimal memory consumption at 6.48 MB, showcasing superior computational efficiency without compromising performance. Our proposed model excels in performance metrics by demonstrating efficient parameter utilization, fast runtime, and minimal memory consumption compared to existing models.
D. Ablation Study
In Table 5, it can be seen that, with meta-learning, we see a 1.5 dB improvement in PSNR as it harnesses the benefits of pre-trained weights. It also helped in obtaining kernel-agnostic properties for the model. We can also see, that without the denoising module, the PSNR decreases significantly. The denoising module improved the quality of the LR image, which is fed to our enhancement (ADST-BPN) module, which significantly improved the quality of the SR Video sequence. The attention module helped in localizing the optical flow in the video which significantly helped in recovering and improving the temporal and motion coherence in the video.
Limitations and Future Work
A. Limitations
The proposed framework may not be as effective in scenarios with extremely low-resolution videos or highly noisy environments. The proposed framework may not be as effective in scenarios with complex motion patterns or large motion displacements. The proposed framework may not be as effective in scenarios with limited training data or when the training data does not accurately represent the target domain. The proposed framework may not be as effective in scenarios with limited computational resources or when real-time performance is required.
B. Future Points
Extending the proposed framework to handle more complex scenarios, such as extremely low-resolution videos, multi-modal data, and real-time applications. Investigating the potential of the proposed framework for other computer vision tasks, such as video denoising, video deblurring, and video inpainting. Investigating the application of the suggested framework in different areas, including medical imaging, remote sensing, and security surveillance. To further validate the effectiveness of the proposed framework, it is important to assess its performance on more extensive and varied datasets. Additionally, employing alternative evaluation metrics like the Perceptual Index (PI) could offer deeper insights.
Conclusion
This paper presents a zero-shot learning and meta-learning-based video space-time Super-Resolution (SR) algorithm. A novel noise-robust video space-time SR architecture, namely, 3D-Deep Convolutional Auto-Encoder guided attention-based deep spatio-temporal back-projection network (CASTNet), is introduced. This proposed method can effectively handle real degradation and noises while super-resolving Low-Resolution (LR) video by jointly optimizing two different SR and denoising losses. The proposed solution converges faster and is robust against realistic degradation. Several comparative studies are conducted and shown in the experiment section to validate the efficacy of the proposed framework. A detailed ablation study is also conducted to highlight the significance of each component in the methodology. We also proposed some limitations and future points of our proposed work. One possible limitation is that our proposed framework may need to be more effective in scenarios with extremely low-resolution videos, highly noisy videos, complex motion pattern scenarios, or scenarios with large motion displacements. Looking forward, we plan to address these limitations in our future work. We aim to extend our framework for other computer vision tasks, such as video denoising and video inpainting. We also plan to explore the possibility of using our framework for other domains, such as remote sensing and medical imaging. We also plan to test our model on a wider variety of videos to ensure its robustness and generalizability. We believe that our work provides a solid foundation for future research in video space-time SR and opens up new avenues for exploration.
ACKNOWLEDGMENT
The authors extend their appreciation to Intelligent Prognostic Private Limited Delhi, India; Universiti Sultan Zainal Abidin (UniSZA) Malaysia and Graphic Era (Deemed to be University), Dehradun, India; Bennett University, India; CSIR-CEERI, Pilani, India; and the Department of Physics, Badghis University, Afghanistan, for providing technical and non-technical support in this research work.