Loading [MathJax]/extensions/MathZoom.js
Learning Trajectory-Aware Transformer for Video Super-Resolution | IEEE Conference Publication | IEEE Xplore

Learning Trajectory-Aware Transformer for Video Super-Resolution


Abstract:

Video super-resolution (VSR) aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Although some progress has been made, ...Show More

Abstract:

Video super-resolution (VSR) aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Although some progress has been made, there are grand challenges to effectively utilize temporal dependency in entire video sequences. Existing approaches usually align and aggregate video frames from limited adjacent frames (e.g., 5 or 7 frames), which prevents these approaches from satisfactory results. In this paper, we take one step further to enable effective spatio-temporal learning in videos. We propose a novel Trajectory-aware Transformer for Video Super-Resolution (TTVSR). In particular, we formulate video frames into several pre-aligned trajectories which consist of continuous visual tokens. For a query token, self-attention is only learned on relevant visual tokens along spatio-temporal trajectories. Compared with vanilla vision Transformers, such a design significantly reduces the computational cost and enables Transformers to model long-range features. We further propose a cross-scale feature tokenization module to over-come scale-changing problems that often occur in long-range videos. Experimental results demonstrate the superiority of the proposed TTVSR over state-of-the-art models, by extensive quantitative and qualitative evaluations in four widely-used video super-resolution benchmarks. Both code and pre-trained models can be downloaded at https://github.com/researchmm/TTVSR.
Date of Conference: 18-24 June 2022
Date Added to IEEE Xplore: 27 September 2022
ISBN Information:

ISSN Information:

Conference Location: New Orleans, LA, USA

Funding Agency:


1. Introduction

Video super-resolution (VSR) aims to recover a high-resolution (HR) video from a low-resolution (LR) counter-part [39]. As a fundamental task in computer vision, VSR is usually adopted to enhance visual quality, which has great value in many practical applications, such as video surveillance [48], high-definition television [10], and satellite imagery [6], [27], etc. From a methodology perspective, unlike image super-resolution that usually learns on spatial dimensions, VSR tasks pay more attention to exploiting temporal information. In Fig. 1, if detailed textures to recover the target frame can be discovered and leveraged at relatively distant frames, video qualities can be greatly enhanced.

A comparison between TTVSR and other SOTA methods: MuCAN [24] and IconVSR [4]. We introduce finer textures for recovering the target frame from the boxed areas (indicated by yellow) tracked by the trajectory (indicated by green).

Contact IEEE to Subscribe

References

References is not available for this document.