I. Introduction
Super-resolution (SR) is a classical problem that transforms low-resolution (LR) images to high-resolution (HR) counterparts, which has attracted increasing attention among computer vision community over past years. Despite the fact that SR is a hot issue in image processing, compared with single image super-resolution (SISR), video super-resolution (VSR) obtains less attention and it is more complicated. The goal of VSR is to reconstruct HR video frames from consecutive LR frames. Thus, besides the spatial relation, the additional temporal correlations among multiple LR inputs require to be exploited in order to obtain high-quality reconstruction results. Along with flourishing of 2D CNNs, which shows the remarkable ability in modeling spatial relation within one single image, the methods based on CNNs [1]–[3] for SISR have achieved promising performances. However, a great quantity of works [4], [5] have proved that directly applying the SISR network to solve the VSR problem usually yields suboptimal results. Thus, a vital issue in VSR is to effectively exploit temporal redundancies.