I. Introduction
The purpose of space-time video super-resolution (ST-VSR) is to convert video with low-frame-rate (LFR) and low resolution (LR) to higher temporal and spatial resolutions. With the continuous advancement of high definition playback equipment, high-resolution slow-motion video sequences are becoming more and more popular among the public. To increase the spatio-temporal resolution of a given video, earlier traditional method [1] consider this task as an optimization problem and usually relies on strong assumptions or prior knowledge consequently hard to apply to complex and diverse scenarios. In reality, space-time video super-resolution can be divided into two sub-tasks: video frame interpolation (VFI) and video super resolution (VSR). For VFI, Some flow-based methods [2]–[4] employ optical flow as motion guidance information to approximate the intermediate frame. Other kernel-based methods[5], [6] leverage adaptive convolution for interpolation. For VSR, most restoration algorism can be divided into two categories: temporal sliding window-based methods [7]–[9] and recurrent methods [10], [11]. The research development of VFI and VSR also promotes the progress of ST-VSR, which is more efficient than the two-stage approaches. Specifically, STARnet[12] first computes the optical flow of two adjacent frames and then applies feature warping to synthesize the intermediate frame. Xiang et al. [13] propose a deformable alignment structure and adopt a bidirectional convLSTM network to leverage preceding and succeeding information from the whole input sequence. Based on [13], Xu et al. further propose TMNet [14] which can perform controllable frame interpolation at any intermediate moment. Despite the remarkable progress of the aforementioned methods, they still suffer from pixel misplacement when handling extreme motions.