I. Introduction
As a fundamental video processing task, the goal of video reconstruction (VR), is to generate the desired in-between frames given a pair of consecutive image frames [1], [2]. VR involves the understanding of pixel motion, image appearance, and even 3D structure, which contributes to many practical applications, such as slow-motion animation [3], [4], video compression [5], [6], novel view synthesis [7], [8], and other real-world systems [9], [10], [11], [12]. In recent years, a plethora of VR techniques has been actively studied around the common global shutter (GS) and rolling shutter (RS) cameras, e.g., GS video frame interpolation [3] and RS temporal super-resolution [13], with increasingly impressive results powered by the rapid progress of deep neural networks.