I. Introduction
Temporally consistent video depth is essential in many vision-related tasks, such as video editing effects [1], [2], [3], bokeh rendering [4], [5], and 2D-to-3D conversion [6], [7]. Although recovering depth from a monocular video has been studied for decades, it is still challenging to estimate spatially accurate and temporally consistent depth for unconstrained videos. Such videos can be any monocular video, captured in indoor or outdoor scenes, with a stationary or moving camera, and may contain moving objects or not. Previous methods can work in specific conditions but show poor generalization to such casual videos.