1. Introduction
Monocular video depth estimation is a prerequisite for various video applications, e.g., bokeh rendering [24], [25], [47], 2D-to-3D video conversion [12], and novel view synthesis [16], [17]. An ideal video depth model should output depth results with both spatial accuracy and temporal consistency. Although the spatial accuracy has been greatly improved by recent advances in single-image depth models [15], [27], [28], [39], [45] and datasets [18], [41], [42], how to obtain temporal consistency, i.e., removing flickers in the predicted depth sequences, is still an open question. The prevailing video depth approaches [13], [23], [48] require test-time training (TTT). During inference, a single-image depth model is finetuned on the testing video with geometry constraints and pose estimation. These TTT-based methods have two main issues: limited robustness and heavy computation overhead. Due to the heavy reliance on camera poses, e.g., CVD [23] shows erroneous predictions and robust-CVD [13] produces obvious artifacts for many videos when camera poses [13], [29] are inaccurate. Moreover, test-time training is extremely time-consuming. CVD [23] takes 40 minutes for 244 frames on four NVIDIA Tesla M40 GPUs.