Loading [MathJax]/extensions/MathZoom.js
Neural Video Depth Stabilizer | IEEE Conference Publication | IEEE Xplore

Neural Video Depth Stabilizer


Abstract:

Video depth estimation aims to infer temporally consistent depth. Some methods achieve temporal consistency by finetuning a single-image depth model during test time usin...Show More

Abstract:

Video depth estimation aims to infer temporally consistent depth. Some methods achieve temporal consistency by finetuning a single-image depth model during test time using geometry and re-projection constraints, which is inefficient and not robust. An alternative approach is to learn how to enforce temporal consistency from data, but this requires well-designed models and sufficient video depth data. To address these challenges, we propose a plug-and-play framework called Neural Video Depth Stabilizer (NVDS) that stabilizes inconsistent depth estimations and can be applied to different single-image depth models without extra effort. We also introduce a large-scale dataset, Video Depth in the Wild (VDW), which consists of 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset to our knowledge. We evaluate our method on the VDW dataset as well as two public benchmarks and demonstrate significant improvements in consistency, accuracy, and efficiency compared to previous approaches. Our work serves as a solid baseline and provides a data foundation for learning-based video depth models. We will release our dataset and code for future research.
Date of Conference: 01-06 October 2023
Date Added to IEEE Xplore: 15 January 2024
ISBN Information:

ISSN Information:

Conference Location: Paris, France

1. Introduction

Monocular video depth estimation is a prerequisite for various video applications, e.g., bokeh rendering [24], [25], [47], 2D-to-3D video conversion [12], and novel view synthesis [16], [17]. An ideal video depth model should output depth results with both spatial accuracy and temporal consistency. Although the spatial accuracy has been greatly improved by recent advances in single-image depth models [15], [27], [28], [39], [45] and datasets [18], [41], [42], how to obtain temporal consistency, i.e., removing flickers in the predicted depth sequences, is still an open question. The prevailing video depth approaches [13], [23], [48] require test-time training (TTT). During inference, a single-image depth model is finetuned on the testing video with geometry constraints and pose estimation. These TTT-based methods have two main issues: limited robustness and heavy computation overhead. Due to the heavy reliance on camera poses, e.g., CVD [23] shows erroneous predictions and robust-CVD [13] produces obvious artifacts for many videos when camera poses [13], [29] are inaccurate. Moreover, test-time training is extremely time-consuming. CVD [23] takes 40 minutes for 244 frames on four NVIDIA Tesla M40 GPUs.

Contact IEEE to Subscribe

References

References is not available for this document.