Conferences >2023 IEEE/CVF International C...

Neural Video Depth Stabilizer

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Video depth estimation aims to infer temporally consistent depth. Some methods achieve temporal consistency by finetuning a single-image depth model during test time usin...Show More

Metadata

Abstract:

Video depth estimation aims to infer temporally consistent depth. Some methods achieve temporal consistency by finetuning a single-image depth model during test time using geometry and re-projection constraints, which is inefficient and not robust. An alternative approach is to learn how to enforce temporal consistency from data, but this requires well-designed models and sufficient video depth data. To address these challenges, we propose a plug-and-play framework called Neural Video Depth Stabilizer (NVDS) that stabilizes inconsistent depth estimations and can be applied to different single-image depth models without extra effort. We also introduce a large-scale dataset, Video Depth in the Wild (VDW), which consists of 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset to our knowledge. We evaluate our method on the VDW dataset as well as two public benchmarks and demonstrate significant improvements in consistency, accuracy, and efficiency compared to previous approaches. Our work serves as a solid baseline and provides a data foundation for learning-based video depth models. We will release our dataset and code for future research.

Published in: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Date of Conference: 01-06 October 2023

Date Added to IEEE Xplore: 15 January 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/ICCV51070.2023.00868

Conference Location: Paris, France

Contents

1. Introduction

Monocular video depth estimation is a prerequisite for various video applications, e.g., bokeh rendering [24], [25], [47], 2D-to-3D video conversion [12], and novel view synthesis [16], [17]. An ideal video depth model should output depth results with both spatial accuracy and temporal consistency. Although the spatial accuracy has been greatly improved by recent advances in single-image depth models [15], [27], [28], [39], [45] and datasets [18], [41], [42], how to obtain temporal consistency, i.e., removing flickers in the predicted depth sequences, is still an open question. The prevailing video depth approaches [13], [23], [48] require test-time training (TTT). During inference, a single-image depth model is finetuned on the testing video with geometry constraints and pose estimation. These TTT-based methods have two main issues: limited robustness and heavy computation overhead. Due to the heavy reliance on camera poses, e.g., CVD [23] shows erroneous predictions and robust-CVD [13] produces obvious artifacts for many videos when camera poses [13], [29] are inaccurate. Moreover, test-time training is extremely time-consuming. CVD [23] takes 40 minutes for 244 frames on four NVIDIA Tesla M40 GPUs.

References is not available for this document.

MIT Libraries

MIT Libraries

Neural Video Depth Stabilizer

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Neural Video Depth Stabilizer

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References