1 Introduction
3D scene reconstruction is a fundamental task in computer vision. Current established approaches to address this task mainly employ multi-view geometry [1], which reconstructs 3D scenes based on feature-point correspondence with consecutive frames or multiple views. In contrast, we aim to recover dense 3D scene shape up to a scale from a single in-the-wild image. With sparse guided depth points, our method can further achieve metric shape. From a single image input, some methods [2] propose to reconstruct both seen and occlusion surfaces and represent them in a volumetric model or 3D meshes, while our method only recovers the seen surfaces and use the point cloud for representation. Under this setting, without multiple views available, we rely on monocular depth estimation. However, as shown in Fig. 1, existing monocular depth estimation methods [3], [4], [5] alone are unable to faithfully recover an accurate 3D point cloud. Even with sparse guided points, it is still challenging to generalize to diverse scenes. The key challenges are: 1) it is difficult to collect large-scale metric depth datasets with diverse scenes, which are needed to achieve good monocular depth estimation models; 2) alternatively, one can train models on large-scale relative depth datasets which are much easier to collect. We discover that learning depth on such datasets requires estimating the depth shift and focal length to generate accurate 3D scene shapes. This problem was rarely studied in the literature, which we attempt to tackle here.