I. Introduction
Depth estimation from stereo camera images is an important task for 3D scene reconstruction and understanding, with numerous applications ranging from robotics [30], [51], [39], [42] to augmented reality [53], [1], [35]. High-resolution stereo cameras provide a reliable solution for 3D perception - unlike time-of-flight cameras, they work well both indoors and outdoors, and compared to LiDAR they are substantially more affordable and energy-efficient [29]. Given a rectified stereo image pair, the focal length, and the stereo baseline distance between the two cameras, depth estimation can be cast into a stereo matching problem, the goal of which is to find the disparity between corresponding pixels in the two images. Although disparity estimation from stereo images is a long-standing problem in computer vision [28], in recent years the adoption of deep convolutional neural networks (CNN) [52], [32], [20], [25], [36] has led to significant progress in the field. Deep networks can solve the matching problem via supervised learning in an end-to-end fashion, and they have the ability to incorporate local context as well as prior knowledge into the estimation process.