1. Introduction
In recent years, deep neural networks (DNNs) have achieved great success in a wide range of computer vision applications. The advance of novel neural architecture design and training schemes, however, often comes a greater demand for computational resources in terms of both memory and time. Consider the stereo matching task as an example. It has been empirically shown that, compared to traditional 2D convolution, 3D convolution on a 4D volume (height × width × disparity × feature channels) [17] can better capture context information and learn representations for each disparity level, resulting in superior disparity estimation results. But due to the extra feature dimension, 3D convolution is typically operating on spatial resolutions that are lower than the original input image size for the time and memory concern. For example, CSPN [8], the top-1 method on the KITTI 2015 benchmark, conducts 3D convolution at 1/4 of the input size and uses bilinear interpolation to upsample the predicted disparity volume for final disparity regression. To handle high resolution images (e.g., 2000 × 3000), HSM [42], the top-1 method on the Middlebury-v3 benchmark, uses a multi-scale approach to compute disparity volume at 1/8,1/16, and 1/32 of the input size. Bilinear upsampling is again applied to generate disparity maps at the full resolution. In both cases, object boundaries and fine details are often not well preserved in final disparity maps due to the upsampling operation.