1. Introduction
Depth prediction from a single image using CNN s has had a surge of interest in recent years [7], [21], [20]. Recently, unsupervised methods that rely solely on monocular video for training (without depth or stereo groundtruth) have captured the attention of the community. Of particular note in this regard is the work of Zhou et al. [31] who proposed a strategy that learned separate pose and depth CNN predictors by minimizing the photometric consistency across monocular video datasets during training. Although achieving impressive results this strategy falls noticeably behind those that have been trained using rectified stereo image pairs [13], [19]. These rectified stereo methods have shown comparable accuracy to supervised methods [26], [7], [23] over datasets where only sparse depth annotation is available. However, the assumption of using calibrated binocular image pairs excludes itself from utilizing monocular video which is easier to obtain and richer in variability. This performance gap between stereo [13], [19] and monocular [31] learning strategies is of central focus in this paper.