1. Introduction
Scene flow estimation, which involves estimating both 3D structure and 3D motion of a dynamic scene from its two consecutive observations, has been receiving increasing attention due to its significance in areas such as robotics [10], augmented reality [22], and autonomous vehicles [35]. Recently, deep learning has demonstrated remarkable progress in the domain of scene flow estimation based on various input modalities, including stereo images [3], [24], [32], [41], [51], [40], RGB-D pairs [31], [39], [45], [33], or Lidar points [28, 18, 54, 56, 38, 55, 12, 7, 11, 52]. These methods, however, either require strict sensor calibrations (e.g., stereo-based), or expensive devices (e.g., RGB-D or Lidar-based) for achieving satisfactory performance, which restricts their widespread applications.