1. Introduction
In our daily lives, we often interact with several objects of various appearances. Among them are those made of transparent or mirror surfaces (ToM), ranging from the glass windows of buildings to the reflective surfaces of cars and appliances. These might represent a hard challenge for an autonomous agent leveraging computer vision to operate in unknown environments. Specifically, among the many tasks involved in Spatial AI, accurately estimating depth information on these surfaces remains a challenging problem for both computer vision algorithms and deep networks [64], yet necessary for proper interaction with the environment in robotic, autonomous navigation, picking, and other application fields. This difficulty arises because ToM surfaces introduce misleading visual information about scene geometry, which makes depth estimation challenging not only for computer vision systems but even for humans – e.g., we might not distinguish the presence of a glass door in front of us due to its transparency. On the one hand, the definition of depth itself might appear ambiguous in such cases: is depth the distance to the scene behind the glass door or to the door itself? Nonetheless, from a practical point of view, we argue that the actual definition depends on the task itself – e.g., a mobile robot should definitely be aware of the presence of the glass door. On the other hand, as humans can deal with this through experience, depth sensing techniques based on deep learning, e.g., monocular [38], [37] or stereo [26], [22] networks, hold the potential to address this challenge given sufficient training data [64].