1. Introduction
Scene understanding is one of the foundations of machine perception. A real-world scene, regardless of its context, often comprises multiple objects of varying ordering and positioning, with one or more object(s) being occluded by other object(s). Hence, scene understanding systems should be able to process modal perception, i.e., parsing the directly visible regions, as well as amodal perception [1]–[3], i.e., perceiving the intact structures of entities including invisible parts. The advent of advanced deep networks along with large-scale annotated datasets has facilitated many scene understanding tasks, e.g., object detection [4]–[7], scene parsing [8]–[10], and instance segmentation [11]–[14]. Nonetheless, these tasks mainly concentrate on modal perception, while amodal perception remains rarely explored to date.