1. Introduction
Visual scene understanding has moved from an elusive goal to a focus of much recent research in computer vision [27]. Semantic reasoning about the contents of a scene is thereby done on several levels of abstraction. Scene recognition aims to determine the overall scene category by putting emphasis on understanding its global properties, e.g. [46], [82]. Scene labeling methods, on the other hand, seek to identify the individual constituent parts of a whole scene as well as their interrelations on a more local pixel- and instance-level, e.g. [41], [71]. Specialized object-centric methods fall somewhere in between by focusing on detecting a certain subset of (mostly dynamic) scene constituents, e.g. [6], [12], [13], [15]. Despite significant advances, visual scene understanding remains challenging, particularly when taking human performance as a reference.