1. Introduction
Holistic scene understanding is an important task in computer vision, where a model is trained to explain each pixel in an image, whether that pixel describes stuff – uncountable regions of similar texture such as grass, road or sky – or thing – a countable object with individually identifying characteristics, such as people or cars. While holistic scene understanding received some early attention [49], [55], [48], modern deep learning-based methods have mainly tackled the tasks of modeling stuff and things independently under the task names semantic segmentation and instance segmentation. Recently, Kirillov et al. proposed the panoptic quality (PQ) metric for unifying these two parallel tracks into the holistic task of panoptic segmentation [24]. Panoptic segmentation is a key step for visual understanding, with applications in fields such as autonomous driving or robotics, where it is crucial to know both the locations of dynamically trackable things, as well as static stuff classes. For example, an autonomous car needs to be able to both avoid other cars with high precision, as well as understand the location of the road and sidewalk to stay on a desired path.