1. Introduction
Scene understanding is a fundamental yet challenging task in computer vision, which has a great impact on other applications such as autonomous driving and robotics. Classic tasks for scene understanding mainly include object detection, instance segmentation and semantic segmentation. This paper considers a recently proposed task named panoptic segmentation [23], which aims at finding all foreground (FG) objects (named things, mainly including countable targets such as people, animals, tools, etc.) at the instance level, meanwhile parsing the background (BG) contents (named stuff, mainly including amorphous regions of similar texture and/or material such as grass, sky, road, etc.) at the semantic level. The benchmark algorithm [23] and MS-COCO panoptic challenge winners [1] dealt with this task by directly combining FG instance segmentation models [15] and BG scene parsing [45] algorithms, which ignores the underlying relationship and fails to borrow rich contextual cues between things and stuff.