1. Introduction
Scene understanding is one of the grand goals for automated perception that requires advanced visual comprehension of tasks like semantic segmentation (Which semantic category does a pixel belong toƒ) and detection or instance-specific semantic segmentation (Which individual object segmentation mask does a pixel belong toƒ). Solving these tasks has large impact on a number of applications, including autonomous driving or augmented reality. Interestingly, and despite sharing some obvious commonalities, both these segmentation tasks have been predominantly handled in a disjoint way ever since the rise of deep learning, while earlier works [49], [50], [56] already approached them in a joint manner. Instead, independent trainings of models, with separate evaluations using corresponding performance metrics, and final fusion in a post-processing step based on task-specific heuristics have seen a revival.