1. Introduction
Scenes in images most often depict multiple objects that partially occlude each other. Recent studies [38], [18] showed that deep networks are less robust at recognizing partially occluded objects compared to Humans. The main difficulties are raised by the combinatorial variability of the object ordering and positioning, as well as the fact that scenes can contain known and unknown object classes.