I. Introduction
The human visual system is capable of recognizing and localizing objects within cluttered scenes with ease. For artificial systems, however, the goal of achieving the same level of competence has been plagued by challenges for decades, due to viewpoint-dependent object variability (i.e. in terms of scale and rotation), differences in lighting conditions and the open-endedness that comes with defining semantically relevant object categories.