1. Introduction
Object retrieval is a fundamental yet hot topic in computer vision, which has attracted much attention for decades. Given a query instance, its target is to find objects sharing similar visual appearances with the query in a large database. For a long time, it is crucial to design discriminative representations, so that the metric defined on the representations can be robust to common deformations, such as rotation, occlusion, illumination, etc. Conventionally, Bag-of-Words (BoW) is usually employed thanks to the design of local descriptors (e.g., [27] for images, [26], [37] for shapes, [42], [43], [15], [14] for 3D models). In recent years, the rapid development of deep learning algorithms and G-PU computing platforms has shifted the research attention to deep-learned features [2], [31], [11], [47], [24], which yield a remarkable performance boost against conventional handcrafted features.