I. Introduction
Learning to localize and classify objects in an image is a fundamental problem in computer vision. It has a wide range of applications [1], [2], including robotics, autonomous vehicles and video surveillance. With the success of convolutional neural networks (CNN), great leaps have been achieved in the object detection by remarkable works including Faster-R-CNN [3], Mask R-CNN [4], YOLO [5] and SSD [6]. Despite of the achievements, most object detectors suffer from an important limitation: they rely on huge amounts of training data and heavily annotated labels. For object detection, annotating the data is very expensive, as it requires not only identifying the categorical labels for all objects in the image but also providing accurate localization information through bounding box coordinates. This warrants a demand for effective object detectors that can generalize well from small amounts of annotated data. Recently, several approaches [7], [8] have attempted at resolving the few-shot object detection that aims to detect the data-scarce novel categories as well as data-sufficient base categories. Those methods attach a meta-learner to an existing object detector. The meta-learner takes support set images that include the few examples of the novel categories and a subset of examples from the base categories. Given the support images, the meta-learner is expected to predict categorical prototypes which are used to reweight the feature maps from a query image to build category-discriminative feature maps that remodel a prediction layer. However, in those methods remodelling the prediction layers suffers from a poor embedding space of the prototypes as they predict each prototype independently without considering each others.