I. Introduction
During the past decade, research has significantly advanced the state-of-the-art in object detection and image segmentation [1]–[9]. Convolution Neural Networks (CNNs) have paved the way for groundbreaking approaches toward object detection. Earlier CNNs were unable to effectively accommodate geometric or spatial variations in terms of object scale, pose, viewpoint, and partial deformation [10]. Therefore, two approaches were followed: i) data augmentation, which includes spatial variations in the training dataset [11], [12], and ii) handcrafting of feature layers, such as pooling [13], [14]. However, such highly specialized approaches could not be generalized for new datasets or handle complicated deformations that require a different receptive field.