Introduction
Object detection is one of the fundamental problems in computer vision and has attracted the attention of many researchers. The tasks of object detection include recognizing examples of visual objects from certain classes (such as humans, animals, or cars) in digital images and developing computational models and techniques to provide the information needed by computer vision applications, such as objects and their locations. The biggest challenge is how to reach the level that humans are capable of. Humans are able to identify many objects in images with a high degree of difficulty, such as objects that are made different in terms of perspective, size, and scale or even when translated or rotated. In addition, humans are still able to recognize an object even though the object is partially obstructed from view. This kind of capability, however, is not capable of being accomplished by computer vision. Research on human vision shows that with a single eye fixation lasting only a fraction of a second, humans can extract large amounts of information from surrounding objects, such as semantic categories, spatial layout, and object identities. Human visual abilities are fast and accurate, allowing for complex tasks, such as driving with little awareness. The direction of recent researches in this field is based on convolutional neural networks that are increasingly complex to improve accuracy or speed [1], [2]. Although accuracy continues to be a key metric [3], [4], as deep learning techniques develop, there is a risen attention to improving the speed of this model. This is deeply inspired by how humans can easily solve visual recognition problems, such as detecting very similar objects. A popular similar object recognition problem is the Chihuahua or muffin and labradoodle or fried chicken problem [5]. Furthermore, this sample case is easy to see, but for machines, it is a challenge.