I. Introduction
Deep neural networks (DNNs) have been empirically confirmed to have efficient and reliable feature extraction capabilities which play a fundamental role in the performance of DNNs [1], [2] through comprehensive experimental results under various tasks [3]–[5]. Specifically, the feature extraction ability of DNNs is mainly reflected in whether it can recognize and pay attention to key pixel regions in an image [6], [7] in computer vision. As depicted in Fig. 1, a popular interpretability technology, i.e., Grad-CAM [8], is adopted to explicitly visualize the regions where DNNs attend in the form of heat maps. From the results, we can find that although the vanilla ResNet [3] achieves good performance, there are non-negligible attention bias problems in key semantic feature extraction: (1) Position bias. In the examples illustrated in Fig. 1(a)(b), ResNet only attends to the label-independent background region rather than the region of the bird and the cat. These position biases can make the features extracted by DNNs sensitive to background information, resulting in error predictions. (2) Range bias. As shown in Fig. 1(c)(d), ResNet is unable to attend to the overlay region of the label while attending to some extra regions such as sky and fence.