1. Introduction
The self-attention mechanism (SAM) [41], [15], [16], [9], [43], [4] is widely used in various artificial intelligence fields and has successfully improved the models' performance in a number of vision tasks, including image classification [22], [60], [47], object detection [37], [56], [25], instance segmentation [8], [52], image super-resolution [64], [46], [49], etc. However, most previous works lay emphasis on designing a new self-attention method, and intuitively or heuristically exploring how the self-attention mechanism helps the performance. For example, many popular channel attention methods [22], [47], [56], [35] consider the attention values as the soft weight of the channels, leading to the importance reassignment of feature maps. These soft weights can also be seen as a gate mechanism [60], [28] to control the forward transmission of information flow, which are usually applied to neural network pruning and neural architecture search [40], [65]. Another viewpoint [38] argues that the self-attention mechanism can help to regulate the noise by enhancing instance-specific information to obtain a better regularization effect. Moreover, the receptive field [62], [63], [67] and long-range dependency [69], [17], [57] are also used to understand the role of self-attention. Although these explanations describe the behavior of self-attention mechanisms to some extent, the relationship between the SAM and model performance is still ambiguous.