1 Introduction
Convolutional Neural Networks (CNN) [1], [2], [3], [4], [5], [6], [7] demonstrates high capability of learning discriminative visual representations, and convincingly generalizes well to a series of Computer Vision (CV) tasks, e.g., image recognition, object detection, and semantic segmentation. The de-facto recipe of CNN architecture design is based on discrete convolutional operators (e.g., 3×3 or 5×5 convolution), which effectively impose spatial locality and translation equivariance. However, the limited receptive field of convolution adversely hinders the modeling of global/long-range dependencies, and such long-range interaction subserves numerous CV tasks [8], [9]. Recently, Natural Language Processing (NLP) field has witnessed the rise of Transformer with self-attention in powerful language modeling architectures [10], [11] that triggers long-range interaction in a scalable manner. Inspired by this, there has been a steady momentum of breakthroughs [12], [13], [14], [15], [16], [17], [18] that push the limits of CV tasks by integrating CNN-based architecture with Transformer-style modules. For example, ViT [14] and DETR [13] directly process the image patches or CNN outputs using self-attention as in Transformer. [17], [18] present a stand-alone design of local self-attention module, which can completely replace the spatial convolutions in ResNet architectures. Nevertheless, previous designs mainly hinge on the independent pairwise query-key interaction for measuring attention matrix as in conventional self-attention block (Fig. 1a), thereby ignoring the rich contexts among neighbor keys.