1. Introduction
Convolutional neural networks (CNNs) [40], [53] used to be a common choice of visual encoders in modern computer vision systems. However, recently, CNNs [40], [53] have been greatly challenged by Vision Transformers (ViTs) [34], [59], [86], [94], which have shown leading performances on many visual tasks - not only image classification [34], [104] and representation learning [4], [9], [16], [100], but also many downstream tasks such as object detection [24], [59], semantic segmentation [94], [98] and image restoration [10], [54]. Why are ViTs super powerful? Some works believed that multi-head self-attention (MHSA) mechanism in ViTs plays a key role. They provided empirical results to demonstrate that, MHSA is more flexible [50], capable (less inductive bias) [20], more robust to distortions [66], [98], or able to model long-range dependencies [69], [90]. But some works challenge the necessity of MHSA [115], attributing the high performance of ViTs to the proper building blocks [33], and/or dynamic sparse weights [38], [111]. More works [20], [38], [42], [95], [115] explained the superiority of ViTs from different point of views.
The effective receptive field (erf) of resnet-101/152 and replknet-13/31 respectively. A more widely distributed dark area indicates a larger erf. More layers (e.g., from resnet-101 to resnet-152) help little in enlarging erfs. Instead, our large kernel model replknet effectively obtains large erfs.