1. Introduction
The goal of single image super-resolution (SR) is to reconstruct high-resolution (HR) images from low-resolution (LR) images. Many deep learning-based methods have worked in this field. In particular, several image restoration studies [16], [27], [55], [61], [63], [66] have adapted the window self-attention (WSA) proposed by Swin Transformer (Swin) [32] as it integrates long-range dependency of Vision Transformer [14] and locality of conventional convolution. However, two critical problems remain in these works. First, the receptive field of the plain WSA is limited within a small local window [52], [56], [58]. It prevents the models from utilizing the texture or pattern of neighbor windows to recover degraded pixels, producing the distorted images. Second, recent state-of-the-art SR [9], [27], [61], [66] and lightweight SR [6], [15], [35], [63] networks require intensive computations. Reducing operations is essential for real-world applications if the parameters are kept around a certain level (e.g., 1M, 4MB sizes), because the primary consumption of semiconductor energy (concerning time) for neural networks is related to Mult-Adds operations [17], [47].
Two tracks of this paper using the n-gram context. (Left) NGswin outperforms previous leading SR methods with an efficient structure. (Right) Our proposed N-Gram context improves different Swin Transformer-based SR models.