1. Introduction
Image restoration aims to recover a clean image from various type of degradations (e.g. noise, blur, rain, and compression artifacts), which has a huge impact on the performance of downstream tasks such as image classification [14], [56], object detection [22], [46], segmentation [4], [10], and to name a few. It is a highly ill-posed inverse problem as there may exist multiple number of solutions for a single degraded image. Recent restoration works [17], [36], [76] attempt to establish a mapping relation between clean and degraded images by leveraging the representation power of the convolutional neural networks (CNNs). A series of local operations used in the CNNs is, however, inherently less capable of capturing a long-range dependency, exhibiting certain limitations in deliberating global information over an entire image. To enlarge the receptive field, increasing network depth [51], dilated convolution [66], and hierarchical architecture [40] have been proposed, but the receptive field still does not secure global information as it is limited to local regions. Recently, non-local operation, which mostly contributed to non-learning based restoration approaches [5], [15], has again emerged as a promising solution with the success of non-local neural networks [58]. As similar patterns tend to repeat within a natural image, non-local self-similarity of computing the response at a single position by weighted sum of all positions has served as an important cue for an image restoration [16], [28], [32], [37], [38], [43], [53], [77], [78]. A non-local self-similarity of [58] could capture the long-range dependency within deep networks, but the quadratic complexity with respect to the input feature resolution limits the network capacity. Consequently, it is employed only in relatively low-resolution feature maps of specific layers [16], [32], [77].
Comparisons of different attention approaches: (a) Global attention [18], [45], [57] computes self-similarity between patches globally, (b) local attention [33], [59] measures self-similarity within a single patch at the pixel-level, and (c) the proposed method aggregates similar k patches with a pair-wise local attention at the pixel-level.