I. Introduction
Single image super-resolution (SISR) has drawn active attention due to its wide applications in computer vision, such as object recognition [1], remote sensing [2], [3] and so on [4], [5], [6], [7], [8], [9], [10]. SISR aims to obtain a high-resolution (HR) image containing great details and textures from a low-resolution (LR) image by a super-resolution method, which is a classic ill-posed inverse problem [11], [12], [13], [14], [15], [16], [17]. To establish the mapping between HR and LR images, lots of CNN-based methods have emerged [18], [19], [20], [21], [22]. These methods focus on designing novel architectures by adopting different network modules, such as residual blocks [23], attention blocks [24], non-local blocks [25], transformer layers [7], [26], and contrastive learning [27], [28]. For optimizing the training process, they prefer to use the MAE or MSE loss (e.g., or ) to optimize the architectures, which often leads to over-smooth results because the above losses provide a straightforward learning objective and optimize for the popular PSNR (peak signal-to-noise-ratio) metric [29], [30], [31], [32], [33].