1. Introduction
Image inpainting is a long-standing problem in computer vision and has many graphics applications. The goal of the problem is to fill in missing regions in a masked image, such that the output is a natural completion of the captured scene with (i) plausible semantics, and (ii) realistic details and textures. The latter can be achieved with traditional inpainting methods that copy patches of valid pixels, e.g., PatchMatch [3], thus preserving the textural statistics of the surrounding regions. Nevertheless, the inpainted results often lack semantic context and do not blend well with the rest of the image. With the advent of deep learning, inpainting neural networks are commonly trained in a self-supervised fashion, by generating random masks and applying them to the full image to produce masked images that are used as the network’s input. These networks are able to produce semantically plausible results thanks to abundant training data. However, the results often do not have realistic details and textures, presumably due to the finding of a spectral bias [36] in neural networks. That is, high-frequency details are difficult to learn as neural networks are biased towards learning low-frequency components. This is especially problematic when training neural networks for image restoration tasks such as image inpainting, because high-frequency details must be generated for realistic results.