1. Introduction
Despite the difficulty of colorization due to the requirement of a semantic understanding of the scenery and natural colors that dwell in the wild, various user-guided image colorization methods have shown remarkable results in restoring grayscale photographs as well as black and white films. Among the user-guided colorization, the point-interactive colorization methods [12],[27],[36] help users with user-guided hints to assist in colorizing an image, while minimizing interaction with users. In particular, [36] proposed a colorization method with U-net architecture trained on ImageNet [3] and training with synthetically generated user hints through 2-D Gaussian sampling. However, prior works suffer from partial colorization, where the unclear boundary of images is not colored successfully. Furthermore, failure in consistent colorization comes from the difficulty of propagating hints to large and distant semantic regions. In order to tackle this problem, [33] leverages the architecture of vision transformers (ViT), allowing the model to learn to propagate the user hints to other distant and similar regions with self-attention. Despite the exceptional performance of ViT in colorization applications, transformer-based models contain redundant computations resulting in slow inference speed. This problem limits users’ active interactions on a variety of real-time colorization applications.