1. Introduction
Local features [45, 7] play a fundamental role in almost all fields of computer vision, where matching between images is needed, e.g. pose estimation [23, 25, 19, 17, 1], registration, 3D reconstruction, structure from motion, visual localization [37, 36, 35], object recognition, visual odometry and simultaneous localization and mapping, etc. Handcrafted keypoint extractors like SIFT [26], BRIEF [6], or ORB [34] are still widely used in spite of numerous alternative end-to-end learning based approaches like R2D2 [33], LIFT [55], MatchNet [10], or DeepCompare [56]. Although learnable features have been a rather active research topic recently, their advantages over handcrafted ones are still not evident [39, 4, 7]. For example, according to [7], extraction of handcrafted descriptors on a CPU is often much faster than extraction of learned descriptors on a GPU; and recent benchmarks targeting their application in image-based reconstruction and localization pipelines suggest that handcrafted features still perform just as well or even better than deep-learned features on such tasks. Among learned keypoint descriptors, L2-Net [43] and its variants became particularly popular. In [29], it was shown that a more powerful descriptor (called HardNet) with significantly simpler learning objective can be learned via an efficient hard negative mining strategy. In [52], a modified robust angular loss function (RAL-Net) was proposed which, coupled with HardNet’s negative mining strategy, yield state-of-theart performance. A significant amount of work on local features and feature matching (surveys in [39, 4, 7]) was presented in the past 20 years to overcome common challenges like texture-less areas or repetitive patterns typical to a man-made environment found e.g. on planar building facades, traffic signs on roads, printed circuits and various other mass-produced objects.