I. Introduction
Efficient and robust extraction of image keypoints and descriptors is critical to many resource-constrained visual measurement applications, such as SLAM [1], computational photography [2], and visual place recognition [3]. Early methods for keypoint detection and descriptor extraction relied on human heuristics [4], [5], [6]. However, these handcrafted methods are not sufficiently efficient and robust. To address these issues, many data-driven approaches based on DNNs have emerged in recent years. Initially, DNNs were used to extract descriptors of image patches at predefined keypoints [7]. Subsequently, the mainstream approach became the extraction of keypoints and descriptors with a single network [8], [9], [10], which can often extract more robust keypoints and discriminative descriptors than handcrafted methods [11]. We refer to these methods as map-based methods because they estimate a score map and a descriptor map using two heads: the SMH and the DMH. Then, they extract keypoints and descriptors from the score map and descriptor map, respectively.