I. Introduction
High quality image/video data with precise and accurate camera metadata is a growing necessity for objective, quantitative evaluation of computer vision methods and pipelines and for development of new data-driven, artificial intelligence guided systems [1]. Of particular interest to our city-scale aerial image/video analysis field [2]–[4] are evaluation and development of fundamental computer vision methods such as feature point detection, description, and matching that constitute the core of many applications including optical flow estimation, image registration, Bundle Adjustment (BA), and Structure-from-Motion (SfM) [5] and larger 3D reconstruction pipelines such as Multi-View Stereo (MVS). Large-scale aerial video of urban scenes suffer from perspective shape distortions and other complications caused by oblique viewing angles, making accurate feature point detection and matching a challenging task. In order to find robust processing algorithms that can reach optimal results, it is important to evaluate such operators on scenario specific data. Furthermore, deep learning approaches which have proven successful in many computer vision tasks generally require large amounts of training data [6], [7]. However, data availability in some domains such as city-scale aerial video may be rather limited. In some cases, availability of accurate ground truth camera metadata is the main issue for quantitative evaluation of feature matching, bundle adjustment, or dense 3D reconstruction methods.