I. Introduction
Multi-Object Tracking (MOT) task is to localize multiple moving objects in video frames over time with preserved identity. Despite the progress made in recent years, MOT is still a challenging problem in the computer vision domain due to heavy occlusions and background clutter as well as diverse scales and spatial object densities [1], [2]. Despite significant progress on MOT in computer vision using deep learning methods, remote sensing or “remote vision” is still in its infancy stage. MOT on aerial imagery has been challenging to exploit previously, due to the limited level of detail of the images. The development of more advanced camera systems and the availability of very high-resolution aerial images have alleviated the aerial MOT limitations to some extend, allowing a variety of applications ranging from the analysis of ecological systems to aerial surveillance [3], [4]. Aerial imagery provides efficient image data over wide areas in a short amount of time. Thus, given sufficient image acquisition speed, developing MOT methods for small moving objects such as pedestrians, vehicles, and ships in image sequences can be investigated to offer new opportunities in disaster management, predictive traffic, and event monitoring. The large number and the small size of the moving objects together with multiple scales and the very low frame rate (e.g., two fps) are the main differences between MOT in aerial and ground-level datasets. Besides, the diversity in visibility and weather conditions, as well as the large images and acquisition by moving cameras, add to the complexity of aerial MOT. Despite its important practical application, to the best of our knowledge, only a few research works have dealt with aerial MOT [5]–[7].