1. Introduction
Vehicle track retrieval from traffic cameras [9] is an essential component of upstream systems aiming for urban planning and traffic-flow control. Large-scale retrieval of vehicle tracks is difficult to obtain with conventional image or video retrieval methods, due to the immense variety of motion patterns and vehicle semantics that need to be considered. Descriptions for these tracks in Natural Language (NL) is an appealing alternative method to enable the retrieval system to directly interact with human-given descriptions [33], [2]. The objective of NL-based vehicle track retrieval [9] is to match a given NL description to the corresponding vehicle track. The NL description is given as one or more text queries, and the vehicle tracks are a sequence of frames from a single camera, where the location of the vehicle is known. This task combines visual and textual modalities, thus solutions should simultaneously account for intra- and inter-modality challenges. Vehicle tracks include a wide variety of vehicle types, colors, and motion types. NL queries often have variations and ambiguities, since different people can describe the same vehicle semantics and actions differently. An additional complexity to the problem is introduced by requiring to identify vehicle maneuvers over a time interval. Contrary to NL-based image or object retrieval [14], [11], an NL-based track retrieval system should address the time dimension of the task, as indicated by the related NL-based visual object tracking task defined in literature [23], [8].