I. Introduction
Current smart cities are equipped with numerous optical sensors or cameras for data sensing, event detection and recognition, and autonomous event reporting, supporting a large number of application domains of healthcare, security, recommender systems, and surveillance. The giant volume of data collected by such a dense network of vision sensors create significant difficulties in analyzing and processing video data for identifying the events of interests. Thus, video summarization (VS) that automatically extracts a brief, yet informative, summary of these videos has attracted intense attention recently in many applications. The current literature contains research methods in the VS domain based on the supervised and unsupervised learning, statistical features, objects detection, and action and activity recognition, which are briefly summarized in the subsequent paragraphs.