I. Introduction
Video is an important information medium in security monitoring and smart home applications. The large volume of video content demands the need for automated methods to summarize and compactly represent the essential content [1]. One promising approach to creating content summaries is to use dense video captions, a technique that generates descriptive text for each event of the video [2]. Unlike classification methods that utilize skeleton joints or frames as input to output a specific defined category, as seen in the method proposed by Naresh et al [3], and the method proposed by Mostafa et al [4], the captions generated by dense video caption methods can always include additional information.