Loading [MathJax]/extensions/MathZoom.js
A Space Information-Enhanced Dense Video Caption for Indoor Human Action Recognition | IEEE Conference Publication | IEEE Xplore

A Space Information-Enhanced Dense Video Caption for Indoor Human Action Recognition


Abstract:

Dense video captioning tasks are used to detect interesting events and provide descriptive text for these events from untrimmed videos. This technology has the potential ...Show More

Abstract:

Dense video captioning tasks are used to detect interesting events and provide descriptive text for these events from untrimmed videos. This technology has the potential to be used in security surveillance and human care applications. However, current methods often overlook the relationships between objects in the video, which limits their applicability and makes it challenging to adapt them to specific domains, such as video summarization for indoor human activities. In these scenarios, human activities are closely intertwined with the objects in the scene. In this paper, we propose a plug-and-play module designed to enhance existing dense video captioning methods with spatial information. Specifically, we extract spatial information about the interesting objects using Red-Green-Blue-Depth (RGB-D) images and the results of image segmentation. We then integrate this information into the captions generated by the Dense Video Captioning (DVC) method using a fine-tuned Large Language Model (LLM). We evaluate the performance of our model on a custom dataset and demonstrate that our system provides a convenient and effective approach for obtaining space-enhanced captions.
Date of Conference: 12-14 January 2024
Date Added to IEEE Xplore: 04 September 2024
ISBN Information:
Conference Location: Shanghai, China

I. Introduction

Video is an important information medium in security monitoring and smart home applications. The large volume of video content demands the need for automated methods to summarize and compactly represent the essential content [1]. One promising approach to creating content summaries is to use dense video captions, a technique that generates descriptive text for each event of the video [2]. Unlike classification methods that utilize skeleton joints or frames as input to output a specific defined category, as seen in the method proposed by Naresh et al [3], and the method proposed by Mostafa et al [4], the captions generated by dense video caption methods can always include additional information.

References

References is not available for this document.