Conferences >2024 8th International Confer...

A Space Information-Enhanced Dense Video Caption for Indoor Human Action Recognition

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Dense video captioning tasks are used to detect interesting events and provide descriptive text for these events from untrimmed videos. This technology has the potential ...Show More

Metadata

Abstract:

Dense video captioning tasks are used to detect interesting events and provide descriptive text for these events from untrimmed videos. This technology has the potential to be used in security surveillance and human care applications. However, current methods often overlook the relationships between objects in the video, which limits their applicability and makes it challenging to adapt them to specific domains, such as video summarization for indoor human activities. In these scenarios, human activities are closely intertwined with the objects in the scene. In this paper, we propose a plug-and-play module designed to enhance existing dense video captioning methods with spatial information. Specifically, we extract spatial information about the interesting objects using Red-Green-Blue-Depth (RGB-D) images and the results of image segmentation. We then integrate this information into the captions generated by the Dense Video Captioning (DVC) method using a fine-tuned Large Language Model (LLM). We evaluate the performance of our model on a custom dataset and demonstrate that our system provides a convenient and effective approach for obtaining space-enhanced captions.

Published in: 2024 8th International Conference on Robotics, Control and Automation (ICRCA)

Date of Conference: 12-14 January 2024

Date Added to IEEE Xplore: 04 September 2024

ISBN Information:

DOI: 10.1109/ICRCA60878.2024.10649311

Conference Location: Shanghai, China

Contents

I. Introduction

Video is an important information medium in security monitoring and smart home applications. The large volume of video content demands the need for automated methods to summarize and compactly represent the essential content [1]. One promising approach to creating content summaries is to use dense video captions, a technique that generates descriptive text for each event of the video [2]. Unlike classification methods that utilize skeleton joints or frames as input to output a specific defined category, as seen in the method proposed by Naresh et al [3], and the method proposed by Mostafa et al [4], the captions generated by dense video caption methods can always include additional information.

References is not available for this document.

MIT Libraries

MIT Libraries

A Space Information-Enhanced Dense Video Caption for Indoor Human Action Recognition

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

A Space Information-Enhanced Dense Video Caption for Indoor Human Action Recognition

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?