Loading [MathJax]/extensions/MathMenu.js
Energy-Based Temporal Summarized Attentive Network for Zero-Shot Action Recognition | IEEE Journals & Magazine | IEEE Xplore

Energy-Based Temporal Summarized Attentive Network for Zero-Shot Action Recognition


Abstract:

Recently, Action Recognition (AR) is facing the scalability problem, since collecting and annotating data for the ever-growing action categories is exhausting and inappro...Show More

Abstract:

Recently, Action Recognition (AR) is facing the scalability problem, since collecting and annotating data for the ever-growing action categories is exhausting and inappropriate. As an alternative to AR, Zero-Shot Action Recognition (ZSAR) is getting more and more attention in the community, as they could utilize a shared semantic/attribute space to recognize novel categories without annotated data. Different from the AR focuses on learning the correlation between actions, ZSAR needs to consider the correlation of action-action, label-label and action-label at the same time. However, as far as we know, there is no work to provide structural guidance for the framework design of ZSAR according to its task characteristics. In this paper, we demonstrate the rationality of using the Energy-Based Model (EBM) to guide the framework design of ZSAR based on their inference mechanism. Furthermore, under the guidance of EBM, we propose an Energy-based Temporal Summarized Attentive Network (ETSAN) to achieve ZSAR. Specifically, to ensure the effectiveness of cross-modal matching, EBM needs to capture the correlations of input-input, output-output and input-output, based on discriminative and focused input and output space. To this end, we first design the Temporal Summarized Attentive Mechanism (TSAM) to capture the correlation of action-action by constructing discriminative and focused input space. Then, a Label Semantic Adaptive Mechanism (LSAM) is proposed to learn the correlation of label-label by adjusting the semantic structure according to the target task. Finally, we devise an Energy Score Estimation Mechanism (ESEM) to measure the compatibility (i.e. energy score) between video representation and label semantic embedding. With end-to-end training, our framework can capture all three of the correlations mentioned above simultaneously by minimizing the energy score of the correct action-label pair. Experiments on the HMDB51 and UCF101 datasets show that the proposed archi...
Published in: IEEE Transactions on Multimedia ( Volume: 25)
Page(s): 1940 - 1953
Date of Publication: 05 April 2023

ISSN Information:

Funding Agency:

Citations are not available for this document.

I. Introduction

The Action Recognition (AR) task has attracted lots of attention in recent years for its wide applications in video surveillance, video understanding [1], [2], human-computer interactions and robotics [3]. However, due to the explosive increase of video data and various action categories, supervised action recognition is under the burden of laboriously collecting and annotating a huge amount of new action video data [4]. In addition, the supervised action recognition model needs to be fine-tuned or retrained when meeting unseen categories [5]. Different from traditional supervised learning, Zero-Shot Action Recognition (ZSAR) could utilize a shared semantic/attribute space to recognize unseen action categories without collecting corresponding labeled data. Therefore, in the community, ZSAR is getting ever-increasing attention as it achieves recognition of unseen categories in a weakly supervised way. However, as far as we know, compared to traditional fully supervised learning, the framework design of ZSAR still lacks structural guidance.

Cites in Papers - |

Cites in Papers - IEEE (2)

Select All
1.
Guangzhao Dai, Xiangbo Shu, Wenhao Wu, Rui Yan, Jiachao Zhang, "GPT4Ego: Unleashing the Potential of Pre-Trained Models for Zero-Shot Egocentric Action Recognition", IEEE Transactions on Multimedia, vol.27, pp.401-413, 2025.
2.
Xun Jiang, Xing Xu, Zailei Zhou, Yang Yang, Fumin Shen, Heng Tao Shen, "Zero-Shot Video Moment Retrieval With Angular Reconstructive Text Embeddings", IEEE Transactions on Multimedia, vol.26, pp.9657-9670, 2024.

Cites in Papers - Other Publishers (2)

1.
Faisal Mehmood, Xin Guo, Enqing Chen, Muhammad Azeem Akbar, Arif Ali Khan, Sami Ullah, "Extended Multi-stream Temporal-attention Module for Skeleton-based Human Action Recognition (HAR)", Computers in Human Behavior, pp.108482, 2024.
2.
Limin Xia, Xin Wen, "Zero-shot action recognition by clustered representation with redundancy-free features", Machine Vision and Applications, vol.34, no.6, 2023.
Contact IEEE to Subscribe

References

References is not available for this document.