Loading [MathJax]/extensions/MathMenu.js
TRM:Temporal Relocation Module for Video Recognition | IEEE Conference Publication | IEEE Xplore

TRM:Temporal Relocation Module for Video Recognition


Abstract:

One of the key differences between video and image understanding lies in how to model the temporal information. Due to the limit of convolution kernel size, most previous...Show More

Abstract:

One of the key differences between video and image understanding lies in how to model the temporal information. Due to the limit of convolution kernel size, most previous methods try to model long-term temporal information via sequentially stacked convolution layers. Such conventional manner doesn’t explicitly differentiate regions/pixels with various temporal receptive requirements and may suffer from temporal information distortion. In this paper, we propose a novel Temporal Relocation Module (TRM), which can capture the long-term temporal dependence in a spatial-aware manner adaptively. Specifically, it relocates the spatial features along the temporal dimension, through which an adaptive temporal receptive field is aligned to within the global temporal interval of input video, TRM can potentially model the long-term temporal information with an equivalent receptive field of the entire video. Experiment results on three representative video recognition benchmarks demonstrate TRM outperforms previous state-of-the-arts noticeably and verifies the effectiveness of our method.
Date of Conference: 04-08 January 2022
Date Added to IEEE Xplore: 15 February 2022
ISBN Information:

ISSN Information:

Conference Location: Waikoloa, HI, USA

Funding Agency:


1. Introduction

Video understanding is an important computer vision task and has been adopted in various scenarios [12], [2], [7], [6], [16], [31]. The recent success of video understanding can be primarily attributed to the advancement of temporal modeling. However, it remains challenging to effectively aggregate temporal information, especially for distinguishing activities with various temporal lengths and complex spatial-temporal contexts. In the former works, different algorithms have been proposed for temporal information aggregation. A series of works [18], [27], [32], [46] are established on the two-stream 2D CNNs. In such framework, a separate stream, which relies on extra temporal features (e.g. optical flow), is employed to incorporate the temporal information.

Contact IEEE to Subscribe

References

References is not available for this document.