Conferences >2023 IEEE/CVF Conference on C...

Post-Processing Temporal Action Detection

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Existing Temporal Action Detection (TAD) methods typically take a pre-processing step in converting an input varying-length video into a fixed-length snippet representati...Show More

Metadata

Abstract:

Existing Temporal Action Detection (TAD) methods typically take a pre-processing step in converting an input varying-length video into a fixed-length snippet representation sequence, before temporal boundary estimation and action classification. This pre-processing step would temporally downsample the video, reducing the inference resolution and hampering the detection performance in the original temporal resolution. In essence, this is due to a temporal quantization error introduced during resolution downsampling and recovery. This could negatively impact the TAD performance, but is largely ignored by existing methods. To address this problem, in this work we introduce a novel model-agnostic post-processing method without model redesign and retraining. Specifically, we model the start and end points of action instances with a Gaussian distribution for enabling temporal boundary inference at a sub-snippet level. We further introduce an efficient Taylor-expansion based approximation, dubbed as Gaussian Approximated Post-processing (GAP). Extensive experiments demonstrate that our GAP can consistently improve a wide variety of pre-trained off-the-shelf TAD models on the challenging ActivityNet (+0.2%~0. 7% in average mAP) and THUMOS (+0.2%~0.5% in average mAP) benchmarks. Such performance gains are already significant and highly comparable to those achieved by novel model designs. Also, GAP can be integrated with model training for further performance gain. Importantly, GAP enables lower temporal resolutions for more efficient inference, facilitating low-resource application. The code is available at https://github.com/sauradip/GAP

Published in: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Date of Conference: 17-24 June 2023

Date Added to IEEE Xplore: 22 August 2023

ISBN Information:

ISSN Information:

DOI: 10.1109/CVPR52729.2023.01806

Conference Location: Vancouver, BC, Canada

Contents

1. Introduction

The objective of Temporal action detection (TAD) is to identify both the temporal interval (i.e., start and end points) and the class label of all action instances in an untrimmed video [3], [7]. Given a test video, existing TAD methods typically generate a set of action instance candidates via proposal generation based on regressing predefined anchor boxes [4], [6], [13], [23] or directly predicting the start and end times of proposals [[2], [9], [10], [15], [25]–[27] and global segmentation masking [14]. To facilitate deep model design and improve computational efficiency, most TAD methods would pre-process a varying-length video into a fixed-length snip-pet sequence by first extracting frame-level visual features with a frozen video encoders and subsequently sampling a smaller number of feature points (i.e., snippet) evenly (see Fig. 1(a)). As a result, a TAD model performs the inference at lower temporal resolutions. This introduces a temporal quantization error that could hamper the model performance. For instance, when decreasing video temporal resolution from 400 to 25, the performance of BMN [9] de-grades significantly from 34.0% to 28.1 % in mAP on ActivityNet. Despite the obvious connection between the er-ror and performance degradation, this problem is largely ignored by existing methods. Figure 1.

A typical pipeline for temporal action detection. (a) For efficiency and model design ease, temporal resolution reduction is often applied during pre-processing. This causes model inference at lower (coarse) temporal resolutions. (b) After bringing the prediction results back to the original temporal resolution during inference, quantization error will be introduced inevitably.

Figure 2.

Conventional snippet-level tad inference along with our proposed sub-snippet-level post-processing.

References is not available for this document.

MIT Libraries

MIT Libraries

Post-Processing Temporal Action Detection

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Post-Processing Temporal Action Detection

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References