1. Introduction
The objective of Temporal action detection (TAD) is to identify both the temporal interval (i.e., start and end points) and the class label of all action instances in an untrimmed video [3], [7]. Given a test video, existing TAD methods typically generate a set of action instance candidates via proposal generation based on regressing predefined anchor boxes [4], [6], [13], [23] or directly predicting the start and end times of proposals [[2], [9], [10], [15], [25]–[27] and global segmentation masking [14]. To facilitate deep model design and improve computational efficiency, most TAD methods would pre-process a varying-length video into a fixed-length snip-pet sequence by first extracting frame-level visual features with a frozen video encoders and subsequently sampling a smaller number of feature points (i.e., snippet) evenly (see Fig. 1(a)). As a result, a TAD model performs the inference at lower temporal resolutions. This introduces a temporal quantization error that could hamper the model performance. For instance, when decreasing video temporal resolution from 400 to 25, the performance of BMN [9] de-grades significantly from 34.0% to 28.1 % in mAP on ActivityNet. Despite the obvious connection between the er-ror and performance degradation, this problem is largely ignored by existing methods.
A typical pipeline for temporal action detection. (a) For efficiency and model design ease, temporal resolution reduction is often applied during pre-processing. This causes model inference at lower (coarse) temporal resolutions. (b) After bringing the prediction results back to the original temporal resolution during inference, quantization error will be introduced inevitably.
Conventional snippet-level tad inference along with our proposed sub-snippet-level post-processing.