Abstract:
Temporal sentence grounding (TSG) aims to locate a semantically related segment of an untrimmed video guided by a sentence query. Since the untrimmed videos are too long,...Show MoreMetadata
Abstract:
Temporal sentence grounding (TSG) aims to locate a semantically related segment of an untrimmed video guided by a sentence query. Since the untrimmed videos are too long, almost all existing TSG works first sparsely down-sample each video into a shorter video of a fixed length and then conduct multimodal interactions with the query sentence for reasoning. However, this video down-sampling process may introduce a challenging issue that confuses the latter grounding process: Due to the video down-sampling, some query-related frames may be filtered out; this process may remove the specific boundary frames of the target segment and take the adjacent irrelevant frames as new boundaries, easily leading to cross-modal misalignment and introducing both boundary-bias and reasoning-bias. Therefore, it is important to keep the grounding consistency (both temporal annotations and boundary predictions) between the original and the sampled videos. To this end, in this paper, we propose a novel Conditional Video Diffusion Network (CVDN) for TSG to learn extra visual semantics to enrich and refine the biased new boundaries, which enables soft-label boundary prediction for fine-grained frame-query reasoning. Specifically, we first construct a conditional video diffusion model which is separately trained to recover the consecutive semantics of the filtered frames between the adjacent sampled frames. Through the designed stochastic interval sampling strategies in the training process, this diffusion model is able to generate absent coherent semantics between the sparsely sampled frames and in turn enrich and refine them, benefiting the integral activity understanding for TSG. In this manner, the incorrect new boundaries will be refined to be closely correlated to the original boundary frames and contain sufficient query-related information, which is crucial for accurate segment prediction. Extensive experiments on three challenging datasets demonstrate the effectiveness of CVDN.
Published in: IEEE Transactions on Multimedia ( Volume: 26)