I. Introduction
With the growing popularity of mobile devices and online video platforms, people are generating and consuming a huge amount of video content every day. Recent studies have shown that over 1 billion hours of video are watched daily on YouTube. This trend has encouraged advanced techniques to precisely recognize actions or events in the videos [27], [35], [39], which can benefit a wide range of applications, including video summarization, video retrieval, and video recommendation.