1. Introduction
Video understanding is an important computer vision task and has been adopted in various scenarios [12], [2], [7], [6], [16], [31]. The recent success of video understanding can be primarily attributed to the advancement of temporal modeling. However, it remains challenging to effectively aggregate temporal information, especially for distinguishing activities with various temporal lengths and complex spatial-temporal contexts. In the former works, different algorithms have been proposed for temporal information aggregation. A series of works [18], [27], [32], [46] are established on the two-stream 2D CNNs. In such framework, a separate stream, which relies on extra temporal features (e.g. optical flow), is employed to incorporate the temporal information.