1. Introduction
Human action recognition is a fundamental yet challenging task with considerable efforts having been investigated for decades. Motivated by the notable-success of convolutional neural networks (CNNs) for visual recognition in still images, many recent works take advantage of deep models to train end-to-end networks for recognizing actions in videos [25], [9], [18], [35], [40], [20], [30], which significantly outperform hand-crafted representation learning methods [33], [23], [32], [17].
Illustration of our proposed MiCT that integrates 2D cnns with the 3D convolution for the spatio-temporal feature learning.