1. Introduction
Deep learning and especially Convolutional Neural Networks (CNNs) have revolutionized image-based tasks, e.g., image classification [16] and objective detection [32]. However, the progress on video analysis is still far from satisfactory, reflecting the difficulty associated with learning representations for spatiotemporal data. We believe that the major obstacle is that the distinctive motion cues in videos demand some new network designs, which are yet to be found and tested.