1. Introduction
As witnessed in recent years, the deep learning technique has successfully established itself as a well generalized foundation for solving a variety of computer intelligence problems, especially in the domains of object recognition, activity understanding and natural language processing. Although the purposes of these systems are different from each other, a common notion can be interpreted as that a complex high-level identification problem could be approximated using a deep neural network, which is mainly implemented by convolution neural networks (CNNs), and recurrent neural networks (RNNs) alike. In terms of CNN structure, one of the most accepted examples shall be accredited to the Inception module defined in [1], also known as the GoogLeNet. By constructing a cascaded module pipeline, the entire network could reach a fairly high performance in recognition rate. Earlier work could be traced back to [2], in which the AlexNet structure was defined, while recent study such as VGG network [3] and ResNet [4] also draw considerable amount of attention from the community. In fact, by investigating the interaction between the data and the network layers, one can come to an inference that the convolution layers and pooling layers are essentially acting as feature extractors, while the dense layers, a.k.a. fully connected layers, are more inclined to synthesizing the output of convolution layers into final decision. As such, typical applications of CNNs are mainly focused on static scene analysis, for example, still image analysis. On the contrary, the natural structure of RNNs defines the routing of information flow, which is more favorable to temporal processing systems. In the recent studies, a lot of efforts have been made to explore the possibility of introducing the RNN into time-sensitive processing systems. For example, Long Short-Term Memory (LSTM), a special kind of RNN, has been applied to address the problems of human action recognition [5] and [6]. Briefly speaking, the CNN excels at extensive feature perception, while the LSTM is favorable to sequential information analysis.