I. Introduction
The deep learning is playing an important role even in the area of embedded applications such as face recognition or speech recognition. Among Deep Neural Networks (DNN s), the Recurrent Neural Network (RNN), which can process the sequential data such as speech and video, is very effective in the image captioning and visual question answering [1] together with a static object classifier such as a Convolutional Neural Network (CNN). However, in order to run deep learning applications in embedded environment of limited power and performance, a dedicated low-power, high-performance deep learning accelerator is required. Even though several dedicated accelerators for CNNs have been reported [2] [3], there is no RNN semiconductor chip because its heavy external memory bandwidth consumptions.