I. Introduction
Reinforcement learning (RL) [1] is one of most important learning methods for constructing robots' controllers, because in many tasks, whether simple or complex ones, human engineers have only some scattered pieces of knowledge of how to explicitly define an optimal policy, which do not allow them to design a fixed-programmed controller and reinforcement learning can help to stick these pieces of knowledge together by representing them as two elementary concepts, Markov Decision Process (MDP) and reward function. Original reinforcement learning takes the process of task execution as a MDP and iteratively approximates the optimal policy defined by the Bellman optimality equation which is composed of a reward function. Reinforcement learning includes a large range of algorithms, such as SARSA [2], Q learning [3] and their variants. In the early phase, these algorithms usually use a table to record the evaluating values of every decision of the states during a MDP. These table based algorithms greatly limit RL's application in the tasks which have a large discrete state set or continuous state space. Therefore, neural network approximator is introduced into reinforcement learning to generalize from the table. Many successful attempts have been made in fields like robot, computer games and unmanned vehicle.