I. INTRODUCTION
Deep reinforcement learning (RL), with high-capacity deep neural networks, has been applied to solve various complex decision-making tasks, including chess games [1], [2], video games [3], [4], etc. In robotics, the ultimate goal of RL is to endow robots with the ability to learn, improve, adapt, and reproduce tasks (e.g., robotic manipulation [5], [6], [7], robot navigation [8], [9], robot competition [10] and other robotic control tasks [11], [12], [13], [14]). However, as a matter of fact, the RL applications in the context of robotics suffer from the poor sample efficiency of RL [15]. For instance, even when solving a simple task, RL still needs substantial interaction data for policy improvement. Furthermore, the poor sample efficiency not only slows down the policy improvement but also brings about other deleterious problems for deep RL, such as memorization and sensitivity to out-of-manifold samples [16], [17]. Generally, RL agent gathers data for policy improvement along the learning process, which means that at the early stage of learning, the amount of training data is small, and the deep neural network is prone to memorize (instead of generalizing from) the training data [18]. Unfortunately, such memorization causes the bootstrapping value function to fail to generalize well to the unvisited (out-of-manifold) state-action combinations and thus can hamper the policy improvement via Bellman backup [19]. Moreover, it also degrades the performance of the agent in the environment, due to its exploring preference towards unvisited states.