I. Introduction
Deep Reinforcement learning (DRL) algorithms have been applied in quiet a lot of challenging fields, from intelligent chess, card games [1], knowledge reasoning [2], recommendation systems [3], [4], and causal reasoning [5], to robot technology [6]. Still, it fails to solve practical problems well due to the slow rate of policy convergence and low robustness. In continuous task control of various robots, if the algorithm converges to a good performance level earlier, robots' working efficiency in complex environments will be improved. Meanwhile, if the algorithm is not robust, its fluctuation may cause huge losses [7]. Unfortunately, existing models cannot maintain high convergence speed and robustness of the algorithm under the premise of misestimate and gradient variance, resulting in suboptimal policy updates. Both of these challenges will affect the performance of reinforcement learning algorithms.