QVDDPG: QV Learning with Balanced Constraint in Actor-Critic Framework | IEEE Conference Publication | IEEE Xplore

QVDDPG: QV Learning with Balanced Constraint in Actor-Critic Framework


Abstract:

Actor-critic framework has achieved tremendous success in a great many of decision-making scenarios. Nevertheless, when updating the value of new states and actions in th...Show More

Abstract:

Actor-critic framework has achieved tremendous success in a great many of decision-making scenarios. Nevertheless, when updating the value of new states and actions in the long-term scene, these methods suffer from misestimate problem and gradient variance problem, significantly reducing convergence speed and robustness of the policy. These problems severely limit the application scope of these methods. In this paper, we first proposed QVDDPG, a deep RL algorithm based on the iterative target value update process. The QV learning method alleviates the problem of misestimate by making use of the guidance of Q value and the fast convergence of V value, thus accelerating the convergence speed. In addition, the actor utilizes a constrained balanced gradient and establishes a hidden state for the continuous action space network for the sake of robustness of the model. We give the update relation among the value functions and the constraint conditions of gradient estimation. We measure our method on the PyBullet and achieved state-of-the-art performance. Moreover, we demonstrate that, our method has higher robustness and convergence speed across different tasks compared to other algorithms.
Date of Conference: 18-23 June 2023
Date Added to IEEE Xplore: 02 August 2023
ISBN Information:

ISSN Information:

Conference Location: Gold Coast, Australia

Funding Agency:


I. Introduction

Deep Reinforcement learning (DRL) algorithms have been applied in quiet a lot of challenging fields, from intelligent chess, card games [1], knowledge reasoning [2], recommendation systems [3], [4], and causal reasoning [5], to robot technology [6]. Still, it fails to solve practical problems well due to the slow rate of policy convergence and low robustness. In continuous task control of various robots, if the algorithm converges to a good performance level earlier, robots' working efficiency in complex environments will be improved. Meanwhile, if the algorithm is not robust, its fluctuation may cause huge losses [7]. Unfortunately, existing models cannot maintain high convergence speed and robustness of the algorithm under the premise of misestimate and gradient variance, resulting in suboptimal policy updates. Both of these challenges will affect the performance of reinforcement learning algorithms.

Contact IEEE to Subscribe

References

References is not available for this document.