Loading [MathJax]/extensions/MathMenu.js
QVDDPG: QV Learning with Balanced Constraint in Actor-Critic Framework | IEEE Conference Publication | IEEE Xplore

QVDDPG: QV Learning with Balanced Constraint in Actor-Critic Framework


Abstract:

Actor-critic framework has achieved tremendous success in a great many of decision-making scenarios. Nevertheless, when updating the value of new states and actions in th...Show More

Abstract:

Actor-critic framework has achieved tremendous success in a great many of decision-making scenarios. Nevertheless, when updating the value of new states and actions in the long-term scene, these methods suffer from misestimate problem and gradient variance problem, significantly reducing convergence speed and robustness of the policy. These problems severely limit the application scope of these methods. In this paper, we first proposed QVDDPG, a deep RL algorithm based on the iterative target value update process. The QV learning method alleviates the problem of misestimate by making use of the guidance of Q value and the fast convergence of V value, thus accelerating the convergence speed. In addition, the actor utilizes a constrained balanced gradient and establishes a hidden state for the continuous action space network for the sake of robustness of the model. We give the update relation among the value functions and the constraint conditions of gradient estimation. We measure our method on the PyBullet and achieved state-of-the-art performance. Moreover, we demonstrate that, our method has higher robustness and convergence speed across different tasks compared to other algorithms.
Date of Conference: 18-23 June 2023
Date Added to IEEE Xplore: 02 August 2023
ISBN Information:

ISSN Information:

Conference Location: Gold Coast, Australia

Funding Agency:

References is not available for this document.

I. Introduction

Deep Reinforcement learning (DRL) algorithms have been applied in quiet a lot of challenging fields, from intelligent chess, card games [1], knowledge reasoning [2], recommendation systems [3], [4], and causal reasoning [5], to robot technology [6]. Still, it fails to solve practical problems well due to the slow rate of policy convergence and low robustness. In continuous task control of various robots, if the algorithm converges to a good performance level earlier, robots' working efficiency in complex environments will be improved. Meanwhile, if the algorithm is not robust, its fluctuation may cause huge losses [7]. Unfortunately, existing models cannot maintain high convergence speed and robustness of the algorithm under the premise of misestimate and gradient variance, resulting in suboptimal policy updates. Both of these challenges will affect the performance of reinforcement learning algorithms.

Select All
1.
D. Zha, J. Xie, W. Ma, S. Zhang, X. Lian, X. Hu, et al., "Douzero: Mastering doudizhu with self-play deep reinforcement learning", International Conference on Machine Learning, pp. 12333-12344, 2021.
2.
W. Xiong, T. Hoang and W. Y. Wang, "Deeppath: A reinforcement learning method for knowledge graph reasoning", arXiv preprint, 2017.
3.
R. Zhang, T. Yu, Y. Shen and H. Jin, "Text-Based Interactive Recommendation via Offline Reinforcement Learning", Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, pp. 11694-11702, 2022.
4.
X. Chen, L. Yao, J. Mcauley, G. Zhou and X. Wang, "A Survey of Deep Reinforcement Learning in Recommender Systems: A Systematic Review and Future Directions", arXiv preprint, 2021.
5.
T. Herlau and R. Larsen, "Reinforcement Learning of Causal Variables Using Mediation Analysis", Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 06, pp. 6910-6917, 2022.
6.
S. Gu, E. Holly, T. Lillicrap and S. Levine, "Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates", 2017 IEEE international conference on robotics and automation (ICRA), pp. 3389-3396, 2017.
7.
F. Pan, J. He, D. Tu and Q. He, "Trust the Model When It Is Confident: Masked Model-based Actor-Critic", arXiv preprint, 2020.
8.
X. Zang, H. Yao, G. Zheng, N. Xu, K. Xu and Z. Li, "Metalight: Value-based meta-reinforcement learning for traffic signal control", Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 01, pp. 1153-1160, 2020.
9.
M. Mohammadi, M. M. Arefi, P. Setoodeh and O. Kaynak, "Optimal tracking control based on reinforcement learning value iteration algorithm for time-delayed nonlinear systems with external disturbances and input constraints", Information Sciences, vol. 554, pp. 84-98, 2021.
10.
C. J. Watkins and P. Dayan, "Q-learning", Machine learning 8, no. 3, pp. 279-292, 1992.
11.
Y. Zhang and K. W. Ross, "On-policy deep reinforcement learning for the average-reward criterion", International Conference on Machine Learning, pp. 12535-12545, 2021.
12.
A. X. Lee, A. Nagabandi, P. Abbeel and S. Levine, "Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model", Advances in Neural Information Processing Systems, vol. 33, pp. 741-752, 2020.
13.
R. S. Sutton, D. McAllester, S. Singh and Y. Mansour, "Policy gradient methods for reinforcement learning with function approximation", Advances in neural information processing systems, vol. 12, 1999.
14.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov, "Proximal policy optimization algorithms", arXiv preprint, 2017.
15.
G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. Tb et al., "Distributed distributional deterministic policy gradients", arXiv preprint, 2018.
16.
D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. H. Van, et al., "Distributed prioritized experience replay", arXiv preprint, 2018.
17.
L. Li and A. A. Faisal, "Bayesian distributional policy gradients", Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 10, pp. 8429-8437, 2021.
18.
J. Schulman, S. Levine, P. Abbeel, M. Jordan and P. Moritz, "Trust region policy optimization", International conference on machine learning, pp. 1889-1897, 2015.
19.
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, et al., "Asynchronous methods for deep reinforcement learning", International conference on machine learning, pp. 1928-1937, 2016.
20.
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra and M. Riedmiller, "Deterministic policy gradient algorithms", International conference on machine learning, pp. 387-395, 2014.
21.
S. Whiteson, "Loaded DiCE: Trading off Bias and Variance in Any Order Score Function Estimators for Reinforcement Learning", arXiv preprint, 2019.
22.
S. Thrun and A. Schwartz, "Issues in using function approximation for reinforcement learning", Proceedings of the 1993 Connectionist Models Summer School Hillsdale, vol. 6, pp. 1-9, 1993.
23.
S. Mannor, D. Simester, P. Sun and J. N. Tsitsiklis, "Bias and variance approximation in value function estimates", Management Science, vol. 53, no. 2, pp. 308-322, 2007.
24.
R. S. Sutton, "Learning to predict by the methods of temporal differences", Machine learning, vol. 3, no. 1, pp. 9-44, 1988.
25.
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, et al., "Continuous control with deep reinforcement learning", arXiv preprint, 2015.
26.
J. Schulman, P. Moritz, S. Levine, M. Jordan and P. Abbeel, "High-dimensional continuous control using generalized advantage estimation", arXiv preprint, 2015.
27.
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare et al., "Human-level control through deep reinforcement learning", nature, vol. 518, no. 7540, pp. 529-533, 2015.
28.
H. H. Van, A. Guez and D. Silver, "Deep reinforcement learning with double q-learning", Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016.
29.
S. Fujimoto, H. Hoof and D. Meger, "Addressing function approximation error in actor-critic methods", International conference on machine learning, pp. 1587-1596, 2018.
30.
G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, et al., "Openai gym", arXiv preprint, 2016.

Contact IEEE to Subscribe

References

References is not available for this document.