I. Introduction
Adaptive critic learning (ACL), also known as adaptive critic design, has been a powerful technique to solve optimization problems [1]–[3]. The success of ACL in solving optimization problems mainly relies on an actor–critic structure. In this structure, the actor performs a control policy to systems or environments, and the critic evaluates the cost caused by that control policy and provides reward/punishment signals to the actor. A significant advantage of the actor–critic structure is that, by employing actor–critic dual neural networks (NNs), it can be utilized to avoid the well-known “curse of dimensionality.” In the computational intelligence community, the actor–critic structure is a typical architecture used in adaptive dynamic programming (ADP) [4] and reinforcement learning (RL) [5]. Because ADP and RL have much in common with ACL (e.g., the same implementation structure), they are generally considered as synonyms for ACL. In this paper, we take ADP and RL as the members of ACL family. Over the past few decades, many kinds of ADP and RL have been introduced to handle optimal control problems, such as goal representative ADP [6], Hamiltonian-driven ADP [7], policy/value iteration ADP [8]–[10], robust ADP [11], [12], online RL [13]–[15], and off-policy RL [16]–[18]. Recently, based on the work of Lin [19] building a relationship between the robust control and the optimal control, ACL was successfully applied to solve the robust control problems [20]–[22]. However, when implementing these ACL algorithms, most of them required the controlled systems to be persistently exciting. Unfortunately, it is often intractable to verify the persistence of excitation (PE) condition, especially for nonlinear systems.