Introduction
A. Related Works
The tracking control of wheeled mobile robot is one of the fundamental functions of robot autonomous navigation, which has been widely used in inspection, security, cleaning, planet exploration, military application and so on. It could be classified as a nonlinear system with multi-inputs and multi-outputs, it is also a underactuated system with uncertainty actually. In addition, wheeled mobile robot must be subjected to nonholonomic constraints, which make it challenging to construct a controller with desired performance. As a result, tracking control of nonholonomic wheeled mobile robot (NWMR) has always been a research focus for past decades.
Due to the existence of nonholonomic constraints of WMR, the approaches to tracking control including both kinematics control and dynamics control, which has been a basic pipeline of tracking control for NWMR. As for kinematics control, it is used to tracking desired pose with WMR’s speed commands, thus a time-varying controller based on Lyapunov theory was proposed in [1]. Kinematics mode with chained-form [2], [3] was used to transform this complex system to a convenient one. Besides, the model described in polar coordinates [4] was also reported to design a robust system. In [5], mode prediction control algorithm combined with neural-dynamics optimization was proposed by using the derived tracking-error kinematics, which effectively achieved tracking control under the velocity constraints and velocity increment constraints. More recently, in [6], nonlinear controllers using synthetic-analytic behavior-based control framework was presented to track with velocity constraints. A PID-based kinematic controller is proposed as a non-model based controller to navigate the tractor-trailer wheeled robot follows desired trajectories in [7].
As for a real WMR, it is obvious that separate kinematics controller can not perform trajectory tracking well. So various advanced dynamics controller is adopted to track the desired velocity, which is the output of kinematics controller exactly. These researches have been mainly concentrated on overcoming system uncertainties and external disturbances. Instead of classical torque-based control, a robust control approach [8] was developed based on the voltage control strategy. In [9], a robust adaptive controller is proposed with the utilization of adaptive control, backstepping and fuzzy logic techniques. Considering the robust performance of sliding mode control, in [10], a controller with finite-time convergence of the tracking errors was provided, the disturbance observer and adaptive compensator were used to enhanced the robustness of system. Similarly, an integral terminal sliding mode controller [11] was adopted in the presence of parameter uncertainties and external disturbances, and an adaptive fuzzy observer was introduced to compensate the non-measurement of velocity. In [12] a fast terminal sliding mode control scheme was proposed under known or unknown upper bound of the system uncertainty and external disturbances.
From the perspective of optimization, a controller [13] based on model prediction control was proposed to prevent sideslip and improve the performance of path tracking control. A nonlinear model predictive controller [14] was introduced by using a set of modifications to track a given trajectory. With the consideration of the uncertainties to be time varying and dynamic, a robust control strategy [15] was proposed with time delay control. In [16], the optimization-based nonlinear control laws were analytically developed using the prediction of WMR responses, the tracking precision is more increased with the integral feedback technique appending. Because neural networks(NNs) can approximate nonlinear functions well, the NNs-based method [17] was provided to approximate the unknown modeling item, the skidding and the slipping item, though it is not common in the low-level driver. Therefore, it could be concluded that kinematics control and dynamics control are two different ways to address the problem of tracking control, in which a variety of nonlinear control approaches could be employed. And both methods highly dependent on a system model, an acceptable algorithm with more accurate model may lead to more precise control accuracy. However, it is hard to describe with nonlinear formulations exactly, especially the mode uncertainties and disturbances.
Except the model-based control method mentioned above, the learning-based (reinforcement learning) methods have been become a new research focus [18], because there is no need to consider a system model. In [19], with the candidate parameters of the PD controller defined as the action space, a hierarchical reinforcement learning approach for optimal path tracking of WMR was proposed, but the state space and action space was decomposed into several subspaces, which is not amenable to the continuous control problem. Thus, the RL method with continuous space has been studied, an actor-critic goal-oriented deep RL architecture [20] was developed to achieve adaptive low-level control strategy in continuous space. In [21], an RL algorithm is designed to generate an optimal control signal for uncertain nonlinear MIMO systems. In [22], A RL-based adaptive tracking control algorithm is proposed for a time-delayed WMR system with slipping and skidding. In [23] a layered depth reinforcement learning algorithm for robot composite tasks is proposed, which is superior to common deep reinforcement learning algorithm among discrete state space. A solution for the path following problem of a quadrotor vehicle based on deep reinforcement learning theory is proposed in three different conditions [24].
Although the excellent performance with RL algorithms, it has been suffered from the disadvantages of time-consuming training and ineffective sampling with interaction between agent and environment [25]. Thus, In [26], a model-based reinforcement learning algorithm with excellent sample complexity was achieved by combining neural network dynamics models with model predictive control (MPC), which produce stable and plausible gaits that accomplish various complex locomotion tasks. And, a kernel-based dynamic model for reinforcement learning was proposed to fulfill the robotic tracking tasks [27], and the optimal control policy is searched by the model-based RL method. In [28] multi pseudo Q-learning-based deterministic policy gradient algorithm was proposed to achieve high-level tracking control accuracy of AUVs, which validated that increasing the number of the actors and critics could further improve the performance. Recently, a data-based approach for analyzing the stability of discrete-time nonlinear stochastic systems modeled by Markov decision process, by using the classic Lyapunov’s method in control theory [29]. Due to the limited exploration ability caused deterministic policy, high-speed autonomous drifting is addressed, using a closed-loop controller based on the deep RL algorithm soft actor critic (SAC) to control the steering angle and throttle of simulated vehicles in [30]. We should notice a fact that deep reinforcement learning algorithms always require time-consuming training episodes. This may be acceptable to a certain extent for simulated robots, but it is not feasible for a actual environment. So the effort should be concentrated on improving the efficiency of deep reinforcement learning algorithms.
B. Motivation of Our Approach
In general, the model-based control methods have always been preferred to develop a controller, and the performance will depend largely on the accuracy of the model. However, model uncertainty and external disturbances are objective and have to be addressed. Thus, a number of robust strategies should be adopted to obtain a controller with more precise control accuracy. Furthermore, once the control algorithm is determined, the accuracy of the controller remains unchanged. It may lose the possibility of improving itself by learning, just like what our humans doing. While the RL-based method do not need a system mode at all, and the human-level performance could be obtained with a reasonable end-to-end training process. Naturally, the synthesis of mode-based control and learning-based control could be a pretty alternative for autonomous WMR.
Considering the great tracking performance of dynamics controller at present, we prefer to control the velocity based on kinematics mode. And existing kinematics controllers are used for solving a complex nonlinear control problem. Thus, it is suboptimal and difficult to improved with model-based methods. So, the learning method can be used to optimize the existing kinematics controller to obtain a better tracking performance.
Thus, in our effort of tracking control for NWMR, the kinematics control is chose as mode-based method, just like “given talent” of human. And the actor-critic based reinforcement learning method is adopted to learn the tracking experience during the whole tracking process, just like ”acquired knowledge”. The main contribution of our proposed method are as follows.
A hybrid control strategy combining mode-based method and deep reinforcement learning method for tracking control is proposed, which shows better performance both in accuracy and efficiency.
The state is defined including current tracking errors, given control inputs and one-step errors, which is one of the keys to efficient convergence of tracking control.
Given Control Law Based on Kinematics Model
A. Kinematics Model of NWMR With Velocity and Acceleration Constraints
As shown in Fig. 1, a NWMR with two drive wheels whose axis is connected through the geometric center of the robot body. The left and right drive wheels are respectively driven by two hub motors to realize the forward, backward and turning of the robot. Point C is midpoint between two hub motors’ axial connections, and its coordinate in the global coordinate system is (x,y),
The kinematics of nonholonomic wheeled mobile robot can be denoted as:\begin{align*} \dot {\mathbf {q}}= \left [{ \begin{matrix} \displaystyle \dot {x}\\ \displaystyle \dot {y}\\ \displaystyle \dot {\theta }\\ \displaystyle \end{matrix} }\right]= \left [{ \begin{matrix} \displaystyle \cos \theta & 0\\ \displaystyle \sin \theta & 0\\ \displaystyle 0 & 1 \end{matrix} }\right] \left [{ \begin{matrix} \displaystyle v \\ \displaystyle \omega \end{matrix} }\right]\tag{1}\end{align*}
Although our method is based on a kinematics model, dynamic constraints must still be considered, because the control inputs for NWMR need a certain response time, and they cannot achieve a sudden change. The linear velocity \begin{align*} \left |{ v }\right |\leq\unicode{0x0026}v_{max} \\ \left |{ \omega }\right |\leq&\omega _{max} \\ \left |{ a }\right |\leq\unicode{0x0026}a_{max} \\ \left |{ \alpha }\right |\leq&\alpha _{max}\tag{2}\end{align*}
B. Given Control Law
As shown in Fig. 2, for any given mobile robot’s desired pose \begin{align*} \tilde {\mathbf {q}}_{e}=&R(\tilde {\theta })(\mathbf {q}_{d}-\tilde {\mathbf {q}}) \\=&\left [{ \begin{matrix} \displaystyle \tilde {x}_{e}\\ \displaystyle \tilde {y}_{e}\\ \displaystyle \tilde {\theta }_{e}\\ \displaystyle \end{matrix} }\right] = \left [{ \begin{matrix} \displaystyle \cos \tilde {\theta } & \sin \tilde {\theta } & 0\\ \displaystyle -\sin \tilde {\theta } & \cos \tilde {\theta } & 0\\ \displaystyle 0 & 0 & 1\\ \displaystyle \end{matrix} }\right] \left [{ \begin{matrix} \displaystyle x_{d}-\tilde {x}\\ \displaystyle y_{d}-\tilde {y}\\ \displaystyle \theta _{d}-\tilde {\theta }\\ \displaystyle \end{matrix} }\right]\tag{3}\end{align*}
\begin{equation*} \tilde {\mathbf {q}}=\mathbf {q}+\mathbf {n}\tag{4}\end{equation*}
The error state dynamics can be written as follow:\begin{align*} \dot {\tilde {x}}_{e}=&\omega \tilde {y}_{e}-v+v_{d}\cos \tilde {\theta }_{e} \\ \dot {\tilde {y}}_{e}=&-\omega \tilde {x}_{e}+v_{d}\sin \tilde {\theta }_{e} \\ \dot {\tilde {\theta }}_{e}=&\omega _{d}-\omega\tag{5}\end{align*}
Tracking control is to seek \begin{align*} \mathbf {u}_{g}= \left [{ \begin{matrix} \displaystyle v_{g}\\ \displaystyle \omega _{g}\\ \displaystyle \end{matrix} }\right] = \left [{ \begin{matrix} \displaystyle k_{1}\tilde {x}_{e}+v_{d}\cos \tilde {\theta }_{e}\\ \displaystyle 2v_{d}\tilde {y}_{e}\cos \frac {\tilde {\theta }_{e}}{2}+\omega _{d}+k_{2}\sin \frac {\tilde {\theta }_{e}}{2}\\ \displaystyle \end{matrix} }\right]\tag{6}\end{align*}
The stability of closed-loop system could be proved according Lyapunov theory, see Appendix.
With the control input in (6), NWMR will move to a new pose \begin{align*} \tilde {\mathbf {q}}'_{e}=&\mathbf {q}_{d}-\tilde {\mathbf {q}}' \\=&\left [{ \begin{matrix} \displaystyle \tilde {x}_{e}'\\ \displaystyle \tilde {y}_{e}'\\ \displaystyle \tilde {\theta }_{e}'\\ \displaystyle \end{matrix} }\right] = \left [{ \begin{matrix} \displaystyle \cos \tilde {\theta }' & \sin \tilde {\theta }' & 0\\ \displaystyle -\sin \tilde {\theta }' & \cos \tilde {\theta }' & 0\\ \displaystyle 0 & 0 & 1\\ \displaystyle \end{matrix} }\right] \left [{ \begin{matrix} \displaystyle x_{d}-\tilde {x}'\\ \displaystyle y_{d}-\tilde {y}'\\ \displaystyle \theta _{d}-\tilde {\theta }' \end{matrix} }\right]\tag{7}\end{align*}
In other words, the pose error of NWMR has made a transition from the previous error
Theoretically, the tracking error could gradually converge to zero, when the time tends to infinity. It can be denoted as following:\begin{equation*} \lim \limits _{t\to \infty }\left |{ \tilde {\mathbf {q}}_{e}'}\right | \approx 0\tag{8}\end{equation*}
However, according to error dynamics equation in Appendix, the nolinear kinematics controller could be suboptimal in finite time, and external disturbance or noise during the tracking control could also lead to a result that the tracking error converges to a certain positive value \begin{equation*} \lim \limits _{t\to t_{0}}\left |{ \tilde {\mathbf {q}}_{e}'}\right | >= c\tag{9}\end{equation*}
Besides, once the given control law (6) is determined, the converge performance of closed-loop system is determined. It can not be able to adjust itself to obtain more precise control performance, so additional strategy is needed to improve it.
Hybrid Control Strategy Incorporating Deep Reinforcement Learning Approach
In this section, we consider a method of deep reinforcement learning, to help NWMR learn the acquired control law from tracking error caused by given control law in secion II-B.
A. Finite MDP
To convert our acquired control problem of wheeled mobile robot to a general RL problem, we model it as a finite Markov decision process [31], which is the fundamental property for RL theory.
Firstly, we define the state \begin{align*} \mathbf {s}_{k}=&\left [{ {\tilde {\mathbf {q}}}_{e}^{\mathrm {T}}(k), \mathbf {u}_{g}^{\mathrm {T}}(k),{\tilde {\mathbf {q}}}_{e}^{'\mathrm {T}}(k) }\right] \\=&\left [{\tilde {x}_{e}(k), \tilde {y}_{e}(k), \tilde {\theta }_{e}(k), v_{g}(k), \omega _{g}(k), \tilde {x}_{e}'(k), \tilde {y}_{e}'(k), \tilde {\theta }_{e}'(k) }\right] \\{}\tag{10}\end{align*}
And, the ation at \begin{equation*} \mathbf {a}_{k}=\mathbf {u}_{a}(k)=\left [{v_{a}(k), \omega _{a}(k)}\right]^{\mathrm {T}}=\mu (\mathbf {s}_{k})\tag{11}\end{equation*}
Then, the hybrid tracking control input at current time is, \begin{equation*} \mathbf {u}(k)=\mathbf {u}_{g}(k)+\mathbf {u}_{a}(k)\tag{12}\end{equation*}
The immediate reward \begin{equation*} r_{k}=-(\left |{ \tilde {x}_{e}'(k)}\right | + \left |{ \tilde {y}_{e}'(k)}\right | + \left |{\tilde {\theta }_{e}(k)}\right |)\tag{13}\end{equation*}
The cumulative reward of whole learning proscess is calculated with a discount constant \begin{equation*} G_{k}=\sum _{i=1}^{N}\gamma ^{i-1} r_{k+i}\tag{14}\end{equation*}
It should be mentioned that the state of our RL problem includes not only the tracking error
Therefore, the error vectors
B. Acquired Control Learning With Actors-Critics Architecture
Due to the experience model is difficult to accurately represent with mathematical expressions, the method in this paper is constructed with deep deterministic policy gradient algorithm [32]. The deterministic policy
In order to update actor network, we first calculate the gradient of critic network, TD error \begin{equation*} L(\boldsymbol{\beta })=\frac {1}{N}\sum _{k=1}^{N}(T_{k}-Q(s_{k},a_{k}|\boldsymbol{\beta }))^{2}\tag{15}\end{equation*}
\begin{equation*} T_{k}=r(s_{k},a_{k})+\gamma \hat {Q}(s_{k+1},\hat {\pi }(s_{k+1}|\hat {\boldsymbol{\lambda }})|\hat {\boldsymbol{\beta }})\tag{16}\end{equation*}
Thus, the gradient of TD error is:\begin{equation*} \nabla _{\boldsymbol{\beta }} L(\boldsymbol{\beta })=-\frac {2}{N}\sum _{k=1}^{N}(T_{k}-Q(s_{k},a_{k}|\boldsymbol{\beta }))\frac {\partial Q(s,a;\boldsymbol{\beta })}{\partial \boldsymbol{\beta }}\tag{17}\end{equation*}
\begin{equation*} \beta \gets \beta +\mathrm {L_{c}}\cdot \nabla _{\boldsymbol{\beta }} L(\boldsymbol{\beta })\tag{18}\end{equation*}
For actor network \begin{align*} F_{\phi }(\pi _{\lambda })=&\int _{S}\rho ^{\phi }(s)Q^{\pi }(s,\pi _{\lambda }(s))ds \\=&\mathbb {E}_{S\sim \rho ^{\phi }}[Q(s,\pi _{\lambda }(s))]\tag{19}\end{align*}
According to [33], off-policy deterministic policy gradient is:\begin{align*} \nabla F_{\phi }(\pi _{\lambda })\approx&\int _{S}\rho ^{\phi }(s)\nabla _{\lambda }\pi _{\lambda }(a|s)Q^{\pi }(s,a)ds \\=&\mathbb {E}_{S\sim \rho ^{\phi }}[\nabla _{\lambda }\pi _{\lambda }(s)\nabla _{a} Q^{\pi }(s,a)|_{a=\pi _{\lambda }(s)}]\tag{20}\end{align*}
So when we got the mini-batch data randomly from the replay memory buffer, the policy gradient is:\begin{align*}&\hspace {-0.5pc}\nabla F_{\phi }(\pi _{\lambda })=\frac {1}{N}\sum _{k=1}^{N}(\nabla _{\lambda }\pi _{\lambda }(s|\lambda)|_{s=s_{k}} \\&\,\cdot \nabla _{a} Q^{\pi }(s,a|\beta)|_{s=s_{k},a=\pi _{\lambda }(s)})\tag{21}\end{align*}
Thus, the actor network is updated with, \begin{equation*} \boldsymbol{\lambda } \gets \boldsymbol{\lambda } + \mathrm {L_{a}}\cdot \nabla F_{\phi }(\pi _{\lambda })\tag{22}\end{equation*}
For the stability of training, parameter vectors of target critic network and target actor network are updated in this way:\begin{align*} \hat {\boldsymbol{\lambda }}=&\varepsilon \lambda +(1-\varepsilon)\hat {\boldsymbol{\lambda }} \tag{23}\\ \hat {\beta }=&\varepsilon \beta +(1-\varepsilon)\hat {\beta } \tag{24}\end{align*}
Finally, when the training is over, the optimal acquired control law \begin{equation*} \mathbf {u}_{a}^{*}=\pi ^{*}(\mathbf {s};\boldsymbol{\lambda }^{*})\tag{25}\end{equation*}
We use fully connected model to build target actor network, so it can be expressed with the forward model as follows:\begin{align*} \mathbf {l}_{1}=&f_{1}(\mathbf {W}^{*}_{1}\mathbf {s}+\mathbf {b}^{*}_{1}) \\ \mathbf {l}_{2}=&f_{2}(\mathbf {W}^{*}_{2}\mathbf {l}_{1}+\mathbf {b}^{*}_{2}) \\ \vdots \\ \mathbf {l}_{n-1}=&f_{n-1}(\mathbf {W}^{*}_{n-1}\mathbf {l}_{n-2}+\mathbf {b}^{*}_{n-1}) \\ \mathbf {u}^{*}_{a}=&f_{n}(\mathbf {W}^{*}_{n}\mathbf {l}_{n-1}+\mathbf {b}^{*}_{n})\tag{26}\end{align*}
So far, the hybrid tracking control law for NWMR is obtained after training, which combined given control from kinematics control method and acquired control from DRL method. The control block diagram is showed in Fig. 3 and Fig. 4 The pseudocode of our tracking control method is shown in Algorithm 1:
Algorithm 1 Hybird Strategy of Tracking Control for NWMR
Initialize/load actor network and critic network,
Initialize target network,
Initialize replay buffer
for episode=1 to Max-ep do
get initial pose observation of NWMR, compute intital state
Initialize cumulative error to zero
for step=1 to Max-step do
compute given control
compute acquired control
execute
store
if number of transitions > Memory then
extract randomly a batch of transitions from R
end if
end for
end for
Simulation Results
In this section, simulations are developed to demonstrate that above proposed method could achieve tracking control for NWMR effectively. We first try to track a circle, and it can be defined as:\begin{align*} x_{d}=&2\cos \theta \\ y_{d}=&2\sin \theta\end{align*}
\begin{align*}&\hspace {-0.5pc} E=-\sum _{k=1}^{N} \xi _{1}(\left |{ x_{e}(k)}\right | + \left |{ y_{e}(k)}\right |) + \xi _{2}\left |{ \theta _{e}(k)}\right | \\&\qquad\qquad\qquad\qquad\qquad\quad \displaystyle {+\, \xi _{3}(\left |{ v_{d} - v(k) }\right | + \left |{ \omega _{d} - \omega (k) }\right |)}\end{align*}
The bounded values in (2),
The uncertainty in (4) is chosen as periodic disturbance, \begin{align*} \mathbf {n}=\left [{ \begin{matrix} \displaystyle n_{x}\\ \displaystyle n_{y}\\ \displaystyle n_{\theta } \end{matrix} }\right]=\left [{ \begin{matrix} \displaystyle 0.002 \sin (\pi t)\\ \displaystyle 0.002 \cos (\pi t)\\ \displaystyle 0.005 \sin (\pi t) \end{matrix}}\right]\end{align*}
The network architecture in section III-B is built with the aid of Tensorflow,1 actor/target actor network havs a similar deep fully connected neural network with critic/target critic network, the hyperparameters are shown in Tab. 2. Besides, the maximum size of replay buffer is 5000, the size of batch is 32, learning rates of actor network and critic network are 0.001, 0.002. In our training, it is a total of 400 episodes and 200 steps in each episode, and sampling time is
The results of our proposed method are showed in Figs. 5–9. The trajectory of NWMR can be seen in Fig. 5, desired trajectory is shown with a dotted line, and another tracjectory is solid line. From Fig. 6, it can be observed that the tracking errors all converge to near zero. The given control signals, acquired control signals and final hybrid control inputs of (12) are shown in Figs. 7–9, respectively. The acquired control inputs works in the whole process, which prove the effectiveness of our method.
To make a comparison, we also test the performance with classical method, that is, only the given control approach (6) works. The results are depicted in Figs. 10–11. Comparing Fig. 10 with Fig. 5, our method obviously performs better, the comparison of Fig. 11 with Fig. 6 also proves this point. Actually, the cumulative error of classical method in Fig. 11 is -202.0255, while the one of our method in Fig. 5 is -110.4874. So, the addition of
Besides, we also try to use learning method, that is, only acquired control method (11) works, the results are depicted in Fig. 12–15. Comparing Fig. 12 with Fig. 5, the performance of tracking circle is similar to each other, the cumulative error of learning method in Fig. 12 is -110.1912. But the comparison of training process in Fig. 14 and Fig. 15 shows that our proposed method converge to stable within 300 episodes, and the fluctuation of the reward (Y axis) in former is more stable than the latter, it proves the superiority in training process of our method.
Training process of tracking cirle with our method, take Y=−150 (red dotted line) as reference.
Training process of tracking cirle only with learning method, take Y=−150 (red dotted line) as reference.
To further demonstrate the effectiveness of our method, we conduct another simulation to track the spiral trajectory, defined as following, \begin{align*} x_{d}=&0.04t \cos (0.5t) \\[5pt] y_{d}=&0.04t \sin (0.5t)\end{align*}
The uncertainty in (4) is chosen as random disturbance, \begin{equation*} \mathbf {n}=0.002 \boldsymbol{\sigma }\end{equation*}
And the parameters are given Tab. 3. The size of replay buffer is 5000, the size of batch is 32, learning rates of actor network and critic network are 0.0002, 0.001. The Maximum episode is 800, the maximum step in each spisode is 250, while other parameters remain unchanged.
The results are depicted in Figs. 16–20. The trajectory of NWMR is shown in Fig. 16, tracking errors are depicted in Fig. 17, given control signals, acquired control signals and hybird control inputs can be seen in Fig. 18–20. From Fig. 18, the angular velocity exceeds the upper bound already, but the hybird control inputs in Fig. 20 is bounded.
The results with only classical control approach works are also illustrated as a comparison in Fig. 21 and Fig. 22. The cumulative error of classical method in Fig. 21 is -358.0541, while the one of our method in Fig. 16 is -93.7636. So, it also proved the effectiveness of our proposed method. The results with only learning method is Figs. 23–26. In Fig. 23, the cumulative error is -114.4648, comparing to Fig. 16, the tracking performance of our method is still better. According to training process in Fig. 25 and Fig. 26, our proposed method is obviously more stable, too.
Training process of tracking spiral with our method, take Y=−110 (red dotted line) as reference.
Training process of tracking spiral only with leaning method, take Y=−110 (red dotted line) as reference.
So, for tracking circler, our proposed method has similar tracking performance to learning method, but, the convergence performance of ours is better; for tracking spiral, our proposed method has advantage both in tracking performance and converge performance evidently.
Conclusion
In this research, the tracking control of NWMR with constraints and uncertainty has been addressed by our proposed hybrid control strategy, which is a combination of mode-based control method and learning based method. The kinematics control is severed as a given control (like “the talent”), the actor-critic based DRL method is used to learn a acquired control law to compensate the existing errors (like“the experience”). The results have demonstrated the effectiveness of our proposed method, and the comparisons show that our method has the advantage of less cumulative error, meanwhile, our method is more stable and efficient than learning based method.
The strategy provided in our effort could improve tracking and convergence performance, which is the vital function for a autonomous mobile robot. Although our method have been tested with tracking control of NWMR, it could also be applied to other complicated control problems.
Appendix
Appendix
Substituting (6) to (5), the error dynamics can be rewritten as:\begin{align*} \dot {x}_{e}=&-k_{1} x_{e}+2v_{d} y_{e}^{2}\cos \frac {\theta _{e}}{2}+k_{2} y_{e}\sin \frac {\theta _{e}}{2}+\omega _{d} y_{e} \\[-2pt] \dot {y}_{e}=&-2v_{d} x_{e} y_{e}\cos \frac {\theta _{e}}{2}-\omega _{d} x_{e}-k_{2} x_{e}\sin \frac {\theta _{e}}{2}+v_{d}\sin \theta _{e} \\ \dot {\theta }_{e}=&-2v_{d} y_{e}\cos \frac {\theta _{e}}{2}-k_{2}\sin \frac {\theta _{e}}{2}\end{align*}
Defining the Lyapunov function, \begin{equation*} L=\frac {1}{2} x_{e}^{2}+\frac {1}{2} y_{e}^{2}-2\cos \frac {\theta _{e}}{2}\end{equation*}
Deriving Lyapunov function along time:\begin{align*} \dot {L}=&x_{e}\dot {x}_{e}+y_{e}\dot {y}_{e}+\dot {\theta }_{e}\sin \frac {\theta _{e}}{2} \\=&x_{e}\left({-k_{1} x_{e}+2v_{d} y_{e}^{2}\cos \frac {\theta _{e}}{2}+k_{2} y_{e}\sin \frac {\theta _{e}}{2}+\omega _{d} y_{e}}\right) \\&+ y_{e}\left({-2v_{d} x_{e} y_{e}\cos \frac {\theta _{e}}{2}-\omega _{d} x_{e}-k_{2} x_{e}\sin \frac {\theta _{e}}{2}+v_{d}\sin \theta _{e}}\right)\\[-2pt]&+ \left({-v_{d} y_{e}\sin \theta _{e}-k_{2}\sin ^{2}\frac {\theta _{e}}{2}}\right)\\[-2pt]=&-k_{1} x_{e}^{2}-k_{2}\sin ^{2}\frac {\theta _{e}}{2} < =0\end{align*}
According Lyapunov theory, the error dynamics will asymptotically converge to zero.