Loading [MathJax]/extensions/MathZoom.js
A Hybrid Tracking Control Strategy for Nonholonomic Wheeled Mobile Robot Incorporating Deep Reinforcement Learning Approach | IEEE Journals & Magazine | IEEE Xplore

A Hybrid Tracking Control Strategy for Nonholonomic Wheeled Mobile Robot Incorporating Deep Reinforcement Learning Approach


Tracking control of nonholonomic wheeled mobile robot combining kinematics control method and deep reinforcement learning approach.

Abstract:

Tracking control is an essential capability for nonholonomic wheeled mobile robots (NWMR) to achieve autonomous navigation. This paper presents a novel hybrid control str...Show More

Abstract:

Tracking control is an essential capability for nonholonomic wheeled mobile robots (NWMR) to achieve autonomous navigation. This paper presents a novel hybrid control strategy combined mode-based control and actor-critic based deep reinforcement learning method. Based on the Lyapunov method, a kinematics control law named given control is obtained with pose errors. Then, the tracking control problem is converted to a finite Markov decision process, in which the defined state contains current tracking errors, given control inputs and one-step errors. After training with deep deterministic policy gradient method, the action named acquired control inputs is capable of compensating the existing errors. Thus, the hybrid control strategy is obtained under velocity constraint, acceleration constraint and bounded uncertainty. A cumulative error is also defined as a criteria to evaluate tracking performance. The comparison results in simulation demonstrate that our proposed method have an obviously advantage on both tracking accuracy and training efficiency.
Tracking control of nonholonomic wheeled mobile robot combining kinematics control method and deep reinforcement learning approach.
Published in: IEEE Access ( Volume: 9)
Page(s): 15592 - 15602
Date of Publication: 21 January 2021
Electronic ISSN: 2169-3536

Funding Agency:

Citations are not available for this document.

SECTION I.

Introduction

A. Related Works

The tracking control of wheeled mobile robot is one of the fundamental functions of robot autonomous navigation, which has been widely used in inspection, security, cleaning, planet exploration, military application and so on. It could be classified as a nonlinear system with multi-inputs and multi-outputs, it is also a underactuated system with uncertainty actually. In addition, wheeled mobile robot must be subjected to nonholonomic constraints, which make it challenging to construct a controller with desired performance. As a result, tracking control of nonholonomic wheeled mobile robot (NWMR) has always been a research focus for past decades.

Due to the existence of nonholonomic constraints of WMR, the approaches to tracking control including both kinematics control and dynamics control, which has been a basic pipeline of tracking control for NWMR. As for kinematics control, it is used to tracking desired pose with WMR’s speed commands, thus a time-varying controller based on Lyapunov theory was proposed in [1]. Kinematics mode with chained-form [2], [3] was used to transform this complex system to a convenient one. Besides, the model described in polar coordinates [4] was also reported to design a robust system. In [5], mode prediction control algorithm combined with neural-dynamics optimization was proposed by using the derived tracking-error kinematics, which effectively achieved tracking control under the velocity constraints and velocity increment constraints. More recently, in [6], nonlinear controllers using synthetic-analytic behavior-based control framework was presented to track with velocity constraints. A PID-based kinematic controller is proposed as a non-model based controller to navigate the tractor-trailer wheeled robot follows desired trajectories in [7].

As for a real WMR, it is obvious that separate kinematics controller can not perform trajectory tracking well. So various advanced dynamics controller is adopted to track the desired velocity, which is the output of kinematics controller exactly. These researches have been mainly concentrated on overcoming system uncertainties and external disturbances. Instead of classical torque-based control, a robust control approach [8] was developed based on the voltage control strategy. In [9], a robust adaptive controller is proposed with the utilization of adaptive control, backstepping and fuzzy logic techniques. Considering the robust performance of sliding mode control, in [10], a controller with finite-time convergence of the tracking errors was provided, the disturbance observer and adaptive compensator were used to enhanced the robustness of system. Similarly, an integral terminal sliding mode controller [11] was adopted in the presence of parameter uncertainties and external disturbances, and an adaptive fuzzy observer was introduced to compensate the non-measurement of velocity. In [12] a fast terminal sliding mode control scheme was proposed under known or unknown upper bound of the system uncertainty and external disturbances.

From the perspective of optimization, a controller [13] based on model prediction control was proposed to prevent sideslip and improve the performance of path tracking control. A nonlinear model predictive controller [14] was introduced by using a set of modifications to track a given trajectory. With the consideration of the uncertainties to be time varying and dynamic, a robust control strategy [15] was proposed with time delay control. In [16], the optimization-based nonlinear control laws were analytically developed using the prediction of WMR responses, the tracking precision is more increased with the integral feedback technique appending. Because neural networks(NNs) can approximate nonlinear functions well, the NNs-based method [17] was provided to approximate the unknown modeling item, the skidding and the slipping item, though it is not common in the low-level driver. Therefore, it could be concluded that kinematics control and dynamics control are two different ways to address the problem of tracking control, in which a variety of nonlinear control approaches could be employed. And both methods highly dependent on a system model, an acceptable algorithm with more accurate model may lead to more precise control accuracy. However, it is hard to describe with nonlinear formulations exactly, especially the mode uncertainties and disturbances.

Except the model-based control method mentioned above, the learning-based (reinforcement learning) methods have been become a new research focus [18], because there is no need to consider a system model. In [19], with the candidate parameters of the PD controller defined as the action space, a hierarchical reinforcement learning approach for optimal path tracking of WMR was proposed, but the state space and action space was decomposed into several subspaces, which is not amenable to the continuous control problem. Thus, the RL method with continuous space has been studied, an actor-critic goal-oriented deep RL architecture [20] was developed to achieve adaptive low-level control strategy in continuous space. In [21], an RL algorithm is designed to generate an optimal control signal for uncertain nonlinear MIMO systems. In [22], A RL-based adaptive tracking control algorithm is proposed for a time-delayed WMR system with slipping and skidding. In [23] a layered depth reinforcement learning algorithm for robot composite tasks is proposed, which is superior to common deep reinforcement learning algorithm among discrete state space. A solution for the path following problem of a quadrotor vehicle based on deep reinforcement learning theory is proposed in three different conditions [24].

Although the excellent performance with RL algorithms, it has been suffered from the disadvantages of time-consuming training and ineffective sampling with interaction between agent and environment [25]. Thus, In [26], a model-based reinforcement learning algorithm with excellent sample complexity was achieved by combining neural network dynamics models with model predictive control (MPC), which produce stable and plausible gaits that accomplish various complex locomotion tasks. And, a kernel-based dynamic model for reinforcement learning was proposed to fulfill the robotic tracking tasks [27], and the optimal control policy is searched by the model-based RL method. In [28] multi pseudo Q-learning-based deterministic policy gradient algorithm was proposed to achieve high-level tracking control accuracy of AUVs, which validated that increasing the number of the actors and critics could further improve the performance. Recently, a data-based approach for analyzing the stability of discrete-time nonlinear stochastic systems modeled by Markov decision process, by using the classic Lyapunov’s method in control theory [29]. Due to the limited exploration ability caused deterministic policy, high-speed autonomous drifting is addressed, using a closed-loop controller based on the deep RL algorithm soft actor critic (SAC) to control the steering angle and throttle of simulated vehicles in [30]. We should notice a fact that deep reinforcement learning algorithms always require time-consuming training episodes. This may be acceptable to a certain extent for simulated robots, but it is not feasible for a actual environment. So the effort should be concentrated on improving the efficiency of deep reinforcement learning algorithms.

B. Motivation of Our Approach

In general, the model-based control methods have always been preferred to develop a controller, and the performance will depend largely on the accuracy of the model. However, model uncertainty and external disturbances are objective and have to be addressed. Thus, a number of robust strategies should be adopted to obtain a controller with more precise control accuracy. Furthermore, once the control algorithm is determined, the accuracy of the controller remains unchanged. It may lose the possibility of improving itself by learning, just like what our humans doing. While the RL-based method do not need a system mode at all, and the human-level performance could be obtained with a reasonable end-to-end training process. Naturally, the synthesis of mode-based control and learning-based control could be a pretty alternative for autonomous WMR.

Considering the great tracking performance of dynamics controller at present, we prefer to control the velocity based on kinematics mode. And existing kinematics controllers are used for solving a complex nonlinear control problem. Thus, it is suboptimal and difficult to improved with model-based methods. So, the learning method can be used to optimize the existing kinematics controller to obtain a better tracking performance.

Thus, in our effort of tracking control for NWMR, the kinematics control is chose as mode-based method, just like “given talent” of human. And the actor-critic based reinforcement learning method is adopted to learn the tracking experience during the whole tracking process, just like ”acquired knowledge”. The main contribution of our proposed method are as follows.

  • A hybrid control strategy combining mode-based method and deep reinforcement learning method for tracking control is proposed, which shows better performance both in accuracy and efficiency.

  • The state is defined including current tracking errors, given control inputs and one-step errors, which is one of the keys to efficient convergence of tracking control.

The reminder of this paper is organized as follows. In Section II, the kinematics mode of NWMR with constraints and the given control law based Lyapunvo theory are described. In Section III, we elaborate our hybrid control strategy combined mode-based control and actor-critic based DRL in detail. In Section IV, we present the simulation results of our method under periodic and random disturbances. Finally, we conclude the full text in Section V.

SECTION II.

Given Control Law Based on Kinematics Model

A. Kinematics Model of NWMR With Velocity and Acceleration Constraints

As shown in Fig. 1, a NWMR with two drive wheels whose axis is connected through the geometric center of the robot body. The left and right drive wheels are respectively driven by two hub motors to realize the forward, backward and turning of the robot. Point C is midpoint between two hub motors’ axial connections, and its coordinate in the global coordinate system is (x,y), $\theta $ is orientation of mobile robot. $v$ is linear velocity of robot, and $\omega $ is angular velocity robot. What’s more, $r$ represents radius of outer ring of drive wheel, $L$ denotes vertical distance between the center of the drive wheel and the midpoint C of the shaft center line. $v_{l}$ , $v_{r}$ are the line speeds of the right and left drive wheels, respectively.

FIGURE 1. - Nonholonomic wheeled mobile robot.
FIGURE 1.

Nonholonomic wheeled mobile robot.

The kinematics of nonholonomic wheeled mobile robot can be denoted as:\begin{align*} \dot {\mathbf {q}}= \left [{ \begin{matrix} \displaystyle \dot {x}\\ \displaystyle \dot {y}\\ \displaystyle \dot {\theta }\\ \displaystyle \end{matrix} }\right]= \left [{ \begin{matrix} \displaystyle \cos \theta & 0\\ \displaystyle \sin \theta & 0\\ \displaystyle 0 & 1 \end{matrix} }\right] \left [{ \begin{matrix} \displaystyle v \\ \displaystyle \omega \end{matrix} }\right]\tag{1}\end{align*} View SourceRight-click on figure for MathML and additional features.

Although our method is based on a kinematics model, dynamic constraints must still be considered, because the control inputs for NWMR need a certain response time, and they cannot achieve a sudden change. The linear velocity $v$ , angular velocity $\omega $ , linear acceleration $a$ , and angular acceleration $\alpha $ for NWMR are all bounded, \begin{align*} \left |{ v }\right |\leq\unicode{0x0026}v_{max} \\ \left |{ \omega }\right |\leq&\omega _{max} \\ \left |{ a }\right |\leq\unicode{0x0026}a_{max} \\ \left |{ \alpha }\right |\leq&\alpha _{max}\tag{2}\end{align*} View SourceRight-click on figure for MathML and additional features.

B. Given Control Law

As shown in Fig. 2, for any given mobile robot’s desired pose $\mathbf {q}_{d}^{\mathrm {T}}=\left [{ x_{d},y_{d},\theta _{d}}\right] ^{\mathrm {T}} $ , the current pose error of the robot in the global coordinate system is:\begin{align*} \tilde {\mathbf {q}}_{e}=&R(\tilde {\theta })(\mathbf {q}_{d}-\tilde {\mathbf {q}}) \\=&\left [{ \begin{matrix} \displaystyle \tilde {x}_{e}\\ \displaystyle \tilde {y}_{e}\\ \displaystyle \tilde {\theta }_{e}\\ \displaystyle \end{matrix} }\right] = \left [{ \begin{matrix} \displaystyle \cos \tilde {\theta } & \sin \tilde {\theta } & 0\\ \displaystyle -\sin \tilde {\theta } & \cos \tilde {\theta } & 0\\ \displaystyle 0 & 0 & 1\\ \displaystyle \end{matrix} }\right] \left [{ \begin{matrix} \displaystyle x_{d}-\tilde {x}\\ \displaystyle y_{d}-\tilde {y}\\ \displaystyle \theta _{d}-\tilde {\theta }\\ \displaystyle \end{matrix} }\right]\tag{3}\end{align*} View SourceRight-click on figure for MathML and additional features. where, the current pose of robot consists of two parts, one is the ideal true value of pose, another is bounded uncertainty caused by external disturbance or noise, \begin{equation*} \tilde {\mathbf {q}}=\mathbf {q}+\mathbf {n}\tag{4}\end{equation*} View SourceRight-click on figure for MathML and additional features.

FIGURE 2. - Tracking control of NWMR.
FIGURE 2.

Tracking control of NWMR.

The error state dynamics can be written as follow:\begin{align*} \dot {\tilde {x}}_{e}=&\omega \tilde {y}_{e}-v+v_{d}\cos \tilde {\theta }_{e} \\ \dot {\tilde {y}}_{e}=&-\omega \tilde {x}_{e}+v_{d}\sin \tilde {\theta }_{e} \\ \dot {\tilde {\theta }}_{e}=&\omega _{d}-\omega\tag{5}\end{align*} View SourceRight-click on figure for MathML and additional features.

Tracking control is to seek $\mathbf {u}=(v,\omega)^{\mathrm {T}} $ , which makes current pose error converge to zero with $t\to \infty $ , similar to [6], our given kinematics control law is chose as:\begin{align*} \mathbf {u}_{g}= \left [{ \begin{matrix} \displaystyle v_{g}\\ \displaystyle \omega _{g}\\ \displaystyle \end{matrix} }\right] = \left [{ \begin{matrix} \displaystyle k_{1}\tilde {x}_{e}+v_{d}\cos \tilde {\theta }_{e}\\ \displaystyle 2v_{d}\tilde {y}_{e}\cos \frac {\tilde {\theta }_{e}}{2}+\omega _{d}+k_{2}\sin \frac {\tilde {\theta }_{e}}{2}\\ \displaystyle \end{matrix} }\right]\tag{6}\end{align*} View SourceRight-click on figure for MathML and additional features. where $k_{1}$ and $k_{2}$ are positive constant.

The stability of closed-loop system could be proved according Lyapunov theory, see Appendix.

With the control input in (6), NWMR will move to a new pose $\tilde {\mathbf {q}}' $ (contain bounded uncertainty, same as (4)) in global coordinate, so the one-step error will be described as, \begin{align*} \tilde {\mathbf {q}}'_{e}=&\mathbf {q}_{d}-\tilde {\mathbf {q}}' \\=&\left [{ \begin{matrix} \displaystyle \tilde {x}_{e}'\\ \displaystyle \tilde {y}_{e}'\\ \displaystyle \tilde {\theta }_{e}'\\ \displaystyle \end{matrix} }\right] = \left [{ \begin{matrix} \displaystyle \cos \tilde {\theta }' & \sin \tilde {\theta }' & 0\\ \displaystyle -\sin \tilde {\theta }' & \cos \tilde {\theta }' & 0\\ \displaystyle 0 & 0 & 1\\ \displaystyle \end{matrix} }\right] \left [{ \begin{matrix} \displaystyle x_{d}-\tilde {x}'\\ \displaystyle y_{d}-\tilde {y}'\\ \displaystyle \theta _{d}-\tilde {\theta }' \end{matrix} }\right]\tag{7}\end{align*} View SourceRight-click on figure for MathML and additional features.

In other words, the pose error of NWMR has made a transition from the previous error $\tilde {\mathbf {q}}_{e}$ in (3) to the latest error $\tilde {\mathbf {q}}'_{e} $ in (7).

Theoretically, the tracking error could gradually converge to zero, when the time tends to infinity. It can be denoted as following:\begin{equation*} \lim \limits _{t\to \infty }\left |{ \tilde {\mathbf {q}}_{e}'}\right | \approx 0\tag{8}\end{equation*} View SourceRight-click on figure for MathML and additional features.

However, according to error dynamics equation in Appendix, the nolinear kinematics controller could be suboptimal in finite time, and external disturbance or noise during the tracking control could also lead to a result that the tracking error converges to a certain positive value $c$ within a finite time $t_{0}$ instead.\begin{equation*} \lim \limits _{t\to t_{0}}\left |{ \tilde {\mathbf {q}}_{e}'}\right | >= c\tag{9}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Besides, once the given control law (6) is determined, the converge performance of closed-loop system is determined. It can not be able to adjust itself to obtain more precise control performance, so additional strategy is needed to improve it.

SECTION III.

Hybrid Control Strategy Incorporating Deep Reinforcement Learning Approach

In this section, we consider a method of deep reinforcement learning, to help NWMR learn the acquired control law from tracking error caused by given control law in secion II-B.

A. Finite MDP

To convert our acquired control problem of wheeled mobile robot to a general RL problem, we model it as a finite Markov decision process [31], which is the fundamental property for RL theory.

Firstly, we define the state $\mathbf {s}_{k}$ at $k$ th time step as, \begin{align*} \mathbf {s}_{k}=&\left [{ {\tilde {\mathbf {q}}}_{e}^{\mathrm {T}}(k), \mathbf {u}_{g}^{\mathrm {T}}(k),{\tilde {\mathbf {q}}}_{e}^{'\mathrm {T}}(k) }\right] \\=&\left [{\tilde {x}_{e}(k), \tilde {y}_{e}(k), \tilde {\theta }_{e}(k), v_{g}(k), \omega _{g}(k), \tilde {x}_{e}'(k), \tilde {y}_{e}'(k), \tilde {\theta }_{e}'(k) }\right] \\{}\tag{10}\end{align*} View SourceRight-click on figure for MathML and additional features.

And, the ation at $k$ th time step, called acquired control law, is obtained with a deterministic policy $\mu $ , \begin{equation*} \mathbf {a}_{k}=\mathbf {u}_{a}(k)=\left [{v_{a}(k), \omega _{a}(k)}\right]^{\mathrm {T}}=\mu (\mathbf {s}_{k})\tag{11}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Then, the hybrid tracking control input at current time is, \begin{equation*} \mathbf {u}(k)=\mathbf {u}_{g}(k)+\mathbf {u}_{a}(k)\tag{12}\end{equation*} View SourceRight-click on figure for MathML and additional features.

The immediate reward $r_{k}$ at $k $ th time step is, \begin{equation*} r_{k}=-(\left |{ \tilde {x}_{e}'(k)}\right | + \left |{ \tilde {y}_{e}'(k)}\right | + \left |{\tilde {\theta }_{e}(k)}\right |)\tag{13}\end{equation*} View SourceRight-click on figure for MathML and additional features.

The cumulative reward of whole learning proscess is calculated with a discount constant $0 < \gamma < =1$ , \begin{equation*} G_{k}=\sum _{i=1}^{N}\gamma ^{i-1} r_{k+i}\tag{14}\end{equation*} View SourceRight-click on figure for MathML and additional features.

It should be mentioned that the state of our RL problem includes not only the tracking error $\tilde {\mathbf {q}}_{e}(k), \tilde {\mathbf {q}}_{e}'(k) $ , but also the given control vector $\mathbf {u}_{g}$ , which is one of key part of our method, and without it could lead our strategy to fail.

Therefore, the error vectors $s(k)$ constitute a finite state space $\mathcal {S} $ , the adjustment control vectors $\mathbf {u}_{a}(k) $ constitute a finite action space $\mathcal {A} $ , plus a reward function $r_{k}$ , a markovian system for tracking control is completed. and $\left [{ \mathbf {s}_{k}, \mathbf {a}_{k}, r_{k}, \mathbf {s}_{k+1}}\right]$ is called a transition. In this problem, we are seeking a optimal policy $\mu ^{*}$ to maximize cumulative reward with a DRL method.

B. Acquired Control Learning With Actors-Critics Architecture

Due to the experience model is difficult to accurately represent with mathematical expressions, the method in this paper is constructed with deep deterministic policy gradient algorithm [32]. The deterministic policy $\mu $ is represented by actor network $\pi (s;\boldsymbol{\lambda }) $ with parameters $\boldsymbol{\lambda }$ , neural networks with parameters $\boldsymbol{\beta }$ is used to represent critic network $Q(s,a;\boldsymbol{\beta }) $ . And another two deep neural networks with parameters $\hat {\boldsymbol{\lambda }}$ and $\hat {\boldsymbol{\beta }}$ are used to represent target actor network $\hat {\pi }(s;\hat {\boldsymbol{\lambda }}) $ and target critic network $\hat {Q}(s,a;\hat {\boldsymbol{\beta }}) $ . The first two networks form a real-time network, the weights are updated in real time. As well, the latter two form a target network which is updated with a soft strategy. During the training process, the parameters of all networks will be updated with the continuously transitions in replay buffer. Our optimal acquired control law will be obtained when the training is done.

In order to update actor network, we first calculate the gradient of critic network, TD error $L(\boldsymbol{\beta }) $ of $Q(s,a;\boldsymbol{\beta }) $ is defined as mean square of target Q value and current Q value:\begin{equation*} L(\boldsymbol{\beta })=\frac {1}{N}\sum _{k=1}^{N}(T_{k}-Q(s_{k},a_{k}|\boldsymbol{\beta }))^{2}\tag{15}\end{equation*} View SourceRight-click on figure for MathML and additional features. where, N denotes the number of transitions in a small batch; $T_{i} $ , as the label, is output of target critic network $\hat {Q}(s,a;\hat {\boldsymbol{\beta }}) $ :\begin{equation*} T_{k}=r(s_{k},a_{k})+\gamma \hat {Q}(s_{k+1},\hat {\pi }(s_{k+1}|\hat {\boldsymbol{\lambda }})|\hat {\boldsymbol{\beta }})\tag{16}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Thus, the gradient of TD error is:\begin{equation*} \nabla _{\boldsymbol{\beta }} L(\boldsymbol{\beta })=-\frac {2}{N}\sum _{k=1}^{N}(T_{k}-Q(s_{k},a_{k}|\boldsymbol{\beta }))\frac {\partial Q(s,a;\boldsymbol{\beta })}{\partial \boldsymbol{\beta }}\tag{17}\end{equation*} View SourceRight-click on figure for MathML and additional features. The critic network is updated with, \begin{equation*} \beta \gets \beta +\mathrm {L_{c}}\cdot \nabla _{\boldsymbol{\beta }} L(\boldsymbol{\beta })\tag{18}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $\mathrm {L_{c}}$ is the learning rate.

For actor network $\pi (s;\boldsymbol{\lambda }) $ , it is the map from input observation $\mathbf {s}_{k}\in \mathcal {S} $ to output action $\mathbf {a}_{k}\in \mathcal {A}$ . Assuming the transitions in replay buffer is distributed according to a strategy $\phi (a|s)$ and probability density function is $\rho ^{\phi } $ , the objective function of actor network can be defined as:\begin{align*} F_{\phi }(\pi _{\lambda })=&\int _{S}\rho ^{\phi }(s)Q^{\pi }(s,\pi _{\lambda }(s))ds \\=&\mathbb {E}_{S\sim \rho ^{\phi }}[Q(s,\pi _{\lambda }(s))]\tag{19}\end{align*} View SourceRight-click on figure for MathML and additional features.

According to [33], off-policy deterministic policy gradient is:\begin{align*} \nabla F_{\phi }(\pi _{\lambda })\approx&\int _{S}\rho ^{\phi }(s)\nabla _{\lambda }\pi _{\lambda }(a|s)Q^{\pi }(s,a)ds \\=&\mathbb {E}_{S\sim \rho ^{\phi }}[\nabla _{\lambda }\pi _{\lambda }(s)\nabla _{a} Q^{\pi }(s,a)|_{a=\pi _{\lambda }(s)}]\tag{20}\end{align*} View SourceRight-click on figure for MathML and additional features.

So when we got the mini-batch data randomly from the replay memory buffer, the policy gradient is:\begin{align*}&\hspace {-0.5pc}\nabla F_{\phi }(\pi _{\lambda })=\frac {1}{N}\sum _{k=1}^{N}(\nabla _{\lambda }\pi _{\lambda }(s|\lambda)|_{s=s_{k}} \\&\,\cdot \nabla _{a} Q^{\pi }(s,a|\beta)|_{s=s_{k},a=\pi _{\lambda }(s)})\tag{21}\end{align*} View SourceRight-click on figure for MathML and additional features.

Thus, the actor network is updated with, \begin{equation*} \boldsymbol{\lambda } \gets \boldsymbol{\lambda } + \mathrm {L_{a}}\cdot \nabla F_{\phi }(\pi _{\lambda })\tag{22}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $\mathrm {L_{a}}$ is learning rate.

For the stability of training, parameter vectors of target critic network and target actor network are updated in this way:\begin{align*} \hat {\boldsymbol{\lambda }}=&\varepsilon \lambda +(1-\varepsilon)\hat {\boldsymbol{\lambda }} \tag{23}\\ \hat {\beta }=&\varepsilon \beta +(1-\varepsilon)\hat {\beta } \tag{24}\end{align*} View SourceRight-click on figure for MathML and additional features. where $\varepsilon \ll 1$ .

Finally, when the training is over, the optimal acquired control law $u_{a}^{*}$ will be obtained with the optimal actor network $\pi ^{*}$ with optimal parameter vector $\boldsymbol{\lambda }^{*}$ , \begin{equation*} \mathbf {u}_{a}^{*}=\pi ^{*}(\mathbf {s};\boldsymbol{\lambda }^{*})\tag{25}\end{equation*} View SourceRight-click on figure for MathML and additional features.

We use fully connected model to build target actor network, so it can be expressed with the forward model as follows:\begin{align*} \mathbf {l}_{1}=&f_{1}(\mathbf {W}^{*}_{1}\mathbf {s}+\mathbf {b}^{*}_{1}) \\ \mathbf {l}_{2}=&f_{2}(\mathbf {W}^{*}_{2}\mathbf {l}_{1}+\mathbf {b}^{*}_{2}) \\ \vdots \\ \mathbf {l}_{n-1}=&f_{n-1}(\mathbf {W}^{*}_{n-1}\mathbf {l}_{n-2}+\mathbf {b}^{*}_{n-1}) \\ \mathbf {u}^{*}_{a}=&f_{n}(\mathbf {W}^{*}_{n}\mathbf {l}_{n-1}+\mathbf {b}^{*}_{n})\tag{26}\end{align*} View SourceRight-click on figure for MathML and additional features. where $\mathbf {l_{i}}$ is output of $i$ th layer, $\mathbf {W}_{i}^{*}, \mathbf {b}_{i}^{*}\in \boldsymbol{\lambda }^{*}$ , are network parameters, $f_{i}$ is activation function of $i$ th layer, $n$ is number of layers.

So far, the hybrid tracking control law for NWMR is obtained after training, which combined given control from kinematics control method and acquired control from DRL method. The control block diagram is showed in Fig. 3 and Fig. 4 The pseudocode of our tracking control method is shown in Algorithm 1:

FIGURE 3. - Control block diagram of our hybrid control strategy for NWMR.
FIGURE 3.

Control block diagram of our hybrid control strategy for NWMR.

FIGURE 4. - Actor-Critic architechture in our hybrid control strategy.
FIGURE 4.

Actor-Critic architechture in our hybrid control strategy.

Algorithm 1 Hybird Strategy of Tracking Control for NWMR

Require:

$\mathbf {q}_{r}, \mathbf {q}_{0}, k_{1}, k_{2}, v_{d}, \omega _{d}, N, \gamma, \varepsilon $

1:

Initialize/load actor network and critic network, $\boldsymbol{\lambda }, \boldsymbol{\beta }$

2:

Initialize target network, $\hat {\boldsymbol{\lambda }}=\boldsymbol{\lambda }, \hat {\boldsymbol{\beta }}=\boldsymbol{\beta }$

3:

Initialize replay buffer

4:

for episode=1 to Max-ep do

5:

get initial pose observation of NWMR, compute intital state $\mathbf {q}_{e}^{0}, {u}_{g}^{0}, \mathbf {q}_{e}^{0'} $

6:

Initialize cumulative error to zero

7:

for step=1 to Max-step do

8:

compute given control $\mathbf {u}_{g}^{k} $ according to $\mathbf {q}_{e}^{k}$ , (6)

9:

compute acquired control $\mathbf {u}_{a}^{k} $ according to $\left [{ \mathbf {q}_{e}^{k}, \mathbf {u}_{g}^{k}, \mathbf {q}_{e}^{'k}}\right] $ , (Sec. III-B)

10:

execute $\mathbf {u}^{k}=\mathbf {u}_{g}^{k}+\mathbf {u}_{a}^{k} $

11:

store $kth$ transition to replay buffer

12:

if number of transitions > Memory then

13:

extract randomly a batch of transitions from R

14:

update actor network and critic network, (18), (22)

15:

update target network, (23), (24)

16:

end if

17:

end for

18:

end for

SECTION IV.

Simulation Results

In this section, simulations are developed to demonstrate that above proposed method could achieve tracking control for NWMR effectively. We first try to track a circle, and it can be defined as:\begin{align*} x_{d}=&2\cos \theta \\ y_{d}=&2\sin \theta\end{align*} View SourceRight-click on figure for MathML and additional features. The cumulative error including pose error and control error is introduced to be a criteria of tracking performance, and a bigger value (because it is negative) means the better tracking accuracy:\begin{align*}&\hspace {-0.5pc} E=-\sum _{k=1}^{N} \xi _{1}(\left |{ x_{e}(k)}\right | + \left |{ y_{e}(k)}\right |) + \xi _{2}\left |{ \theta _{e}(k)}\right | \\&\qquad\qquad\qquad\qquad\qquad\quad \displaystyle {+\, \xi _{3}(\left |{ v_{d} - v(k) }\right | + \left |{ \omega _{d} - \omega (k) }\right |)}\end{align*} View SourceRight-click on figure for MathML and additional features. where $\xi _{1} $ , $\xi _{2} $ , $\xi _{3} $ are weight coefficients corresponding to pose error, orientation error and control vector error.

The bounded values in (2), $v_{max}=2m/s$ , $\omega _{max}=1rad/s$ , $a_{max}=1m/s^{2}$ , $\alpha _{max}=1.5rad/s^{2}$ . The parameters of tracking circle are given in Tab. 1. $k_{1}$ and $k_{2}$ are fine tuned with above criteria $E $ , $v_{d}=1m/s$ , $\omega _{d}=0.5rad/s$ .

TABLE 1 Parameters of Tracking Circle
Table 1- 
Parameters of Tracking Circle

The uncertainty in (4) is chosen as periodic disturbance, \begin{align*} \mathbf {n}=\left [{ \begin{matrix} \displaystyle n_{x}\\ \displaystyle n_{y}\\ \displaystyle n_{\theta } \end{matrix} }\right]=\left [{ \begin{matrix} \displaystyle 0.002 \sin (\pi t)\\ \displaystyle 0.002 \cos (\pi t)\\ \displaystyle 0.005 \sin (\pi t) \end{matrix}}\right]\end{align*} View SourceRight-click on figure for MathML and additional features.

The network architecture in section III-B is built with the aid of Tensorflow,1 actor/target actor network havs a similar deep fully connected neural network with critic/target critic network, the hyperparameters are shown in Tab. 2. Besides, the maximum size of replay buffer is 5000, the size of batch is 32, learning rates of actor network and critic network are 0.001, 0.002. In our training, it is a total of 400 episodes and 200 steps in each episode, and sampling time is $0.1s$ , initial state of NWMR is $(0, 0,0)^{\mathrm {T}}$ .

TABLE 2 Hyperparameters of Network
Table 2- 
Hyperparameters of Network

The results of our proposed method are showed in Figs. 5–​9. The trajectory of NWMR can be seen in Fig. 5, desired trajectory is shown with a dotted line, and another tracjectory is solid line. From Fig. 6, it can be observed that the tracking errors all converge to near zero. The given control signals, acquired control signals and final hybrid control inputs of (12) are shown in Figs. 7–​9, respectively. The acquired control inputs works in the whole process, which prove the effectiveness of our method.

FIGURE 5. - Trajectory of tracking circle with our method.
FIGURE 5.

Trajectory of tracking circle with our method.

FIGURE 6. - Errors of tracking circle with our method.
FIGURE 6.

Errors of tracking circle with our method.

FIGURE 7. - Given control signals of tracking circle with our method.
FIGURE 7.

Given control signals of tracking circle with our method.

FIGURE 8. - Acquired control signals of tracking circle with our method.
FIGURE 8.

Acquired control signals of tracking circle with our method.

FIGURE 9. - Hybrid control inputs of tracking circle with our method.
FIGURE 9.

Hybrid control inputs of tracking circle with our method.

To make a comparison, we also test the performance with classical method, that is, only the given control approach (6) works. The results are depicted in Figs. 10–​11. Comparing Fig. 10 with Fig. 5, our method obviously performs better, the comparison of Fig. 11 with Fig. 6 also proves this point. Actually, the cumulative error of classical method in Fig. 11 is -202.0255, while the one of our method in Fig. 5 is -110.4874. So, the addition of $u_{a}$ in our method could improve tracking performance indeed.

FIGURE 10. - Trajectory of NWMR with classical control method.
FIGURE 10.

Trajectory of NWMR with classical control method.

FIGURE 11. - Errors of tracking circle with classical control method.
FIGURE 11.

Errors of tracking circle with classical control method.

Besides, we also try to use learning method, that is, only acquired control method (11) works, the results are depicted in Fig. 12–​15. Comparing Fig. 12 with Fig. 5, the performance of tracking circle is similar to each other, the cumulative error of learning method in Fig. 12 is -110.1912. But the comparison of training process in Fig. 14 and Fig. 15 shows that our proposed method converge to stable within 300 episodes, and the fluctuation of the reward (Y axis) in former is more stable than the latter, it proves the superiority in training process of our method.

FIGURE 12. - Trajectory of NWMR with leaning control method.
FIGURE 12.

Trajectory of NWMR with leaning control method.

FIGURE 13. - Errors of tracking circle with learning control method.
FIGURE 13.

Errors of tracking circle with learning control method.

FIGURE 14. - Training process of tracking cirle with our method, take Y=−150 (red dotted line) as reference.
FIGURE 14.

Training process of tracking cirle with our method, take Y=−150 (red dotted line) as reference.

FIGURE 15. - Training process of tracking cirle only with learning method, take Y=−150 (red dotted line) as reference.
FIGURE 15.

Training process of tracking cirle only with learning method, take Y=−150 (red dotted line) as reference.

To further demonstrate the effectiveness of our method, we conduct another simulation to track the spiral trajectory, defined as following, \begin{align*} x_{d}=&0.04t \cos (0.5t) \\[5pt] y_{d}=&0.04t \sin (0.5t)\end{align*} View SourceRight-click on figure for MathML and additional features.

The uncertainty in (4) is chosen as random disturbance, \begin{equation*} \mathbf {n}=0.002 \boldsymbol{\sigma }\end{equation*} View SourceRight-click on figure for MathML and additional features. where, $\sigma \sim \mathcal {N}(0,1)$ .

And the parameters are given Tab. 3. The size of replay buffer is 5000, the size of batch is 32, learning rates of actor network and critic network are 0.0002, 0.001. The Maximum episode is 800, the maximum step in each spisode is 250, while other parameters remain unchanged.

TABLE 3 Parameters of Tracking Spiral
Table 3- 
Parameters of Tracking Spiral

The results are depicted in Figs. 16–​20. The trajectory of NWMR is shown in Fig. 16, tracking errors are depicted in Fig. 17, given control signals, acquired control signals and hybird control inputs can be seen in Fig. 18–​20. From Fig. 18, the angular velocity exceeds the upper bound already, but the hybird control inputs in Fig. 20 is bounded.

FIGURE 16. - Trajectory of tracking spiral with our method.
FIGURE 16.

Trajectory of tracking spiral with our method.

FIGURE 17. - Errors of tracking spiral with our method.
FIGURE 17.

Errors of tracking spiral with our method.

FIGURE 18. - Given control signals of tracking spiral with our method.
FIGURE 18.

Given control signals of tracking spiral with our method.

FIGURE 19. - Acquired control signals of tracking spiral with our method.
FIGURE 19.

Acquired control signals of tracking spiral with our method.

FIGURE 20. - Hybrid control inputs of tracking spiral with our method.
FIGURE 20.

Hybrid control inputs of tracking spiral with our method.

The results with only classical control approach works are also illustrated as a comparison in Fig. 21 and Fig. 22. The cumulative error of classical method in Fig. 21 is -358.0541, while the one of our method in Fig. 16 is -93.7636. So, it also proved the effectiveness of our proposed method. The results with only learning method is Figs. 23–​26. In Fig. 23, the cumulative error is -114.4648, comparing to Fig. 16, the tracking performance of our method is still better. According to training process in Fig. 25 and Fig. 26, our proposed method is obviously more stable, too.

FIGURE 21. - Trajectory of tracking spiral with classical method.
FIGURE 21.

Trajectory of tracking spiral with classical method.

FIGURE 22. - Errors of tracking spiral with classical method.
FIGURE 22.

Errors of tracking spiral with classical method.

FIGURE 23. - Trajectory of tracking spiral with learning method.
FIGURE 23.

Trajectory of tracking spiral with learning method.

FIGURE 24. - Errors of tracking spiral with learning method.
FIGURE 24.

Errors of tracking spiral with learning method.

FIGURE 25. - Training process of tracking spiral with our method, take Y=−110 (red dotted line) as reference.
FIGURE 25.

Training process of tracking spiral with our method, take Y=−110 (red dotted line) as reference.

FIGURE 26. - Training process of tracking spiral only with leaning method, take Y=−110 (red dotted line) as reference.
FIGURE 26.

Training process of tracking spiral only with leaning method, take Y=−110 (red dotted line) as reference.

So, for tracking circler, our proposed method has similar tracking performance to learning method, but, the convergence performance of ours is better; for tracking spiral, our proposed method has advantage both in tracking performance and converge performance evidently.

SECTION V.

Conclusion

In this research, the tracking control of NWMR with constraints and uncertainty has been addressed by our proposed hybrid control strategy, which is a combination of mode-based control method and learning based method. The kinematics control is severed as a given control (like “the talent”), the actor-critic based DRL method is used to learn a acquired control law to compensate the existing errors (like“the experience”). The results have demonstrated the effectiveness of our proposed method, and the comparisons show that our method has the advantage of less cumulative error, meanwhile, our method is more stable and efficient than learning based method.

The strategy provided in our effort could improve tracking and convergence performance, which is the vital function for a autonomous mobile robot. Although our method have been tested with tracking control of NWMR, it could also be applied to other complicated control problems.

Appendix

Substituting (6) to (5), the error dynamics can be rewritten as:\begin{align*} \dot {x}_{e}=&-k_{1} x_{e}+2v_{d} y_{e}^{2}\cos \frac {\theta _{e}}{2}+k_{2} y_{e}\sin \frac {\theta _{e}}{2}+\omega _{d} y_{e} \\[-2pt] \dot {y}_{e}=&-2v_{d} x_{e} y_{e}\cos \frac {\theta _{e}}{2}-\omega _{d} x_{e}-k_{2} x_{e}\sin \frac {\theta _{e}}{2}+v_{d}\sin \theta _{e} \\ \dot {\theta }_{e}=&-2v_{d} y_{e}\cos \frac {\theta _{e}}{2}-k_{2}\sin \frac {\theta _{e}}{2}\end{align*} View SourceRight-click on figure for MathML and additional features.

Defining the Lyapunov function, \begin{equation*} L=\frac {1}{2} x_{e}^{2}+\frac {1}{2} y_{e}^{2}-2\cos \frac {\theta _{e}}{2}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Deriving Lyapunov function along time:\begin{align*} \dot {L}=&x_{e}\dot {x}_{e}+y_{e}\dot {y}_{e}+\dot {\theta }_{e}\sin \frac {\theta _{e}}{2} \\=&x_{e}\left({-k_{1} x_{e}+2v_{d} y_{e}^{2}\cos \frac {\theta _{e}}{2}+k_{2} y_{e}\sin \frac {\theta _{e}}{2}+\omega _{d} y_{e}}\right) \\&+ y_{e}\left({-2v_{d} x_{e} y_{e}\cos \frac {\theta _{e}}{2}-\omega _{d} x_{e}-k_{2} x_{e}\sin \frac {\theta _{e}}{2}+v_{d}\sin \theta _{e}}\right)\\[-2pt]&+ \left({-v_{d} y_{e}\sin \theta _{e}-k_{2}\sin ^{2}\frac {\theta _{e}}{2}}\right)\\[-2pt]=&-k_{1} x_{e}^{2}-k_{2}\sin ^{2}\frac {\theta _{e}}{2} < =0\end{align*} View SourceRight-click on figure for MathML and additional features.

According Lyapunov theory, the error dynamics will asymptotically converge to zero.

Cites in Papers - |

Cites in Papers - IEEE (6)

Select All
1.
Huy Anh Bui, Anh Tu Nguyen, Thanh Tung Nguyen, "Develop A Navigation Approach for Mobile Robots Based on the Distributional Deep Reinforcement Learning Framework", 2024 IEEE 11th International Conference on Computational Cybernetics and Cyber-Medical Systems (ICCC), pp.1-6, 2024.
2.
Chia-Feng Juang, Zhoa-Boa You, "Reinforcement Learning of an Interpretable Fuzzy System Through a Neural Fuzzy Actor-Critic Framework for Mobile Robot Control", IEEE Transactions on Fuzzy Systems, vol.32, no.6, pp.3655-3668, 2024.
3.
Kayleb Garmon, Ying Wang, "Vision Based Leader-Follower Control of Wheeled Mobile Robots using Reinforcement Learning and Deep Learning", 2023 9th International Symposium on System Security, Safety, and Reliability (ISSSR), pp.440-441, 2023.
4.
Yanying Zhou, Shijie Li, Jochen Garcke, "Foresight Social-aware Reinforcement Learning for Robot Navigation", 2023 35th Chinese Control and Decision Conference (CCDC), pp.3501-3507, 2023.
5.
Zhongjing Luo, Jialing Zhou, Guanghui Wen, "Deep Reinforcement Learning Based Tracking Control of Unmanned Vehicle with Safety Guarantee", 2022 13th Asian Control Conference (ASCC), pp.1893-1898, 2022.
6.
Xinhui Zhu, Li Zhang, Yang Shi, Jing Wang, Jian Li, "Robot Manipulator Control via Solving Four-Layered Time-Variant Equations Including Linear, Nonlinear Equalities and Inequalities", 2021 11th International Conference on Intelligent Control and Information Processing (ICICIP), pp.416-421, 2021.

Cites in Papers - Other Publishers (13)

1.
Zhi-cheng Qiu, Yi-hong Liu, Xian-min Zhang, "Reinforcement learning vibration control and trajectory planning optimization of translational flexible hinged plate system", Engineering Applications of Artificial Intelligence, vol.133, pp.108630, 2024.
2.
Ching-Chih Tsai, Hsing-Yi Chen, Chun-Chieh Chan, Chi-Chih Hung, Guan-Ming Chen, "Intelligent Actor-Critic Learning Control for Collison-Free Trajectory Tracking of Mecanum-Wheeled Mobile Robots", International Journal of Fuzzy Systems, 2024.
3.
Muhammad Qomaruz Zaman, Hsiu-Ming Wu, "An improved fuzzy inference strategy using reinforcement learning for trajectory-tracking of a mobile robot under a varying slip ratio", Robotica, pp.1, 2024.
4.
Fei Fan, Guanglin Xu, Na Feng, Lin Li, Wei Jiang, Lianqin Yu, Xiaoshuang Xiong, "Spatiotemporal path tracking via deep reinforcement learning of robot for manufacturing internal logistics", Journal of Manufacturing Systems, vol.69, pp.150, 2023.
5.
Mohammad Tahmasbi, "Trajectory control of autonomous mobile robots considering disturbance with machine learning agents", Journal of the Brazilian Society of Mechanical Sciences and Engineering, vol.45, no.6, 2023.
6.
Hung-Yi Lin, Shun-Hung Tsai, Kuan-Yo Chen, "Design and Implementation of the Trajectory Tracking and Dynamic Obstacle Avoidance of Wheeled Mobile Robot Based on T?S Fuzzy Model", International Journal of Fuzzy Systems, 2023.
7.
Ching-Chih Tsai, Hsing-Yi Chen, Shih-Che Chen, Feng-Chun Tai, Guan-Ming Chen, "Adaptive Reinforcement Learning Formation Control Using ORFBLS for Omnidirectional Mobile Multi-Robots", International Journal of Fuzzy Systems, 2023.
8.
Guoxing Bai, Yu Meng, Qing Gu, Guodong Wang, Guoxin Dong, Lei Zhou, "An Anti-sideslip Path Tracking Control Method of Wheeled Mobile Robots", Intelligent Robotics and Applications, vol.13456, pp.245, 2022.
9.
Bin Zhu, Jianrong Zhang, Jian Li, Lei Chen, Jinping Wu, Zeyad Farisi, "Path Planning of Energy Robot Based on Improved Ant Colony Algorithm", Wireless Communications and Mobile Computing, vol.2022, pp.1, 2022.
10.
Lihong Yan, "Application Method of Environmental Protection Building Elements Based on Artificial Intelligence Technology in the Field of Urban Planning and Design", Advances in Multimedia, vol.2022, pp.1, 2022.
11.
Hiroya KIDA, Ryoya KUBO, Hisakazu NAKAMURA, "Trajectory Tracking Control of Two-wheeled Mobile Robot Using Control Lyapunov Function", Transactions of the Society of Instrument and Control Engineers, vol.58, no.10, pp.470, 2022.
12.
Jichang Ma, Hui Xie, Kang Song, Hao Liu, "Self-Optimizing Path Tracking Controller for Intelligent Vehicles Based on Reinforcement Learning", Symmetry, vol.14, no.1, pp.31, 2021.
13.
Omid Elhaki, Khoshnam Shojaei, "Observer-based robust platoon formation control of electrically driven car-like mobile robots under collision avoidance and connectivity maintenance with a prescribed performance", Journal of Vibration and Control, pp.107754632110191, 2021.

References

References is not available for this document.