Introduction
With rapid development, unmanned aerial vehicles (UAVs) have become important roles in various engineering fields in recent years. UAVs have been used to assist or replace human to execute dirty, boring, and difficult missions due to its low cost, high mobility, and unmanned feature [1]. Therefore, UAVs are widely used in surveillance, searching, tracking, and other missions [2]. Thus, how to improve the autonomy of UAVs while performing some tasks to avoid risking human lives has become the research focus in various fields. For instance, some people used UAVs to carry out delivery of relief supplies, extinguishing, and so on [3]–[5]. Consequently, it has become one of the key issues for engineering applications to improve the autonomous flight capability of UAV [6]–[9].
Nowadays, UAVs are mainly used to execute tasks which could be conducted automatically instead of manually, such as target tracking, long-distance delivery, patrol and so on. One of the important technical issues from these tasks is finding an optimal path from start to end point and designing a controller to manipulate it following the path. The optimal path could be found by path planning algorithms [10], such as visibility graph [11], randomly sampling search algorithms including rapidly-exploring random tree [12], probabilistic roadmap [13], heuristic algorithms including A-Star [14], Sparse A-Star [15], and D-Star [16], and genetic algorithms [17]. Then, a controller could be designed to operate UAV following the planned path using various trajectory tracking algorithms [18]. However, there are some disadvantages in the solution mentioned above. For example, finding the optimal path relies on prior knowledge about the environment, but the data of terrain and obstacles is so difficult to obtain that the capability of environment modelling is limited [19], [20]. Moreover, for dynamic environment with moving obstacles [21], the scheme designed above is not flexible enough to alter their control strategies immediately. A replan of paths has to be scheduled to adapt to the changes in the environment. Furthermore, because conventional algorithms need much more time to calculate the optimal path, it is difficult to apply them to solving real-time problems. Therefore, it is necessary to design an end-to-end algorithm which could be used to operate autonomous UAV flight in a dynamic environment without path planning and trajectory tracking.
A research highlight is inspired by AlphaGo developed by Google based on deep reinforcement learning (DRL) [22], which could play Atari games using an end-to-end decision-making algorithm called deep Q network (DQN) [23]. The performance of this algorithm reached human level after an extensive training, and it has attracted lots of researchers from various fields to study the applications of DRL in all kinds of engineering problems [8], [9]. Meanwhile, the deep deterministic policy gradient (DDPG) [24] was proposed to solve the dimension explosion caused by the continuity of action space and state space. And the experience replay method that samples from experience buffer was constructed in these algorithms to allow agents to remember and learn from historical data. The uniform experience replay (UER) was used to form training set by taking samples from experience buffer. However, the UER does not fully exploit the diversity of historical data and affects the convergence rate of policy, even causes divergence after lots of training. Therefore, the prioritized experience replay (PER) [25] was proposed to improve the efficiency of learning from experiences. It uses the potential value of historical data to increase the convergence rate of policy network, because a priority model of each sample is designed to evaluate the profit of samples for training of policy network at current step.
In addition, it is important for training policy to design a reward function of target problem. Traditional formulation of reward function comes from original model of the problem [23], [24], such as CartPole, Pendulum and other Atari games. Thus, how to construct an appropriate reward function is not the focus of most popular works, as they mainly focus on the improvements of algorithms instead of modelling. However, it is also essential for solving problems from specific research fields to define modified model when we attempt to apply DRL for a new problem. Traditionally, some researchers considered that the reward function involved in problem model is designed by human experience [26], [27] so that the trained policy extremely relied on the capability of designer. Although some recently published work found effective policy, we could find that different formulations of reward function brought different training processes and trained policies.
In the present work, we aim to tackle the challenges mentioned above and focus on the UAV maneuvering decision-making algorithm for autonomous air-delivery including guidance towards area and guidance towards specific point tasks. The main works presented in this paper are summarized as follows:
The UAV maneuvering decision-making model for air-delivery is built based on Markov decision processes (MDPs) [28], [29]. Particularly, we refine the guidance towards area and guidance towards specific point tasks involved in the air-delivery problem. Meanwhile, we design the flight state space, the flight action space, and the reward functions of each task. Among the components of the model, we devote the traditional air-to-ground fire control theory to designing and constructing the UAV maneuvering decision-making model for air-delivery.
We propose the UAV maneuvering decision-Making algorithm for autonomous air-delivery based on DDPG with PER sampling (PER-DDPG) to optimize the maneuvering policy. Specially, we design the policy function by deep neural network and generate the training samples based on PER. Moreover, we present a construction method of reward function based on expert experience and domain knowledge.
It is proved that the proposed algorithm could improve the autonomy of UAV during the air-delivery process and the presented construction method of reward function is beneficial to the convergence of maneuvering policy, even improving the quality of policy's output.
This paper is organized as follows: Section 2 describes the background knowledge of all the methods used to design UAV maneuvering decision-making model and algorithm for air-delivery. Section 3 presents the details of experiments we design, comparison of training parameters under different sampling methods and reward functions. Section 4 shows the conclusion of our work and looks forward to the future of our research.
Methodology
With the rapid development of UAV technology, it has been used to execute various dangerous and repetitive missions, such as electric power inspection, crop protection, wildlife surveillance, traffic monitoring, and rescue operations. Demands for more advanced and simple UAV autonomous flight solution have emerged. As mentioned above, traditional solution of UAV guidance is that the algorithm first plans an optimal path and then UAV follows the path by the trajectory tracking method.
In this paper, we describe the process of UAV autonomous air-delivery in detail and define the guidance towards area and guidance towards specific point tasks involved in the air-delivery problem. Then, we construct the UAV maneuvering decision-making model for air-delivery based on MDPs. Meanwhile, we present a construction method of reward function, and the expert experience and domain knowledge are given due consideration. On the other hand, we propose the UAV maneuvering decision-making algorithm based on DRL.
As shown in Fig. 1, we first construct the UAV maneuvering decision-making model for air-delivery consisting of the guidance towards area and guidance towards specific point tasks based on MDPs. Among this model, we design action space, state space and basic reward of each task which are used to demonstrate the characteristics of UAV autonomous flight during air-delivery. Moreover, we design and realize the UAV maneuvering decision-making algorithm including the UAV maneuvering decision-making policy based on neural network, and the policy network is optimized according to data sampled from historical experiences by PER. Meanwhile, we construct the shaping reward of each task to increase the convergence rate of policy network and to improve the quality of policy's output.
2.1 UAV Maneuvering Decision-Making Model for Air-Delivery Based on MDPs
At present, the most cyclical decision-making problems, called multi-stage decision-making, could be modeled by MDPs. Meanwhile, most of the researchers focusing on autonomous control and decision-making always describe problems and construct the problem model based on MDPs, including the UAV maneuvering decision-making model for air-delivery. As shown in Fig. 2, we use MDPs to design and construct the UAV maneuvering decision-making model for the guidance towards area and guidance towards specific point tasks included in air-delivery. We design and implement the simulator core consisting of UAV kinematic model, bomb model and target point. Particularly, we present the state space, action space and reward function of UAV maneuvering decision-making model for air-delivery in the light of MDPs.
Construction diagram of UAV maneuvering decision-making model for air-delivery based on MDPs
2.1.1 MDPs
During the process of performing air-delivery mission, UAV maneuvering decision-making could be regarded as a sequential decision processes. Moreover, the operator of UAV usually considers current information from environment while selecting optimal action. Therefore, we can consider this decision process Markovian and use MDPs to model the UAV maneuvering decision-making model for air-delivery.
The MDPs can be described by a tuple
\begin{equation*}\{T,S,A(s),P(\cdot\vert s,a),R(s,a)\}\end{equation*}
As shown in Fig. 3, MDPs could be described as follows: when the state of environment is initialized by
During the process of interactions between environment and agent, a sequence of rewards \begin{equation*}
v(s)= \sup\limits_{\pi}v(s,\pi),\ s\in S. \tag{1}\end{equation*}
Based on the characteristics of the UAV maneuvering decision-making problem for air-delivery, we use the infinite stage discount model as the utility function
\begin{equation*}
v(s,\pi)=\sum\limits_{t=0}^{\infty}\gamma^{t}\mathrm{E}_{\pi}^{s}[R(s_{t},a_{t})],\ s\in S. \tag{2}\end{equation*}
In the above formula,
In the following, we will first demonstrate the problem definition among air-delivery missions. Then, state space
2.1.2 Definition of Guidance Towards Area and Guidance Towards Specific Point Tasks Involved in Air-Delivery Mission
Before we start running a reinforcement learning based algorithm, we should construct a simulation model of problems to be solved. Thus, we refine two tasks involved in air-delivery, which are the guidance towards area task and guidance towards specific point task. As shown in Fig. 2, we should first construct UAV kinematic model and bomb model. In this paper, we adopt a dynamic model describing air-delivery mission based on 3-degree of freedom (3-DoF) kinematic model of UAV [6]. On the other hand, we also design a 3-DoF kinematic model of bomb [30], [31] to calculate the external ballistics parameters of uncontrolled bomb. When the position and attitude of UAV are confirmed at
Based on the 3-DoF kinematic model of UAV, the flight state is defined as
In this paper, the guidance towards area task and guidance towards specific point task are defined and described, considering the task load type and launching mode.
(i) Guidance Towards Area Task
When the UAV flies with controlled bomb, or other steerable loads, the guidance target of UAV is usually a broad area. Thus, we consider the target of UAV to be an area including mission point, while using the controlled bomb.
Fig. 4 is a vector diagram of guidance towards area task.
\begin{equation*}\boldsymbol{D}_{\text{LOS}}=\boldsymbol{X}_{\text{TGT}}-\boldsymbol{X}_{\text{UAV}}. \tag{3}\end{equation*}
Thereby, we could define the successful termination condition of guidance towards area task
\begin{equation*}\Vert \boldsymbol{D}_{\text{LOS}}\Vert_{2}\leqslant R_{\text{guidance}}. \tag{4}\end{equation*}
In the formula above,
(ii) Guidance Towards Specific Point Task
When the mission load of UAV is uncontrolled bomb, the target of the task is a specific point instead of an area. As shown in Fig. 5, if UAV flies with uncontrolled bomb, it will perform guidance towards specific point task and adjust
In Fig. 5, \begin{equation*}\delta_{\psi_{\text{LOS}}}=\psi_{\text{UAV}}-\psi_{\text{LOS}}. \tag{5}\end{equation*}
\begin{equation*}\boldsymbol{X}_{\text{Bomb}}=\boldsymbol{X}_{\text{UAV}}+\boldsymbol{A}_{\text{Bomb}}. \tag{6}\end{equation*}
Moreover, we could define the successful termination condition of guidance towards specific point task
\begin{equation*}\begin{cases}\left\vert \delta_{\psi_{\text{LOS}}}\right\vert \leqslant\delta_{\psi}\\ \Vert \boldsymbol{X}_{\text{TGT}}-\boldsymbol{X}_{\text{Bomb}}\Vert_{2}\leqslant \delta_{A}\end{cases} \tag{7}\end{equation*}
2.1.3 State Space, Action Space, and Basic Reward of Each Task
As mentioned above, we present the UAV maneuvering decision-making model for air-delivery based on MDPs. Therefore, we should design state space, action space and reward function of each task refined in air-delivery based on the problems' definition described above.
(i) State Space of Guidance Towards Area Task
Considering that the purpose of guidance towards area task, we define the state space of guidance towards area task as
\begin{equation*}
S_{\text{guidance}}=\{D_{\text{LOS}},\delta_{\psi_{\text{LOS}}},v_{\text{UAV}},H_{\text{UAV}}\} \tag{8}\end{equation*}
(ii) State Space of Guidance Towards Specific Point Task
For the guidance towards specific point task, we can define its state space as
\begin{equation*}
S_{\text{aim}}=\{D_{\text{LOS}},\delta_{\psi_{\text{LOS}}},v_{\text{UAV}},H_{\text{UAV}},A_{\text{Bomb}}\} \tag{9}\end{equation*}
(iii) Action Space of Each Task
Based on the flight simulation model of UAV we construct, we can establish the action space as
\begin{equation*}
A(s)=\{N_{z}\}.\tag{10}\end{equation*}
In the guidance towards specific point task, the action space is defined as (11) similarly.
(iv) Reward Function of Each Task
Moreover, considering the successful termination condition of each task, we define the reward function as
\begin{equation*}
R(s,a)=\begin{cases}
1.0,\ \ \text{Successful termination}\\
0.0,\ \ \text{Failed}\end{cases}. \tag{11}\end{equation*}
The formula above shows that if UAV's situation satisfies the successful termination condition of each task,
2.2 UAV Maneuvering Decision-Making Algorithm for Air-Delivery Based on PER-DDPG
After constructing simulation environments, we could design the corresponding algorithm to solve the problem. In this paper, we propose the UAV maneuvering decision-making algorithm for air-delivery mission based on PER-DDPG [6] with expert experience and domain knowledge. It is composed of the PER to generate training samples, the policy including actor network and critic network, and shaping reward to improve the quality of policy's output and increase the convergence rate of policy, as shown in Fig. 6. Particularly, we introduce expert experience and domain knowledge to design the shaping reward to achieve the improved performance of the proposed algorithm.
2.2.1 Framework of PER-DDPG
The PER-DDPG is a model-free, off-policy and DRL-based algorithm designed by actor-critic architecture. Problems belong to MDPs have continuous state space and action space and could be solved effectively. Mean-while, because DDPG does not consider the diversity of data and does not fully utilize historical experience, policy converged in DDPG exhibit low convergence rate and poor stability. Therefore, the PER is used to generate training data, which can improve the utilization of the potential value of historical data, thereby increasing convergence rate and enhancing the stability of trained policy. PER-DDPG has been verified in autonomous airdrop task, showing the high performance than original DDPG.
The framework of PER-DDPG is shown in Fig. 7, which is composed of the evaluation networks, target networks, the PER and other methods. At each decision-making step, the evaluation actor network outputs action with noise for exploring according to state. Then, the current state, action, reward, and next state are packaged and stored in historical transitions buffer
Specifically, at each moment, the policy gives action by
\begin{equation*}
a_{t}=\mu(s_{t}\vert \theta_{\mu})\tag{12}\end{equation*}
2.2.2 UAV Maneuvering Decision-Making Policy Based on Neural Network
As mentioned above, DDPG is a kind of DRL algorithm constructed by the actor-critic framework. During the training process, the actor network
(i) Actor Network
The actor network
(ii) Critic Network
The critic network
In addition, before running the networks, the input value should be normalized for eliminating the influence of data's physical meaning. Moreover, the structure of target networks
2.2.3 Shaping Reward Function Based on Expert Experience and Domain Knowledge
Although the algorithm we propose could learn an optimal policy according to the reward function shown in (12), there is a serious challenge that could influence the convergence rate of policy, because the rewards environment returned are too sparse to learn useful experience such as those transitions whose reward is not zero.
Therefore, some researchers proposed a technique called reward shaping (RS) [32], which leverages the expert knowledge to reconstruct the reward model of the target domain to improve the agent's policy learning. More specifically, in addition to reward from environment, RS provides a shaping function \begin{equation*}
M=\{T,S,A,P,R\}\rightarrow M^{\prime}=\{T,S,A,P,R^{\prime}\}. \tag{13}\end{equation*}
In the formula above, \begin{equation*}
F(s,a,s^{\prime})=\gamma\varPhi(s^{\prime})-\varPhi(s) \tag{14}\end{equation*}
\begin{equation*}
F(s_{1},a_{1}, s_{2})+F(s_{2},a_{2}, s_{3})+\cdots+F(s_{n},a_{n}, s_{1}) > 0.\end{equation*}
Fortunately, PBRS could avoid this issue by making any state cycle meaningless, with \begin{equation*}
Q_{M^{\prime}}(s,a)=Q_{M}(s,a)-\varPhi(s). \tag{15}\end{equation*}
Thereby, if we construct a function over both the state and the action to form the potential function \begin{equation*}
F(s,a,s^{\prime},a^{\prime})=\gamma\varPhi(s^{\prime},a^{\prime})-\varPhi(s,a). \tag{16}\end{equation*}
Similar to (15), the optimal \begin{equation*}
Q_{M^{\prime}}(s,a)=Q_{M}(s,a)-\varPhi(s,a). \tag{17}\end{equation*}
Similarly, PBA is also sufficient and necessary to preserve the policy invariance. Therefore, once the optimal policy in \begin{equation*}\mu_{M}(s)= \underset{a\in A(s)}{\arg\max}[Q_{M^{\prime}} -\Phi(s,a)].\tag{18}\end{equation*}
Thus, we propose a construction approach to design the shaping function of air-delivery mission based on PBRS and PBA, introducing expert experience and domain knowledge, as shown in Fig. 10.
(i) Shaping Function of Guidance Towards Area Task
When the UAV is performing the guidance towards area task, the distance between current position and target point and the relative azimuth between LOS and the nose direction of UAV are the main influence factors. Therefore, we can construct the shaping function as
\begin{gather*}
F_{\text{guidance}}(s,a, s^{\prime})=\\ \gamma[\varPhi_{d}(s^{\prime})+ \varPhi_{\psi}(s^{\prime})]-[\varPhi_{d}(s)+\varPhi_{\psi}(s)] \tag{19}\end{gather*}
\begin{gather*}\varPhi_{d}(s)=\frac{D_{\text{LOS}}^{\max }- D_{\text{LOS}}}{D_{\text{LOS}}^{\max}- D_{\text{LOS}}^{\min}}, \tag{20}\\ \varPhi_{\psi}(s)=\frac{\pi- \delta_{\psi_{\text{LOS}}}}{\pi}. \tag{21}\end{gather*}
According to the shaping function designed above, we could get optimal policy more quickly than sparse reward model. More specifically, we can introduce expert experience to navigate agent to reach optimal state faster
\begin{equation*}
F_{\text{guidance}}(s,a, s^{\prime},a^{\prime})=F_{\text{guidance}}(s,a, s^{\prime})+R^{\dagger} \tag{22}\end{equation*}
\begin{equation*}
R^{\mathrm{T}}=\begin{cases}
1.0,\ \varPhi(s^{\prime})\geqslant\varPhi(s)\\
0.0,\ \text{otherwise}\end{cases} \tag{23}\end{equation*}
(ii) Shaping Function of Guidance Towards Specific Point Task
Similar to the shaping function of guidance towards area task, the shaping function of guidance towards specific point task can be modelled as
\begin{equation*}
F_{\text{aim}}(s,a, s^{\prime})=\gamma[\varPhi_{a}(s^{\prime})+\varPhi_{\psi}(s^{\prime})]-[\varPhi_{a}(s)+\varPhi_{\psi}(s)]\tag{24}\end{equation*}
\begin{equation*}\varPhi_{a}(s)=\exp\left(-\frac{D_{\text{Impact}}-D_{\text{Impact}}^{\min}}{D_{\text{Impact}}^{\max}-D_{\text{Impact}}^{\min}}\right) \tag{25}\end{equation*}
\begin{equation*}
D_{\text{Impact}}=\Vert \boldsymbol{X}_{\text{TGT}}-\boldsymbol{X}_{\text{Bomb}}\Vert_{2}. \tag{26}\end{equation*}
Furthermore, because of the accuracy of guidance towards specific point task, the shaping function with expert experience can be modelled by
\begin{equation*}
F_{\text{aim}}(s,a, s^{\prime},a^{\prime})=F_{\text{aim}}(s,a, s^{\prime})+R^{\dagger}(s,a, s^{\prime},a^{\prime}). \tag{27}\end{equation*}
In the formula defined above, \begin{equation*}
R^{\dagger}(s,a,s^{\prime},a^{\prime})= \exp\left[-\frac{\left\vert N_{z}-\left\vert \delta_{\psi_{\text{LOS}}}\right\vert \cdot N_{z}^{\max}/\pi\right\vert }{2N_{z}^{\max}}\right] \tag{28}\end{equation*}
2.2.4 Training Set Sampling Method Based on PER
During the training process, the PER [25] is used to sample training data from \begin{equation*}
P(i)= \frac{p_{i}^{\alpha}}{\sum\limits_{k}p_{k}^{\alpha}} \tag{29}\end{equation*}
\begin{equation*}
p_{i}=\vert \delta_{i}\vert +\varepsilon\tag{30}\end{equation*}
PER improves the availability of experiences, but the distribution bias of transitions sampled by PER occurs compared with UER, and this issue also reduces the diversity of training data. Therefore, IS weights are used to correct the distribution bias of training data caused by PER. The IS weight \begin{equation*}\omega_{j}=\left(\frac{1}{N}\cdot\frac{1}{P(j)}\right)^{\beta} \tag{31}\end{equation*}
In order to ensure stable convergence of \begin{equation*}\omega_{j}=\left(\frac{\min\limits_{i}P(i)}{P(j)}\right)^{\beta}. \tag{32}\end{equation*}
At the same time, in the early stage of training, the distribution bias caused by PER is little. Therefore, we define an initial
2.2.5 Training Procedure of Uav Maneuvering Decision-Making Algorithm
According to MDPs, the key issue of optimizing UAV maneuvering decision-making policy is to solve an optimization problem defined as
\begin{equation*}
Q(s,a)=\mathrm{E}_{\pi}[v(s,\pi)]\tag{33}\end{equation*}
\begin{gather*}
Q(s,a)=Q(s,a)+\\ \sigma\left[r+\gamma Q^{\prime} \left(s^{\prime}, \underset{a}{\arg \max} Q(s^{\prime},a)\right)-Q(s,a)\right] \tag{34}\end{gather*}
\begin{equation*}
L(\theta_{Q})=\mathrm{E}_{(s,a,r,s^{\prime})_{j}}\left[\delta_{j}^{\ 2}\right] \tag{35}\end{equation*}
\begin{equation*}\delta_{j}=y_{j}-Q(s_{j},a_{j}\vert \theta_{Q})\tag{36}\end{equation*}
\begin{equation*}
y_{j}=\begin{cases}
r_{j},\ s_{j}\ \text{satifies termination condition}\\
r_{j}+\gamma Q^{\prime}(s_{\ j}^{\prime},\mu^{\prime}(s_{\ j}^{\prime}\vert \theta_{\mu^{\prime}})\vert \theta_{Q^{\prime}}),\ \text{otherwise}\end{cases} \tag{37}\end{equation*}
\begin{equation*}\nabla_{\theta_{Q}}L(\theta_{Q})=\mathrm{E}_{(s,a,r,s^{\prime})_{j}}\left[\delta_{j}\cdot\nabla_{\theta_{Q}}Q(s_{j},a_{j}\vert \theta_{Q})\right]. \tag{38}\end{equation*}
Therefore, when we use PER to sample training data, the updating goal \begin{equation*}\Delta=\sum\limits_{j}\omega_{j}\cdot\delta_{j}\cdot\nabla_{\theta_{Q}}Q(s_{j},a_{j}\vert \theta_{Q}). \tag{39}\end{equation*}
Meanwhile, we define the loss function \begin{equation*}
L(\theta_{\mu})=\mathrm{E}_{s}[Q(s,\mu(s\vert \theta_{\mu})\vert \theta_{Q})]. \tag{40}\end{equation*}
Thereby, we can obtain the gradient of \begin{equation*}\nabla_{\theta_{\mu}}L(\theta_{\mu})=\mathrm{E}_{s}\left[\left.\nabla_{\theta_{\mu}}\mu(s\vert \theta_{\mu})\nabla_{a}Q(s,a\vert \theta_{Q})\right\vert _{a=\mu(s\vert \theta_{\mu})}\right]. \tag{41}\end{equation*}
In addition, considering stability of target networks' training, the parameters of \begin{equation*}\begin{cases}\theta_{Q^{\prime}}=\tau\theta_{Q}+(1-\tau)\theta_{Q^{\prime}}\\ \theta_{\mu^{\prime}}=\tau\theta_{\mu}+(1-\tau)\theta_{\mu^{\prime}}\end{cases}. \tag{42}\end{equation*}
In the equation above, \begin{equation*}
a_{t}=\mu(s_{t}\vert \theta_{\mu})+N(0,\sigma).\tag{43}\end{equation*}
Finally, the training procedure of the UAV maneuvering decision-making algorithm for air-delivery is given in Algorithm 1.
Algorithm 1 UAV Maneuvering Decision-Making Algorithm for Air-Delivery
Input:
The hyperparameters of RL: policy's learning period
The hyperparameters of DL: size of minibatch
The hyperparameters of PER: availability exponent
The hyperparameters of environment: maximum training episodes
Output:
The evaluation networks: actor network
The target networks: actor network
Run:
Initialize
For
Reset environment and receive the initial state
Calculate
For
Calculate
Save current transition
If
Clear the cumulative updating value
For
Sample data
Calculate IS weight
Calculate TD-error
Accumulate
End for
Update the parameters of
Update the parameters of
Update the parameters of target networks
End if
End for
End for
Results and Analysis
Based on the model and algorithm we present above, experiments are conducted to prove the rationality of the model and verify the availability of the algorithm. In the following, we will explain settings of the simulation, details of the training process, results of Monte-Carlo (MC) test experiments, as well as their analysis.
3.1 Settings of Simulation Experiments
For the designed experiments, the mission area is restricted to
Moreover, because each element of state space in guidance towards area task and guidance towards specific point task has different physical units, all the dimensions of each vector before inputting into networks should be normalized. Details of parameters defined above are explained in Table 1. Thereby, we can normalize these parameters according to their range.
3.2 Simulation Results & Analysis of Guidance Towards Area Task
3.2.1 Parameters Setting of Algorithms
According to the training procedure of algorithm, before we start to train, some parameters should be assigned. Parameters assignments of the algorithm are shown in Table 2. Moreover, we design the structure of networks
3.2.2 Analysis of Simulation Results
Using the parameters assigned above, we finish the training of networks successfully. Then, the loss diagrams of critic networks and actor networks involved in UER-DDPG without advice, PER-DDPG without advice, UER-DDPG with advice, PER-DDPG with advice for the guidance towards area task are shown in Fig. 11, Fig. 12, Fig. 13, Fig. 14, respectively.
Loss curves of actor network and critic network over training steps under the setting of UER-DDPG and RS without advice in the guidance towards area task
Loss curves of actor network and critic network over training steps under the setting of PER-DDPG and RS without advice in the guidance towards area task
Loss curves of actor network and critic network over training steps under the setting of UER-DDPG and RS with advice in the guidance towards area task
Fig. 11 and Fig. 12 show the loss curves of actor network and critic network involved in UER-DDPG and PER-DDPG without advice, respectively. The curve of actor network's loss climbs gradually over time and is stable after enough training. Meanwhile, the loss of critic network decreases gradually over time, and finally becomes stable to a small amount. In Fig. 13 and Fig. 14 similar to those algorithms without advice, the loss curves of actor network and critic network are also stable in the end by using the algorithms with advice. However, we can see that the loss of PER-DDPG converges faster than UER-DDPG and it is more stable than UER-DDPG after convergence, from not only the actor loss but the critic loss.
Loss curves of actor network and critic network over training steps under the setting of PER-DDPG and RS with advice in the guidance towards area task
Moreover, Fig. 15, Fig. 16, Fig. 17 and Fig. 18 show that the training parameters are generated from training process over simulation episodes, including the episode rewards and successful rate under the settings of UER-DDPG without advice, PER-DDPG without advice, UER-DDPG with advice, and PER-DDPG with advice. The episode rewards refer to the accumulation of reward agent received in each episode. The successful rate refers to the ratio of successful results in the last 50 experiments.
Curves of evaluation parameters for training under the setting of UER-DDPG and RS without advice in the guidance towards areatask
Curves of evaluation parameters for training under the setting of PER-DDPG and RS without advice in the guidance towards areatask
Curves of evaluation parameters for training under the setting of UER-DDPG and RS with advice in the guidance towards area task
Fig. 15(a), Fig. 16(a), Fig. 17(a) and Fig. 18(a) are the curves of cumulative rewards over simulation episodes. They show that cumulative rewards will be stable at a deterministic value and the trend of all curves is similar. Fig. 15(b), Fig. 16(b), Fig. 17(b) and Fig. 18(b) are the curves of the successful rateover simulation episodes, which demonstrate that the parameters increase until 1.0. We can see that all the training experiments have optimized the optimal policy, but much more time is consumed during the process of training when introducing expert advice. This indicates that extra training time is needed for the agent to learn the more valuable information.
Curves of evaluation parameters for training under the setting of PER-DDPG and RS with advice in the guidance towards area task
After training, we perform a group of MC experiments to evaluate the quality of trained results of the designed algorithms described above. Statistical analysis results of MC test experiments are shown in Table 5. We can see that all the four groups of experiments' results demonstrate the trained results converged to the near optimal point.
Meanwhile, we visualize some assessment results from MC experiments, including the flight trajectory of UAV in an experiment, the action given by agent over time, and the reward agent received over time, as shown in Fig. 19, Fig. 20, Fig. 21 and Fig. 22. In Fig. 19(a), Fig. 20(a), Fig. 21(a), and Fig. 22(a), the red solid line represents the flight trajectory of UAV, the red point and the green “x” indicate the start position and the end position of UAV respectively, the blue “+” and the blue dashed circle surrounding it indicate the target position and its effective area respectively. In (b) subplots of these figures, these are the curve of action output by algorithm and the curve of rewardsover time.
Visualization of test experiment for the trained policy of UER-DDPG and RS without advice in the guidance towards area task
Visualization of test experiment for the trained policy of PER-DDPG and RS without advice in the guidance towards area task
Visualization of test experiment for the trained policy of UER-DDPG and RS with advice in the guidance towards area task
Visualization of test experiment for the trained policy of PER-DDPG and RS with Advice in the guidance towards area task
In Fig. 19(a), Fig. 20(a), Fig. 21(a), and Fig. 22(a), we can see that the trained policy of each algorithm with different kind of reward function in the guidance towards area task converges to the near optimal point because of its approximate optimal performance. The UAV could enter the mission area from arbitrary position and random azimuth, and all the algorithms show good performance on this task. Meanwhile, we can see that the policy outputs a reasonable control value at different situations according to the curves of action for each algorithm. For example, when the UAV flies towards the target area, the action of policy is approximately equal to 0. If the target area is located on the front left side of UAV, the action of policy will be a negative value. On the contrary, the policy will give a positive value while the target area is located on the front right side of UAV. And the reward curve also shows the same trend which means when the UAV selects the action which makes it closer to the target area, the reward given by model will be bigger. Especially for the model improved by RS with advice, when the UAV selects the action that potentially makes bigger difference between current state and last state, it could receive a positive value, as shown in Fig. 21(b) and Fig. 22(b).
Furthermore, the trained policies show good performance in the guidance towards area task, but different kinds of shaping function produce disparate performance in terms of policy output. Obviously, the output of those algorithms with advice in Fig. 21(b) and Fig. 22(b) are much stabler than those without advice in Fig. 19(b) and Fig. 20(b), because there are not so many high-frequency fluctuations in the curves of reward and action of Fig. 21 and Fig. 22.
Thereby, it is proven that the algorithms and the modified model we design is reasonable and effective to solve the guidance towards area task and improve the autonomy of UAV during the process of adjusting attitude when it prepares to drop bomb. Furthermore, the trained results of UER-DDPG and PER-DDPG with expert advice have more superior quality in terms of policy's output than algorithms without expert advice because the output of policy involved in UER-DDPG and PER-DDPG with expert advice is smoother.
3.3 Simulation Results and Analysis of Guidance Towards Specific Point Task
3.3.1 Parameters Setting of Algorithms
Similarly, before we start to train, some parameters should be assigned. Parameters assignments of algorithm are shown in Table 6. Moreover, we design the structure of networks
3.3.2 Analysis of Simulation Results
Similar to the guidance towards area task, we also obtain reasonable results after training. As shown in Fig. 23, Fig. 24, Fig. 25 and Fig. 26, these are the loss diagrams of critic networks and actor networks involved in UER-DDPG without advice, PER-DDPG without advice, UER-DDPG with advice, and PER-DDPG with advice for the guidance towards specific point task, respectively.
Loss curves of actor network and critic network over training steps under the setting of UER-DDPG and RS without advice in the guidance towards specific point task
Loss curves of actor network and critic network over training steps under the setting of PER-DDPG and RS without advice in theguidance towards specific point task
Loss curves of actor network and critic network over training steps under the setting of UER-DDPG and RS with advice in the guidance towards specific point task
In Fig. 23(a), Fig. 24(a), Fig. 25(a), and Fig. 26(a), the loss curves of actor network are presented, which show that actor's losses increase or decrease until stabilizing at a deterministic value. In Fig. 23(b), Fig. 24(b), Fig. 25(b), and Fig. 26(b), the loss curves of critic network are shown, which demonstrate that the critic networks stabilize at a good value after lots of training, though there are some peaks during training. Among these figures, we could find that all the algorithms under different reward functions converge to near optimal point and the loss of PER-DDPG is more stable than UER-DDPG after convergence in terms of the smoothness of loss curve.
Loss curves of actor network and critic network over training steps under the setting of PER-DDPG and RS with advice in the guidance towards specific point task
Moreover, Fig. 27, Fig. 28, Fig. 29, Fig. 30 show the training parameters generated from training process over simulation episodes, including the episode rewards and successful rate under the settings of UER-DDPG without advice, PER-DDPG without advice, UER-DDPG with advice, and PER-DDPG with advice.
Curves of evaluation parameters for training under the setting of UER-DDPG and RS without advice in the guidance towards specific point task
Curves of evaluation parameters for training under the setting of PER-DDPG and RS without advice in the guidance towards specific point task
Curves of evaluation parameters for training under the setting of UER-DDPG and RS with advice in the guidance towards specificpoint task
Fig. 27(a), Fig. 28(a), Fig. 29(a), and Fig. 30(a) show that the episode rewards increase gradually till a fixed value. Fig. 27(b), Fig. 28(b), Fig. 29(b), and Fig. 30(b) are the curves of successful rate, which show that though the training process is fluctuant, the actor and critic of each algorithm converge at an excellent level. Among these results, we can obtain the optimal policy in all the training experiments, but much more time is consumed during the process of training when introducing expert advice due to similar reasons. In addition, algorithms are required to operate UAV more accurately in guidance towards specific point task because the impact point will be target point far away when the agent only slightly adjusts the azimuth of UAV due to the range of bomb. Thereby, it is difficult for algorithms to converge to the optimal point due to overestimate bias and variance.
Curves of evaluation parameters for training under the setting of PER-DDPG and RS with advice in the guidance towards specific point task
In the same way, we also design a group of MC experiments to assess the quality of trained results of algorithms above in the guidance towards specific point task. Statistical analysis results of MC test experiments are shown in Table 9. We can see that all four groups of experimental results demonstrate that the trained results converge to the near optimal point and have achieved an available and satisfied level.
Simultaneously, we visualize some assessment results from MC experiments, including the flight trajectory of UAV in an experiment, the action given by agent over time, and the reward agent received over time, as shown in Figs. 31–34. In (a) subplots of each figure, the red solid line represents the flight trajectory of UAV, the red dashed line represents the trajectory of bomb on the horizontal plane, the red point and the green “x” indicate the start position and the end position of UAV respectively, and the blue “+” and the blue dashed circle surrounding it indicate the target position and its effective area separately.
Visualization of test experiment for the trained policy of UER-DDPG and RS without advice in the guidance towards specific point task
Visualization of test experiment for the trained policy of PER-DDPG and RS without advice in the guidance towards specific point task
Visualization of test experiment for the trained policy of UER-DDPG and RS with advice in the guidance towards specific point task
Similar to guidance towards area task, the proposed algorithms show a good performance in the guidance towards specific point task. As shown in Figs. 31–34, the policy converges to the near optimal position after training. In Fig. 31(a), Fig. 32(a), Fig. 33(a), and Fig. 34(a), the UAV reaches an appropriate position to release bomb by giving a reasonable control value to adjust the attitude of UAV. For instance, if the desired impact point is located at the front left side of UAV, the policy will give a negative action. On the contrary, the action output by policy will become positive when the point is located at the right left side of UAV. We could find these results from the curve of action in Fig. 31(b), Fig. 32(b), Fig. 33(b), and Fig. 34(b).
Visualization of test experiment for the trained policy of PER-DDPG and RS with advice in the guidance towards specific point task
Moreover, we could find the good performance of trained policies in the guidance towards specific point task, but different kinds of reward shaping methods cause the difference of policy in terms of output stability and robustness. Obviously, the output of policies trained based on RS with advice is much stabler than those based on RS without advice, because there is not much irregular noise in the output value of policy shown in Fig. 33(b) and Fig. 34(b), compared with Fig. 31(b) and Fig. 32(b).
Thereby, it is proved that the algorithms and the modified model we design is reasonable and effective to solve the guidance towards specific point task and enhance the autonomy of UAV during the process of aiming at the target when it prepares to release bomb. Furthermore, the performance of trained results of UER-DDPG and PER-DDPG with expert advice is similar to the results in the guidance towards area task. The output of algorithms introducing expert advice is smoother than that of algorithms without expert advice. This superiority could help the policy trained by the DRL-based algorithm be transferred to the real world, because the high frequency vibration is unbearable for machine.
Conclusions
In the present work, we refine and describe the guidance towards area task and the guidance towards specific task in air-delivery. According to the definitions of problems, we propose the UAV maneuvering decision-making algorithm based on DRL to execute air-delivery mission autonomously. Among this work, we design and construct the UAV maneuvering decision-making model based on MDPs, consisting of state space, action space and reward function of each task. Then, we present the UAV maneuvering decision-making algorithm based on PER-DDPG, in which the PER sampling method improves the availability of historical data during the training process. Specifically, we propose a construction method of modified reward function that took domain knowledge and expert advice into account to improve the inference quality of trained policy network.
Meanwhile, we design extensive experiments to verify the performance of proposed algorithms and model. The parameters generated from training process show that PER method could help algorithm converge more quickly than UER and the trained results are stabler than that using UER. Furthermore, the MC experiments results demonstrate that the modified reward function involving expert advice can significantly improve the quality of policy trained by algorithms and has better performance to achieve accurate operation.
In the future, we will consider the influence of information dimension loss, which means environment is partially observed. And we will extend the proposed algorithm to manipulate real UAVs in a 3D environment while performing special missions.