Journals & Magazines >Journal of Systems Engineerin... >Volume: 35 Issue: 3

UAV Maneuvering Decision-Making Algorithm Based on Deep Reinforcement Learning Under the Guidance of Expert Experience

Abstract:

Autonomous umanned aerial vehicle (UAV) manipulation is necessary for the defense department to execute tactical missions given by commanders in the future unmanned battl...Show More

Metadata

Abstract:

Autonomous umanned aerial vehicle (UAV) manipulation is necessary for the defense department to execute tactical missions given by commanders in the future unmanned battle-field. A large amount of research has been devoted to improving the autonomous decision-making ability of UAV in an interactive environment, where finding the optimal maneuvering decision-making policy became one of the key issues for enabling the intelligence of UAV. In this paper, we propose a maneuvering decision-making algorithm for autonomous air-delivery based on deep reinforcement learning under the guidance of expert experience. Specifically, we refine the guidance towards area and guidance towards specific point tasks for the air-delivery process based on the traditional air-to-surface fire control methods. Moreover, we construct the UAV maneuvering decision-making model based on Markov decision processes (MDPs). Specifically, we present a reward shaping method for the guidance towards area and guidance towards specific point tasks using potential-based function and expert-guided advice. The proposed algorithm could accelerate the convergence of the maneuvering decision-making policy and increase the stability of the policy in terms of the output during the later stage of training process. The effectiveness of the proposed maneuvering decision-making policy is illustrated by the curves of training parameters and extensive experimental results for testing the trained policy.

Published in: Journal of Systems Engineering and Electronics ( Volume: 35, Issue: 3, June 2024)

Page(s): 644 - 665

Date of Publication: 23 April 2024

Electronic ISSN: 1004-4132

DOI: 10.23919/JSEE.2024.000022

Funding Agency:

Citations are not available for this document.

Contents

SECTION 1.

Introduction

With rapid development, unmanned aerial vehicles (UAVs) have become important roles in various engineering fields in recent years. UAVs have been used to assist or replace human to execute dirty, boring, and difficult missions due to its low cost, high mobility, and unmanned feature [1]. Therefore, UAVs are widely used in surveillance, searching, tracking, and other missions [2]. Thus, how to improve the autonomy of UAVs while performing some tasks to avoid risking human lives has become the research focus in various fields. For instance, some people used UAVs to carry out delivery of relief supplies, extinguishing, and so on [3]–[5]. Consequently, it has become one of the key issues for engineering applications to improve the autonomous flight capability of UAV [6]–[9].

Nowadays, UAVs are mainly used to execute tasks which could be conducted automatically instead of manually, such as target tracking, long-distance delivery, patrol and so on. One of the important technical issues from these tasks is finding an optimal path from start to end point and designing a controller to manipulate it following the path. The optimal path could be found by path planning algorithms [10], such as visibility graph [11], randomly sampling search algorithms including rapidly-exploring random tree [12], probabilistic roadmap [13], heuristic algorithms including A-Star [14], Sparse A-Star [15], and D-Star [16], and genetic algorithms [17]. Then, a controller could be designed to operate UAV following the planned path using various trajectory tracking algorithms [18]. However, there are some disadvantages in the solution mentioned above. For example, finding the optimal path relies on prior knowledge about the environment, but the data of terrain and obstacles is so difficult to obtain that the capability of environment modelling is limited [19], [20]. Moreover, for dynamic environment with moving obstacles [21], the scheme designed above is not flexible enough to alter their control strategies immediately. A replan of paths has to be scheduled to adapt to the changes in the environment. Furthermore, because conventional algorithms need much more time to calculate the optimal path, it is difficult to apply them to solving real-time problems. Therefore, it is necessary to design an end-to-end algorithm which could be used to operate autonomous UAV flight in a dynamic environment without path planning and trajectory tracking.

A research highlight is inspired by AlphaGo developed by Google based on deep reinforcement learning (DRL) [22], which could play Atari games using an end-to-end decision-making algorithm called deep Q network (DQN) [23]. The performance of this algorithm reached human level after an extensive training, and it has attracted lots of researchers from various fields to study the applications of DRL in all kinds of engineering problems [8], [9]. Meanwhile, the deep deterministic policy gradient (DDPG) [24] was proposed to solve the dimension explosion caused by the continuity of action space and state space. And the experience replay method that samples from experience buffer was constructed in these algorithms to allow agents to remember and learn from historical data. The uniform experience replay (UER) was used to form training set by taking samples from experience buffer. However, the UER does not fully exploit the diversity of historical data and affects the convergence rate of policy, even causes divergence after lots of training. Therefore, the prioritized experience replay (PER) [25] was proposed to improve the efficiency of learning from experiences. It uses the potential value of historical data to increase the convergence rate of policy network, because a priority model of each sample is designed to evaluate the profit of samples for training of policy network at current step.

In addition, it is important for training policy to design a reward function of target problem. Traditional formulation of reward function comes from original model of the problem [23], [24], such as CartPole, Pendulum and other Atari games. Thus, how to construct an appropriate reward function is not the focus of most popular works, as they mainly focus on the improvements of algorithms instead of modelling. However, it is also essential for solving problems from specific research fields to define modified model when we attempt to apply DRL for a new problem. Traditionally, some researchers considered that the reward function involved in problem model is designed by human experience [26], [27] so that the trained policy extremely relied on the capability of designer. Although some recently published work found effective policy, we could find that different formulations of reward function brought different training processes and trained policies.

In the present work, we aim to tackle the challenges mentioned above and focus on the UAV maneuvering decision-making algorithm for autonomous air-delivery including guidance towards area and guidance towards specific point tasks. The main works presented in this paper are summarized as follows:

The UAV maneuvering decision-making model for air-delivery is built based on Markov decision processes (MDPs) [28], [29]. Particularly, we refine the guidance towards area and guidance towards specific point tasks involved in the air-delivery problem. Meanwhile, we design the flight state space, the flight action space, and the reward functions of each task. Among the components of the model, we devote the traditional air-to-ground fire control theory to designing and constructing the UAV maneuvering decision-making model for air-delivery.
We propose the UAV maneuvering decision-Making algorithm for autonomous air-delivery based on DDPG with PER sampling (PER-DDPG) to optimize the maneuvering policy. Specially, we design the policy function by deep neural network and generate the training samples based on PER. Moreover, we present a construction method of reward function based on expert experience and domain knowledge.
It is proved that the proposed algorithm could improve the autonomy of UAV during the air-delivery process and the presented construction method of reward function is beneficial to the convergence of maneuvering policy, even improving the quality of policy's output.

This paper is organized as follows: Section 2 describes the background knowledge of all the methods used to design UAV maneuvering decision-making model and algorithm for air-delivery. Section 3 presents the details of experiments we design, comparison of training parameters under different sampling methods and reward functions. Section 4 shows the conclusion of our work and looks forward to the future of our research.

SECTION 2.

Methodology

With the rapid development of UAV technology, it has been used to execute various dangerous and repetitive missions, such as electric power inspection, crop protection, wildlife surveillance, traffic monitoring, and rescue operations. Demands for more advanced and simple UAV autonomous flight solution have emerged. As mentioned above, traditional solution of UAV guidance is that the algorithm first plans an optimal path and then UAV follows the path by the trajectory tracking method.

In this paper, we describe the process of UAV autonomous air-delivery in detail and define the guidance towards area and guidance towards specific point tasks involved in the air-delivery problem. Then, we construct the UAV maneuvering decision-making model for air-delivery based on MDPs. Meanwhile, we present a construction method of reward function, and the expert experience and domain knowledge are given due consideration. On the other hand, we propose the UAV maneuvering decision-making algorithm based on DRL.

As shown in Fig. 1, we first construct the UAV maneuvering decision-making model for air-delivery consisting of the guidance towards area and guidance towards specific point tasks based on MDPs. Among this model, we design action space, state space and basic reward of each task which are used to demonstrate the characteristics of UAV autonomous flight during air-delivery. Moreover, we design and realize the UAV maneuvering decision-making algorithm including the UAV maneuvering decision-making policy based on neural network, and the policy network is optimized according to data sampled from historical experiences by PER. Meanwhile, we construct the shaping reward of each task to increase the convergence rate of policy network and to improve the quality of policy's output.

Fig. 1

Structure of UAV maneuvering decision-making for air-delivery based on DRL

MIT Libraries

MIT Libraries

UAV Maneuvering Decision-Making Algorithm Based on Deep Reinforcement Learning Under the Guidance of Expert Experience

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Methodology

2.1 UAV Maneuvering Decision-Making Model for Air-Delivery Based on MDPs

2.1.1 MDPs

2.1.2 Definition of Guidance Towards Area and Guidance Towards Specific Point Tasks Involved in Air-Delivery Mission

(i) Guidance Towards Area Task

(ii) Guidance Towards Specific Point Task

2.1.3 State Space, Action Space, and Basic Reward of Each Task

(i) State Space of Guidance Towards Area Task

(ii) State Space of Guidance Towards Specific Point Task

(iii) Action Space of Each Task

(iv) Reward Function of Each Task

2.2 UAV Maneuvering Decision-Making Algorithm for Air-Delivery Based on PER-DDPG

2.2.1 Framework of PER-DDPG

2.2.2 UAV Maneuvering Decision-Making Policy Based on Neural Network

(i) Actor Network

(ii) Critic Network

2.2.3 Shaping Reward Function Based on Expert Experience and Domain Knowledge

(i) Shaping Function of Guidance Towards Area Task

(ii) Shaping Function of Guidance Towards Specific Point Task

2.2.4 Training Set Sampling Method Based on PER

2.2.5 Training Procedure of Uav Maneuvering Decision-Making Algorithm

Algorithm 1 UAV Maneuvering Decision-Making Algorithm for Air-Delivery

Results and Analysis

3.1 Settings of Simulation Experiments

3.2 Simulation Results & Analysis of Guidance Towards Area Task

3.2.1 Parameters Setting of Algorithms

3.2.2 Analysis of Simulation Results

3.3 Simulation Results and Analysis of Guidance Towards Specific Point Task

3.3.1 Parameters Setting of Algorithms

3.3.2 Analysis of Simulation Results

Conclusions

References