Introduction
An autopilot system for a modern missile must be able to stabilize the missile rotational dynamics and to effectively track the sequence of acceleration commands provided by the navigation and guidance system to follow the desired trajectory. Generally, to achieve these objectives, missile autopilots are designed exploiting classical model-based approaches, mainly relying on linearization of nonlinear dynamics and gain scheduling control (see for example [1] and the references therein). However, since the closed-loop performance might be significantly deteriorated by the presence of highly nonlinear terms in the plant dynamics [2], several nonlinear control strategies have been proposed to tackle this issue, ranging from sliding mode approaches [3] to backstepping [4], to nonlinear model predictive control [5] and
RL algorithms proposed in literature can be classified according to two different paradigms, namely model-based and model-free, depending on the assumed knowledge of the environment model [10]. Although the approaches that belong to the first class, i.e. the model-based RL, have been extensively investigated in real applications (see for example [11] and the references therein), they are generally designed under the restrictive assumption that model information is available to the agent. Therefore, the performance of these approaches highly rely on the accuracy of the model [12]. These concepts advise using model-free approaches when this information is not available for the training phase. Indeed, conversely to the first methods, model-free RL requires more interactions with the external environment and bases its functionality mainly on the environment changes and feedbacks, without the need to deeply understand its functioning [11]. Therefore, these methodologies do not require an estimation of the Markov Decision Process (MDP) model, and the value or policy function can be evaluated directly by sampling in order to approximate the task solution [10]. Although these features can limit the applicability of these strategies in some real applications, they can be used in all cases where there is no a priori information useful for the training phase and, therefore, can be exploited to address the more challenging case of a completely unknown environment. Moreover, recent developments and remarkable achievements in image processing [13], face recognition [14] and natural language processing [15] fields suggested to integrate the Deep Learning theory into the RL framework, leading to the concept of Deep Reinforcement Learning (DRL), which leverages the ability of Deep Neural Networks to serve as universal function approximators to achieve improved control performance [16], [17]. Thanks to DRL, it is possible to deploy RL-based control systems for all those applications where continuous or high-dimensional state and action spaces make traditional RL strategies, such as Q-learning, impractical or insufficient. In particular, Deep Deterministic Policy Gradient (DDPG, [17]) is currently one of the most common approaches in this field of research. RL and DRL have been successfully applied to various control engineering problems, ranging from autonomous vehicles [18]–[21], to energy and electrical systems [22]–[25], robotics [26], [27], IoT security [28], [29] and maritime applications [30], [31].
Surprisingly, despite their significant potential, only few recent works propose the use of RL techniques as a control strategy for tackling different air vehicles problems, the most representative being perhaps [32] and [33], where the authors exploit a DDPG approach to design the inner-loop controller providing attitude control for a quadrotor, the autopilot of an Unmanned Combat Aerial Vehicle, respectively, or more recently [34] where a RL-based missile path-planning algorithm is proposed for head-on interception. In addition, in [12] a DDPG approach is exploited to tune the control gains of a typical fixed-structure three-loop autopilot [7], with the aim of optimizing the missile autopilot performance.
In this perspective, the objective of this work is to investigate the possibility of successfully exploiting high-performance learning tools for the design of data-driven missile autopilot control in a model-free fashion. To this aim, a policy gradient model-free RL approach, specifically the DDPG strategy, is adopted to stabilize the longitudinal dynamics of a missile and to satisfy some performance requirements through the choice of a suitable reward function. DDPG is a relatively simple Policy Gradient (PG) actor-critic algorithm based on the use of deep neural networks, which has been chosen for the purposes of this work due to its sample efficiency and the small number of hyper-parameters involved, which makes the tuning procedure more straightforward when compared to more sophisticated RL techniques. Indeed, deep RL algorithms usually have a quite large number of free parameters (the structure of the neural networks, the learning rates, the soft update policy in case of twin neural networks, as in TD3, and so on) whose effect on the final result is not always obvious or immediately interpretable. In recent years, several DRL algorithms have been proposed in the literature, some of which can improve the characteristics of the agent’s training with respect to the DDPG algorithm exploited here; indeed, DDPG is sometimes prone to training instability issues (mostly because it does not implement any explicit bound on the gradient ascent stepsize).
For the sake of completeness and to better motivate our work, despite our focus is on DDPG, in the following discussion we will try to give to the reader an overview of other comparable DRL methods, while further details of the DDPG algorithm are instead given in Section IV.
In general, PG RL algorithms aim at exploiting some form of gradient ascent to optimize the policy so as to maximize some given objective function, based on the reward obtained at each time step. However, the gradient method does not prescribe a way to choose a safe step-size in the optimization procedure. For this reason, the Trust Region Policy Optimization (TRPO) algorithm was proposed in [35], which proposes to limit the Kullack-Leibler divergence between the old and updated policies in order to limit the gradient steps amplitude. Proximal Policy Optimization (PPO) [36] is a revised version of TRPO, which exploits a clipping mechanism in order to obtain a Trust Region-like optimization algorithm which is compatible with the classical Stochastic Gradient Descent. It is worth to remark that both PPO and TRPO implicitly call for stochastic policies.
On the other hand, RL research moved along a parallel path in order to increase the sample efficiency of the training algorithms for agents which employ neural networks (especially in the actor-critic framework). The simplest algorithm belonging to this class of techniques is DDPG, which contains ideas that stem from the Deep Q-Network algorithm, but that is naturally suited for continuous actions spaces, and which exploits a replay buffer technique. In some implementations target networks are also used to improve the algorithm’s stability.
Modifications to DDPG have been proposed in the technical literature to improve some aspects of the agent’s training procedure; in [37] the TD3 algorithm was proposed, which adds some devices to avoid overestimation and reduce variance, providing better stability properties in some application cases, while a maximum entropy version of DDPG/TD3 named Soft Actor-Critic (SAC) has been introduced in [38].
In this view, our final choice fell on the DDPG algorithm, which tends to be more sample efficient than PPO on one hand, and to have less tuning parameters than more sophisticated techniques such as TD3 or SAC on the other. Thus, in this paper, we first recast the missile autopilot design control problem into the RL framework, with the primary aim of testing this approach in terms of control performance (settling time, undershoot, etc.) and then we compare the fully data-driven DDPG controller against classical model-based control strategies (such as
Most notably, despite the nonlinearities that affect the process under exam, it was found that, when applying the DDPG approach to the autopilot problem, a deep knowledge of the plant model is not required and a linear model can be effectively used during the training procedure, to reduce the required computational burden, without degrading the performance of the real closed-loop system, at least close to the considered equilibrium point. Along this line, the agent obtained through the proposed method is then validated on the 2-Degrees of Freedom (2-DoF) fully nonlinear model.
The analysis further discloses how the careful study and definition of the reward function allows to easily shape the performance in the transient behavior, for example by decreasing the undershoot phenomena. In addition, comparison results in a realistic flight scenario confirm that the excellent capabilities of the proposed RL approach in capturing the underlying unknown nonlinear behaviors allow providing satisfactory closed-loop performances, which are comparable to those of state-of-the-art model-based techniques, without the need for running a detailed model of the process in real-time or for having a detailed a priori knowledge of the nonlinear dynamics. In addition, simulations at different Mach numbers and with random variations in the aerodynamic coefficients employing a Monte Carlo approach are performed in order to provide some meaningful insight on the robustness of the closed-loop.
It is finally worth noting that the need for pioneering solutions to respond to unmet challenges as well as to new opportunities derived from the application of AI techniques to this research field is confirmed by the autopilot system very recently designed in [40] leveraging a modified TRPO agent trained on a detailed nonlinear model of the plant dynamics. In particular, such system exploits a transformed acceleration signal as the controlled variable to overcome the inherent non-minimum phase characteristics of the missile dynamic. This approach does not allow the authors to take into account, during the training of the RL agent, the typical undershoot that characterizes the transient response of a missile to a step request in the acceleration. As opposed to [40], the present work instead tries to investigate the capability of a purely data-driven missile autopilot by explicitly considering the main performance indexes (settling-time, undershoot, steady-state error, etc.) in the DDPG reward function.
The rest of the paper is organized as follows. Sections II and III describe the control requirements and the missile nonlinear 2-DoF model, respectively, while in Section IV a brief introduction to the DDPG algorithm is provided. The details of the proposed RL approach, in terms of agent structure, reward function engineering and training procedure, are described in Section V while, simulation results are discussed in Section VI, where the performance of the proposed RL agent is compared to those of a self-scheduled
Problem Statement
This section defines the control requirements that will be taken into account in the design of the proposed autopilot based on a RL control approach.
During the flight, the longitudinal dynamics of a missile can be unstable, depending on the relative location between the center of pressure and the center of mass, i.e. the center of pressure is the point where the lifting force is considered to act, as shown in Fig. 1. In order to stabilize and to control the longitudinal dynamic of the missile a tail fin is introduced. It follows that the controller must generate the required tail deflection to produce the desired normal acceleration, while stabilizing the airframe rotational motion. Moreover, the transient response of the missile to a step request in the normal acceleration is characterized by an initial undershoot, which is reflected by the fact that the associated linearized model is a non-minimum phase one [39]. Ideally, this undershoot should be kept as small as possible; however, as it will be shown in what follows, this results in a slower response, hence a trade-off between the bandwidth of the closed-loop system and the maximum undershoot must be sought.
Simplified scheme of the considered missile: velocity vector is depicted in red while the lift forces in blue.
Based on the previous observations, the following qualitative requirements are considered in Section V-B to design the reward function of the proposed DDPG approach, in order to ensure performance that are similar to those of other solutions available in literature ([6], [39], [41]):
the control system shall ensure the stability of the closed-loop system over the largest possible operating range, defined in terms of angle of attack
and the Mach number\alpha (t) ; it should be noticed that a wider range in terms ofM is preferable since typical applications foresee the scheduling of different controllers as a function of\alpha (t) (see [41] as an example);M the control system shall take into account the maximum deflection that can be applied to the tail;
in tracking a step command in the normal acceleration, the control system shall minimize the following quantities:
the rising time at the 90% of the final value;
the overshoot;
the undershoot;
the steady-state error.
Longitudinal Missile Dynamic Model
In order to simulate the missile dynamics and to prove the effectiveness of the proposed autopilot system, the following simplified 2-DoF nonlinear model proposed in literature [39], [41] is considered, which is capable of describing the longitudinal dynamics of a tailed controlled missile (see Fig. 1) under the following assumption.
Assumption 1 (Fully Decoupled Dynamics):
It is assumed that the pitch, yaw and roll channels are decoupled, so coupling phenomena are ignored.
Given Assumption 1, the longitudinal dynamic of a missile can be described as follows: \begin{align*} \dot {\alpha }(t)=&K_{\alpha } M C_{n}\left ({\alpha (t),\delta (t),M}\right) \cos (\alpha (t)) + q(t),\quad \tag{1a}\\ \dot {q}(t)=&K_{q} M^{2} C_{m}\left ({\alpha (t),\delta (t),M}\right)\!, \tag{1b}\\ \dot {\delta }(t)=&\delta _{v}(t), \tag{1c}\\ \dot {\delta }_{v}(t)=&-\omega _{a}^{2} \delta (t) -2\zeta \omega _{a}\delta _{v}(t) + \omega _{a}^{2}\delta _{c}(t), \tag{1d}\\ \eta (t)=&K_{z} M^{2} C_{n}\left ({\alpha (t),\delta (t),M}\right)\!, \tag{1e}\end{align*}
Equations (1c) and (1d) define a second-order linear model of the actuator that links the tail fin deflection command \begin{align*} C_{n} (\alpha,\delta,M)=&a_{n} \alpha ^{3} + b_{n} \alpha |\alpha | + c_{n} \left ({2-\frac {1}{3} M}\right) \\&\times \,\alpha +d_{n} \delta, \tag{2a}\\ C_{m}(\alpha,\delta,M)=&a_{m} \alpha ^{3} + b_{m} \alpha |\alpha | + c_{m}\left ({-7+\frac {8}{3} M}\right) \\&\times \,\alpha + d_{m} \delta,\tag{2b}\end{align*}
Deep Deterministic Policy Gradient
In the RL approach, an agent must learn to interact with an unknown environment in a way that maximizes the expected cumulative value of a given reward function. Usually, the environment is modeled as a Partially Observable MPD (PO-MDP); in particular, at each time instant, the agent receives from the environment an observation and must pick an action \begin{equation*} R_{t} = \sum _{k=0}^{N} \gamma ^{k} r_{t+k+1}, \quad \gamma \in \big[0,1\big), \tag{3}\end{equation*}
In classical RL tabular methods, discrete action and observation spaces are considered. The name tabular reflects the fact that, in such methods, the agent usually stores a table that associates to each state-action pair the value of the expected cumulative reward
Tabular methods, however, are limited in working only with discrete action and observation spaces, being inefficient in the presence of continuous and high dimensional spaces. To overcome this limitation, several extensions have been proposed in the technical literature, mainly exploiting deep neural networks and their capability of serving as universal function approximators. The combination of Deep Learning techniques with Reinforcement Learning algorithms is usually referred to as Deep Reinforcement Learning. In particular, in actor-critic methods, the RL problem is separated into two subproblems:
critic: finds a good approximation of the action-value function
(whereQ(s,a) ands may assume continuous values);a actor: exploits the critic to improve the policy, represented with another approximator
.\mu (s)
In this study, an actor-critic method known as DDPG algorithm, originally proposed in [17], is considered. DDPG is a model-free, off-policy approach that extends the DPG [16] with the exploitation of deep neural networks. A simple representation of the DDPG paradigm is shown in Fig. 2. In DDPG, an actor network \begin{align*} L(\theta ^{Q})=&\mathbb {E}\Big [(Q(s_{t},a_{t}\vert \theta ^{Q})-y_{t})^{2} \Big] \\[-2pt]\approx&\frac {1}{N} \sum _{i}(Q(s_{i}, a_{i} \vert \theta ^{Q})-y_{i})^{2}, \tag{4}\end{align*}
\begin{align*} \nabla _{\theta ^{\mu }} J=&\mathbb {E} \Big [\nabla _{a} Q(s,a \vert \theta ^{Q})_{s=s_{t}, a=\mu (s_{t})} \nabla _{\theta ^\mu } \mu (s \vert \theta ^\mu)\vert _{s=s_{t}} \Big] \\[-2pt]\approx&\frac {1}{N} \sum _{i} \nabla _{a} Q(s,a \vert \theta ^{Q})_{s=s_{i}, a=\mu (s_{i})} \nabla _{\theta ^\mu } \mu (s \vert \theta ^\mu)\vert _{s=s_{i}}\,. \\ \tag{5}\end{align*}
Basic scheme of a DDPG agent showing the interaction between the actor, the critic (represented as deep neural networks) and the environment (represented as a missile). According to the DDPG technique, the environment provides the state
Since the
Deep Deterministic Policy Gradient (DDPG)
Randomly initialize critic and actor networks
Initialize the target net- works \begin{equation*} \theta ^{Q'} \leftarrow \theta ^{Q}, \theta ^{\mu '} \leftarrow \theta ^{\mu };\end{equation*}
Initialize the replay buffer
for
Initialize the random pro- cess noise
Receive initial observation state
for
Select an action
Execute the action
Store transition
Sample a random minibatch of
Set
Update the critic minimizing the loss function by using equation (4)
Update the actor policy using the sampled policy gradient in equation (5)
Update the target networks:\begin{align*} \theta ^{Q'}\leftarrow&\tau \theta ^{Q}+(1-\tau)\theta ^{Q'};\\ \theta ^{\mu '}\leftarrow&\tau \theta ^{\mu }+(1-\tau)\theta ^{\mu '}.\end{align*}
end for
end for
RL Control System for Missile
In this section, the proposed DDPG control algorithm is introduced, focusing on the control system architecture, neural networks and details concerning the reward function proposed for the training phase.
A. Controller Architecture
Starting from the state variables of the 2-DoF nonlinear missile model in (1), the observations vector for the agent training has been chosen as:\begin{equation*} Obs(t)=\begin{bmatrix} \alpha (t) &\quad q(t) &\quad \delta (t) &\quad \eta _{ref}(t) \end{bmatrix}^{T} \end{equation*}
The structures of the neural networks (see Fig. 2) have been defined through a trial-and-error procedure in terms of the number of hidden layers and neurons, activation functions, etc., considering a trade-off between performance and limited computational capacity available on board. The main results of the analysis that was carried out are summarized in what follows.
The architecture of the critic neural network is shown in Table 2. This neural network is characterized by 5 input variables, i.e. observation and the action variables, and a single output variable, representing the critic’s estimate of the action-value function. Note that, all input variables were normalized so as to take values in the range [0, 1]. Five fully connected layers connect inputs and outputs, each of them characterized by 100 neurons and REctified Linear Units (RELU) activation function. In particular, the fully connected layers 1 and 2 process in sequence the observation variables, while the fully connected layer 3 processes the action variable. Then, the outputs of layers 2 and 3 are summed before passing through the fully connected layers 4 and 5.
The architecture of the actor’s neural network is shown in Table 3. This network has 4 input variables, i.e. the observation variables. Also in this case, the input variables have been normalized to take values in the range [0, 1]. The only output of the network is the control action. Between the input and output layers there are four fully connected layers, each containing 100 neurons. The first three layers have RELU activation function, while the last has a hyperbolic tangent (tanh) activation function, which produces an output in the range [−1, 1]. The output of this layer is then scaled taking into account the maximum allowed actuator deflection
B. Reward Engineering
Once the structure of the agent has been set, a reward function must be defined, taking into account the requirements discussed in Section II. To attain the desired goals, the following reward function has been used:\begin{align*} r(t)=&-\omega _{1}(t) P_{fail} \!+\! \left ({1\!-\! \omega _{1}(t)}\right) \left [{P_{step} \!- \!K_{1} e^{2}(t) - K_{2} q^{2}(t)}\right. \\[2pt]&\left.{ -\,K_{3} \dot {\delta }(t)^{2} - \omega _{2}(t) K_{4} e^{2}(t) - \left ({1-\omega _{3}(t)}\right) K_{5} \delta ^{2}(t) }\right. \\[2pt]&\left.{ +\,\omega _{3}(t) P_{win}}\right]\!,\tag{6}\end{align*}
It can be seen how the control policy is rewarded by function (6) when the missile acceleration is steered and kept close to the reference, i.e. requirements 3a and 3d are verified, while it is penalized when the missile motion exceeds a prescribed range of lateral acceleration values. The quadratic terms in the missile angular velocity and actuator deflection and deflection speed are used to take into account the requirements about the overshoot and the undershoot, and to limit the control power. Indeed, due to the non-minimum phase behavior of the linearized plant, a further error penalty is considered to limit the undershoot. Since some requirements conflict with each other, e.g. rising time and overshoot, the positive constants
C. Training Procedure
According to the reward function (6), a training procedure has been performed on the missile linearized model around the equilibrium point defined by
More specifically, each training episode is characterized by a different step command, whose amplitude is chosen randomly within the range [−1, 1] [g], and terminates when the simulation time reaches its maximum value, chosen as
Simulation Results
In this section, the effectiveness of the proposed controller is characterized through numerical simulations.
The DDPG agent has been trained by implementing the procedure defined in Section V-C. In particular, the constants in the reward function (6) have been chosen equal to \begin{align*} K_{1}=&1, \quad K_{2}=0.2,~K_{3}=0.002,~K_{4}=5,~K_{5}=25,\\ P_{fail}=&150, \quad P_{step}=3,~P_{win}=25,\end{align*}
Section VI-A shows how the proposed data-driven approach is capable to learn the nonlinear behaviour of the missile described by (1) from the limited experience that it gets from the response of the linearized model for specific flight conditions. Moreover, we evaluate the robustness of the control system for different flight conditions and in presence of uncertainty in the aerodynamic coefficients. A further assessment is carried out in Section V-B, by comparing the DDPG trained agent with two robust model-based strategies, i.e. the self-scheduled
A. Controller Validation
In this section the closed-loop responses of the linearized and nonlinear models are compared to validate the trained DDPG agent. A maneuver starting from the flight condition considered in the training phase and with three different acceleration requests is considered (see the black trace in Fig. 3). The simulation results reported in Fig. 3 show that the closed-loop responses to the first request of one additional
Comparison between the closed-loop responses of the linearized (blue) and nonlinear (red) models. Time traces of: (a) angle of attack; (b) pitch rotational rate; (c) tail fin deflection; (d) normal acceleration.
Moreover, the robustness of the proposed approach has been evaluated considering 820 different nonlinear simulations performed for a step command of magnitude
The figure show the variation of the cumulative reward
Controller performance at different Mach numbers for the same
Furthermore, the robustness of the proposed approach has been evaluated through Monte Carlo simulations performed with the nonlinear model, that start at the initial flight condition
Fig. 6 shows the results for 100 runs when a step command of magnitude
Monte Carlo robustness analysis for 100 random uncertainty realizations of the aerodynamic coefficients
B. Comparison With Model-Based Methodologies
To better discuss the advantages of the proposed DDPG strategy in tracking the missile lateral acceleration, we now compare its closed-loop behaviour with two different robust model-based strategies proposed in the literature to solve the same control problem. Specifically, the former has been presented in [6], where authors developed a robustself-scheduled
The design procedure for the self-scheduled
Similarly, the AAC achieves robustness by designing a baseline state-feedback that guarantees robust stability for all the models belonging to the convex hull defined by the linearized models with
The simulation results are shown in Figs. 7 and 8, where the closed-loop responses have been compared by performing two different maneuvers. For the comparison in Fig. 7 we have considered a sequence of three step commands. When tracking the first 1 [g] step reference, all the controllers show the same undershoot, while the response of the RL agent is characterized by a slightly shorter settling time. When a reference changes larger than 1 [g] is requested, the response of the data-driven controller is always characterized by the smallest undershoot and the smallest control effort when compared with the two model-based controllers. Moreover, in the worst case, the settling time for the RL controller is similar to those of the other two considered approaches.
Performance comparison among the proposed data driven controller and two model based controllers in tracking a reference signal which model a sequence of step maneuvers of different magnitudes. Time traces of: (a) tail fin deflection; (b) normal acceleration.
Performance comparison among the proposed data driven controller and two model based controller in tracking a reference signal which emulates the effects of a guidance system for hitting a moving target. Time traces of: (a) tail fin deflection; (b) normal acceleration.
The further simulations shown in Fig. 8 refer to the response to a reference signal similar to the one computed by a guidance system, as proposed in [42]. Here we want to remark that, despite the RL controller was not trained using such a class of reference signals, it shows similar performance when compared to the two model-based controllers. From these results, it is possible to conclude that the proposed DDPG autopilot shows the same robustness against model uncertainties as to the two model-based approaches. This result is achieved without the need for a detailed system model, as required by both the self-scheduled
Conclusion
The feasibility of a model-free controller for the lateral acceleration of a missile has been investigated in this article. Specifically, exploiting the DDPG approach, a RL agent has been trained on the linearized dynamics of a 2-DoF nonlinear missile model, taking into account the main performance indexes. To assess the effectiveness of the proposed approach, different scenarios have been simulated on a 2-DoF nonlinear model, proving the efficiency of the data-driven approach in stabilizing the rotational dynamics, satisfying the control requirements in design flight conditions. Furthermore, a robustness analysis is provided to show the capability of the proposed approach in guaranteeing closed-loop stability in a wide range of flight conditions and in presence of model uncertainty. Along this line, future works will involve the improvement of the robustness w.r.t. variations of the Mach number, model uncertainties and measurement noise, by the explicit inclusion of robustness as a further objective during the training phase.