Journals & Magazines >IEEE Access >Volume: 12

Constraints Driven Safe Reinforcement Learning for Autonomous Driving Decision-Making

Framework of the proposed MLSCPO method, comprising the Environment Interaction(EI) and Policy Optimization(PO) module.

Abstract:

Although reinforcement learning (RL) methodologies exhibit potential in addressing decision-making and planning problems in autonomous driving, ensuring the safety of the...Show More

Metadata

Abstract:

Although reinforcement learning (RL) methodologies exhibit potential in addressing decision-making and planning problems in autonomous driving, ensuring the safety of the vehicle under all circumstances remains a formidable challenge in practical applications. Current RL methods are predominantly driven by singular reward mechanisms, frequently encountering difficulties in balancing multiple sub-rewards such as safety, comfort, and efficiency. To address these limitations, this paper introduces a constraint-driven safety RL method, applied to decision-making and planning policy in highway scenarios. This method ensures decisions maximize performance rewards within the bounds of safety constraints, exhibiting exceptional robustness. Initially, the framework reformulates the autonomous driving decision-making problem as a Constrained Markov Decision Process (CMDP) within the safety RL framework. It then introduces a Multi-Level Safety-Constrained Policy Optimization (MLSCPO) method, incorporating a cost function to address safety constraints. Ultimately, simulated tests conducted within the CARLA environment demonstrate that the proposed method MLSCPO outperforms the current advanced safe reinforcement learning policy, Proximal Policy Optimization with Lagrangian (PPO-Lag) and the traditional stable longitudinal and lateral autonomous driving model, Intelligent Driver Model with Minimization of Overall Braking Induced by Lane Changes (IDM+MOBIL). Compared to the classic IDM+MOBIL method, the proposed approach not only achieves efficient driving but also offers a better driving experience. In comparison with the reinforcement learning method PPO-Lag, it significantly enhances safety while ensuring driving efficiency, achieving a zero-collision rate. In the future, we will integrate the aforementioned potential expansion plans to enhance the usability and generalization capabilities of the method in real-world applications.

Framework of the proposed MLSCPO method, comprising the Environment Interaction(EI) and Policy Optimization(PO) module.

Published in: IEEE Access ( Volume: 12)

Page(s): 128007 - 128023

Date of Publication: 04 September 2024

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2024.3454249

Funding Agency:

Contents

SECTION I.

Introduction

The rapid advancement of autonomous driving technology is propelling transportation systems towards greater levels of automation and intelligence, with primary design objectives focused on ensuring vehicular safety, providing a comfortable driving experience, and enhancing efficiency. In the evolution of autonomous driving technology, the capability to make safe decisions across all driving scenarios is paramount [1], [2]. Currently, decision-making and planning in autonomous driving primarily rely on rule-based and learning-based methods.

Rule-based methods in autonomous driving rely on a repository of predefined rules to make decisions or process information. These rules dictate when an autonomous vehicle should accelerate, decelerate, turn, or stop based on traffic laws and safe driving principles across various scenarios. In emergencies or rare events, these vehicles depend on these established rules for response. This method effectively addresses the majority of common driving scenarios [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]. However, given the extensive, complex, and safety-critical nature of scenarios and conditions that vehicles encounter, the inflexibility and limited adaptability of manually constructed rules can lead to severe traffic accidents in unforeseen situations.

Learning-based methods capture complex mappings between inputs and outputs by analyzing vast datasets and employing sophisticated deep neural network models to make decisions or process information. Due to the reliance of supervised learning on the quality and diversity of training datasets, these models often exhibit limited generalization capabilities when encountering new scenarios, which has led to the adoption of deep RL methods in autonomous driving decisions. Deep RL [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], a method that integrates principles of deep learning and RL, transcends reliance on rule sets or large datasets alone. It actively engages with the environment to explore and adapt, optimizing its responses, making it a potent solution for designing autonomous driving decisions [28], [29]. Compared to rule-based methods like finite state machines, learning-based methods offer significant advantages in adaptability and flexibility. These methods autonomously learn from data and adapt to new patterns and changes, thereby reducing reliance on hardcoded rules. Furthermore, deep RL possesses the potential for continuous improvement, consistently enhancing its performance and accuracy by learning from new data, adapting to evolving circumstances.

Deep RL methods currently applied in autonomous driving, such as Deep Deterministic Policy Optimization (DDPG) [18], [19], Deep Q-Network (DQN) [13], [14], [15], [16], [17], Proximal Policy Optimization (PPO) [20], [21], Soft Actor-Critic (SAC) [22], [23], [24], Advantage Actor-Critic (A2C) [25], [26], and Asynchronous Advantage Actor-Critic (A3C) [27], have effectively addressed issues related to the low timeliness and excessive reliance on existing data encountered by traditional methods. However, these Deep RL methods are predominantly driven by a single reward mechanism. In the pursuit of maximizing overall rewards, they frequently struggle to balance multiple sub-rewards such as safety, comfort, and efficiency, often failing to ensure that autonomous driving decisions consistently adhere to safety constraints. Particularly in extreme or marginal traffic scenarios, which fall outside the distribution of the training data, these RL policies struggle to consistently maintain vehicular safety [30]. Therefore, in the field of autonomous driving, safety is a fundamental requirement and should not merely be considered a trade-off against other performance metrics. This field demands deep RL methods that can satisfy applicable safety constraints effectively.

To address these limitations, this paper introduces a constraint-driven safety RL method, Multi-Level Safety-Constrained Policy Optimization (MLSCPO). This method has been applied to decision-making and planning policy in highway scenarios, seeking to maximize performance rewards while adhering to stringent safety constraints. It achieves safe, efficient, and comfortable driving, optimizes the dynamic coordination of traffic flow, and exhibits outstanding robustness. The specific contributions of this paper are as follows:

This paper introduces MLSCPO method. Building on traditional trust domain algorithms, it incorporates a cost-based neural network to conduct tiered quantitative assessments of safety constraint satisfaction under current vehicle operating conditions, updating policy accordingly. For low-risk safety conditions, policy updates target reward maximization; for medium-risk conditions, dual optimization problems are solved to ensure training adheres to safety constraints, allowing policy updates to progress towards reward optimization without compromising safety costs; in high-risk situations, natural gradient methods are employed to find solutions that downgrade the current high-risk condition to a medium or low risk, thus enhancing the policy’s compliance with safety constraints.
This paper skillfully constructs refined reward and cost functions, and aptly defines the state and action spaces, framing the highway autonomous driving decision-making problem as a CMDP addressed using the proposed MLSCPO method. Specifically, the state space incorporates scenario information from a defined area around the autonomous vehicle, effectively avoiding issues caused by overly complex state space designs; the cost function considers safety concerns encountered in highway scenarios, providing a basis for updating policy under different safety conditions, thereby enhancing traffic safety and reducing potential collision risks; the design of the reward function ensures safety, efficiency, and comfort for autonomous vehicles in highway settings.
The MLSCPO method has been validated in typical highway scenarios. By quantifying the metrics of safety, efficiency, and comfort, the MLSCPO method was evaluated and compared with the Proximal Policy Optimization with Lagrangian (PPO-Lag) method and the Intelligent Driver Model with Minimization of Overall Braking Induced by Lane Changes (IDM+MOBIL) method. The comparison results indicate that the proposed approach exhibits superior performance, making smarter decisions in high-interaction scenarios, and demonstrating extensive adaptability and reliability in complex and dynamic situations.

The rest of this paper is organized as follows: Section II provides an overview of the current work on autonomous driving decision-making algorithms and points out the limitations. Section III introduces the framework of MLSCPO and defines the highway autonomous driving problem. Section IV describes the environment interaction module of the MLSCPO algorithm. Section V presents the policy optimization module of the MLSCPO algorithm. Section VI analyzes and discusses the experimental results. Finally, conclusions are summarized, and future work directions are proposed, as described in Section VII.

SECTION II.

Related Works

A. Decision Planning for Autonomous Driving Based on Rule-Based Methods

Rule-based decision-making and planning in autonomous driving traditionally relies on a repository of predefined rules for making decisions or processing information. This method has been extensively researched and implemented within the autonomous driving sector. The Finite State Machine (FSM) [4] is the most commonly utilized rule-based method, prized for its practicality and clear logic, which facilitates easy debugging and implementation. Autonomous vehicles such as MIT’s Talos [5], the Defense Science and Technology University’s Red Flag CA7460, Stanford University and Volkswagen’s Junior [6], Carnegie Mellon University and Ford’s Boss [7], Mercedes-Benz’s Bertha [8], and Virginia Tech’s Odin [9] all employ decision-making methodologies based on Finite State Machines.

Finite State Machine (FSM) in autonomous vehicles can be categorized into three structural types based on the transition relations set among their sub-states: serial, parallel, and hybrid configurations. In the serial configuration, state transitions are unidirectional and non-cyclical, with MIT’s Talos autonomous vehicle [5] being a prime example. This structure’s greatest advantage lies in its simple and clear logic, making it well-suited for straightforward driving tasks and environments. However, this structure’s limitations become apparent in complex scenarios where the static state transition path may struggle with unexpected variations in traffic flow, potentially leading to a collapse of the decision-making chain if any state processes incorrectly or is inadequate. In contrast, the parallel structure offers greater flexibility, allowing for multiple sub-states that can transition among each other, forming loops. Stanford and Volkswagen’s Junior [6] autonomous vehicle utilizes this decision model. This structure can adapt to various driving environments, yet the challenge lies in accurately categorizing diverse scenarios. As system complexity increases, the number of designed rules multiplies, complicating debugging efforts. The hybrid structure [9] combines elements of both serial and parallel configurations. Carnegie Mellon and Ford’s Boss [7] autonomous vehicle employs this structure, which hierarchically processes tasks, effectively enhancing the success rate of task execution and overall system stability.

Additionally, some studies have combined Finite State Machines with other algorithms or modules for autonomous driving decision-making. For instance, Okumura et al. [10] developed a method for autonomous vehicles merging into roundabouts, which generates high-dimensional decision commands through a classifier, then utilizes FSM to create rules tailored to the specific scenarios of merging into roundabouts. Cui et al. [11] developed a tiered emergency collision avoidance system, where the decision-making component utilizes FSM to determine specific anti-collision measures, integrated with a collision risk model that assesses the threat level from surrounding vehicles. Vanholme et al. [12] treated decision-making actions as discrete entities, associating each action with specific states, thus breaking down the autonomous vehicle’s decision tasks into processes of environmental state recognition and response. This method boasts excellent real-time capabilities, enabling autonomous vehicles to make decisions swiftly under given conditions. However, given the variety of complex and sudden situations that may arise during driving, relying solely on predefined rules and models is insufficient to cover all possible scenarios, especially in high-speed highway environments where the range and complexity of situations are particularly extensive.Solving such complex driving tasks through manually crafted rules is a highly challenging endeavor. Furthermore, encountering situations beyond the rule set may lead to severe traffic accidents.

B. Decision Planning for Autonomous Driving Based on Learning Methods

Learning-based methods excel at addressing edge-case scenarios, outperforming rule-based methods like Finite State Machine by identifying and learning deeper, more complex patterns from data. This reduces reliance on hard-coded rules and offers significant advantages in adaptability and flexibility. Furthermore, machine learning models can continuously improve their performance and accuracy by learning from new data, showcasing superior ongoing learning capabilities. This allows them to effectively handle larger-scale and more complex problems. Learning-based autonomous driving decision methods primarily employ deep RL, which leverages the powerful function approximation capabilities of deep learning and the decision-making prowess of RL. This allows agents to “trial and error” through continuous interaction with the environment, learning from feedback to derive optimal driving policy for accomplishing tasks.

Currently, numerous scholars are researching RL-based autonomous driving decisions. Early RL algorithms advocated using function approximation to derive optimal Q-values, thus guiding agents to optimize their policy. The DQN algorithm improves RL by predicting Q-values for actions but often overestimates them. Double DQN [13], [14] uses two networks for better accuracy but struggles with continuous actions. The Normalized Advantage Function (NAF) [15], [16] addresses this but still has limitations in complex environments. Researchers are now focusing on policy optimization techniques to refine policy [17] directly.Chen et al. [18] employed the DDPG method to craft adversarial policies for managing surrounding agents in lane-changing situations. In another study, policies for controlling multiple interacting vehicles were generated by Wachi [19] through multi-agent DDPG. To enhance multi-agent interaction predictions in highway scenarios, Radtke et al. [20] applied the PPO algorithm within the AIRL framework. Shi et al. [21] used distributed PPO to handle training dimensions and communication loads in mixed traffic control, focusing on improving efficiency and communication. Demonstrating a complex balance between rule-based methods and deep RL methods, Zhang et al. [22] developed a dual-layer system for lane-changing behavior planning with SAC, enhancing collaboration among autonomous vehicles and overall traffic safety and efficiency. A hybrid method proposed by Al-Sharman et al. [23] combines a high-level learning-based behavior planning layer, based on the SAC principle, with a low-level motion planning layer using Model Predictive Control (MPC). This approach enables the learning of safe, non-conservative driving behaviors while ensuring the feasibility of the generated trajectories. Zhang et al. [24] utilized a Lyapunov-based method to guarantee the safety of policies generated by the Soft Actor-Critic (SAC) algorithm. By incorporating a mathematical function to ensure policy stability, this method addresses the inherent uncertainties in autonomous driving, providing a robust safety mechanism. Managing individual surrounding vehicles in vehicle-following scenarios was tackled by Kuutti et al. [25] with the Advantage Actor-Critic (A2C) [26] framework. Zhou et al. [27] investigated the application of an improved A3C algorithm for autonomous driving lane-changing decisions, which significantly enhances sample efficiency, thereby boosting learning efficiency and performance in complex traffic conditions.

However, the aforementioned RL methods are driven by single reward metrics, focusing on maximizing overall rewards. This often results in a trade-off among sub-rewards such as safety, comfort, and efficiency, especially in extreme or edge-case traffic scenarios, where these policies struggle to maintain vehicle safety outside the distribution of training data [30]. In the domain of autonomous driving, safety is a fundamental requirement and should not be merely considered a trade-off against other performance metrics. To address this issue, this paper introduces a constraint-driven safety RL method. This method is applied to decision-making and planning policy in highway scenarios, aiming to maximize performance rewards while adhering to stringent safety constraints, thereby ensuring safe, efficient, and comfortable driving.

SECTION III.

Problem Definition and Method Framework

A. Problem Definition

Highways are a critical setting for travel, presenting unique complexities and challenges for autonomous driving compared to urban roads. The scenario considered in this study, as illustrated in Figure 1, involves vehicles traveling on multi-lane highways. Here, the Ego Vehicle (EV) monitors the motion states of Surrounding traffic participants Vehicles (SVs) to make real-time driving decisions. The number of SVs is random, their speeds vary within a specified range, they display random driving behaviors in the scenario, and are capable of making reasonable lane changes. In such complex conditions, whether the decision-making algorithm can successfully capture, recognize, and make accurate decision planning is a significant challenge.

FIGURE 1.

Framework of the proposed MLSCPO method, comprising the Environment Interaction(EI) and Policy Optimization(PO) module.

Show All

This work abstracts the scenarios encountered in autonomous highway driving into four main types:

Smooth Traffic Scenario: The EV encounters few SVs, which scarcely impede its progress. In this scenario, the EV can adjust its speed or change lanes within the limits of traffic regulations, enjoying a high degree of freedom. Safety for the EV is almost unthreatened by external factors;
Overtaking Behavior Scenario: On a multi-lane highway, the lanes on both sides of the EV are occupied by SVs, and there is an obstructing vehicle ahead of the EV. The EV has only two choices: to slow down and follow or to overtake using the only lane without obstructions, placing the EV in a lower-risk condition;
Emergency Avoidance Scenario: The vehicle in front of the EV suddenly decelerates or brakes urgently, placing the EV in a moderate-risk condition. The EV must urgently change lanes to the left or right to avoid a collision;
High Traffic Density Scenario: The highway is heavily congested, with all lanes ahead of the EV occupied by vehicles. The EV is unable to overtake and must slow down and follow, waiting for an opportunity, placing it in a high-risk condition.

In the real world, highway traffic scenarios are typically a random combination of the above conditions, and the unpredictable behavior of SVs creates even more complex situations, increasing both uncertainty and complexity.

At each timestep, the EV can access real-time state information $S$ , which includes: $S_{EV}$ – a subset of its own motion state, $S_{SV}$ – the real-time motion state collection of SVs, and $S_{RI}$ – a subset of road information relevant to the EV’s current location: $\begin{equation*} S=[S_{EV}, S_{SV}, S_{RI}]\end{equation*}$ View Source where, $S_{EV}=[x_{e}, y_{e}, v_{x, e}, v_{y, e}, \omega _{e}]$ , includes the EV’s position coordinates $x_{e}, y_{e}$ , velocity components $v_{x, e}, v_{y, e}$ , and yaw angle $\omega _{e}$ . $S_{SV}=\{S_{i}\}^{N_{SV}}$ composed of $N_{SV}$ subsets $S_{i}$ , where $N_{SV}$ indicates the number of SVs. In the set $S_{i} = [\Delta x_{i}, \Delta y_{i}, \Delta v_{x, i}, \Delta v_{y, i}, \omega _{i}]$ , the position $\Delta x_{i}, \Delta y_{i}$ relative to the EV, the relative velocity $\Delta v_{x, i}, \Delta v_{y, i}$ , and the yaw angle $\omega _{i}$ of the i-th SVs are detailed. $S_{RI} = [n_{rj}, y_{rj}, d_{rj}]$ , includes the index of the road center line for the lane in which the EV is located $n_{rj}$ , the longitudinal coordinate $y_{rj}$ , and the lane width $d_{rj}$ . The autonomous driving decision system of the EV, embedded with the MLSCPO algorithm, utilizes the state space information S acquired at each timestep to dynamically output vehicle acceleration and steering angle controls, ensuring safe, comfortable, and efficient driving in highway scenarios.

B. Method Framework

MLSCPO constructs three neural networks: a policy network, a cost value network, and a reward value network. Specifically, the policy network maps states to actions, the cost-based value network evaluates the expected cost of policy, and the reward-based value network assesses the expected rewards of policy. Together, these value networks guide policy updates, ensuring that rewards are maximized under the constraints imposed.

Figure 1 depicts the framework of the robust highway autonomous driving decision method based on MLSCPO, which is primarily divided into two parts: Environment Interaction (EI) and Policy Optimization (PO). EI segment explores the highway environment based on the current policy, acquiring state space information of the EV at each timestep, observing the motion information of SVs, and outputting control commands according to the current policy. This interaction yields current reward and cost values, forming a CMDP-formatted trajectory for use in PO.

The PO component uses the trajectories formed by EI to optimize the parameters of the policy neural network and the reward and cost value networks, based on the MLSCPO method. Specifically, the value networks conduct safety risk assessments for policy adopted by the EV under the current scenario. Different policy updates are applied depending on the level of safety risk: for low-risk scenarios, updates aim to maximize rewards; for moderate-risk scenarios, dual optimization problems are solved to ensure that training aligns with safety constraints, allowing updates to progress toward reward optimization without compromising safety costs; in high-risk scenarios, natural gradient methods are employed to seek policy updates. Once optimized, the updated policy continues to be tested through EI. The EI and PO components work in tandem to facilitate continuous iterative improvements until the optimized policy converges to the desired performance level.

SECTION IV.

MLSCPO Environment Interaction Module

Autonomous driving policy must interact with the environment using the framework of safety RL formalized as CMDP to support policy updates. This section translates the decision-making and planning problems of EVs in highway scenarios into CMDP problems, which includes the construction of state spaces, action spaces, reward functions, and cost functions.

A. Constrained Markov Decision Process Introduction

CMDP is a common formulation in the field of safety RL, building on the foundation of Markov Decision Processes (MDP). CMDP introduces a cost function to calculate the expected cost of an agent performing action A in state S, thereby imposing constraints on the outcomes, and it can be represented by a tuple: $\begin{equation*} CMDP=[S, A, P, R, C, \gamma, u_{0}]\end{equation*}$ View Source where S represents the state space, which, in this context, refers to the collection of state information observed and gathered by the EV. This information is obtained through sensors on the EV that perceive the surrounding environment; A denotes the action space, which consists of the set of actions that the EV can take at any given timestep, such as steering, braking, or accelerating; P stands for the state transition probability function, which describes the probability of the agent transitioning from a given state $s\in S$ to a new state $s^{\prime } \in S$ after taking an action $a \in A$ ; R and C respectively represent the reward and cost functions, which calculate the reward value and the cost value incurred by the agent when performing an action a in state s; $\gamma$ represents the discount factor, used to discount future rewards and costs; $u_{0}$ denotes the initial state distribution, which specifies the state of the system at the start of the decision-making process.

At each timestep t, the agent interacts with the environment to acquire the current state information $s_{t}\in S$ . Based on the current policy $\pi$ , the agent executes an action $a_{t}{\sim }\pi (\cdot |s_{t})$ . After the agent performs action $a_{t}\in A$ from the action space A, the environment transitions to a new state $s_{t+1}{\sim }P(\cdot |s_{t}, a_{t})$ . Concurrently, the agent receives a reward $R(s_{t}, a_{t}, s_{t+1})$ and incurs a cost $C(s_{t}, a_{t}, s_{t+1})$ . This process continues throughout an episode. After one cycle, the policy is updated based on the trajectory recorded over the episode, denoted as $\tau =(s_{0}, a_{0}, R_{0}, C_{0}, s_{1}, a_{1}, R_{1}, C_{1}\ldots)$ . The agent then interacts with the environment using this updated policy. In RL, the agent learns a policy $\pi : S\to A$ through interaction with the environment. This policy guides the agent in selecting actions A based on the current state S. The goal of RL is to find an optimal policy $\pi ^{*}$ that maximizes the cumulative reward R obtained by the agent, formulated as: $\begin{align*} J(\pi)=\mathbb {E}_{\substack {s_{0}\sim u_{0} \\ \tau \sim \pi }}\left [{{\sum _{t}\gamma ^{t}R(s_{t}, a_{t}, s_{t+1})}}\right ] \tag {1}\end{align*}$ View Sourcewhere $J(\pi)$ represents the expected discounted return with respect to the reward function R under policy $\pi$ , given the initial state distribution $u_{0}$ .

Additionally, the model requires that while maximizing rewards, the policy must also satisfy safety constraints. Therefore, a cost constraint cap b is defined, and under policy $\pi$ with the initial state distribution $u_{0}$ , the expected discounted return regarding the cost function C is denoted as $J_{C}(\pi)$ . It follows that: $\begin{align*} J_{C}(\pi)=\mathbb {E}_{\substack {s_{0}\sim u_{0} \\ \tau \sim \pi }}\left [{{\sum _{t}\gamma ^{t}C(s_{t}, a_{t}, s_{t+1})}}\right ] \tag {2}\end{align*}$ View Source

The set of feasible policy should satisfy: $\begin{equation*} \Pi _{C}=\left \{{{\pi \in \Pi {: } J_{C}(\pi)\leq b}}\right \} \tag {3}\end{equation*}$ View Source

The optimal policy $\pi ^{*}$ that achieves the maximum expected reward function value can be defined as: $\begin{align*} \pi ^{*}=arg\max _{\pi \in \Pi _{C}}\mathbb {E}_{\substack {s_{0}\sim u_{0} \\ \tau \sim \pi }}\left [{{\sum _{t}\gamma ^{t}R(s_{t}, a_{t}, s_{t+1})}}\right ] \tag {4}\end{align*}$ View Source

B. Convert Highway Autonomous Driving Decisions Into CMDP

Definition of State and Action Spaces: Designing an appropriate state space is essential for the effective training of RL algorithms. The accuracy and completeness of state information in highly complex and dynamic scenarios help the system output suitable actions and guide the policy to learn stably. In highway scenarios, the state space includes information about the EV, the status of SVs within a certain observational range, and the road information relevant to the current location of the EV: $\begin{equation*} S=[S_{EV}, S_{SV}^{\prime }, S_{RI}]\end{equation*}$ View Source

The EV’s state information $S_{EV}=[x_{e}, y_{e}, v_{x, e}, v_{y, e}]$ , and the road information at the current position of the EV $S_{RI} = [n_{rj}, y_{rj}, d_{rj}]$ are consistent with the problem definition provided. To enhance the policy neural network’s ability to intuitively understand the presence of SVs, the state of the i-th SVs $S_{i}^{\prime }=\left [{{\Delta x_{i}, \Delta y_{i}, \Delta v_{x, i}, \Delta v_{y, i}, w_{i}, p_{i}}}\right]\in S_{SV}^{\prime }$ includes an additional positional parameter $p_{i}$ . Specifically, the EV continuously monitors the area around itself (shown as the shaded area in Figure 2), with a defined Lateral Detection Distance (LDD) and Forward Detection Distance (FDD). If another vehicle i is detected within this area, its presence indicator $p_{i}$ is set to 1, and information regarding its position, speed, and yaw angle relative to the EV is obtained. If a vehicle is not within the monitoring area, its presence indicator $p_{i}$ is set to 0, and all state information for $S_{i}^{\prime }$ is set to 0.

FIGURE 2.

State space diagram.

Show All

Under this framework, the driving control information for EVs is outputted through the action space, primarily characterizing control commands for the vehicle’s lateral and longitudinal movements: $\begin{equation*} A=[a_{s}, \delta _{s}]\: s.t. a\in [a_{s_{min}}, a_{s_{max}}]m/s^{2}, \delta _{s}\in [\delta _{s_{min}}, \delta _{s_{max}}]rad\end{equation*}$ View Source

In this model, the magnitude of a represents the driving acceleration of the EV, while the magnitude of $\delta _{s}$ indicates the steering angle. In practical applications, the EV adjusts the throttle opening or applies braking based on the obtained driving acceleration. Within this framework, the throttle/brake opening and the steering angle information for the EV are generated by a configured PID controller: $\begin{equation*} u(t)=K_{p}e(t)+K_{i}\int _{0}^{t}e(t)dt+K_{d}\frac {de(t)}{dt} \tag {5}\end{equation*}$ View Sourcewhere, $u(t)$ represents the control input at time t, corresponding to the acceleration and steering angle values output by the neural network within the action space; $e(t)$ denotes the error at time $t$ , which is the difference between the target value and the actual value. $K_{p}, K_{i}, K_{d}$ are the proportional, integral, and derivative coefficients of the PID controller, respectively, and are customizable hyperparameters.

Cost Function Definition: In this study, the cost function considers both instantaneous and terminal costs. The instantaneous cost represents the cost incurred by the EV at each time step (in the current state) after selecting an action, helping the EV adjust its policy during training to adhere to safety constraints (such as avoiding collisions and traffic rule violations), and guiding it to make relatively safe and effective decisions. The instantaneous cost in this study is defined with two components: beyond road boundary cost and proximity distance violation cost. The beyond road boundary cost $c_{boun}$ quantifies the safety risk associated with the non-compliant behaviors of EV relative to navigable roadways, and the proximity distance violation cost $c_{dis}$ quantifies the potential collision safety risk related to proximity distance between EV and SVs. The terminal cost is abstracted as collision cost $c_{crash}$ , which quantifies the safety cost associated with vehicle collisions. If the collision, the current episode is terminated, a high cost value is assigned, and a new episode begins. Thus, the cost function $C$ is defined as: $\begin{equation*} C=k_{1}\times c_{crash}+k_{2} \times c_{boun}+k_{3} \times c_{dis} \tag {6}\end{equation*}$ View Source

The construction of the cost function encompasses the potential risks between EV and the road, as well as between EV and SVs, including both potential and actual collision risks. Even in highly dynamic and uncertain environments, it can still guide policies to avoid various potential risks. Each component is assigned a fixed risk penalty value, with $k_{1}, k_{2}, k_{3}$ being the component weight hyperparameters.

In highway scenarios, should the EV collide with SVs or road obstacles, it could lead to severe traffic accidents, posing significant threats to the safety of the driver and passengers. Consequently, the baseline cost value for such collisions is set high, and $c_{\text {crash}}$ is defined as follows: $\begin{align*} c_{crash}=\begin{cases}\displaystyle 15, if\: crash=True \\ \displaystyle \\ \displaystyle 0, else\end{cases} \tag {7}\end{align*}$ View Source

Similarly, driving out of highway boundaries also exposes drivers and passengers to unpredictable dangers. Therefore, this work have defined a cost function, $c_{\text {boun}}$ , for such occurrences. Given the lower probability of this event, the corresponding baseline cost value is set relatively low, defined as follows: $\begin{align*} c_{boun}=\begin{cases}\displaystyle 1, if\: beyond \hspace {1mm}boundary \\ \displaystyle 0, else\end{cases} \tag {8}\end{align*}$ View Source

Maintaining a safe distance is essential for safe vehicle operation, particularly on highways where higher speeds necessitate greater safety margins. Consequently, a safety distance threshold $C_{r}$ has been established. If the distance $\Delta x_{i}$ between the EV and the nearest forward vehicle $i$ with a potential for collision falls below $C_{r}$ , the cost function $c_{\text {dis}}$ is increased. It is defined as follows: $\begin{align*} c_{dis}=\begin{cases}\displaystyle 1, if\: \Delta x_{i}\lt C_{r} \\ \displaystyle \\ \displaystyle 0, else\end{cases} \tag {9}\end{align*}$ View Source

Reward Function Definition: The reward function plays a crucial role in guiding and evaluating the EV’s behavior. In the MLSCPO algorithm, to balance safety and performance, the reward function and safety constraints are set independently. When safety and performance cannot be simultaneously satisfied, policy updates are guided by the quantified risk level. It provides immediate feedback, steering the agent towards desired actions and optimizing its policy. A well-designed reward function not only encourages appropriate exploration and the discovery of promising new policy but also balances short-term and long-term benefits, preventing the agent from focusing excessively on immediate rewards at the expense of long-term objectives. In this study, the reward function is thoughtfully crafted to consider driving efficiency and riding comfort, among other factors. Specifically, it includes: 1) a driving efficiency reward component $r_{\text {effi}}$ ; 2) a riding comfort reward component $r_{\text {comf}}$ ; 3) driving task optimization reward components $r_{\text {pref}}$ and $r_{\text {ovrt}}$ ; 4) safety penalty C. Therefore, the reward function R is defined as follows: $\begin{align*} R=k_{4} \times r_{effi}+k_{5} \times r_{comf}+k_{6} \times r_{pref}+k_{7} \times r_{ovrt} - C \tag {10}\end{align*}$ View Sourcewhere, $k_{4}, k_{5}, k_{6}, k_{7}$ are adjustable hyperparameters that are set to fine-tune the reward function.

In this study, driving efficiency primarily refers to the ability of the EV to travel at the highest possible speed while staying within the road speed limits $[v_{\min }, v_{\max }]$ , defined as follows: $\begin{align*} r_{effi}= \begin{cases} \displaystyle \frac {v_{e}-v_{min}}{v_{max}-v_{min}},& if\: v_{e}\in [v_{min}, v_{max}] \\ \displaystyle -0.5,& else\end{cases} \tag {11}\end{align*}$ View Sourcewhere, $v_{e}$ represents the current driving speed of the EV, with $v_{\min }$ and $v_{\max }$ denoting the minimum and maximum speeds allowed for EVs on the highway, respectively. If the EV’s speed falls within this allowed range, it receives a positive reward; otherwise, it incurs a penalty.

The comfort reward function primarily considers the experience of the passengers during the EV’s journey, characterized by the magnitude of the vehicle’s lateral acceleration, which is used as an indicator for evaluating this aspect. It is defined as follows: $\begin{equation*} r_{comf}=1-\left |{{\frac {\mathit v_{e, y}^{t+\Delta t}-\mathit v_{e, y}^{t}}{4\Delta t}}}\right | \tag {12}\end{equation*}$ View Sourcewhere, $\frac {v_{e, y}^{t+\Delta t} - v_{e, y}^{t}}{\Delta t}$ represents the lateral acceleration of the EV during travel, where $\Delta t$ is the time interval between each timestep.

The driving task optimization reward aims to ensure the safety of the EV while also incorporating greater flexibility for executing maneuvers such as lane changes and overtaking. The reward component $r_{\text {pref}}$ is specifically designed to encourage safe lane-changing maneuvers by the EV, and is defined as follows: $\begin{align*} r_{pref} & = \left ({{\frac {n_{rj}+1}{n_{rj}-1/\: n_{rj}\: /\: n_{rj}+1}}}\right) \\ & \quad + \left |{{\frac {y_{e}-y_{rj}}{\left ({{(n_{rj}-1/\: n_{rj}\ /\: n_{rj}+1)-(n_{rj}+1)}}\right)d_{rj}}}}\right | \tag {13}\end{align*}$ View Sourcewhere, the indices $n_{rj-1}, n_{rj}, n_{rj+1}$ represent three possible lane positions for the EV, determined by the relationship between $y_{e}$ (the lateral position of the EV), $y_{rj}$ (the road centerline position), and $d_{rj}$ (the lane width). The appropriate lane index is assigned based on these parameters: When $y_{rj} \gt y_{e} + \frac {d_{rj}}{2}$ , the EV is considered to be in the left lane, hence the index $n_{rj-1}$ is assigned. When $|y_{rj} - y_{e}| \leq \frac {d_{rj}}{2}$ , the EV is deemed to be in the middle lane, assigned index $n_{rj}$ . When $y_{e} \gt y_{rj} + \frac {d_{rj}}{2}$ , the EV is in the right lane, thus the index $n_{rj+1}$ is used. We use $n_{rj+1}$ to represent the index for the road’s right boundary lane. The term $y_{e} - y_{rj}$ denotes the lateral distance between the EV and the centerline of the current lane it occupies. The expression $((n_{rj-1}/ n_{rj}/ n_{rj+1}) - (n_{rj+1})) \times d_{rj}$ calculates the distance from the lane’s current centerline to the right boundary.

The $r_{\text {ovrt}}$ reward component is set to encourage the EV to perform overtaking maneuvers, while also avoiding local optima. This setup aims to promote safe overtaking exploration, ensuring that the EV adheres to safety constraints while attempting to surpass other vehicles. Given that highway scenarios can typically be approximated as linear trajectories, the relative front and back positions are determined by coordinates along the road’s direction. The definition is as follows: $\begin{align*} r_{ovrt}=\begin{cases}\displaystyle -0.8, & if\: x_{e}\gt x_{1} \\ \displaystyle 0.1,& else\end{cases} \tag {14}\end{align*}$ View Sourcewhere, $x_{e}$ represents the longitudinal position of the EV, and $x_{1}$ denotes the longitudinal position of the nearest vehicle ahead of the EV.

SECTION V.

MLSCPO Policy Optimization Module

The PO component utilizes trajectories formed through EI to optimize the parameters of the policy neural network and the reward and cost value networks, based on the MLSCPO method. This section discusses how MLSCPO constructs trust and constraint domains using neural networks to assess safety risk levels for autonomous highway vehicles and implement policy optimization under various safety levels.

A. Policy Parameter Update

In large or continuous CMDP, the dimensionality problem makes it infeasible to find an exact optimal solution as described by Equation (4). Therefore, this work introduces the parameter $\theta$ representing the parameterized policy and performs a policy search within the local neighborhood of the most recent iterative policy $\pi _{k}$ to maximize $J_{C}(\pi)$ . To ensure that the policy iterations are feasible for the CMDP, optimization is conducted within the intersection of the parameterized policy space $\Pi _{\theta }$ and the feasible policy set $\Pi _{C}$ , denoted as $\Pi _{\theta '}$ .

The policy update can be reformulated as: $\begin{align*} \pi _{k+1}& =arg\max _{\pi \in \Pi _{\theta ^{\prime }}}J(\pi) \\ & \quad ~s.t.J_{\mathit {C}}(\pi)\leq \mathit {b} \\ & \qquad \mathit {D}(\pi, \pi _{k})\leq \delta \tag {15}\end{align*}$ View Source

However, this update is difficult to implement because it requires evaluating whether the proposed policy $\pi$ is feasible with respect to the constraint function, which typically necessitates off-policy evaluation. To address this issue, this work introduces easily evaluable surrogate functions based on trust domain optimization methods to more effectively guide policy improvement, as detailed below:

Let $R(\tau)$ represent the discounted reward return of a trajectory. $V^{\pi } (s)$ and $Q^{\pi }(s, a)$ can then be expressed as follows: $\begin{align*} V^{\pi }(s)& \equiv \mathbb {E}_{\tau \sim \pi }(R(\tau)|s_{0}=s) \tag {16}\\ Q^{\pi }(s, a)& \equiv \mathbb {E}_{\tau \sim \pi }(R(\tau)|s_{0}=s, a_{0}=a) \tag {17}\end{align*}$ View Sourcewhere, $V^{\pi } (s)$ represents the expected total return when following policy $\pi$ from a given state s; $Q^{\pi } (s, a)$ represents the expected total return when taking action a in state s and thereafter following policy $\pi$ .

The advantage function is defined as: $\begin{equation*} A^{\pi }(s, a)=Q^{\pi }(s, a)-V^{\pi }(s) \tag {18}\end{equation*}$ View SourceSimilarly, the policy cost value function $V_{C}^{\pi } (s)$ , the action cost value function $Q_{C}^{\pi } (s, a)$ , and the cost advantage function $A_{C}^{\pi } (s, a)$ can be defined.

The advantage function offers a more precise evaluation method by measuring the benefit of specific actions in given states, compared to solely using the value function. It reduces variance and randomness in policy updates, providing a clear gradient signal for parameter adjustments. This results in more effective updates, reducing sample requirements and computational complexity, and accelerating convergence to the optimal policy.Generalized Advantage Estimation (GAE) balances accuracy and variance of cost measurements for stable learning updates. This study employs GAE to estimate the cost advantage function, defined as follows: $\begin{align*} A_{\mathit {C}}^{\pi }(s, a) & = \sum _{k=0}^{T} (\gamma \xi)^{k} \big (-C(s_{t+k}, a_{t+k}) \\ & \quad + \gamma V_{C}^{\pi }(s_{t+k+1}) - V_{C}^{\pi }(s_{t+k}) \big) \tag {19}\end{align*}$ View Sourcewhere, $\xi$ is a parameter related to GAE that we use to balance measurement accuracy and standard deviation.

The advantage function allows us to more succinctly express the difference in rewards before and after policy updates: $\begin{equation*} J(\pi ')-J(\pi)=\frac {1}{1-\gamma }\mathbb {E}_{\tau \sim \pi }[A^{\pi }(s, a)] \tag {20}\end{equation*}$ View Sourcewhere, $\pi '$ represents the new policy, and $\pi$ represents the old policy. Similarly, we can express the difference in costs before and after policy updates as $\frac {1}{1-\gamma } \mathbb {E}_{\tau \sim \pi ^{k}} [A_{C}^{\pi _{k}} (s, a)]$ .

The policy update can be reformulated as: $\begin{align*} \pi _{k+1}& =arg\: max\: \mathbb {E}_{\tau \sim \pi ^{k}}\left [{{A^{\pi _{k}}(s, a)}}\right ] \\ & \quad ~s.t.\: J_{\mathit {C}}(\pi _{k})+\frac {1}{1-\gamma }\mathbb {E}_{\tau \sim \pi ^{k}}[A_{\mathit {C}}^{\pi _{k}}(s, a)]\leq \mathit {b} \\ & \qquad \overline {D}_{KL}(\pi \|\pi _{k})\leq \delta \tag {21}\end{align*}$ View SourceHowever, when dealing with high-dimensional parameter space, such a constrained optimization problem is difficult to handle in practice as the neural network and AIM problems are quite complex, the increased computational cost makes directly solving Equation (22) impractical, MLSCPO algorithm approximates the problem by using a Taylor expansion. For small step sizes $\delta$ , the reward and cost constraints can be approximated by linearizing around $\pi _{k}$ , and the KL divergence constraint can be approximated by a second-order expansion. We define the gradient of the reward as r, the gradient of the cost constraint as e, the Hessian matrix of the KL divergence as H, and the safety factor as z. Equation (22) can then be approximated as: $\begin{align*} \theta _{k+1}& =arg\: \underset {\theta }{m}ax\: r^{T}(\theta -\theta _{k}) \\ & \quad ~s.t.\quad \mathit {z} + \mathit {e}^{T}(\theta -\theta _{k})\leq 0 \\ & \hphantom {\quad ~s.t.\quad }\frac {1}{2}(\theta -\theta _{k})^{T}\mathbf {H}(\theta -\theta _{k})\leq \delta \tag {22}\end{align*}$ View Source

Since the matrix $H$ is always semi-positive definite (assuming it is positive definite), the optimization problem can be transformed into a convex optimization problem. If this convex optimization problem is solvable, it can be efficiently solved by addressing the dual problem, which can be expressed as: $\begin{align*} \max _{\substack {\lambda \geq 0 \\ v \gt 0}}\frac {-1}{2\lambda }\Big (r^{T}\mathbf {H}^{-1}r-2g^{T}v+v^{T}Sv\Big)+v^{T}z-\frac {\lambda \delta }{2} \tag {23}\end{align*}$ View Sourcewhere, $g = r^{T} \mathbf {H}^{-1} e$ and $S = e^{T} \mathbf {H}^{-1} e$ . This is a convex optimization problem with $m+1$ variables; when the number of constraints is small relative to the dimensionality of $\theta$ , this problem is easier to solve. If $\lambda ^{*}$ and $\nu ^{*}$ are the solutions to the dual problem, then the solution to the original problem can be expressed as: $\begin{equation*} \theta ^{*}=\theta _{k}+\frac {1}{\lambda ^{*}}\mathbf {H}^{-1}(r-ev^{*}) \tag {24}\end{equation*}$ View SourceThe gradient of the reward value function r is defined as follows: $\begin{align*} r& =\nabla _{\theta } J\left ({{\pi }}\right) \\ & =\frac {1}{N_{tr}}\sum _{j=1}^{N_{tr}}\sum _{t=1}^{N_{t}}\nabla _{\theta }\mathbb {E}_{\tau _{\pi _{k}}\sim \pi _{\theta }}[\: log\pi _{\theta }(a_{t}|s_{t})A^{\pi }(s_{t}, a_{t})] \tag {25}\end{align*}$ View Source

The gradient of the cost value function e is defined as follows: $\begin{align*} e& =\nabla _{\theta } J_{C}(\pi) \\ & =\frac {1}{N_{tr}}\sum _{j=1}^{N_{tr}}\sum _{t=1}^{N_{t}}\nabla _{\theta }\mathbb {E}_{\tau _{\pi _{k}}\sim \pi _{\theta }}[\: log\pi _{\theta }(a_{t}|s_{t})\: A_{C}^{\pi }(s_{t}, a_{t})] \tag {26}\end{align*}$ View Sourcewhere, $N_{\text {tr}}$ represents the number of trajectories, $N_{t}$ represents the total time steps in a single episode, $\nabla _{\theta }$ represents the gradient of the policy parameters $\theta$ , $\mathbb {E}_{\tau _{\pi _{k}} \sim \pi _{\theta }}$ denotes the weighted value of all possible trajectories under the current policy $\pi _{\theta }$ , and $\log \pi _{\theta }(a_{t} | s_{t})$ represents the log probability of choosing action $a_{t}$ in state $s_{t}$ according to policy $\pi _{\theta }$ .

The safe level judgment factors z and D are obtained through the calculation of the expenditure value function gradient, defined as follows: $\begin{align*} \mathit {z}& =\mathbb {E}\left [{{\sum _{t=0}^{N_{t}}\gamma ^{t}C(s_{t}, a_{t}, s_{t+1})}}\right ]-\frac {b}{1-\gamma } \tag {27}\\ D& =\delta -\frac {z^{2}}{\mathit {e}^{T}\mathbf {H}^{-1}\mathit {e}} \tag {28}\end{align*}$ View Sourcewhere, $\mathbb {E}$ represents the expected value of the expression, and $\delta$ is the set threshold for KL divergence. The trust domain set in this work is defined as: $\begin{equation*} \text {Trust domain}=\{\pi _{\theta }\epsilon \Pi _{\theta }{: }\overline {D}_{KL}(\pi \|\pi _{k})\leq \delta \}\end{equation*}$ View SourceThe Constrained domain set in this work is defined as: $\begin{align*} \text {Constrained domain} = \big \{ \pi _{\theta }\in \Pi _{\theta }: J_{C}(\pi _{k}) + \frac {1}{1-\gamma } \big. \\ \big. \mathbb {E}_{\tau \sim \pi ^{k}}[A_{C}^{\pi _{k}}(s, a)] \leq b \big \}\end{align*}$ View Source

The MLSCPO algorithm categorizes the possible risk situations for agent policy updates based on the relative positions of the trust domain and the constraint domain as follows:

Low-Risk Safety Condition: When the norm of the cost value function gradient $\|e\|$ is extremely small (less than $1 \times 10^{-8}$ ) or the safety level judgment factors $z \lt 0$ and $D \lt 0$ , it indicates that the constraint domain is entirely within the trust domain, as shown in Figure 3(a). The updated policy is highly likely to remain within both the constraint and trust domains, implying that the current policy almost never violates the constraints.
Moderate-Risk Safety Condition: When the safety level judgment factor $D \gt 0$ , it suggests that although there is an intersection between the constraint domain and the trust domain, as shown in Figure 3(b), the policy that maximizes the reward might fall outside the constraint domain, posing a certain risk of dangerous situations.
High-Risk Safety Condition: When the safety level judgment factors $z \gt 0$ and $D \lt 0$ , it indicates that there is no intersection between the constraint domain and the trust domain, as shown in Figure 3(c). The updated policy will definitely not satisfy the constraint domain, meaning that no matter how the policy is updated, the current dangerous situation cannot be reversed.

FIGURE 3.

Schematic diagram of the relative positions of the trust domain and the constraint domain.

Show All

The MLSCPO algorithm performs corresponding policy updates according to different risk situations:

When the evaluation results indicate that the EV is operating under a low-risk safety condition with the current policy, the trust domain method is used for policy updates to maximize the expected reward function value.

The updated policy is: $\begin{equation*} \theta _{k+1}=\theta _{k}+\sqrt {\frac {2\delta }{\mathit {r}^{T}\mathbf {H}^{-1}\mathit {r}}}\mathbf {H}^{-1}r \tag {29}\end{equation*}$ View Source

When the evaluation results indicate that the EV is operating under a moderate-risk safety condition with the current policy, it means that the policy update may violate safety constraints. Therefore, it is necessary to solve the dual problem to find a policy that both satisfies the safety constraints and maximizes the expected reward function.

The updated policy is: $\begin{equation*} \theta _{k+1}=\theta _{k}+\frac {1}{\lambda ^{*}}\mathbf {H}^{-1}(r-ev^{*}) \tag {30}\end{equation*}$ View Sourcewhere $\begin{align*} v^{*}& =\left ({{\frac {\lambda ^{*}z-g}{S}}}\right)_{+} \tag {31}\\ \lambda ^{*} & = \arg \max _{\lambda \geq 0} \begin{cases} \displaystyle f_{a}(\lambda) \triangleq \frac {1}{2\lambda } \left ({{ \frac {g^{2}}{S} {-} q }}\right) \\ \displaystyle \quad + \frac {\lambda }{2} \left ({{ \frac {z^{2}}{S} - \delta }}\right) - \frac {gz}{S} & \text {if}~ \lambda z {-} g \gt 0 \\ \displaystyle f_{b}(\lambda) \triangleq -\frac {1}{2} \left ({{ \frac {q}{\lambda } + \lambda \delta }}\right) & \text {else} \end{cases} \tag {32}\end{align*}$ View Sourcewhere, $q$ represents a threshold set for the reward function.

After obtaining the optimal values $v^{*}$ and $\lambda ^{*}$ , the conjugate gradient algorithm is used to calculate the direction of the policy update $\omega _{k}$ : $\begin{equation*} \omega _{k}=\mathbf {H}^{-1}(\mathit {r}-\mathit {e}v^{*}) \tag {33}\end{equation*}$ View Source

After this, the current policy can be updated using the following equation: $\begin{equation*} \theta _{k+1}=\theta _{k}+\frac {\alpha }{\lambda ^{*}}\omega _{k} \tag {34}\end{equation*}$ View Sourcewhere, $\alpha$ is used to adjust the step size and is determined by a line search.

When the evaluation results indicate that the EV is operating under a high-risk safety condition with the current policy, it means that the policy update cannot ensure the set safety constraints, posing a very high safety risk for the EV. In this case, the policy is updated using the natural gradient method.

The updated policy is: $\begin{equation*} \theta _{k+1}=\theta _{k}-\sqrt {\frac {2\delta }{\mathit {e}^{T}\mathbf {H}^{-1}\mathit {e}}}\mathbf {H}^{-1}\mathit {e} \tag {35}\end{equation*}$ View Source

In this way, risk assessment for highway EV is achieved, and safety levels are categorized based on the evaluation results. Different policy update methods are selected for each safety level, ensuring that the policy updates maximize rewards while maintaining safety.

B. Value Network Parameter Update

Network updates are a critical part of the optimization process. Value networks are typically used to estimate the expected return in specific states, which is crucial for the efficient learning and decision-making of the policy network. The accuracy of the value network directly impacts the decision quality of the policy network. When updating the value network, EI interacts with the environment to collect state information and generate trajectories $\tau$ .

The parameter update of the value network is achieved by minimizing the difference between the predicted value and the actual global return. We use a squared loss function to minimize the prediction error for optimization. For each policy of the EV, there is a predicted benchmark value calculated based on reward functions and other rules. This predicted value represents the output of the value network under relatively good performance policy. However, in actual states, it cannot be guaranteed that the current policy taken by the vehicle is near-optimal, as the actual return values fluctuate continuously. The goal of minimizing the squared prediction error is to ensure that the actual return values obtained by the policy taken by the vehicle are close to or exceed the predicted values under better policy, continuously optimizing the driving policy.

The update rule for the reward-based value network parameters is as follows: $\begin{equation*} \varphi _{r}=argmin_{\varphi }\mathbb {E}\left [{{\left ({{V_{\varphi _{r}}(s_{t})-R_{t}}}\right)^{2}}}\right ] \tag {36}\end{equation*}$ View Source

The update rule for the cost-based value network parameters is as follows: $\begin{equation*} \varphi _{e}=argmin_{\varphi }\mathbb {E}\left [{{\left ({{V_{\varphi _{e}}(s_{t})-C_{t}}}\right)^{2}}}\right ] \tag {37}\end{equation*}$ View Sourcewhere $R_{t}$ and $C_{t}$ represent the actual reward return and cost return obtained at time $t$ , and ${V}_{\varphi _{r}}(s_{t})$ and ${V}_{\varphi _{e}}(s_{t})$ represent the estimated reward value function and cost value function at state $s_{t}$ , using parameters $\varphi _{r}$ and $\varphi _{e}$ respectively.

C. MLSCPO Algorithm

This section provides the pseudocode for the MLSCPO algorithm based on the discussion above. The input to MLSCPO is the state information S obtained from the interaction of the EV with the environment, and the output is the control information for the EV after policy optimization, corresponding to its action space A. The pseudocode for MLSCPO is shown in Algorithm 1:

Algorithm 1 MLSCPO

Input:

Global state from environment $S=[S_{EV}$ , $S_{SV}^{\prime }$ , $S_{RI}$ ]

Output:

EV action $A = [a_{s}, \delta _{s}]$

Initialize: $\theta$ , $\varphi _{r}$ , $\varphi _{e}$ , $\gamma$ , $\xi$ , $\delta$ , $N_{t}$ , $N_{tr}$ , $N_{e}$ , b, q, $u_{0}$ , $k_{1}$ , $k_{2}$ , $k_{3}$ , $k_{4}$ , $k_{5}$ , $k_{6}$ , $k_{7}$ , $C_{r}$ , $V_{min}$ , $V_{max}$ , $a_{s_{min}}$ , $a_{s_{max}}$ , $\delta _{s_{min}}$ , $\delta _{s_{max}}$ , LDD, FDD

for epoch n = $1, 2, \ldots, N_{e}$ do

for step k = ${1, 2, \ldots, (N_{tr} \times N_{t}})$ do

for episode t = $1, 2, \ldots, N_{t}$ do

EI interacting with the envionment to collect state $s_{t} = [S_{EV}, S_{SV}^{\prime }, S_{RI}]^{t}$

PO execute action $a_{t}$ , get reward $R_{t}$ , cost $C_{t}$ , next state $s_{t+1}$

Collect trajectory $\tau = [s_{t}, a_{t}, R_{t}, C_{t}, s_{t+1}\ldots]$

end for

Collect global trajectory $\tau _{\pi _{k}} = N_{tr}~\tau$

end for

Update policy $\pi _{k}~\to ~\pi _{k+1}$

Calculate $r, e, z, D$

$r=\nabla _{\theta }J\left ({{\pi }}\right)$ , $e=\nabla _{\theta }J_{C}(\pi)$

$z = E\left [{{\sum _{t=0}^{N_{t}} \gamma ^{t} C(s_{t}, a_{t}, s_{t+1})}}\right] - \frac {b}{1-\gamma }$

$D = \delta - \frac {z^{2}}{e^{T} \mathbf {H}^{-1}e}$

10:

Safety risk level assessment by $z, D$

11:

if Low Risk then

12:

update policy network as:

$\theta _{k+1}=\theta _{k}+\sqrt {\frac {2\delta }{r^{T}\mathbf {H}^{-1}r}}\mathbf {H}^{-1}{r}$

13:

else if Middle Risk then

14:

update policy network as:

$\theta _{k+1}=\theta _{k}+\frac {1}{\lambda ^{*}}\mathbf {H}^{-1}(r-ev^{*})$

15:

else if High Risk then

16:

update policy network as:

$\theta _{k+1}=\theta _{k}-\sqrt {\frac {2\delta }{e^{T}\mathbf {H}^{-1}{e}}}\mathbf {H}^{-1}{e}$

17:

end if

Update $\varphi _{r}$ , $\varphi _{e}$ as:

$\varphi _{r}=argmin_{\varphi } E\left [{{\left ({{V_{\varphi _{r}}(s_{t})-R_{t}}}\right)^{2}}}\right]$

$\varphi _{e}=argmin_{\varphi }E\left [{{\left ({{V_{\varphi _{e}}(s_{t})-C_{t}}}\right)^{2}}}\right]$

18:

Provide action A through $\pi _{k+1}$

19:

end for

The MLSCPO algorithm is an iterative process. Lines 4-6 in the pseudocode represent: In each episode, collect EV information $S_{EV}$ , SVs information $S_{SV'}$ , and current EV location road information $S_{RI}$ . The length of each episode is $N_{t}$ . The information collected in each episode is stored in trajectory $\tau$ . After collecting $N_{tr}$ trajectories, one round of training is complete. The $N_{tr}$ trajectories $\tau _{\pi _{k}}$ collected reflect the EV’s performance under the current policy $\pi _{k}$ . Based on this, a new round of policy updates can be guided, corresponding to lines 3-8 in the pseudocode. Lines 9-18 in the pseudocode explain how the MLSCPO algorithm updates the policy network based on different safety risk levels through hierarchical evaluation. This is the core part of the MLSCPO algorithm. The updated policy $\pi _{k+1}$ will guide the value network updates and influence the new round of action outputs. The entire training process runs for $N_{e}$ iterations.

SECTION VI.

Experiment

This section describes the experiments conducted using the proposed MLSCPO algorithm on the CARLA simulation platform. First, the model parameters and other settings used in the experiments are detailed. Then, the training simulations and test results are analyzed. Finally, the proposed method is compared with the PPO-Lag method and the IDM+MOBIL method under the same scenario settings to validate the performance of the proposed method in terms of safety, comfort, and driving efficiency of the EV.

A. Experimental Setup

The version of the CARLA simulation platform used in this work is 0.9.12. The MLSCPO RL model is built on the PyTorch framework, running on an NVIDIA GeForce RTX 3070ti GPU with Ubuntu 20.04 as the operating system. In the CARLA simulation platform, a one-way multi-lane highway training scenario is constructed, each lane is 3.5 meters wide. During the experiment, the implementation of EI is achieved by using the built-in sensors of the CARLA simulator and the Python API to simulate sensors deployed on the EV. This captures real-time interaction data between the EV and the scene, and controls vehicle behavior by converting the actions performed by the EV (such as speed, steering angle, braking, etc.) into data signals, which guide the lateral and longitudinal movements of the EV. This process is implemented through three neural networks in the PO module. Specifically, two of these networks are value networks and one is a policy network. The value networks assess the level of safety risk, and based on the assessment results, the policy network guides the corresponding policy updates.

To ensure the simulation of real-time driving in realistic scenarios, the time step is set to 0.1 seconds, meaning the interval from when the EV captures the current scene and vehicle information to when it controls the lateral and longitudinal movements is 0.1 seconds. SVs can perform random but reasonable acceleration, deceleration, and lane-changing behaviors. The initial speed of SVs is set to 25 m/s, and the expected speed range for the EV is defined as 17-33 m/s. This speed range ensures that the EV can perform car-following and overtaking behaviors. At the end of each cycle, the speed, position, yaw angle, and other information of the autonomous vehicle at the termination moment are recorded. These data are used as the initial state inputs for the next cycle, ensuring the continuity of the training process and the accuracy of the simulation.

The PO module is mainly composed of three neural networks: the reward value neural network, the cost value neural network, and the policy neural network. Assuming there are N vehicles, the structure of the policy and value neural networks is defined as $(6N-1) \times 512 \times 512 \times 2$ and $(6N-1) \times 512 \times 512 \times 1$ , respectively, using the tanh activation function and the Adam optimizer. Each simulation runs for a maximum of $N_{t}$ time steps, and $N_{tr}$ sets of samples are collected for each policy iteration. The learning rate starts at $1 \times 10^{-3}$ and decays linearly to 0. The training algorithm stops after a maximum of $N_{e}$ iterations. The main parameters of the experiment are shown in Table 1.

TABLE 1 Experiment Parameters

To evaluate the performance of the MLSCPO algorithm, this study conducted two sets of experiments. The first set of experiments analyzed the training process between MLSCPO and the current advanced safe reinforcement learning algorithm PPO-Lag, while the second set of experiments compares the performance of the trained policies of MLSCPO, PPO-Lag, and IDM+MOBIL. In the first set of training process experiments, the relative positions, speeds, and driving behaviors of EVs and SVs were set with high randomness to test and validate the adaptability and effectiveness of the MLSCPO and PPO-Lag algorithm in multi-level risk assessment and constrained policy optimization across various complex traffic environments. In the second set of experiments, the highway autonomous driving decision-making policy trained by the MLSCPO algorithm was compared with that of PPO-Lag and the traditional stable algorithm IDM+MOBIL in the same scenario. The experiments designed test scenarios of varying difficulty levels, including the regular scenario in Figure 4(a) and the difficult scenario in Figure 4(b). The distinction in the difficulty of the two scenarios lies in the real-time traffic density around the autonomous vehicle and the randomness of other vehicles during the training process. These two scenarios are random collections of different traffic density conditions, car-following conditions, overtaking conditions, and even emergency conditions. Additionally, they include out-of-distribution scenarios to fully validate the algorithm’s adaptability. This study focused on comparing the algorithms in three aspects: driving efficiency, driving comfort, and driving safety.

FIGURE 4.

Schematic diagram of highway conventional scene and difficult scene.

Show All

B. Training Process and Result

In this experiment, the proposed MLSCPO algorithm and the PPO-Lag algorithm were trained in highly uncertain training scenarios. By analyzing the training process data, the curves of the total reward and total cost for both methods were obtained.

Figure 5(a) illustrates the total reward values per epoch for the MLSCPO and PPO-Lag algorithms. The solid line represents the total reward value, averaged from the immediate rewards in each epoch, while the shaded area indicates the standard deviation. The curves reveal that after convergence, the MLSCPO algorithm, with the same reward function parameter settings, achieves higher reward values than the PPO-Lag algorithm. Additionally, MLSCPO converges faster. The reasons for this difference are that, on one hand, MLSCPO separates safety and performance, and after satisfying safety constraints, policy updates focus on performance optimization. In contrast, PPO-Lag struggles to balance safety and performance, resulting in lower total reward values after convergence. On the other hand, although both algorithms consider safety constraints, the MLSCPO method employs a multi-level safety risk assessment using a cost value function, selecting different policy update methods based on various safety assessment outcomes. Meanwhile, PPO-Lag only uses a single Lagrange multiplier to guide policy updates under safety constraints. The diverse policy update methods in MLSCPO are more comprehensive and advanced, significantly improving learning efficiency and thus leading to faster convergence compared to PPO-Lag. Furthermore, as indicated by the standard deviation, the PPO-Lag algorithm shows greater fluctuations in reward values, while the MLSCPO algorithm maintains rewards at a high level with minor fluctuations after convergence, indicating better stability.

FIGURE 5.

MLSCPO and PPO-Lag algorithm training curve.

Show All

Figure 5(b) shows the cost of the MLSCPO algorithm and the PPO-Lag algorithm during 1400 epoch of training, with the solid line representing the total cost, averaged from the costs in each epoch, and the shaded area representing the standard deviation. During training, the total cost reflects the level of safety assurance; when the total in an epoch is below threshold, it can be considered that the EV has almost no collision risk and high safety. The curves indicate that the training performance of the MLSCPO algorithm is superior to that of the PPO-Lag algorithm. Specifically, the total cost of the MLSCPO method shows an overall downward trend, dropping below the benchmark value of 15 after approximately 200 policy iterations and continuing to decrease until it approaches zero. In contrast, the total cost of the PPO-Lag algorithm fluctuates greatly and does not show a stable downward trend. This demonstrates that the MLSCPO method has high stability in ensuring driving safety and can achieve a zero collision rate. The outstanding performance of MLSCPO is due to its multi-level assessment approach for policy updates. The assessment results can be categorized into three safety levels: low, medium, and high, with different policy update methods applied accordingly. The outcomes of these policy updates ensure that parameters are optimized to maximize rewards within the feasible domain of constraints, meaning that they prioritize meeting safety constraints while enhancing the overall performance of the EV.

C. Performance Comparison Between MLSCPO and IDM+MOBIL

To further confirm the control performance of the proposed MLSCPO algorithm for EVs in highway scenarios, the updated parameters of MLSCPO are used to conduct comparative tests with the PPO-Lag algorithm and the IDM+MOBIL algorithm in the same scenario. We have set multiple parameters to quantify the safety, efficiency, and comfort of autonomous vehicles. For the safety performance indicators of autonomous vehicles, we have set the success rate and average front vehicle distance as metrics. The success rate effectively reflects the probability of collisions or other hazardous behaviors under the adopted methods; the average front vehicle distance, as another safety indicator, suggests that a greater distance allows more time and space to handle emergencies, enhancing safety. For the efficiency indicators of autonomous driving, we quantify these through average speed and speed standard deviation, where the speed standard deviation effectively reflects the stability of driving speed. Regarding the comfort performance indicator, we have set the average steering wheel angle, average lateral acceleration of the vehicle, and its standard deviation as metrics. A smaller average steering wheel angle indicates smoother driving with less severe directional changes; a lower average lateral acceleration means that passengers experience less lateral force during turns, significantly improving the comfort of the ride. On the CARLA simulation platform, we use the Python API to obtain sensor data, collect these state information in real-time.

Regarding the safety performance metric, data from Table 2 shows that the policies updated using the three methods achieved a success rate of over 90% in regular, difficult, and out-of-training-distribution scenarios. The policies updated using the MLSCPO method and IDM+MOBIL even achieved a 100% success rate, demonstrating the broad adaptability of the MLSCPO method. Additionally, the parameter of average distance to the car in front indicates that the MLSCPO method maintained the greatest safety distance in any scenario. This provided ample time and space for the next action, significantly ensuring the safety and reliability of the driving process.

TABLE 2 Test Data for Safety, Efficiency, and Comfort Performance Indexes Under Different Test Scenarios

Regarding efficiency performance metrics, the data in Table 2 shows that, whether in regular or challenging scenarios, the policies updated using the MLSCPO method achieve the highest average speed. Although the average speed of all three methods decreases in challenging scenarios, the MLSCPO method shows the smallest reduction in average speed. This demonstrates that, in uncertain scenarios, the MLSCPO method maintains the highest driving efficiency, as indicated by the average speed parameter. Additionally, the speed standard deviation parameter reflects the stability of vehicle speed across different methods. The data shows that, in various scenarios, the MLSCPO method has the smallest speed standard deviation compared to the PPO-Lag and IDM+MOBIL methods, indicating that the MLSCPO method not only ensures efficient driving but also provides the best stability.

Regarding comfort performance metrics, we quantified the parameters using the average steering wheel angle, average lateral acceleration, and acceleration standard deviation, as shown in Table 2. The data indicates that although the average steering wheel angle using the MLSCPO method is not always the smallest, the average steering wheel angle for all three methods (IDM+MOBIL, PPO-Lag and MLSCPO) is less than 0.01 rad in both regular and challenging scenarios. This suggests that the steering wheel control is very smooth for policies updated by all three methods during travel. Additionally, the average lateral acceleration for all three methods is below 0.1, meaning that there is minimal force during vehicle turns, ensuring a high level of passenger comfort. Furthermore, the acceleration standard deviation under the MLSCPO method is smaller compared to the IDM+MOBIL and PPO-Lag methods. This indicates that with the MLSCPO method, the EV avoids sudden accelerations or decelerations, resulting in a very comfortable driving experience.

Figure 6 presents the policies obtained using MLSCPO, PPO-Lag, and IDM+MOBIL algorithms, which were applied in two different difficulty highway test scenarios shown in Figure 4 for comparative simulation testing. The results include statistical data on the speed information and steering angle information of the EV under the control of the three algorithms (positive values indicate right turns, negative values indicate left turns, higher values indicate larger angles, and a value of 0 indicates the steering wheel is centered).

FIGURE 6.

The speed and steering angle curves of the MLSCPO, PPO-Lag, and IDM+MOBIL algorithms in different test scenarios.

Show All

Figures 6(a) and 6(b) respectively correspond to the EV speed variation and steering wheel angle variation curves obtained from 200 time steps in the conventional test scenario shown in Figure 4, using the MLSCPO algorithm, PPO-Lag algorithm, and IDM+MOBIL algorithm. In conventional scenarios with smooth traffic and larger vehicle spacing, the EV has a high degree of freedom, with ample space for lane changes and acceleration. As seen in the curves in Figure 6(a), under sparse traffic conditions, vehicles controlled by all three algorithms can achieve high speed. However, their speed stability varies when encountering obstacles. Compared to the other two methods, the MLSCPO algorithm maintains a higher speed while avoiding obstacles. Around the 60th time step, the IDM+MOBIL algorithm opts for a policy of decelerating and following when facing an obstacle ahead of the EV, continuing this for approximately 60 time steps before overtaking, which greatly reduces driving efficiency. In contrast, the PPO-Lag algorithm, like the MLSCPO method, quickly makes an overtaking maneuver, ensuring driving efficiency, though its speed stability is not as good as that of the EV under the MLSCPO control policy. Additionally, as seen from the curve changes in Figure 6(b), during the lane change maneuver of the EV, the MLSCPO method achieves the smallest peak steering wheel angle, ensuring a comfortable driving experience. By contrast, the steering wheel angle under the IDM+MOBIL control shows abrupt changes, which is very dangerous at high speeds and reduces passenger comfort. Compared to the PPO-Lag method, the EV under the MLSCPO control exhibits smaller fluctuations in steering wheel angle changes, resulting in better driving stability.

Figures 6(c) and 6(d) correspond to the EV speed variation and steering wheel angle variation curves obtained over 400 time steps in the challenging test scenarios shown in Figure 4, using the MLSCPO algorithm, PPO-Lag algorithm, and IDM+MOBIL algorithm. In challenging scenarios, traffic flow is dense, restricting the EVs’ freedom compared to conventional scenarios. They are constrained by SVs and sometimes must decelerate and follow other vehicles. Maintaining high driving efficiency in such scenarios is a significant challenge for decision-making algorithms. As shown in the curves in Figure 6(c), even in dense traffic, the MLSCPO method demonstrates higher decision-making intelligence compared to the PPO-Lag and IDM+MOBIL algorithms, maintaining a higher speed while ensuring safety. During the sampling period, vehicles controlled by the MLSCPO algorithm exhibited significantly less speed fluctuation than those controlled by the other two methods. Even when the vehicles controlled by the MLSCPO algorithm were forced to decelerate and follow, their average speed remained higher than those controlled by the PPO-Lag and IDM+MOBIL algorithms, ensuring higher driving efficiency. According to the curves in Figure 6(d), it can also be observed that the EVs under the control of the MLSCPO algorithm performed two lane change maneuvers, while the vehicles controlled by the PPO-Lag and IDM+MOBIL methods only performed one lane change maneuver before choosing to follow the vehicle ahead. This further demonstrates the greater flexibility of the MLSCPO algorithm, maintaining driving efficiency even in challenging scenarios. Additionally, during lane change maneuvers, the EVs controlled by the MLSCPO algorithm maintained the lowest peak steering wheel angle, ensuring a comfortable driving experience while achieving efficient driving.

In summary, by comparing the vehicle performance indicators after updating the policy control using three methods, and by sampling data in both conventional and challenging test scenarios, we can see that the MLSCPO algorithm demonstrates higher decision-making intelligence compared to the PPO-Lag algorithm and the IDM+MOBIL method. It guides the vehicle to perform more efficiently while ensuring safety. When encountering obstacles, the vehicle quickly decides to change lanes when conditions allow, fully ensuring driving efficiency. Furthermore, during lane changes, the vehicle does not experience significant steering angle changes, thereby ensuring ride comfort and driving safety.

Moreover, by decoupling the state, reward, and cost value functions from the environment, the MLSCPO algorithm ensures robust training performance even in highly uncertain scenarios. It effectively reduces risk in interactive situations, demonstrates adaptive decision-making during overtaking maneuvers, and maintains efficient driving capabilities under various traffic densities. The adaptability and reliability of the MLSCPO method enable it to extend to other highway scenarios, such as bidirectional multi-lane highways and highways with varying road widths, and it also has the potential for application in urban roads and other complex traffic environments. Before deploying the algorithm in new environments, it may be necessary to fine-tune the model’s hyperparameters, such as adjusting the coefficients of the reward and cost functions, the size of the vehicle monitoring area, and other specific environmental parameters. For more complex scenarios, additional state variables (e.g., driving direction) can be introduced to provide a more comprehensive description of traffic conditions, or the range of action options (e.g., driving routes) can be expanded to achieve more sophisticated control.

SECTION VII.

Conclusion

This paper proposes a constraint-driven safe RL method and applies it to learning autonomous driving decision-making policy in highway scenarios. The framework introduces the concept of a constraint domain based on the trust domain. It constructs the autonomous driving decision-making problem as a CMDP within a safe RL framework, ensuring that the control policy of EVs maximize rewards while satisfying safety constraints. The study further proposes the MLSCPO method, which introduces a cost function to account for safety constraints. This allows for different control policy updates under various safety conditions. Through this innovative policy update method, MLSCPO achieves decoupling of state and specific scenarios, significantly enhancing adaptability and reliability when facing complex and changing scenarios, even those outside the training distribution. Simulation tests on the CARLA platform demonstrate that the proposed method outperforms the current advanced reinforcement learning method PPO-Lag and the classic stable IDM+MOBIL algorithm in terms of driving safety, driving efficiency and driving comfort.

Although the proposed method achieves a certain degree of decoupling between state settings and scenarios, the various subjective factors present in real-world driving, along with the increased randomness in driving conditions, result in more edge cases. These factors present challenges for the practical implementation of the MLSCPO method in real-world applications. To ensure an effective transition, it is essential to further fine-tune the policy neural network trained in the simulation environment when deploying it in the real world. This process may involve adjusting the algorithm’s hyperparameters to accommodate more complex scenarios and expanding the state inputs to provide a more comprehensive and detailed description of the environment. During the interaction between the policy and the real-world environment, which is necessary to obtain rewards and costs for adapting to real-world conditions, it is important to address issues such as policy forgetting and the efficiency of real-world sampling. Furthermore, a safe transition from simulation to real-world applications may require the integration of various methodologies. For instance, integrating autonomous driving perception systems can ensure the rapid and accurate collection of scenario information necessary for the MLSCPO decision-making method, significantly enhancing adaptability, safety, and reliability under different road and traffic conditions; incorporating Large Language Models (LLMs) can enhance general knowledge, allowing for human-like understanding of diverse scenarios, thus improving the system’s adaptability and robustness in various traffic situations; implementing formal safety driving rules as a safety redundancy mechanism can help maintain the vehicle’s safety in hazardous situations and use these rule-based behaviors to guide updates in the reinforcement learning policy. In the future, we aim to incorporate the aforementioned potential extensions to improve the applicability and generalizability of the method in real-world scenarios.

Conflict of Interest

On behalf of all the authors, the corresponding author states that there is no conflict of interest.

References is not available for this document.

Constraints Driven Safe Reinforcement Learning for Autonomous Driving Decision-Making

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Works

A. Decision Planning for Autonomous Driving Based on Rule-Based Methods

B. Decision Planning for Autonomous Driving Based on Learning Methods