Introduction
The rapid advancement of autonomous driving technology is propelling transportation systems towards greater levels of automation and intelligence, with primary design objectives focused on ensuring vehicular safety, providing a comfortable driving experience, and enhancing efficiency. In the evolution of autonomous driving technology, the capability to make safe decisions across all driving scenarios is paramount [1], [2]. Currently, decision-making and planning in autonomous driving primarily rely on rule-based and learning-based methods.
Rule-based methods in autonomous driving rely on a repository of predefined rules to make decisions or process information. These rules dictate when an autonomous vehicle should accelerate, decelerate, turn, or stop based on traffic laws and safe driving principles across various scenarios. In emergencies or rare events, these vehicles depend on these established rules for response. This method effectively addresses the majority of common driving scenarios [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]. However, given the extensive, complex, and safety-critical nature of scenarios and conditions that vehicles encounter, the inflexibility and limited adaptability of manually constructed rules can lead to severe traffic accidents in unforeseen situations.
Learning-based methods capture complex mappings between inputs and outputs by analyzing vast datasets and employing sophisticated deep neural network models to make decisions or process information. Due to the reliance of supervised learning on the quality and diversity of training datasets, these models often exhibit limited generalization capabilities when encountering new scenarios, which has led to the adoption of deep RL methods in autonomous driving decisions. Deep RL [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], a method that integrates principles of deep learning and RL, transcends reliance on rule sets or large datasets alone. It actively engages with the environment to explore and adapt, optimizing its responses, making it a potent solution for designing autonomous driving decisions [28], [29]. Compared to rule-based methods like finite state machines, learning-based methods offer significant advantages in adaptability and flexibility. These methods autonomously learn from data and adapt to new patterns and changes, thereby reducing reliance on hardcoded rules. Furthermore, deep RL possesses the potential for continuous improvement, consistently enhancing its performance and accuracy by learning from new data, adapting to evolving circumstances.
Deep RL methods currently applied in autonomous driving, such as Deep Deterministic Policy Optimization (DDPG) [18], [19], Deep Q-Network (DQN) [13], [14], [15], [16], [17], Proximal Policy Optimization (PPO) [20], [21], Soft Actor-Critic (SAC) [22], [23], [24], Advantage Actor-Critic (A2C) [25], [26], and Asynchronous Advantage Actor-Critic (A3C) [27], have effectively addressed issues related to the low timeliness and excessive reliance on existing data encountered by traditional methods. However, these Deep RL methods are predominantly driven by a single reward mechanism. In the pursuit of maximizing overall rewards, they frequently struggle to balance multiple sub-rewards such as safety, comfort, and efficiency, often failing to ensure that autonomous driving decisions consistently adhere to safety constraints. Particularly in extreme or marginal traffic scenarios, which fall outside the distribution of the training data, these RL policies struggle to consistently maintain vehicular safety [30]. Therefore, in the field of autonomous driving, safety is a fundamental requirement and should not merely be considered a trade-off against other performance metrics. This field demands deep RL methods that can satisfy applicable safety constraints effectively.
To address these limitations, this paper introduces a constraint-driven safety RL method, Multi-Level Safety-Constrained Policy Optimization (MLSCPO). This method has been applied to decision-making and planning policy in highway scenarios, seeking to maximize performance rewards while adhering to stringent safety constraints. It achieves safe, efficient, and comfortable driving, optimizes the dynamic coordination of traffic flow, and exhibits outstanding robustness. The specific contributions of this paper are as follows:
This paper introduces MLSCPO method. Building on traditional trust domain algorithms, it incorporates a cost-based neural network to conduct tiered quantitative assessments of safety constraint satisfaction under current vehicle operating conditions, updating policy accordingly. For low-risk safety conditions, policy updates target reward maximization; for medium-risk conditions, dual optimization problems are solved to ensure training adheres to safety constraints, allowing policy updates to progress towards reward optimization without compromising safety costs; in high-risk situations, natural gradient methods are employed to find solutions that downgrade the current high-risk condition to a medium or low risk, thus enhancing the policy’s compliance with safety constraints.
This paper skillfully constructs refined reward and cost functions, and aptly defines the state and action spaces, framing the highway autonomous driving decision-making problem as a CMDP addressed using the proposed MLSCPO method. Specifically, the state space incorporates scenario information from a defined area around the autonomous vehicle, effectively avoiding issues caused by overly complex state space designs; the cost function considers safety concerns encountered in highway scenarios, providing a basis for updating policy under different safety conditions, thereby enhancing traffic safety and reducing potential collision risks; the design of the reward function ensures safety, efficiency, and comfort for autonomous vehicles in highway settings.
The MLSCPO method has been validated in typical highway scenarios. By quantifying the metrics of safety, efficiency, and comfort, the MLSCPO method was evaluated and compared with the Proximal Policy Optimization with Lagrangian (PPO-Lag) method and the Intelligent Driver Model with Minimization of Overall Braking Induced by Lane Changes (IDM+MOBIL) method. The comparison results indicate that the proposed approach exhibits superior performance, making smarter decisions in high-interaction scenarios, and demonstrating extensive adaptability and reliability in complex and dynamic situations.
The rest of this paper is organized as follows: Section II provides an overview of the current work on autonomous driving decision-making algorithms and points out the limitations. Section III introduces the framework of MLSCPO and defines the highway autonomous driving problem. Section IV describes the environment interaction module of the MLSCPO algorithm. Section V presents the policy optimization module of the MLSCPO algorithm. Section VI analyzes and discusses the experimental results. Finally, conclusions are summarized, and future work directions are proposed, as described in Section VII.
Related Works
A. Decision Planning for Autonomous Driving Based on Rule-Based Methods
Rule-based decision-making and planning in autonomous driving traditionally relies on a repository of predefined rules for making decisions or processing information. This method has been extensively researched and implemented within the autonomous driving sector. The Finite State Machine (FSM) [4] is the most commonly utilized rule-based method, prized for its practicality and clear logic, which facilitates easy debugging and implementation. Autonomous vehicles such as MIT’s Talos [5], the Defense Science and Technology University’s Red Flag CA7460, Stanford University and Volkswagen’s Junior [6], Carnegie Mellon University and Ford’s Boss [7], Mercedes-Benz’s Bertha [8], and Virginia Tech’s Odin [9] all employ decision-making methodologies based on Finite State Machines.
Finite State Machine (FSM) in autonomous vehicles can be categorized into three structural types based on the transition relations set among their sub-states: serial, parallel, and hybrid configurations. In the serial configuration, state transitions are unidirectional and non-cyclical, with MIT’s Talos autonomous vehicle [5] being a prime example. This structure’s greatest advantage lies in its simple and clear logic, making it well-suited for straightforward driving tasks and environments. However, this structure’s limitations become apparent in complex scenarios where the static state transition path may struggle with unexpected variations in traffic flow, potentially leading to a collapse of the decision-making chain if any state processes incorrectly or is inadequate. In contrast, the parallel structure offers greater flexibility, allowing for multiple sub-states that can transition among each other, forming loops. Stanford and Volkswagen’s Junior [6] autonomous vehicle utilizes this decision model. This structure can adapt to various driving environments, yet the challenge lies in accurately categorizing diverse scenarios. As system complexity increases, the number of designed rules multiplies, complicating debugging efforts. The hybrid structure [9] combines elements of both serial and parallel configurations. Carnegie Mellon and Ford’s Boss [7] autonomous vehicle employs this structure, which hierarchically processes tasks, effectively enhancing the success rate of task execution and overall system stability.
Additionally, some studies have combined Finite State Machines with other algorithms or modules for autonomous driving decision-making. For instance, Okumura et al. [10] developed a method for autonomous vehicles merging into roundabouts, which generates high-dimensional decision commands through a classifier, then utilizes FSM to create rules tailored to the specific scenarios of merging into roundabouts. Cui et al. [11] developed a tiered emergency collision avoidance system, where the decision-making component utilizes FSM to determine specific anti-collision measures, integrated with a collision risk model that assesses the threat level from surrounding vehicles. Vanholme et al. [12] treated decision-making actions as discrete entities, associating each action with specific states, thus breaking down the autonomous vehicle’s decision tasks into processes of environmental state recognition and response. This method boasts excellent real-time capabilities, enabling autonomous vehicles to make decisions swiftly under given conditions. However, given the variety of complex and sudden situations that may arise during driving, relying solely on predefined rules and models is insufficient to cover all possible scenarios, especially in high-speed highway environments where the range and complexity of situations are particularly extensive.Solving such complex driving tasks through manually crafted rules is a highly challenging endeavor. Furthermore, encountering situations beyond the rule set may lead to severe traffic accidents.
B. Decision Planning for Autonomous Driving Based on Learning Methods
Learning-based methods excel at addressing edge-case scenarios, outperforming rule-based methods like Finite State Machine by identifying and learning deeper, more complex patterns from data. This reduces reliance on hard-coded rules and offers significant advantages in adaptability and flexibility. Furthermore, machine learning models can continuously improve their performance and accuracy by learning from new data, showcasing superior ongoing learning capabilities. This allows them to effectively handle larger-scale and more complex problems. Learning-based autonomous driving decision methods primarily employ deep RL, which leverages the powerful function approximation capabilities of deep learning and the decision-making prowess of RL. This allows agents to “trial and error” through continuous interaction with the environment, learning from feedback to derive optimal driving policy for accomplishing tasks.
Currently, numerous scholars are researching RL-based autonomous driving decisions. Early RL algorithms advocated using function approximation to derive optimal Q-values, thus guiding agents to optimize their policy. The DQN algorithm improves RL by predicting Q-values for actions but often overestimates them. Double DQN [13], [14] uses two networks for better accuracy but struggles with continuous actions. The Normalized Advantage Function (NAF) [15], [16] addresses this but still has limitations in complex environments. Researchers are now focusing on policy optimization techniques to refine policy [17] directly.Chen et al. [18] employed the DDPG method to craft adversarial policies for managing surrounding agents in lane-changing situations. In another study, policies for controlling multiple interacting vehicles were generated by Wachi [19] through multi-agent DDPG. To enhance multi-agent interaction predictions in highway scenarios, Radtke et al. [20] applied the PPO algorithm within the AIRL framework. Shi et al. [21] used distributed PPO to handle training dimensions and communication loads in mixed traffic control, focusing on improving efficiency and communication. Demonstrating a complex balance between rule-based methods and deep RL methods, Zhang et al. [22] developed a dual-layer system for lane-changing behavior planning with SAC, enhancing collaboration among autonomous vehicles and overall traffic safety and efficiency. A hybrid method proposed by Al-Sharman et al. [23] combines a high-level learning-based behavior planning layer, based on the SAC principle, with a low-level motion planning layer using Model Predictive Control (MPC). This approach enables the learning of safe, non-conservative driving behaviors while ensuring the feasibility of the generated trajectories. Zhang et al. [24] utilized a Lyapunov-based method to guarantee the safety of policies generated by the Soft Actor-Critic (SAC) algorithm. By incorporating a mathematical function to ensure policy stability, this method addresses the inherent uncertainties in autonomous driving, providing a robust safety mechanism. Managing individual surrounding vehicles in vehicle-following scenarios was tackled by Kuutti et al. [25] with the Advantage Actor-Critic (A2C) [26] framework. Zhou et al. [27] investigated the application of an improved A3C algorithm for autonomous driving lane-changing decisions, which significantly enhances sample efficiency, thereby boosting learning efficiency and performance in complex traffic conditions.
However, the aforementioned RL methods are driven by single reward metrics, focusing on maximizing overall rewards. This often results in a trade-off among sub-rewards such as safety, comfort, and efficiency, especially in extreme or edge-case traffic scenarios, where these policies struggle to maintain vehicle safety outside the distribution of training data [30]. In the domain of autonomous driving, safety is a fundamental requirement and should not be merely considered a trade-off against other performance metrics. To address this issue, this paper introduces a constraint-driven safety RL method. This method is applied to decision-making and planning policy in highway scenarios, aiming to maximize performance rewards while adhering to stringent safety constraints, thereby ensuring safe, efficient, and comfortable driving.
Problem Definition and Method Framework
A. Problem Definition
Highways are a critical setting for travel, presenting unique complexities and challenges for autonomous driving compared to urban roads. The scenario considered in this study, as illustrated in Figure 1, involves vehicles traveling on multi-lane highways. Here, the Ego Vehicle (EV) monitors the motion states of Surrounding traffic participants Vehicles (SVs) to make real-time driving decisions. The number of SVs is random, their speeds vary within a specified range, they display random driving behaviors in the scenario, and are capable of making reasonable lane changes. In such complex conditions, whether the decision-making algorithm can successfully capture, recognize, and make accurate decision planning is a significant challenge.
Framework of the proposed MLSCPO method, comprising the Environment Interaction(EI) and Policy Optimization(PO) module.
This work abstracts the scenarios encountered in autonomous highway driving into four main types:
Smooth Traffic Scenario: The EV encounters few SVs, which scarcely impede its progress. In this scenario, the EV can adjust its speed or change lanes within the limits of traffic regulations, enjoying a high degree of freedom. Safety for the EV is almost unthreatened by external factors;
Overtaking Behavior Scenario: On a multi-lane highway, the lanes on both sides of the EV are occupied by SVs, and there is an obstructing vehicle ahead of the EV. The EV has only two choices: to slow down and follow or to overtake using the only lane without obstructions, placing the EV in a lower-risk condition;
Emergency Avoidance Scenario: The vehicle in front of the EV suddenly decelerates or brakes urgently, placing the EV in a moderate-risk condition. The EV must urgently change lanes to the left or right to avoid a collision;
High Traffic Density Scenario: The highway is heavily congested, with all lanes ahead of the EV occupied by vehicles. The EV is unable to overtake and must slow down and follow, waiting for an opportunity, placing it in a high-risk condition.
In the real world, highway traffic scenarios are typically a random combination of the above conditions, and the unpredictable behavior of SVs creates even more complex situations, increasing both uncertainty and complexity.
At each timestep, the EV can access real-time state information \begin{equation*} S=[S_{EV}, S_{SV}, S_{RI}]\end{equation*}
B. Method Framework
MLSCPO constructs three neural networks: a policy network, a cost value network, and a reward value network. Specifically, the policy network maps states to actions, the cost-based value network evaluates the expected cost of policy, and the reward-based value network assesses the expected rewards of policy. Together, these value networks guide policy updates, ensuring that rewards are maximized under the constraints imposed.
Figure 1 depicts the framework of the robust highway autonomous driving decision method based on MLSCPO, which is primarily divided into two parts: Environment Interaction (EI) and Policy Optimization (PO). EI segment explores the highway environment based on the current policy, acquiring state space information of the EV at each timestep, observing the motion information of SVs, and outputting control commands according to the current policy. This interaction yields current reward and cost values, forming a CMDP-formatted trajectory for use in PO.
The PO component uses the trajectories formed by EI to optimize the parameters of the policy neural network and the reward and cost value networks, based on the MLSCPO method. Specifically, the value networks conduct safety risk assessments for policy adopted by the EV under the current scenario. Different policy updates are applied depending on the level of safety risk: for low-risk scenarios, updates aim to maximize rewards; for moderate-risk scenarios, dual optimization problems are solved to ensure that training aligns with safety constraints, allowing updates to progress toward reward optimization without compromising safety costs; in high-risk scenarios, natural gradient methods are employed to seek policy updates. Once optimized, the updated policy continues to be tested through EI. The EI and PO components work in tandem to facilitate continuous iterative improvements until the optimized policy converges to the desired performance level.
MLSCPO Environment Interaction Module
Autonomous driving policy must interact with the environment using the framework of safety RL formalized as CMDP to support policy updates. This section translates the decision-making and planning problems of EVs in highway scenarios into CMDP problems, which includes the construction of state spaces, action spaces, reward functions, and cost functions.
A. Constrained Markov Decision Process Introduction
CMDP is a common formulation in the field of safety RL, building on the foundation of Markov Decision Processes (MDP). CMDP introduces a cost function to calculate the expected cost of an agent performing action A in state S, thereby imposing constraints on the outcomes, and it can be represented by a tuple:\begin{equation*} CMDP=[S, A, P, R, C, \gamma, u_{0}]\end{equation*}
At each timestep t, the agent interacts with the environment to acquire the current state information \begin{align*} J(\pi)=\mathbb {E}_{\substack {s_{0}\sim u_{0} \\ \tau \sim \pi }}\left [{{\sum _{t}\gamma ^{t}R(s_{t}, a_{t}, s_{t+1})}}\right ] \tag {1}\end{align*}
Additionally, the model requires that while maximizing rewards, the policy must also satisfy safety constraints. Therefore, a cost constraint cap b is defined, and under policy \begin{align*} J_{C}(\pi)=\mathbb {E}_{\substack {s_{0}\sim u_{0} \\ \tau \sim \pi }}\left [{{\sum _{t}\gamma ^{t}C(s_{t}, a_{t}, s_{t+1})}}\right ] \tag {2}\end{align*}
The set of feasible policy should satisfy:\begin{equation*} \Pi _{C}=\left \{{{\pi \in \Pi {: } J_{C}(\pi)\leq b}}\right \} \tag {3}\end{equation*}
The optimal policy \begin{align*} \pi ^{*}=arg\max _{\pi \in \Pi _{C}}\mathbb {E}_{\substack {s_{0}\sim u_{0} \\ \tau \sim \pi }}\left [{{\sum _{t}\gamma ^{t}R(s_{t}, a_{t}, s_{t+1})}}\right ] \tag {4}\end{align*}
B. Convert Highway Autonomous Driving Decisions Into CMDP
Definition of State and Action Spaces: Designing an appropriate state space is essential for the effective training of RL algorithms. The accuracy and completeness of state information in highly complex and dynamic scenarios help the system output suitable actions and guide the policy to learn stably. In highway scenarios, the state space includes information about the EV, the status of SVs within a certain observational range, and the road information relevant to the current location of the EV:\begin{equation*} S=[S_{EV}, S_{SV}^{\prime }, S_{RI}]\end{equation*}
The EV’s state information
Under this framework, the driving control information for EVs is outputted through the action space, primarily characterizing control commands for the vehicle’s lateral and longitudinal movements:\begin{equation*} A=[a_{s}, \delta _{s}]\: s.t. a\in [a_{s_{min}}, a_{s_{max}}]m/s^{2}, \delta _{s}\in [\delta _{s_{min}}, \delta _{s_{max}}]rad\end{equation*}
In this model, the magnitude of a represents the driving acceleration of the EV, while the magnitude of \begin{equation*} u(t)=K_{p}e(t)+K_{i}\int _{0}^{t}e(t)dt+K_{d}\frac {de(t)}{dt} \tag {5}\end{equation*}
Cost Function Definition: In this study, the cost function considers both instantaneous and terminal costs. The instantaneous cost represents the cost incurred by the EV at each time step (in the current state) after selecting an action, helping the EV adjust its policy during training to adhere to safety constraints (such as avoiding collisions and traffic rule violations), and guiding it to make relatively safe and effective decisions. The instantaneous cost in this study is defined with two components: beyond road boundary cost and proximity distance violation cost. The beyond road boundary cost \begin{equation*} C=k_{1}\times c_{crash}+k_{2} \times c_{boun}+k_{3} \times c_{dis} \tag {6}\end{equation*}
The construction of the cost function encompasses the potential risks between EV and the road, as well as between EV and SVs, including both potential and actual collision risks. Even in highly dynamic and uncertain environments, it can still guide policies to avoid various potential risks. Each component is assigned a fixed risk penalty value, with
In highway scenarios, should the EV collide with SVs or road obstacles, it could lead to severe traffic accidents, posing significant threats to the safety of the driver and passengers. Consequently, the baseline cost value for such collisions is set high, and \begin{align*} c_{crash}=\begin{cases}\displaystyle 15, if\: crash=True \\ \displaystyle \\ \displaystyle 0, else\end{cases} \tag {7}\end{align*}
Similarly, driving out of highway boundaries also exposes drivers and passengers to unpredictable dangers. Therefore, this work have defined a cost function, \begin{align*} c_{boun}=\begin{cases}\displaystyle 1, if\: beyond \hspace {1mm}boundary \\ \displaystyle 0, else\end{cases} \tag {8}\end{align*}
Maintaining a safe distance is essential for safe vehicle operation, particularly on highways where higher speeds necessitate greater safety margins. Consequently, a safety distance threshold \begin{align*} c_{dis}=\begin{cases}\displaystyle 1, if\: \Delta x_{i}\lt C_{r} \\ \displaystyle \\ \displaystyle 0, else\end{cases} \tag {9}\end{align*}
Reward Function Definition: The reward function plays a crucial role in guiding and evaluating the EV’s behavior. In the MLSCPO algorithm, to balance safety and performance, the reward function and safety constraints are set independently. When safety and performance cannot be simultaneously satisfied, policy updates are guided by the quantified risk level. It provides immediate feedback, steering the agent towards desired actions and optimizing its policy. A well-designed reward function not only encourages appropriate exploration and the discovery of promising new policy but also balances short-term and long-term benefits, preventing the agent from focusing excessively on immediate rewards at the expense of long-term objectives. In this study, the reward function is thoughtfully crafted to consider driving efficiency and riding comfort, among other factors. Specifically, it includes: 1) a driving efficiency reward component \begin{align*} R=k_{4} \times r_{effi}+k_{5} \times r_{comf}+k_{6} \times r_{pref}+k_{7} \times r_{ovrt} - C \tag {10}\end{align*}
In this study, driving efficiency primarily refers to the ability of the EV to travel at the highest possible speed while staying within the road speed limits \begin{align*} r_{effi}= \begin{cases} \displaystyle \frac {v_{e}-v_{min}}{v_{max}-v_{min}},& if\: v_{e}\in [v_{min}, v_{max}] \\ \displaystyle -0.5,& else\end{cases} \tag {11}\end{align*}
The comfort reward function primarily considers the experience of the passengers during the EV’s journey, characterized by the magnitude of the vehicle’s lateral acceleration, which is used as an indicator for evaluating this aspect. It is defined as follows:\begin{equation*} r_{comf}=1-\left |{{\frac {\mathit v_{e, y}^{t+\Delta t}-\mathit v_{e, y}^{t}}{4\Delta t}}}\right | \tag {12}\end{equation*}
The driving task optimization reward aims to ensure the safety of the EV while also incorporating greater flexibility for executing maneuvers such as lane changes and overtaking. The reward component \begin{align*} r_{pref} & = \left ({{\frac {n_{rj}+1}{n_{rj}-1/\: n_{rj}\: /\: n_{rj}+1}}}\right) \\ & \quad + \left |{{\frac {y_{e}-y_{rj}}{\left ({{(n_{rj}-1/\: n_{rj}\ /\: n_{rj}+1)-(n_{rj}+1)}}\right)d_{rj}}}}\right | \tag {13}\end{align*}
The \begin{align*} r_{ovrt}=\begin{cases}\displaystyle -0.8, & if\: x_{e}\gt x_{1} \\ \displaystyle 0.1,& else\end{cases} \tag {14}\end{align*}
MLSCPO Policy Optimization Module
The PO component utilizes trajectories formed through EI to optimize the parameters of the policy neural network and the reward and cost value networks, based on the MLSCPO method. This section discusses how MLSCPO constructs trust and constraint domains using neural networks to assess safety risk levels for autonomous highway vehicles and implement policy optimization under various safety levels.
A. Policy Parameter Update
In large or continuous CMDP, the dimensionality problem makes it infeasible to find an exact optimal solution as described by Equation (4). Therefore, this work introduces the parameter
The policy update can be reformulated as:\begin{align*} \pi _{k+1}& =arg\max _{\pi \in \Pi _{\theta ^{\prime }}}J(\pi) \\ & \quad ~s.t.J_{\mathit {C}}(\pi)\leq \mathit {b} \\ & \qquad \mathit {D}(\pi, \pi _{k})\leq \delta \tag {15}\end{align*}
However, this update is difficult to implement because it requires evaluating whether the proposed policy
Let \begin{align*} V^{\pi }(s)& \equiv \mathbb {E}_{\tau \sim \pi }(R(\tau)|s_{0}=s) \tag {16}\\ Q^{\pi }(s, a)& \equiv \mathbb {E}_{\tau \sim \pi }(R(\tau)|s_{0}=s, a_{0}=a) \tag {17}\end{align*}
The advantage function is defined as:\begin{equation*} A^{\pi }(s, a)=Q^{\pi }(s, a)-V^{\pi }(s) \tag {18}\end{equation*}
The advantage function offers a more precise evaluation method by measuring the benefit of specific actions in given states, compared to solely using the value function. It reduces variance and randomness in policy updates, providing a clear gradient signal for parameter adjustments. This results in more effective updates, reducing sample requirements and computational complexity, and accelerating convergence to the optimal policy.Generalized Advantage Estimation (GAE) balances accuracy and variance of cost measurements for stable learning updates. This study employs GAE to estimate the cost advantage function, defined as follows:\begin{align*} A_{\mathit {C}}^{\pi }(s, a) & = \sum _{k=0}^{T} (\gamma \xi)^{k} \big (-C(s_{t+k}, a_{t+k}) \\ & \quad + \gamma V_{C}^{\pi }(s_{t+k+1}) - V_{C}^{\pi }(s_{t+k}) \big) \tag {19}\end{align*}
The advantage function allows us to more succinctly express the difference in rewards before and after policy updates:\begin{equation*} J(\pi ')-J(\pi)=\frac {1}{1-\gamma }\mathbb {E}_{\tau \sim \pi }[A^{\pi }(s, a)] \tag {20}\end{equation*}
The policy update can be reformulated as:\begin{align*} \pi _{k+1}& =arg\: max\: \mathbb {E}_{\tau \sim \pi ^{k}}\left [{{A^{\pi _{k}}(s, a)}}\right ] \\ & \quad ~s.t.\: J_{\mathit {C}}(\pi _{k})+\frac {1}{1-\gamma }\mathbb {E}_{\tau \sim \pi ^{k}}[A_{\mathit {C}}^{\pi _{k}}(s, a)]\leq \mathit {b} \\ & \qquad \overline {D}_{KL}(\pi \|\pi _{k})\leq \delta \tag {21}\end{align*}
\begin{align*} \theta _{k+1}& =arg\: \underset {\theta }{m}ax\: r^{T}(\theta -\theta _{k}) \\ & \quad ~s.t.\quad \mathit {z} + \mathit {e}^{T}(\theta -\theta _{k})\leq 0 \\ & \hphantom {\quad ~s.t.\quad }\frac {1}{2}(\theta -\theta _{k})^{T}\mathbf {H}(\theta -\theta _{k})\leq \delta \tag {22}\end{align*}
Since the matrix \begin{align*} \max _{\substack {\lambda \geq 0 \\ v \gt 0}}\frac {-1}{2\lambda }\Big (r^{T}\mathbf {H}^{-1}r-2g^{T}v+v^{T}Sv\Big)+v^{T}z-\frac {\lambda \delta }{2} \tag {23}\end{align*}
\begin{equation*} \theta ^{*}=\theta _{k}+\frac {1}{\lambda ^{*}}\mathbf {H}^{-1}(r-ev^{*}) \tag {24}\end{equation*}
\begin{align*} r& =\nabla _{\theta } J\left ({{\pi }}\right) \\ & =\frac {1}{N_{tr}}\sum _{j=1}^{N_{tr}}\sum _{t=1}^{N_{t}}\nabla _{\theta }\mathbb {E}_{\tau _{\pi _{k}}\sim \pi _{\theta }}[\: log\pi _{\theta }(a_{t}|s_{t})A^{\pi }(s_{t}, a_{t})] \tag {25}\end{align*}
The gradient of the cost value function e is defined as follows:\begin{align*} e& =\nabla _{\theta } J_{C}(\pi) \\ & =\frac {1}{N_{tr}}\sum _{j=1}^{N_{tr}}\sum _{t=1}^{N_{t}}\nabla _{\theta }\mathbb {E}_{\tau _{\pi _{k}}\sim \pi _{\theta }}[\: log\pi _{\theta }(a_{t}|s_{t})\: A_{C}^{\pi }(s_{t}, a_{t})] \tag {26}\end{align*}
The safe level judgment factors z and D are obtained through the calculation of the expenditure value function gradient, defined as follows:\begin{align*} \mathit {z}& =\mathbb {E}\left [{{\sum _{t=0}^{N_{t}}\gamma ^{t}C(s_{t}, a_{t}, s_{t+1})}}\right ]-\frac {b}{1-\gamma } \tag {27}\\ D& =\delta -\frac {z^{2}}{\mathit {e}^{T}\mathbf {H}^{-1}\mathit {e}} \tag {28}\end{align*}
\begin{equation*} \text {Trust domain}=\{\pi _{\theta }\epsilon \Pi _{\theta }{: }\overline {D}_{KL}(\pi \|\pi _{k})\leq \delta \}\end{equation*}
\begin{align*} \text {Constrained domain} = \big \{ \pi _{\theta }\in \Pi _{\theta }: J_{C}(\pi _{k}) + \frac {1}{1-\gamma } \big. \\ \big. \mathbb {E}_{\tau \sim \pi ^{k}}[A_{C}^{\pi _{k}}(s, a)] \leq b \big \}\end{align*}
The MLSCPO algorithm categorizes the possible risk situations for agent policy updates based on the relative positions of the trust domain and the constraint domain as follows:
Low-Risk Safety Condition: When the norm of the cost value function gradient
is extremely small (less than\|e\| ) or the safety level judgment factors1 \times 10^{-8} andz \lt 0 , it indicates that the constraint domain is entirely within the trust domain, as shown in Figure 3(a). The updated policy is highly likely to remain within both the constraint and trust domains, implying that the current policy almost never violates the constraints.D \lt 0 Moderate-Risk Safety Condition: When the safety level judgment factor
, it suggests that although there is an intersection between the constraint domain and the trust domain, as shown in Figure 3(b), the policy that maximizes the reward might fall outside the constraint domain, posing a certain risk of dangerous situations.D \gt 0 High-Risk Safety Condition: When the safety level judgment factors
andz \gt 0 , it indicates that there is no intersection between the constraint domain and the trust domain, as shown in Figure 3(c). The updated policy will definitely not satisfy the constraint domain, meaning that no matter how the policy is updated, the current dangerous situation cannot be reversed.D \lt 0
Schematic diagram of the relative positions of the trust domain and the constraint domain.
The MLSCPO algorithm performs corresponding policy updates according to different risk situations:
When the evaluation results indicate that the EV is operating under a low-risk safety condition with the current policy, the trust domain method is used for policy updates to maximize the expected reward function value.
The updated policy is:\begin{equation*} \theta _{k+1}=\theta _{k}+\sqrt {\frac {2\delta }{\mathit {r}^{T}\mathbf {H}^{-1}\mathit {r}}}\mathbf {H}^{-1}r \tag {29}\end{equation*}
When the evaluation results indicate that the EV is operating under a moderate-risk safety condition with the current policy, it means that the policy update may violate safety constraints. Therefore, it is necessary to solve the dual problem to find a policy that both satisfies the safety constraints and maximizes the expected reward function.
The updated policy is:\begin{equation*} \theta _{k+1}=\theta _{k}+\frac {1}{\lambda ^{*}}\mathbf {H}^{-1}(r-ev^{*}) \tag {30}\end{equation*}
\begin{align*} v^{*}& =\left ({{\frac {\lambda ^{*}z-g}{S}}}\right)_{+} \tag {31}\\ \lambda ^{*} & = \arg \max _{\lambda \geq 0} \begin{cases} \displaystyle f_{a}(\lambda) \triangleq \frac {1}{2\lambda } \left ({{ \frac {g^{2}}{S} {-} q }}\right) \\ \displaystyle \quad + \frac {\lambda }{2} \left ({{ \frac {z^{2}}{S} - \delta }}\right) - \frac {gz}{S} & \text {if}~ \lambda z {-} g \gt 0 \\ \displaystyle f_{b}(\lambda) \triangleq -\frac {1}{2} \left ({{ \frac {q}{\lambda } + \lambda \delta }}\right) & \text {else} \end{cases} \tag {32}\end{align*}
After obtaining the optimal values \begin{equation*} \omega _{k}=\mathbf {H}^{-1}(\mathit {r}-\mathit {e}v^{*}) \tag {33}\end{equation*}
After this, the current policy can be updated using the following equation:\begin{equation*} \theta _{k+1}=\theta _{k}+\frac {\alpha }{\lambda ^{*}}\omega _{k} \tag {34}\end{equation*}
When the evaluation results indicate that the EV is operating under a high-risk safety condition with the current policy, it means that the policy update cannot ensure the set safety constraints, posing a very high safety risk for the EV. In this case, the policy is updated using the natural gradient method.
The updated policy is:\begin{equation*} \theta _{k+1}=\theta _{k}-\sqrt {\frac {2\delta }{\mathit {e}^{T}\mathbf {H}^{-1}\mathit {e}}}\mathbf {H}^{-1}\mathit {e} \tag {35}\end{equation*}
In this way, risk assessment for highway EV is achieved, and safety levels are categorized based on the evaluation results. Different policy update methods are selected for each safety level, ensuring that the policy updates maximize rewards while maintaining safety.
B. Value Network Parameter Update
Network updates are a critical part of the optimization process. Value networks are typically used to estimate the expected return in specific states, which is crucial for the efficient learning and decision-making of the policy network. The accuracy of the value network directly impacts the decision quality of the policy network. When updating the value network, EI interacts with the environment to collect state information and generate trajectories
The parameter update of the value network is achieved by minimizing the difference between the predicted value and the actual global return. We use a squared loss function to minimize the prediction error for optimization. For each policy of the EV, there is a predicted benchmark value calculated based on reward functions and other rules. This predicted value represents the output of the value network under relatively good performance policy. However, in actual states, it cannot be guaranteed that the current policy taken by the vehicle is near-optimal, as the actual return values fluctuate continuously. The goal of minimizing the squared prediction error is to ensure that the actual return values obtained by the policy taken by the vehicle are close to or exceed the predicted values under better policy, continuously optimizing the driving policy.
The update rule for the reward-based value network parameters is as follows:\begin{equation*} \varphi _{r}=argmin_{\varphi }\mathbb {E}\left [{{\left ({{V_{\varphi _{r}}(s_{t})-R_{t}}}\right)^{2}}}\right ] \tag {36}\end{equation*}
The update rule for the cost-based value network parameters is as follows:\begin{equation*} \varphi _{e}=argmin_{\varphi }\mathbb {E}\left [{{\left ({{V_{\varphi _{e}}(s_{t})-C_{t}}}\right)^{2}}}\right ] \tag {37}\end{equation*}
C. MLSCPO Algorithm
This section provides the pseudocode for the MLSCPO algorithm based on the discussion above. The input to MLSCPO is the state information S obtained from the interaction of the EV with the environment, and the output is the control information for the EV after policy optimization, corresponding to its action space A. The pseudocode for MLSCPO is shown in Algorithm 1:
Algorithm 1 MLSCPO
Global state from environment
EV action
Initialize:
for epoch n =
for step k =
for episode t =
EI interacting with the envionment to collect state
PO execute action
Collect trajectory
end for
Collect global trajectory
end for
Update policy
Calculate
Safety risk level assessment by
if Low Risk then
update policy network as:
else if Middle Risk then
update policy network as:
else if High Risk then
update policy network as:
end if
Update
Provide action A through
end for
The MLSCPO algorithm is an iterative process. Lines 4-6 in the pseudocode represent: In each episode, collect EV information
Experiment
This section describes the experiments conducted using the proposed MLSCPO algorithm on the CARLA simulation platform. First, the model parameters and other settings used in the experiments are detailed. Then, the training simulations and test results are analyzed. Finally, the proposed method is compared with the PPO-Lag method and the IDM+MOBIL method under the same scenario settings to validate the performance of the proposed method in terms of safety, comfort, and driving efficiency of the EV.
A. Experimental Setup
The version of the CARLA simulation platform used in this work is 0.9.12. The MLSCPO RL model is built on the PyTorch framework, running on an NVIDIA GeForce RTX 3070ti GPU with Ubuntu 20.04 as the operating system. In the CARLA simulation platform, a one-way multi-lane highway training scenario is constructed, each lane is 3.5 meters wide. During the experiment, the implementation of EI is achieved by using the built-in sensors of the CARLA simulator and the Python API to simulate sensors deployed on the EV. This captures real-time interaction data between the EV and the scene, and controls vehicle behavior by converting the actions performed by the EV (such as speed, steering angle, braking, etc.) into data signals, which guide the lateral and longitudinal movements of the EV. This process is implemented through three neural networks in the PO module. Specifically, two of these networks are value networks and one is a policy network. The value networks assess the level of safety risk, and based on the assessment results, the policy network guides the corresponding policy updates.
To ensure the simulation of real-time driving in realistic scenarios, the time step is set to 0.1 seconds, meaning the interval from when the EV captures the current scene and vehicle information to when it controls the lateral and longitudinal movements is 0.1 seconds. SVs can perform random but reasonable acceleration, deceleration, and lane-changing behaviors. The initial speed of SVs is set to 25 m/s, and the expected speed range for the EV is defined as 17-33 m/s. This speed range ensures that the EV can perform car-following and overtaking behaviors. At the end of each cycle, the speed, position, yaw angle, and other information of the autonomous vehicle at the termination moment are recorded. These data are used as the initial state inputs for the next cycle, ensuring the continuity of the training process and the accuracy of the simulation.
The PO module is mainly composed of three neural networks: the reward value neural network, the cost value neural network, and the policy neural network. Assuming there are N vehicles, the structure of the policy and value neural networks is defined as
To evaluate the performance of the MLSCPO algorithm, this study conducted two sets of experiments. The first set of experiments analyzed the training process between MLSCPO and the current advanced safe reinforcement learning algorithm PPO-Lag, while the second set of experiments compares the performance of the trained policies of MLSCPO, PPO-Lag, and IDM+MOBIL. In the first set of training process experiments, the relative positions, speeds, and driving behaviors of EVs and SVs were set with high randomness to test and validate the adaptability and effectiveness of the MLSCPO and PPO-Lag algorithm in multi-level risk assessment and constrained policy optimization across various complex traffic environments. In the second set of experiments, the highway autonomous driving decision-making policy trained by the MLSCPO algorithm was compared with that of PPO-Lag and the traditional stable algorithm IDM+MOBIL in the same scenario. The experiments designed test scenarios of varying difficulty levels, including the regular scenario in Figure 4(a) and the difficult scenario in Figure 4(b). The distinction in the difficulty of the two scenarios lies in the real-time traffic density around the autonomous vehicle and the randomness of other vehicles during the training process. These two scenarios are random collections of different traffic density conditions, car-following conditions, overtaking conditions, and even emergency conditions. Additionally, they include out-of-distribution scenarios to fully validate the algorithm’s adaptability. This study focused on comparing the algorithms in three aspects: driving efficiency, driving comfort, and driving safety.
B. Training Process and Result
In this experiment, the proposed MLSCPO algorithm and the PPO-Lag algorithm were trained in highly uncertain training scenarios. By analyzing the training process data, the curves of the total reward and total cost for both methods were obtained.
Figure 5(a) illustrates the total reward values per epoch for the MLSCPO and PPO-Lag algorithms. The solid line represents the total reward value, averaged from the immediate rewards in each epoch, while the shaded area indicates the standard deviation. The curves reveal that after convergence, the MLSCPO algorithm, with the same reward function parameter settings, achieves higher reward values than the PPO-Lag algorithm. Additionally, MLSCPO converges faster. The reasons for this difference are that, on one hand, MLSCPO separates safety and performance, and after satisfying safety constraints, policy updates focus on performance optimization. In contrast, PPO-Lag struggles to balance safety and performance, resulting in lower total reward values after convergence. On the other hand, although both algorithms consider safety constraints, the MLSCPO method employs a multi-level safety risk assessment using a cost value function, selecting different policy update methods based on various safety assessment outcomes. Meanwhile, PPO-Lag only uses a single Lagrange multiplier to guide policy updates under safety constraints. The diverse policy update methods in MLSCPO are more comprehensive and advanced, significantly improving learning efficiency and thus leading to faster convergence compared to PPO-Lag. Furthermore, as indicated by the standard deviation, the PPO-Lag algorithm shows greater fluctuations in reward values, while the MLSCPO algorithm maintains rewards at a high level with minor fluctuations after convergence, indicating better stability.
Figure 5(b) shows the cost of the MLSCPO algorithm and the PPO-Lag algorithm during 1400 epoch of training, with the solid line representing the total cost, averaged from the costs in each epoch, and the shaded area representing the standard deviation. During training, the total cost reflects the level of safety assurance; when the total in an epoch is below threshold, it can be considered that the EV has almost no collision risk and high safety. The curves indicate that the training performance of the MLSCPO algorithm is superior to that of the PPO-Lag algorithm. Specifically, the total cost of the MLSCPO method shows an overall downward trend, dropping below the benchmark value of 15 after approximately 200 policy iterations and continuing to decrease until it approaches zero. In contrast, the total cost of the PPO-Lag algorithm fluctuates greatly and does not show a stable downward trend. This demonstrates that the MLSCPO method has high stability in ensuring driving safety and can achieve a zero collision rate. The outstanding performance of MLSCPO is due to its multi-level assessment approach for policy updates. The assessment results can be categorized into three safety levels: low, medium, and high, with different policy update methods applied accordingly. The outcomes of these policy updates ensure that parameters are optimized to maximize rewards within the feasible domain of constraints, meaning that they prioritize meeting safety constraints while enhancing the overall performance of the EV.
C. Performance Comparison Between MLSCPO and IDM+MOBIL
To further confirm the control performance of the proposed MLSCPO algorithm for EVs in highway scenarios, the updated parameters of MLSCPO are used to conduct comparative tests with the PPO-Lag algorithm and the IDM+MOBIL algorithm in the same scenario. We have set multiple parameters to quantify the safety, efficiency, and comfort of autonomous vehicles. For the safety performance indicators of autonomous vehicles, we have set the success rate and average front vehicle distance as metrics. The success rate effectively reflects the probability of collisions or other hazardous behaviors under the adopted methods; the average front vehicle distance, as another safety indicator, suggests that a greater distance allows more time and space to handle emergencies, enhancing safety. For the efficiency indicators of autonomous driving, we quantify these through average speed and speed standard deviation, where the speed standard deviation effectively reflects the stability of driving speed. Regarding the comfort performance indicator, we have set the average steering wheel angle, average lateral acceleration of the vehicle, and its standard deviation as metrics. A smaller average steering wheel angle indicates smoother driving with less severe directional changes; a lower average lateral acceleration means that passengers experience less lateral force during turns, significantly improving the comfort of the ride. On the CARLA simulation platform, we use the Python API to obtain sensor data, collect these state information in real-time.
Regarding the safety performance metric, data from Table 2 shows that the policies updated using the three methods achieved a success rate of over 90% in regular, difficult, and out-of-training-distribution scenarios. The policies updated using the MLSCPO method and IDM+MOBIL even achieved a 100% success rate, demonstrating the broad adaptability of the MLSCPO method. Additionally, the parameter of average distance to the car in front indicates that the MLSCPO method maintained the greatest safety distance in any scenario. This provided ample time and space for the next action, significantly ensuring the safety and reliability of the driving process.
Regarding efficiency performance metrics, the data in Table 2 shows that, whether in regular or challenging scenarios, the policies updated using the MLSCPO method achieve the highest average speed. Although the average speed of all three methods decreases in challenging scenarios, the MLSCPO method shows the smallest reduction in average speed. This demonstrates that, in uncertain scenarios, the MLSCPO method maintains the highest driving efficiency, as indicated by the average speed parameter. Additionally, the speed standard deviation parameter reflects the stability of vehicle speed across different methods. The data shows that, in various scenarios, the MLSCPO method has the smallest speed standard deviation compared to the PPO-Lag and IDM+MOBIL methods, indicating that the MLSCPO method not only ensures efficient driving but also provides the best stability.
Regarding comfort performance metrics, we quantified the parameters using the average steering wheel angle, average lateral acceleration, and acceleration standard deviation, as shown in Table 2. The data indicates that although the average steering wheel angle using the MLSCPO method is not always the smallest, the average steering wheel angle for all three methods (IDM+MOBIL, PPO-Lag and MLSCPO) is less than 0.01 rad in both regular and challenging scenarios. This suggests that the steering wheel control is very smooth for policies updated by all three methods during travel. Additionally, the average lateral acceleration for all three methods is below 0.1, meaning that there is minimal force during vehicle turns, ensuring a high level of passenger comfort. Furthermore, the acceleration standard deviation under the MLSCPO method is smaller compared to the IDM+MOBIL and PPO-Lag methods. This indicates that with the MLSCPO method, the EV avoids sudden accelerations or decelerations, resulting in a very comfortable driving experience.
Figure 6 presents the policies obtained using MLSCPO, PPO-Lag, and IDM+MOBIL algorithms, which were applied in two different difficulty highway test scenarios shown in Figure 4 for comparative simulation testing. The results include statistical data on the speed information and steering angle information of the EV under the control of the three algorithms (positive values indicate right turns, negative values indicate left turns, higher values indicate larger angles, and a value of 0 indicates the steering wheel is centered).
The speed and steering angle curves of the MLSCPO, PPO-Lag, and IDM+MOBIL algorithms in different test scenarios.
Figures 6(a) and 6(b) respectively correspond to the EV speed variation and steering wheel angle variation curves obtained from 200 time steps in the conventional test scenario shown in Figure 4, using the MLSCPO algorithm, PPO-Lag algorithm, and IDM+MOBIL algorithm. In conventional scenarios with smooth traffic and larger vehicle spacing, the EV has a high degree of freedom, with ample space for lane changes and acceleration. As seen in the curves in Figure 6(a), under sparse traffic conditions, vehicles controlled by all three algorithms can achieve high speed. However, their speed stability varies when encountering obstacles. Compared to the other two methods, the MLSCPO algorithm maintains a higher speed while avoiding obstacles. Around the 60th time step, the IDM+MOBIL algorithm opts for a policy of decelerating and following when facing an obstacle ahead of the EV, continuing this for approximately 60 time steps before overtaking, which greatly reduces driving efficiency. In contrast, the PPO-Lag algorithm, like the MLSCPO method, quickly makes an overtaking maneuver, ensuring driving efficiency, though its speed stability is not as good as that of the EV under the MLSCPO control policy. Additionally, as seen from the curve changes in Figure 6(b), during the lane change maneuver of the EV, the MLSCPO method achieves the smallest peak steering wheel angle, ensuring a comfortable driving experience. By contrast, the steering wheel angle under the IDM+MOBIL control shows abrupt changes, which is very dangerous at high speeds and reduces passenger comfort. Compared to the PPO-Lag method, the EV under the MLSCPO control exhibits smaller fluctuations in steering wheel angle changes, resulting in better driving stability.
Figures 6(c) and 6(d) correspond to the EV speed variation and steering wheel angle variation curves obtained over 400 time steps in the challenging test scenarios shown in Figure 4, using the MLSCPO algorithm, PPO-Lag algorithm, and IDM+MOBIL algorithm. In challenging scenarios, traffic flow is dense, restricting the EVs’ freedom compared to conventional scenarios. They are constrained by SVs and sometimes must decelerate and follow other vehicles. Maintaining high driving efficiency in such scenarios is a significant challenge for decision-making algorithms. As shown in the curves in Figure 6(c), even in dense traffic, the MLSCPO method demonstrates higher decision-making intelligence compared to the PPO-Lag and IDM+MOBIL algorithms, maintaining a higher speed while ensuring safety. During the sampling period, vehicles controlled by the MLSCPO algorithm exhibited significantly less speed fluctuation than those controlled by the other two methods. Even when the vehicles controlled by the MLSCPO algorithm were forced to decelerate and follow, their average speed remained higher than those controlled by the PPO-Lag and IDM+MOBIL algorithms, ensuring higher driving efficiency. According to the curves in Figure 6(d), it can also be observed that the EVs under the control of the MLSCPO algorithm performed two lane change maneuvers, while the vehicles controlled by the PPO-Lag and IDM+MOBIL methods only performed one lane change maneuver before choosing to follow the vehicle ahead. This further demonstrates the greater flexibility of the MLSCPO algorithm, maintaining driving efficiency even in challenging scenarios. Additionally, during lane change maneuvers, the EVs controlled by the MLSCPO algorithm maintained the lowest peak steering wheel angle, ensuring a comfortable driving experience while achieving efficient driving.
In summary, by comparing the vehicle performance indicators after updating the policy control using three methods, and by sampling data in both conventional and challenging test scenarios, we can see that the MLSCPO algorithm demonstrates higher decision-making intelligence compared to the PPO-Lag algorithm and the IDM+MOBIL method. It guides the vehicle to perform more efficiently while ensuring safety. When encountering obstacles, the vehicle quickly decides to change lanes when conditions allow, fully ensuring driving efficiency. Furthermore, during lane changes, the vehicle does not experience significant steering angle changes, thereby ensuring ride comfort and driving safety.
Moreover, by decoupling the state, reward, and cost value functions from the environment, the MLSCPO algorithm ensures robust training performance even in highly uncertain scenarios. It effectively reduces risk in interactive situations, demonstrates adaptive decision-making during overtaking maneuvers, and maintains efficient driving capabilities under various traffic densities. The adaptability and reliability of the MLSCPO method enable it to extend to other highway scenarios, such as bidirectional multi-lane highways and highways with varying road widths, and it also has the potential for application in urban roads and other complex traffic environments. Before deploying the algorithm in new environments, it may be necessary to fine-tune the model’s hyperparameters, such as adjusting the coefficients of the reward and cost functions, the size of the vehicle monitoring area, and other specific environmental parameters. For more complex scenarios, additional state variables (e.g., driving direction) can be introduced to provide a more comprehensive description of traffic conditions, or the range of action options (e.g., driving routes) can be expanded to achieve more sophisticated control.
Conclusion
This paper proposes a constraint-driven safe RL method and applies it to learning autonomous driving decision-making policy in highway scenarios. The framework introduces the concept of a constraint domain based on the trust domain. It constructs the autonomous driving decision-making problem as a CMDP within a safe RL framework, ensuring that the control policy of EVs maximize rewards while satisfying safety constraints. The study further proposes the MLSCPO method, which introduces a cost function to account for safety constraints. This allows for different control policy updates under various safety conditions. Through this innovative policy update method, MLSCPO achieves decoupling of state and specific scenarios, significantly enhancing adaptability and reliability when facing complex and changing scenarios, even those outside the training distribution. Simulation tests on the CARLA platform demonstrate that the proposed method outperforms the current advanced reinforcement learning method PPO-Lag and the classic stable IDM+MOBIL algorithm in terms of driving safety, driving efficiency and driving comfort.
Although the proposed method achieves a certain degree of decoupling between state settings and scenarios, the various subjective factors present in real-world driving, along with the increased randomness in driving conditions, result in more edge cases. These factors present challenges for the practical implementation of the MLSCPO method in real-world applications. To ensure an effective transition, it is essential to further fine-tune the policy neural network trained in the simulation environment when deploying it in the real world. This process may involve adjusting the algorithm’s hyperparameters to accommodate more complex scenarios and expanding the state inputs to provide a more comprehensive and detailed description of the environment. During the interaction between the policy and the real-world environment, which is necessary to obtain rewards and costs for adapting to real-world conditions, it is important to address issues such as policy forgetting and the efficiency of real-world sampling. Furthermore, a safe transition from simulation to real-world applications may require the integration of various methodologies. For instance, integrating autonomous driving perception systems can ensure the rapid and accurate collection of scenario information necessary for the MLSCPO decision-making method, significantly enhancing adaptability, safety, and reliability under different road and traffic conditions; incorporating Large Language Models (LLMs) can enhance general knowledge, allowing for human-like understanding of diverse scenarios, thus improving the system’s adaptability and robustness in various traffic situations; implementing formal safety driving rules as a safety redundancy mechanism can help maintain the vehicle’s safety in hazardous situations and use these rule-based behaviors to guide updates in the reinforcement learning policy. In the future, we aim to incorporate the aforementioned potential extensions to improve the applicability and generalizability of the method in real-world scenarios.
Conflict of Interest
On behalf of all the authors, the corresponding author states that there is no conflict of interest.