Introduction
DRL has emerged as a critical technology for autonomous decision-making and learning in various fields, including autonomous driving, robotics, drones, and medical devices [1], [2], [3]. DRL systems excel at solving complex decision-making problems in large state spaces, but ensuring safety during real-time operation remains a significant challenge. Incorrect decision-making by autonomous vehicles, drones, or robots can lead to catastrophic failures, making it crucial to predict and mitigate potential safety violations during both the training phase and real-world operation. Consequently, a systematic verification method is required to assess the failure probabilities and ensure that DRL systems meet strict safety requirements throughout their lifecycle [4], [5].
Given the complexity and size of the state space in DRL systems, it is essential to reduce the dimensionality of the state space to improve learning efficiency. State clustering is an effective technique that groups similar states together, simplifying the state space and reducing computational overhead during training and verification. However, a key issue in this process is determining the optimal number of clusters. If the number of clusters is too small, important state information may be lost, while too many clusters can unnecessarily increase computational cost. Therefore, balancing efficiency and safety when determining the optimal number of clusters is a critical challenge.
To validate our approach, we conducted experiments in two widely used simulation environments: LunarLander [6] and Highway [7]. The LunarLander environment represents a two-dimensional safe-landing task where an agent must control its thrust and orientation to land a spacecraft on a designated landing pad with minimal velocity. This environment serves as a benchmark for verifying safety in continuous control tasks, particularly under failure risks such as excessive velocity or angular drift. On the other hand, the Highway environment simulates multi-lane traffic where an ego-vehicle navigates through surrounding vehicles while avoiding collisions and optimizing for high-speed travel. This environment is well-suited for evaluating safety verification in autonomous driving systems, as it captures critical decision-making tasks like lane-changing and collision avoidance.
In this study, we employ the KMeans++ algorithm for state clustering and determine the optimal number of clusters by doubling the cluster count incrementally and computing the objective function at each step. The objective function balances failure probability and counterexample path complexity to identify the most effective number of clusters. This approach enables us to efficiently reduce the state space while preserving the critical state information needed for learning and verification.
After clustering the state space, we construct a DTMC model for each cluster. A DTMC models the probabilistic transitions between discrete states, providing a formal representation of state transitions in DRL systems. Based on the DTMC model, we use the PRISM model checker to verify the safety of the DRL system. PRISM is a powerful tool that allows for the formal verification of probabilistic systems. In this study, we focus on verifying the system’s failure probability to evaluate how likely it is for the system to enter a failure state during its operation.
Additionally, we verify the invariance property to ensure that the system does not transition into a failure state under any possible path. This property is essential for guaranteeing the overall safety of the system by confirming that failure is avoided in all potential trajectories.
Once these properties are verified using PRISM [8], we conduct a counterexample analysis to investigate the specific paths that could lead to system failure. A counterexample identifies concrete scenarios where the system might reach a failure state, providing critical insights into the causes of safety violations. Through this analysis, we can trace the potential failure paths and identify critical transitions that may require further attention in the system’s learning policy to prevent similar failures in future operations.
The contributions of this research are threefold. First, we present a novel approach that combines KMeans++ clustering with objective function calculations to determine the optimal number of clusters for a DRL system. This method balances clustering efficiency and safety by ensuring that important state information is preserved while reducing the computational burden. Second, we develop DTMC models for the clustered states and rigorously verify the system’s failure probabilities and invariance properties using the PRISM model checker. Third, we incorporate counterexample analysis to trace failure paths and identify key failure mechanisms, providing valuable insights that can be used to improve the system’s safety and performance.
This paper is organized as follows: Section II provides a thorough review of existing literature on state abstraction, clustering techniques, and safety verification in DRL systems, highlighting key contributions and identifying research gaps. Section III details the methodology, including state collection, labeling, and clustering with the KMeans++ algorithm, followed by the construction of DTMC model. This section also describes the iterative process of determining the optimal number of clusters using an objective function that balances failure probability and path complexity. Additionally, it covers the use of the PRISM model checker for formal verification and counterexample analysis to identify critical failure paths. Section IV presents the experimental results and analysis conducted in the LunarLander and Highway environments, demonstrating how varying the number of clusters affects system safety and failure detection. Section V concludes the study by summarizing key findings, discussing implications, and proposing future research directions to further enhance the reliability and safety of DRL systems.
Literature Review
DRL has made significant progress in addressing challenges related to managing large state spaces efficiently and ensuring safety in real-world applications, such as autonomous driving, robotics, and complex decision-making systems. This section provides a comprehensive review of key studies related to state abstraction, clustering techniques, and safety verification, which are essential for improving the performance and safety of DRL systems. By examining these studies, this research positions itself within the broader academic landscape and identifies the unique contributions it makes.
A. State Abstraction and Reinforcement Learning
State abstraction is a critical tool in for reducing the complexity of large state spaces, enabling agents to learn more efficiently. Large state spaces increase the temporal and computational costs of learning and expose agents to unnecessary exploration of redundant or irrelevant states. Therefore, state abstraction plays a pivotal role in improving sample efficiency and enhancing system performance.
Tomar et al. [9] introduced a causal structure-based abstraction technique for model-based reinforcement learning (MBRL) to reduce redundant states and improve learning efficiency. Similarly, Lan et al. [10] developed an online state abstraction method using causal transformers that dynamically simplify the state space during learning, resulting in improved sample efficiency.
Building on these advancements, Pan et al. [11] proposed an environment abstraction method that combines state clustering with parameter truncation for efficient system verification. Their method reduces complexity by abstracting the environment’s state space, enabling large-scale systems like cache coherence protocols to be analyzed effectively. However, Pan et al.’s work primarily focuses on classical systems, whereas our research applies state abstraction via KMeans++ clustering specifically for DRL. This abstraction is subsequently combined with PRISM-based formal verification to ensure quantitative safety guarantees, providing a novel and integrated approach to DRL safety verification.
B. Clustering Techniques and Kmeans
State clustering plays a critical role in reducing the complexity of state spaces in RL. By grouping states with similar properties, clustering techniques enable efficient exploration, learning, and safety verification in large-scale systems. Several clustering approaches have been proposed to address the challenges of state abstraction, computational efficiency, and noise handling.
Sendi et al. [12] introduced the iSparse KMeans technique, which efficiently clusters dynamic datasets by optimizing the selection of representative samples. This method significantly reduces computational time without compromising clustering quality, making it particularly suitable for large-scale data environments. Similarly, Ouyang et al. [13] applied hierarchical clustering to divide the state space into smaller, manageable regions. By training RL agents on these clusters individually, they demonstrated improved learning efficiency for autonomous systems.
Corsi et al. [14] extended clustering methods into safety-critical applications with a verification-guided shielding approach. Their method dynamically clusters unsafe regions in the state space and activates “shields” selectively to prevent unsafe actions during runtime. While Corsi et al.’s work focuses on real-time performance improvements by dynamically identifying safety violations, our approach differs in its systematic offline abstraction of the entire state space. Specifically, we employ KMeans++ clustering to partition the continuous state space into a finite set of clusters, ensuring a well-defined and manageable state representation. This clustered state space is subsequently analyzed using PRISM for formal safety verification, allowing rigorous failure probability calculations and counterexample tracing. By integrating state abstraction with formal probabilistic verification, our methodology ensures both computational efficiency and precise safety guarantees.
Despite the advantages of KMeans++, alternative clustering techniques such as DBSCAN and Gaussian Mixture Models (GMM) have been explored to address state heterogeneity and noise. For instance, Liu et al. [15] proposed Multi-Level DBSCAN, which performs hierarchical clustering across multiple density levels, capturing both coarse and fine-grained structures. Hasana and Fitrianah [16] demonstrated that ensemble approaches combining DBSCAN with UMAP-based dimensionality reduction significantly improve clustering accuracy in noisy environments. While these methods offer robustness, their computational complexity remains a challenge for large-scale DRL state spaces. In contrast, KMeans++ strikes a balance between efficiency and accuracy, making it particularly suitable for our goal of state abstraction in probabilistic safety verification.
C. Safety Verification and PRISM
Safety verification is essential for ensuring that DRL systems satisfy safety-critical properties, particularly in applications such as autonomous driving and robotics. Probabilistic model checking tools like PRISM are widely used to analyze system behaviors by formally verifying properties such as failure probabilities, invariance, and reachability.
Akazaki et al. [17] applied falsification techniques to generate counterexamples that violate safety properties in cyber-physical systems (CPS) controlled by DRL agents. Their work demonstrated the importance of identifying unsafe trajectories during training to improve policy safety. Gupta and Hwang [18] proposed the use of reachable sets for verifying safety properties in MBRL systems, enabling the identification of safe trajectories and avoidance of dangerous regions in the state space. Ji and Filieri [19] introduced a probabilistic model checking framework to trace counterexample paths that lead to safety violations, providing actionable insights for refining RL policies.
Amir et al. [20] addressed the scalability challenge of formal verification in DRL systems by integrating k-induction and invariant inference techniques. Their approach effectively verified safety and liveness properties in large-scale systems, such as communication networks, but primarily focused on policy-level verification. Zarei et al. [21] introduced a statistical verification framework for learning-based CPS, providing scalable and confidence-based safety verification for systems such as robotic controllers and autonomous vehicles.
In contrast to these studies, our research focuses on state-level safety verification by combining KMeans++ clustering with PRISM-based probabilistic model checking. This approach provides a fine-grained analysis of failure probabilities and unsafe transitions within the clustered state space. By abstracting the continuous state space into discrete clusters, we enable rigorous verification of safety properties while managing the complexity of large-scale DRL environments. Furthermore, the counterexample path analysis conducted in our framework allows system designers to identify and mitigate specific failure mechanisms, offering a structured pathway for enhancing policy safety.
Through the integration of state abstraction, DTMC modeling, and formal verification, our work advances the state-of-the-art in DRL safety verification by providing a robust and scalable framework for assessing system behavior under safety-critical conditions.
D. Reinforcement Learning and Safety
Ensuring safety in RL while maintaining learning efficiency has been the focus of several recent studies. Many frameworks combine safety constraints, reachability analysis, and formal verification techniques to address the trade-off between exploration and safety.
Selim et al. [22] proposed the BRSL, which computes reachable sets in real-time to plan safe trajectories for robots. Chen and Liu [23] introduced a Safe Set Algorithm (SSA) for safe exploration in clustered environments, ensuring that agents avoid unsafe states during training.
Expanding on safety verification for complex systems, Zarei et al. proposed a statistical verification method for learning-based CPS. Their approach uses statistical model checking to evaluate the safety and reliability of neural network-based controllers. While Zarei et al. provide probabilistic safety guarantees, our research enhances this by employing formal verification for DRL systems. Specifically, we integrate state clustering with PRISM to systematically analyze safety risks, ensuring deterministic and precise safety guarantees, which are critical in high-risk environments.
The studies reviewed in this section illustrate significant advancements in state abstraction, clustering techniques, and safety verification for RL. While existing approaches have focused on improving learning efficiency, runtime safety, or probabilistic guarantees, our research uniquely combines:
KMeans++-based state clustering to simplify large state spaces,
PRISM-based formal verification to rigorously analyze failure probabilities and trace unsafe counterexample paths.
This integration bridges the gap between performance optimization and safety assurance, providing a systematic and precise approach to ensuring the safety of DRL systems in real-world applications.
Methodology
This research presents a systematic methodology for verifying the safety of DRL systems, focusing on state collection and labeling, state clustering, DTMC model construction, and safety verification. Each stage is crucial for evaluating system performance and ensuring safety. Through this approach, we analyze failure probabilities and identify improvement strategies for DRL systems. Figure 1 illustrates the overall workflow of the proposed methodology, depicting the key steps including state collection, clustering using KMeans++, DTMC model construction, PRISM-based formal verification, and counterexample analysis. This diagram provides a clear visual summary of the sequential processes involved in the methodology.
A. State Collection and Labeling
To abstract the state space of DRL systems, we first collect states experienced by the agent during simulation, based on its learned policy. The collected states reflect a wide range of situations the agent encounters during operation, and these states are sampled randomly.
After collecting the states, we label each state based on its outcome to understand whether it is associated with a system success or failure, and to trace how the system transitions into failure states. In this study, we assign appropriate labels based on the experimental environments of LunarLander and Highway (autonomous driving simulation).
In the LunarLander environment, the lander’s states during its landing attempts are labeled as follows:
Landed: The lander has successfully landed, making a safe touchdown on the surface. This is determined when the reward is equal to 100.
Crashed: The lander has landed unsuccessfully, with its body colliding with the surface. This occurs when the reward is equal to -100.
Out of bounds: The lander has moved out of the bounds of the x-axis, specifically when its x-coordinate exceeds the range of -1.0 to 1.0, making a valid landing attempt impossible.
Landing: If none of the above conditions are met, the lander is labeled as attempting to land, descending towards the surface.
The labeling criteria are derived from the reward signals and positional constraints provided by the simulation environment, ensuring an objective and consistent categorization of states.
In the autonomous driving scenario within the Highway environment, the states of the vehicle are labeled as follows:
Collision: The vehicle has collided with another vehicle or object, causing the episode to terminate before completing the predefined step limit of 30 steps.
Safe: The vehicle has successfully completed the episode without any incidents, maintaining a safe state until the predefined step limit.
Driving: If the episode is ongoing and neither a collision nor the safe termination condition has been met, the vehicle is labeled as driving normally.
The labeling for the Highway environment is determined based on the episode’s termination flags and the number of elapsed steps, providing a clear distinction between safe, failure, and intermediate driving states.
These labels allow for a precise understanding of the agent’s behavior across various scenarios, enabling further analysis during the clustering and safety verification processes.
B. State Clustering
After labeling, we perform state clustering separately for each labeled group to ensure that clustering is conducted on states within the same label. This approach reduces the complexity of the state space while preserving the structure necessary for subsequent analysis.
State clustering is performed using the KMeans++ algorithm, an improved version of the traditional KMeans algorithm. KMeans++ optimizes the initialization of cluster centroids, which accelerates convergence and enhances the consistency of the clustering results. The steps for KMeans++ clustering are as follows:
Initialization: The first cluster centroid
is selected randomly. Subsequent centroids are chosen probabilistically based on their squared distances from the nearest existing centroids:$\mu _{1}$ where s is a state,\begin{equation*} P(s) = \frac {d(s, \mu _{j})^{2}}{\sum _{s' \in \mathcal {S}} d(s', \mu _{j})^{2}}, \tag {1}\end{equation*} View Source\begin{equation*} P(s) = \frac {d(s, \mu _{j})^{2}}{\sum _{s' \in \mathcal {S}} d(s', \mu _{j})^{2}}, \tag {1}\end{equation*}
is the nearest existing centroid, and$\mu _{j}$ is the set of all states.$\mathcal {S}$ Assignment Step: Each state
is assigned to the nearest cluster centroid$s \in \mathcal {S}$ based on the Euclidean distance:$\mu _{j}$ \begin{equation*} d(s, \mu _{j}) = \| s - \mu _{j} \|^{2}. \tag {2}\end{equation*} View Source\begin{equation*} d(s, \mu _{j}) = \| s - \mu _{j} \|^{2}. \tag {2}\end{equation*}
Centroid Update: The centroids
are updated as the mean of all states assigned to cluster$\mu _{j}$ :$C_{j}$ where\begin{equation*} \mu _{j} = \frac {1}{|C_{j}|} \sum _{s \in C_{j}} s, \tag {3}\end{equation*} View Source\begin{equation*} \mu _{j} = \frac {1}{|C_{j}|} \sum _{s \in C_{j}} s, \tag {3}\end{equation*}
is the number of states in cluster$|C_{j}|$ .$C_{j}$ Convergence Check: The process repeats until the centroids stabilize. Convergence is achieved when the change in centroids satisfies:
where\begin{equation*} \|\mu _{j}^{(t+1)} - \mu _{j}^{(t)}\| \lt \epsilon, \quad \forall j, \tag {4}\end{equation*} View Source\begin{equation*} \|\mu _{j}^{(t+1)} - \mu _{j}^{(t)}\| \lt \epsilon, \quad \forall j, \tag {4}\end{equation*}
is a small predefined threshold, and t is the iteration number.$\epsilon $
To determine the optimal number of clusters k, we apply the Elbow Method. This method calculates the Sum of Squared Errors (SSE) for increasing values of k and identifies the point where increasing k yields diminishing returns:\begin{equation*} SSE = \sum _{i=1}^{k} \sum _{s \in C_{i}} \| s - \mu _{i} \|^{2}, \tag {5}\end{equation*}
The application of KMeans++ in this study relies on the following assumptions:
The state space is continuous and sufficiently well-structured, such that distances (e.g., Euclidean) between states provide meaningful clustering results.
The number of clusters k is fixed and determined using the Elbow Method, ensuring a balance between abstraction quality and computational cost.
The state distance metric effectively captures relationships and transitions between states after sufficient training.
Despite its advantages, KMeans++ has the following limitations:
Sensitivity to Initialization: Although KMeans++ improves centroid selection, it remains sensitive to initialization. To address this, we perform multiple clustering runs with different initial conditions and select the clustering result with the lowest SSE.
Fixed Cluster Number: A static k may limit adaptability to highly dynamic or heterogeneous state spaces. However, empirical validation ensures that the chosen k maintains abstraction quality without excessive computational cost.
Alternative clustering methods were considered but deemed less suitable for the scope of this study:
DBSCAN: Capable of identifying arbitrarily shaped clusters without a predefined k, but struggles with high-dimensional and non-uniform state spaces.
GMM: Provides probabilistic clustering with soft assignments but is computationally expensive and sensitive to initialization.
Hierarchical Clustering: Enables multi-scale clustering but incurs quadratic time complexity, making it impractical for large-scale DRL systems.
The choice of KMeans++ in this study is driven by its computational efficiency, scalability, and practical performance in clustering continuous DRL state spaces. While alternative methods have their strengths, this study focuses on integrating state clustering with formal safety verification to simplify the state space and facilitate rigorous failure probability analysis and counterexample tracing.
The KMeans++ algorithm was implemented using the scikit-learn Python library. The default initialization strategy of KMeans++ was employed for centroid selection. To ensure reproducibility of clustering results, we set the
C. DTMC Model Construction
Once clustering is complete, we construct the DTMC model using the clustered states. The DTMC model abstracts the system’s behavior into a probabilistic framework, enabling the analysis of state transitions and verification of safety properties.
A DTMC is formally defined as a 4-tuple:\begin{equation*} M = (S, P, \pi _{0}, AP), \tag {6}\end{equation*}
S: A finite set of states, where each state corresponds to a cluster generated by the KMeans++ algorithm.
: The state transition probability matrix.$P: S \times S \to [{0, 1}]$ represents the probability of transitioning from state$P(s, s')$ to state$s \in S$ , subject to the condition:$s' \in S$ \begin{equation*} \forall s \in S, \quad \sum _{s' \in S} P(s, s') = 1. \tag {7}\end{equation*} View Source\begin{equation*} \forall s \in S, \quad \sum _{s' \in S} P(s, s') = 1. \tag {7}\end{equation*}
: The initial state distribution, which specifies the probability of the system starting in each state.$\pi _{0}$ AP: A set of atomic propositions that define specific properties or conditions of interest for states (e.g., failure states labeled as fail).
1) Transition Probability Calculation
The transition probabilities in the DTMC model are derived from the frequency of transitions observed in the system’s simulation data. Given the set of clusters \begin{equation*} P(C_{j} | C_{i}) = \frac {\sum _{s \in C_{i}, s' \in C_{j}} N(s, s')}{\sum _{s \in C_{i}} \sum _{s' \in S} N(s, s')}, \tag {8}\end{equation*}
Within a cluster \begin{equation*} P_{\text {intra}}(s' | C_{i}) = \frac {\sum _{s \in C_{i}} N(s, s')}{\sum _{s \in C_{i}} \sum _{s' \in C_{i}} N(s, s')}, \quad \forall s' \in C_{i}. \tag {9}\end{equation*}
2) DTMC Transition Matrix
The state transition matrix P for the DTMC is constructed using the calculated inter-cluster and intra-cluster probabilities. Each entry \begin{equation*} \sum _{s' \in S} P(s, s') = 1, \quad \forall s \in S. \tag {10}\end{equation*}
The resulting transition matrix P is a
3) Initial State Distribution
The initial state distribution \begin{equation*} \pi _{0}(s) = \frac {1}{|S|}, \quad \forall s \in S. \tag {11}\end{equation*}
4) Atomic Propositions and Safety Properties
To facilitate safety verification, we define a set of atomic propositions AP, which describe specific properties of interest for the states. For example, a failure state can be denoted as:\begin{equation*} \textit {fail} \in AP. \tag {12}\end{equation*}
5) DTMC Model Integration
The DTMC model constructed in this way abstracts the complex dynamics of the system into a manageable state space. By leveraging the clustered states generated by KMeans++, the DTMC enables the analysis of failure probabilities, identification of unsafe paths, and formal verification of safety requirements.
D. Safety Verification: Formal Verification with PRISM
After constructing the DTMC model, we use the PRISM tool to formally verify the safety of the DRL system. PRISM is a model checker for probabilistic systems that allows us to verify system properties using Probabilistic Computation Tree Logic (PCTL). In this research, we verify both the failure probability and invariance properties.
We verify the failure probability of the system using a PCTL formula that calculates the likelihood of the system reaching a failure state. The property we verify in PRISM is:\begin{equation*} P=? [F \ "fail"]\end{equation*}
We also verify the invariance property, which ensures that the system does not transition into a failure state along any possible path. This is essential for ensuring the global safety of the system and is expressed in PCTL as:\begin{equation*} A [G \ \neg "fail"]\end{equation*}
Here, A means “for all paths,” and G stands for “globally,” indicating that the system must avoid the failure state in all possible trajectories.
E. Counterexample Path Analysis
PRISM also generates counterexamples, which provide detailed failure paths that lead to the system entering a failure state. Analyzing these counterexamples helps us understand the specific transitions that caused the failure and identify points where the system’s policy can be improved.
Collecting Counterexamples: PRISM identifies specific paths through which the system reaches a failure state. These paths consist of a sequence of state transitions.
Step-by-Step Analysis: Each step in the counterexample is analyzed to determine where the system encountered a risk and failed to recover.
Policy Improvement: Based on the counterexample analysis, we refine the system’s policy to prevent similar failures in future operations.
For example, in the LunarLander experiment, rapid vertical velocity changes and an increase in angular velocity during the landing phase were identified as key failure factors. In the autonomous driving experiment, the failure to detect an approaching vehicle in time led to a collision. These insights help us modify the system’s policy to avoid such failures in future scenarios.
This methodology outlines the steps for collecting and labeling states, clustering these states using KMeans++, constructing a DTMC model, and verifying the system’s safety using PRISM. Through formal verification and counterexample analysis, we can accurately assess the failure probabilities and identify areas where the system can be improved to ensure safer operation in real-time environments. This approach provides a structured framework for analyzing and improving the safety of DRL systems, contributing to their reliable performance in critical applications.
Result and Analysis
A. Lunarlander
The goal of the LunarLander experiment is to analyze the failure probabilities and counterexample paths under varying cluster numbers to evaluate how reliably the system can perform a safe landing. Clustering was performed using the KMeans++ algorithm, and the optimal number of clusters was determined using the elbow method. Experiments were conducted with cluster sizes ranging from 1X to 128X, where the cluster size was doubled in each successive experiment.
The table below show the failure probability and counterexample path lengths for each cluster configuration.
As observed in Table 1, the failure probability decreases from 0.7443 to 0.6644 when increasing the cluster size from 1X to 2X, indicating that better state distinction leads to improved failure detection. The probability remains relatively low for 4X, but rises again from 8X onward, reaching 0.7871 at 128X. This suggests that overly fine state segmentation introduces excessive sensitivity, causing the system to perceive minor state changes as failures.
Also, Table 1 shows the counterexample path lengths, which increase as the cluster size grows. The shortest path occurs with 1X (length 4), and the path length increases to 14 for 128X, indicating more complex failure paths at higher cluster sizes. In particular, the 16X cluster, with a path length of 8, strikes a balance between path complexity and failure detection. It allows for the identification of intricate state transitions without becoming overly convoluted.
To determine the optimal number of clusters, we formulated a multi-objective optimization problem that balances failure probability and counterexample path complexity. The objective function is defined as follows:\begin{equation*} \text {Objective}(k) = P_{f}(k) + \lambda \cdot \frac {L(k)}{k}\end{equation*}
is the failure probability for a given cluster size k,$P_{f}(k)$ is the counterexample path length,$L(k)$ is a weighting factor that balances the importance of path complexity against failure probability.$\lambda $
The goal is to minimize this objective function, which represents a trade-off between minimizing failure probability and keeping counterexample paths manageable. A lower objective value indicates a better balance.
The failure probability function \begin{equation*} P_{f}(k) = a \cdot e^{-b \cdot k} + c \cdot \log (k)\end{equation*}
\begin{equation*} L(k) = \alpha \cdot \log (k) + \beta \end{equation*}
Substituting these into the objective function, we get:\begin{equation*} \text {Objective}(k) = a \cdot e^{-b \cdot k} + c \cdot \log (k) + \lambda \cdot \frac {\alpha \cdot \log (k) + \beta }{k}\end{equation*}
After calculating and differentiating the objective function to find the optimal cluster size, we determined that 16X provides the best balance between failure probability and path complexity. Using constants \begin{align*} P_{f}(16) & = 0.8 \cdot e^{-0.1 \cdot 16} + 0.01 \cdot \log (16) = 0.1892 \\ L(16) & = 2 \cdot \log (16) + 3 = 8.5452 \\ \text {Objective}(16) & = 0.1892 + 0.5 \cdot \frac {8.5452}{16} = 0.4563\end{align*}
This confirms that 16X achieves the best trade-off between failure probability and counterexample path complexity.
The 16X cluster’s counterexample path demonstrates how the LunarLander system progresses from an initially stable descent to a failure state, providing critical insights into system weaknesses. The lander’s state is represented by an 8-dimensional vector that tracks its position
Initial State (landing_0): The lander starts with a stable descent, maintaining a slight vertical velocity and a minor lateral drift. The angle and angular velocity are negligible, and neither leg is in contact with the ground. At this point, there is minimal risk of failure.
Minor Lateral Adjustment (landing_853): The lander begins moving slightly left, with a moderate horizontal velocity of 0.16. The vertical velocity has decreased slightly, and there is still no significant rotational movement. The lander remains stable but has not yet made ground contact.
Sudden Speed Increase (landing_750): Both the horizontal and vertical velocities increase sharply to 0.52 and 0.51, respectively, suggesting a loss of control. The angle remains stable, but the acceleration increases the likelihood of an unstable landing.
Increased Instability (landing_186): The lander drifts further left, with a horizontal velocity of 0.33 and a vertical velocity of 0.17. The descent continues, and although no significant angular motion is present, the lander’s control deteriorates.
Approaching Ground (landing_268): The lander moves left with a significant horizontal velocity of 0.74 and vertical velocity of 0.35, indicating a dangerously fast approach toward the ground. The increasing speed makes it more difficult for the lander to stabilize before impact.
Ground Contact Attempt (landing_535): One leg of the lander touches the ground, but the high horizontal velocity and angular velocity cause instability. Although the lander is close to landing, the conditions for a stable descent are not met.
Rapid Spin (landing_82): Both legs make contact, but the lander’s rotation accelerates dramatically, leading to an unstable situation. Despite ground contact, the angular velocity increases uncontrollably.
Crash (crashed_87): The lander spins rapidly out of control, with its angle sharply tilted and its angular velocity peaking at 2.56. The lander crashes due to the inability to reduce rotational speed and stabilize the descent.
The primary factors contributing to landing failures are identified as a rapid increase in both vertical and horizontal velocity and a loss of control over angular velocity. While ground contact is established, excessive speed and uncontrolled rotation significantly hinder a stable landing. These failure modes highlight critical limitations in the current DRL-based landing policy, particularly in its ability to manage high-velocity descent and stabilize orientation. For reinforcement learning-based landing policies to ensure a higher success rate, precise velocity regulation and angular stabilization mechanisms must be reinforced. In particular, the policy should be adapted to respond more sensitively to rapid changes in velocity and attitude immediately before touchdown, mitigating the risk of destabilization.
From the perspective of safety verification in deep reinforcement learning, these findings underscore the detrimental impact of dynamic instability and excessive velocity variations on policy generalization. Without proper control over these variables, the learned policy may fail to generalize effectively across diverse landing conditions. Consequently, future research should focus on designing reinforcement learning frameworks that can predict and proactively mitigate abrupt changes in velocity and attitude, thereby enhancing the robustness and reliability of autonomous landing systems.
B. Highway
The primary goal of the Highway environment experiment is to evaluate the safety and reliability of an autonomous driving system by analyzing failure probabilities and counterexample paths across different numbers of state clusters. By using the KMeans++ clustering algorithm and the elbow method, the optimal number of clusters was determined. Experiments were conducted by varying the number of clusters, from 1X to 128X, by doubling the number of clusters in each successive experiment.
The following table presents the failure probabilities and the lengths of the counterexample paths for each cluster configuration.
As observed in Table 2, the failure probability decreases as the number of clusters increases from 1X to 32X. For example, with 1X, the failure probability is 0.4522, which drops to 0.4044 at 16X, and further to 0.3934 at 32X, indicating that more granular state distinctions help the system detect potential risks more accurately.
However, for 64X and 128X, the failure probability increases significantly to 0.5001 and 0.5047, respectively. This trend suggests that over-segmentation of states can introduce excessive sensitivity, where the system begins to classify minor state transitions as failures. Therefore, increasing the number of clusters beyond an optimal level can lead to an undesirable increase in false positives, as the system becomes overly sensitive to unimportant state changes.
Thus, it can be concluded that an intermediate number of clusters, such as 16X, allows the system to detect critical risks effectively while avoiding excessive sensitivity to noise or minor state changes.
Table 2 also shows the lengths of the counterexample paths. The 1X cluster has the shortest path with a length of 1, indicating a simple failure path with minimal complexity. As the number of clusters increases, the path length reaches 3 at 16X, the longest observed in the experiment. This indicates that more clusters allow the system to detect more complex failure paths, as the increased state segmentation enables the system to identify more detailed state transitions leading to failure.
At 64X and 128X, the counterexample path length decreases to 2 again, suggesting that excessive state segmentation simplifies failure paths by classifying even trivial state changes as failures. This reduction in path length is an indication that the system is losing its ability to capture the more intricate state transitions that lead to significant failures.\begin{equation*} J(k) = w_{1} \cdot P(k) + w_{2} \cdot \frac {1}{L(k)}\end{equation*}
is the failure probability for a given cluster size k,$P(k)$ is the length of the counterexample path for cluster size k,$L(k)$ and$w_{1} = 0.7$ are the weights assigned to failure probability and counterexample path length, respectively.$w_{2} = 0.3$
The objective function combines these two factors, aiming to minimize failure probability while maximizing the path complexity (expressed as the inverse of path length). A lower value of the objective function indicates a better balance between failure detection and the system’s ability to capture complex failure scenarios.
The table below shows the results of applying the weighted objective function to the experimental data:
As shown in Table 3, the 16X cluster has the lowest objective function value of 0.3831, indicating that it provides the best balance between failure probability and counterexample path complexity. Therefore, 16X is identified as the optimal cluster size, offering the most effective combination of failure detection and system complexity management.
The counterexample path for the 16X cluster provides a detailed view of how the autonomous driving system transitions from a stable state to a failure state. The system’s state is represented by a vector of 25 values, which includes information about the ego vehicle (the autonomous vehicle) and four surrounding vehicles. These values describe each vehicle’s position
The analysis of the counterexample path reveals several critical stages where the system transitions from a safe state to a failure:
Initial State (driving_0): The ego vehicle is traveling at a stable speed with
and$vx = 0.6667$ , with no immediate threats. The surrounding vehicles are trailing behind, with significantly lower velocities, indicating no imminent danger.$vy = 0.2603$ State Change (driving_234): The ego vehicle maintains a stable speed, but Vehicle 1, located behind the ego vehicle, begins to approach. The velocity difference is still small, but the gap between the vehicles is narrowing, signaling the onset of potential risk.
Increased Risk (driving_65): As the ego vehicle continues its course, Vehicle 1 moves closer, reducing the gap further. The surrounding vehicles begin to accelerate slightly, increasing the risk of collision as the ego vehicle’s space on the road narrows.
Collision (collision_53): The proximity between the ego vehicle and Vehicle 1 becomes critical. A minor speed difference causes Vehicle 1 to make contact with the ego vehicle, resulting in a collision. Despite maintaining a stable trajectory, the ego vehicle was unable to avoid the crash due to the insufficient distance from the approaching vehicle.
The failure progression in the 16X cluster can be summarized as follows:
Initially, the ego vehicle is in a safe state, with surrounding vehicles trailing behind at a safe distance.
As the scenario evolves, Vehicle 1 begins to approach the ego vehicle, reducing the gap and increasing the risk of a collision.
The system fails to detect the increasing risk in time, and a collision occurs due to the insufficient gap between the ego vehicle and Vehicle 1.
In the Highway environment, the 16X cluster provides the optimal balance between failure probability and counterexample path complexity. The failure probability is minimized, while the counterexample path is complex enough to capture the significant state transitions leading to a failure. This cluster configuration allows the system to detect critical risks effectively without becoming overly sensitive to minor state changes.
The analysis of the counterexample path reveals that the primary failure mode of the autonomous driving system lies in its inability to maintain a safe distance from surrounding vehicles. In particular, the system exhibits a limited capacity to perceive and appropriately respond to gradual changes in vehicle proximity, increasing the likelihood of collision. To mitigate this issue, it is imperative to enhance the system’s ability to precisely detect dynamic variations in inter-vehicle distance and execute appropriate evasive and deceleration strategies accordingly. Rather than relying solely on static distance thresholds, the policy must be refined to learn dynamic proximity patterns and adapt its responses in a context-aware manner.
This insight provides a concrete direction for improving collision avoidance algorithms and overall system safety. From the perspective of safety verification in deep reinforcement learning, the findings indicate that while the current policy may be effective in responding to abrupt proximity changes exceeding a predefined threshold, it struggles to handle more gradual variations in vehicle distance. Consequently, future research should focus on optimizing reinforcement learning frameworks to detect subtle distance variations at an early stage, predict potential collision risks proactively, and refine avoidance strategies dynamically, thereby improving the robustness and reliability of autonomous driving systems.
Conclusion
This research presents a novel approach for verifying the safety of DRL systems by employing state clustering techniques and formal probabilistic model checking. DRL systems often operate in large and complex state spaces, making it crucial to reduce the dimensionality of these state spaces while ensuring that important state information is preserved for accurate verification. To address this, we used the KMeans++ algorithm to cluster similar states and constructed DTMC models to capture the probabilistic transitions between clustered states. These DTMC models were then analyzed using the PRISM model checker to assess failure probabilities and evaluate the system’s safety properties.
Two experimental environments, LunarLander and Highway, were used to validate the proposed methodology. By incrementally increasing the number of clusters from 1X to 128X, we identified 16X as the optimal cluster size for both environments. This optimal number of clusters strikes a balance between minimizing failure probability and managing the complexity of counterexample paths. The results show that excessively fine-grained state segmentation (such as with 64X or 128X clusters) increases the likelihood of over-sensitivity, leading to a higher rate of false positives. Conversely, too few clusters risk losing critical state information, resulting in inaccurate verification. The 16X cluster provides the best trade-off, capturing important state distinctions while avoiding unnecessary complexity.
The counterexample analysis in both experiments provided valuable insights into how the system transitions from stable to failure states. In the LunarLander environment, rapid increases in vertical velocity and angular velocity were identified as key failure mechanisms during landing attempts. Similarly, in the Highway environment, the failure to adequately detect and respond to approaching vehicles was identified as the primary cause of collisions. These findings are critical for refining the system’s policies to prevent similar failures in future operations.
A. Practical Implications for DRL Safety Verification
A key contribution of this work is its emphasis on safety verification as a complementary objective to reward maximization in DRL systems. In real-world applications, such as autonomous driving and robotics, achieving high rewards in training environments does not inherently guarantee safe operation. Systems optimized solely for performance may exhibit unsafe behavior in unexpected or critical scenarios, which can lead to catastrophic failures.
Our approach addresses this limitation by formally quantifying failure probabilities through state abstraction and probabilistic model checking. By leveraging DTMC models constructed from clustered states, the proposed framework allows for:
Quantitative evaluation of failure risks, ensuring that safety is assessed rigorously and not inferred solely from performance metrics such as reward values.
Systematic trade-offs between reward maximization and safety, where avoiding critical failures may require accepting lower rewards in specific conditions to enhance robustness.
Efficient handling of continuous and large-scale state spaces through KMeans++ clustering, enabling scalable verification in complex environments.
These contributions are particularly significant for safety-critical applications where even minor failures can lead to unacceptable consequences. For instance, in autonomous vehicles, failure to respond to rare but hazardous conditions may compromise passenger safety. Similarly, robotic systems operating in dynamic or hazardous environments must prioritize avoiding failures over optimizing task performance. The proposed methodology provides a systematic and formal approach to addressing these challenges, bridging the gap between RL performance optimization and safety assurance.
B. Real-World Applicability
This study validates the proposed methodology using widely recognized simulation environments, namely OpenAI Gym’s LunarLander and Highway. These environments are extensively used as benchmarks in the DRL research community for evaluating decision-making and safety verification systems. The successful application of our approach in these environments demonstrates its feasibility for analyzing failure probabilities and verifying safety properties.
While the current work focuses on simulation-based validation, this serves as a critical first step toward real-world implementation. The proposed methodology can be extended to address practical safety verification challenges in real-world applications as follows:
Pre-deployment safety verification: Before deploying DRL policies in complex environments, such as autonomous vehicles or robotic control systems, the proposed framework can be used to quantitatively verify safety properties. This step minimizes risks by identifying unsafe transitions and failure-prone states in simulation prior to real-world deployment.
Testbed validation: The DTMC-based failure probability analysis provides an intermediate step for verifying DRL policies on testbeds or emulated systems. For example, an autonomous driving system can utilize the proposed method to assess safety-critical decisions, such as collision avoidance, in controlled environments.
Integration into complex systems: The ability to cluster continuous state spaces and abstract system behavior makes our approach suitable for integration into multi-layer decision-making systems. This includes modular architectures in robotic systems, where global behavior emerges from local policy decisions.
By enabling formal safety verification prior to real-world deployment, this methodology provides a robust foundation for improving the safety and reliability of DRL systems. The quantitative assessment of failure probabilities ensures that safety-critical decisions are rigorously evaluated, particularly in high-stakes applications such as autonomous driving and robotic control.
Future efforts will focus on validating the proposed framework in real-world testbeds and dynamic environments to assess its robustness and scalability. Additionally, the integration of real-time verification will further enhance its applicability for systems operating under changing conditions.
C. Limitations
While this study provides a systematic framework for DRL safety verification in simulated environments, it has certain limitations. The experiments were conducted in noise-free simulation settings, assuming perfect data collection without external disturbances. In real-world applications, state data acquired from sensors or dynamic environments are often noisy or incomplete, which can impact the clustering accuracy and overall verification performance.
Addressing this limitation will require exploring noise-resilient clustering algorithms and robust data preprocessing techniques. These approaches will ensure that the proposed methodology remains effective when applied to real-world DRL systems where sensor noise and environmental uncertainties are unavoidable.
D. Contributions and Future Work
This study makes several significant contributions to the field of DRL safety verification:
Optimal State Clustering for Safety: By combining KMeans++ clustering with an objective function that balances failure probability and path complexity, we developed an efficient method to determine the optimal number of clusters. This approach not only reduces the state space but also ensures that critical safety-related information is retained, thereby improving the efficiency and reliability of safety verification.
DTMC Modeling and Formal Verification: We introduced a framework that constructs DTMC models from clustered states and verifies the system’s failure probabilities and invariance properties using the PRISM model checker. This combination of state clustering and formal verification provides a robust methodology for assessing the safety of DRL systems in both controlled and real-time environments.
Counterexample Path Analysis: Through detailed counterexample analysis, we identified specific failure mechanisms within DRL systems. This analysis allows system designers to trace the exact paths leading to failures and refine their policies accordingly. Our work demonstrates the importance of counterexample analysis in improving system safety by addressing failure points that might otherwise go undetected.
Despite the contributions made by this research, there are several limitations that warrant further investigation. First, while KMeans++ provides an effective method for clustering, it does not explicitly account for state correlations or noise, which could lead to suboptimal clustering in highly dynamic environments. Incorporating more sophisticated clustering techniques or deep learning-based state representations could improve the accuracy of the DTMC models. Additionally, this study focused on two specific simulation environments—LunarLander and Highway—which limits the generalizability of the results. Future studies should apply this methodology to a wider range of real-world scenarios to validate its robustness.
In future research, exploring more advanced state abstraction methods that can adapt dynamically during training could further enhance the efficiency of DRL systems. Additionally, expanding the methodology to include real-time verification and on-demand safety checking would be valuable, especially in applications where systems must continuously adapt to changing environments, such as autonomous driving or robotics.
While this study has focused on controlled simulation environments, generalizing the proposed methodology to real-world applications requires additional research. Specifically, introducing external noise and environmental disturbances into the simulation environment will allow for a more robust validation of the method’s generalizability. Future work will explore how the current framework performs under these conditions to ensure applicability in real-world scenarios.
Multi-agent RL environments, where agents interact in more complex and unpredictable ways, also present a promising area for extending this research. Ensuring safety in multi-agent systems is particularly challenging due to the interdependencies among agents’ decisions and actions. By incorporating state clustering and formal verification into multi-agent systems, we aim to systematically analyze the collective safety properties and failure probabilities. This will be crucial for applications such as swarm robotics, collaborative autonomous vehicles, and multi-agent traffic management systems, where multiple autonomous entities operate simultaneously.
This research has introduced a structured framework for improving the safety and reliability of DRL systems through state clustering, DTMC model construction, and formal verification. By combining state clustering with probabilistic model checking, we have shown how failure probabilities can be quantitatively assessed and critical safety violations identified. The insights gained from the counterexample analysis provide a practical pathway for improving system policies to avoid failures. As DRL systems become more prevalent in high-stakes applications such as autonomous vehicles and robotics, ensuring their safety will be paramount. This research represents an important step toward achieving that goal and lays the foundation for future advancements in DRL safety verification.