Introduction
Automated Driving Systems (ADSs1) promise enormous benefits to society in terms of increased comfort, safety and efficiency of the transportation systems, by relieving the driver of the responsibility of driving, and even of supervising the system, while it is operating in traffic [1]. To achieve such benefits, it is essential to provide evidence that adequately supports the belief that such systems are safe, not least to ensure public acceptance. However, such evidence has, so far, proven to be difficult to amass, as is evident from the lack of wide-spread deployment of ADSs contrary to the promises from several high-profile automakers. In the following, we elaborate on key challenges involved in collecting such evidence for the safety of ADSs. This is, in part, due to the fact that systems with such high-level autonomous decision-making capabilities have never previously been deployed for mass application on public roads. Moreover, the provision of safety evidence is inherently difficult for a number of additional reasons related to the complexities of the system, its open operational environment [2], as well as the high-dependability requirements imposed on the system to perform at least on par with human drivers [3], [4]. All these aspects complicate the construction of a complete and predictive safety argumentation. With increased scale of commercial ADS operations, the likelihood of ADS-related accidents will also increase. Given the human negativity bias [5], according to which negative events are given higher credence than positive ones, there will therefore be an increased need for thorough and reliable safety cases able to infuse confidence in the technology.
Of course, safety cases were always required to be both complete and predictive. For ADSs, however, it is practically infeasible to collect sufficient evidence from in-traffic testing alone [6], [7], [8], what is often referred to in the industry as the “billion miles problem”. Closely related to this, the complexity of the operational environment makes it difficult to make a complete assessment of the system through any forms of testing, as this complexity leads to “infinitely many characteristics” [9], bringing to question how such evidence should indeed be acquired. Further, increasing reliance on machine learning in safety-critical parts of the system contests the applicability of traditional design and development processes [10], [11], [12], [13].
There already exist several well established methods to design, develop, verify and validate dependable and safety-critical systems, see [14] for a list of more than 800 safety methods spanning across multiple domains. However, it is widely recognised that there is a need for new methodologies when it comes to assuring reliability and safety of ADSs, that goes beyond existing standards and methodologies. The standardisation landscape for safety of ADSs is also fragmented with several existing, pending revisions as well as new standards, each covering complementary aspects related to the safety of ADSs [15], [16], [17], [18], [19]. This landscape reflects the shift away from existing established methods towards novel approaches to integrate safety and risk.
The picture painted in this work suggests that one reason for this shift is the inability of current methods to simultaneously address all operational complexities and developmental hurdles pertaining to an ADS. The inabilities of current methods to single-handedly provide a comprehensive solution to the development of a safe ADS further makes it difficult to grasp the current progress, as well as which gaps remain.
Therefore, in this work we review state-of-the-art in literature regarding methods for providing safety evidence. This review led to the identification of eight challenges, which were used to review and discuss the abilities and shortcomings of existing methods in the state-of-the-art, and in particular with respect to the challenges encountered when providing evidence for safety of an ADS. We group the eight challenges into five areas:
operational uncertainties,
behavioural and structural complexities,
high-dependability requirements,
the use of machine learning components, and
the shift to agile development and continuous deployment.
The challenges are derived based on the review of the literature and further elaborated upon in Sec. III.
This paper lays the foundation for a holistic perspective of safety of ADSs, highlighting what areas are presently addressed in the literature as well as what challenges remain to be solved. Furthermore, we discuss which methods can help to reduce the gap with respect to each of the eight identified challenges presented in Sec. III. It is our aim with this work to provide practitioners and researchers with a starting point to grasp the larger picture of how the different methods contribute to safety, as well as where they face difficulties when used in the context of the development of an ADS. Hopefully, this overview could also help further (interdisciplinary) research, spanning across multiple of the reviewed methods, to further close the gaps on the road to safe ADSs. To improve the clarity and structure of the review, the covered methods are grouped into four main categories: design techniques, verification and validation methods, run-time risk assessment and run-time (self-)adaptation, as illustrated in the mind-map in Fig. 1.
A mind-map of the concepts discussed in this paper, grouped into four major themes, each supplying methods that contribute evidence as a basis for the assurance case of an ADS.
The contributions of this paper can be summarised as follows:
Identification of eight challenges for providing safety evidence for ADSs;
A state-of-the-art review of existing methods, in the light of the aforementioned challenges; and
Identification and elaboration of research gaps and corresponding research questions, based on the two contributions above.
Note that these contributions are primarily directed towards designers, developers or organisations developing ADSs. However, both the challenges, as well as the provided structure of the reviewed method, would also naturally hold relevance for a number of other stakeholders, including assessors, legislators, consumers, and the general public.
The layout of this paper is illustrated in Fig. 2. We start by defining the addressed research questions in Sec. II, the methodology followed in this work in Sec. II-A, the delimitations in Sec. II-B, and the presentation of preliminaries and definitions in Sec. II-D. Sec. III presents the challenges related to providing safety evidence for ADSs. Based on the proposed articulation structure, design techniques are discussed in Sec. IV, verification and validation methods in Sec. V, run-time risk assessment concepts in Sec. VI, and run-time (self-)adaptation methods in Sec. VII. Along with the discussions each method category section also includes a summary table, presenting the safety evidence provided by each of the reviewed methods. Results of the review are presented in Sec. VIII, where each method’s ability to address each of the eight challenges are classified and promising methods to address each of the presented challenges are discussed. In Sec. IX research gaps are identified and future research avenues are given, followed by a discussion on the threats to validity, in Sec, X. Finally, concluding remarks are provided in Sec. XI.
The layout of the paper. Sections are illustrated in blue and subsections and visuals/contributions in indigo. The yellow boxes group the different sections corresponding to the four method categories: design techniques and verification and validation methods and run-time risk assessment and (self-)adaptation respectively.
Research Questions, Method and Preliminaries
In order to provide a pertinent and meaningful articulation of the existing literature, this work and its discussions are focused on the following research questions:
RQ1:
What are the present challenges for providing safety evidence for an ADS?
RQ2:
What methods exist in the literature that support such evidence provision?
RQ3:
How do these methods address, and how are they affected by, the challenges from RQ1?
RQ4:
Based on the results from RQ3, what are the gaps in the state-of-the-art of RQ2 considering the challenges of RQ1?
The deliverables on the right-hand-side of Fig. 3 indicate where, within this paper, the research questions are answered. The answer to RQ1, corresponds to 4.b), and RQ2 relates to both 4.a) and 5.a). Further, RQ3 is answered by TABLE VI, i.e. deliverable 6.a), and the answer to RQ4 is indicated by 8.a).
Illustration of the research methodology. The indigo boxes on the left indicate to the main research steps, whereas the orange boxes on the right indicate the corresponding deliverables within the paper.
A. Method and Approach
Fig. 3 illustrates, as a step-wise process, the approach and research methodology of this work. The initial literature survey regarding safety evidence for ADSs helped in retrieving challenges and methods pertinent to this work. Steps 2 and 3 identify, respectively, the challenges regarding safety evidence for ADSs and applicable development methods, addressing and providing such safety evidence. Step 4 structured and organised the above mentioned challenges and methods according to existing frameworks and safety standards (elaborated upon in Sec. II-A1 below). This effort resulted in the mind-map of Fig. 1 and the list of challenges presented in Sec. III. Based on this structure, the identified methods were investigated in light of the challenges (step 5), resulting in the discussions of Sec. IV through Sec. VII. The classification in TABLE VI resulted from step 6, and a gap analysis (step 7) was conducted based on this classification. Subsequently, the gap analysis laid the foundation for step 8, wherein research questions, aimed at addressing those gaps, were derived. Lastly, a focused high-level systematic literature survey was conducted to validate the proposed structure of the methods (i.e. the mind-map of Fig. 1) as well as ensure no other related surveys were omitted.
1) A Note on the Structure and Content of This Work:
It is important to mention that the existing literature on ADSs and all the inherent aspects, including safety and evidence provision, is extremely large and spans different topics and scientific communities, an aspect that motivates this work. In an effort to provide a structure to this vast landscape of methods and approaches, we present a proposed structuring of methods in terms of the mind-map in Fig. 1. In addition to the structuring of the reviewed methods, this mind-map also aims to guide the reader through the discussions of this paper by contextualising where, within the developmental cycle (design-time or run-time), the discussed method is positioned. Furthermore, the proposed categories and mind-map should be seen as one way of organising the methods discussed, and provide the scientific community with one perspective of the current state-of-the-art. While we make no claim for this structure to be exhaustive or complete, no better structure/mapping exists in the literature and we have yet to find a reference pointing to a method that cannot be fit into one (or several) of the categories shown in Fig. 1. Note that the review, analysis and discussion of the literature presented within each of the subsections are independent of the proposed structure and the provided mind-map. The analysed methods are collected under four main categories: design techniques, V&V methods, run-time risk assessment and run-time adaptation. The first two categories correspond to activities and methods commonly included before the deployment of the system, during design-time. The latter two include methods which encompass activities during the operations of the system, i.e. in run-time.
The design techniques can be seen as a collection of the activities on the left side of the V-model [20] (c.f. Sec. II-D8), whereas the V&V methods correspond to the activities in the right leg of the V-model. Also the run-time concepts have a strong anchoring in the V-model, where all models and systems developed for run-time naturally need to be designed, developed and V&Ved. The run-time concepts do, however, not strictly fit within a classic “waterfall” development cycle since their main contributions reside at run-time. The two run-time categories, aim at collecting methods supporting evaluation, evidence provision as well as adaptation of the ADS in run-time.
Comparing to the Cyber-Physical Systems (CPS) framework by the National Institute of Standards and Technologies (NIST) [21, Fig. 4], the scope of this paper covers methods both within the conceptualisation and realisation facets. The design techniques covered in our work fit nicely into both the conceptualisation and realisation facets of the NIST framework. The V&V methods, on the other hand, correspond to the realisation facet as do the run-time concepts (or more specifically to operating the CPS/ADS). The aim of our work is to elucidate the evidence that each method provides toward the safety assurance of the ADS, thus effectively providing a link between the first two facets (conceptualisation and realisation) and the third facet (assurance) of the CPS framework [21, Fig. 4].
Note that several of the methods discussed in this review could be positioned into two or even four of the categories, and a judgement of the authors have been exercised in order to position each method in the section where the most relevant aspects for providing safety evidence can be appropriately highlighted. For example, supervisor architectures are allocated to design techniques to highlight the underlying architectural aspects, even though such architectural considerations are integral for both effective monitoring (pertaining to methods of run-time risk assessment) as well as run-time adaptation. Further, the application of supervisors is a design decision but also strongly supports the validation of the system.
2) Included Methods:
With the purpose of providing a holistic perspective of safety of ADSs, spanning a wide range of methods and research disciplines, we eventually discarded explicit systematic literature searches as a basis for forming the reference list of this work. The issue with such a systematic approach was that we either found way too many references (in the orders of tens of thousands of papers) or way to few (less than ten) for applicable search strings. However, the effort of developing such a systematic approach yielded a good understanding of references available as well as methods commonly appearing across the research landscape. These insights together with the domain knowledge of the authors, provided a starting point for investigation. The methods included were expanded iteratively as new appropriate references were found from snowballing of the initial set of references and additional exploratory (non systematic) search of the literature.
3) Focused High-Level Systematic Literature Survey:
To complement the (non-systematic) approach taken for the methods reviewed in this paper, a focused high-level systematic literature survey was also conducted. The aim of this focused survey is to identify works similar to this paper, as well as to validate the proposed structure and exhaustiveness of the methods included in this review.
The focused search was done within a narrower scope, by identifying other journal articles reviewing safety aspects of ADSs. References with titles matching the following search string:\begin{equation*}\$AD ~\mathbf{AND}~ \$Safe ~\mathbf{AND} ~\$Survey,\end{equation*}
The focused survey only considers references contributing with perspectives on safety evidence provision and methods related thereto. Together with the restriction to journal articles, this resulted in 17 papers being excluded. A simplified PRISMA-like [22] flowchart for the complementary search is presented in Fig. 4.
The process of the focused high-level literature survey, yielding a final set of n=12 relevant references.
The complementary references include the following main topics:
testing and scenario-based methods [23], [24], [25], [26], [27];
AI and the safety implications of using AI in the context of ADSs [28], [29], [30];
Safety Of The Intended Functionality (SOTIF)-related aspects [31];
surrogate safety measures for ADSs [32] (closely related to the threat assessment techniques covered in Sec. VI-B);
standards and practises for safety of autonomous systems, where [33], provides a cross-domain review beyond the automotive domain; and
industrial safety and design approaches, where [34] reviews publications related to Mobileye’s Responsible Sensitive Safety (RSS) approach.
It is worth mentioning that the methods and subtopics studied in all the complementary references were easily mapped onto the structure proposed in this work. On the other hand, many of the details pertaining to the use of AI-based methods are not thoroughly reviewed and discussed in this paper. This is however justified by the original scoping of this paper, whose primary focus is on methods for providing safety evidence, rather than methods for realising the ADS functionality.
B. Delimitation
The research and development efforts for the successful introduction and productification of ADSs are immense and include a wide variety of obstacles [3]. In this paper we limit ourselves to the challenges pertaining to safety in the sense of functional safety (as per, e.g. ISO 26262 [20]) and nominal safety (e.g. SOTIF [35]). We highlight technical areas and methods that provide evidence supporting the safety claims of an ADS.
In the interest of length and clarity of this paper and its contributions, we consider the following areas as out of scope:
the definition of what is safe enough, in particular in comparison to human drivers, and what are appropriate (quantitative) definitions of risk (as reviewed in, e.g., [36] and discussed in, e.g., [37], [38]),
collaborative and communicating systems, for example: vehicle-to-vehicle (V2V), vehicle-to-infrastructure (V2I) and vehicle-to-everything (V2X) communications, despite the indicated importance for safe deployment of ADSs, e.g., [39],
cyber-security and its relation to safety of the system (see [40] for a survey on the cyber-security considerations for connected and automated vehicles),
human-machine interface and interaction, including safety considerations on hand-over procedures and mode confusion, as analysed in, e.g., [41], as well as potential safety issues concerning automation complacency [42], [43],
the (safety) implications of using driver monitoring systems (see, e.g., [44] for a review of DMSs in the context of automated vehicles),
the physical platform, on which the ADS operates, is assumed to be reliable (something that would be addressed through classic functional safety work following, e.g., ISO 26262 [20]. Further, note that architectural patterns for, e.g., fault tolerance are discussed in Sec. IV-E),
the question how to scale from an advanced-driver-assistance-system into a full ADS (considering, e.g., the difference in required sensor setup and compute [45, Figure 5] and ODDs [1]),
organisational complexity [2], the impact of safety culture and the way the development organisation is organised, as, for example, in the case of the Boeing 737 Max accidents [46], and
explicit quantification techniques of operational risks originating from ML/AI-based components (especially in relation to computer vision task), as reviewed in, e.g., [28].
Furthermore, aspects pertaining to tracking, organisation, and traceability of arguments and evidence of the safety case will also be excluded from the discussion presented. Additionally, the interconnections, relationships and dependencies in-between the reviewed methods are not detailed in this work, but deferred to future work.
The delimitation of this work could also be seen in the light of the levels provided in the CPS framework [21, Fig. 1], where we cover the innermost levels: device and system, but consider systems-of-systems and human interaction (at least in terms of the human being the user of the system, human traffic participants would naturally be in scope) as out of scope.
Note that the methods and references included in our work have been collected through exploratory search across multiple domains, stemming from the authors’ own areas of expertise. Considering the high volume of research in this topic, the studied reference list could naturally be extended and complemented. The presented selection is nevertheless considered to be pertinent and representative of the existing methods, and the arguments and conclusions made in this paper believed to be valid. An extended discussion on the threats to validity is provided in Sec. X-A.
C. Related Work
Our review focuses on a holistic perspective on technical methods providing safety evidence for ADSs. While autonomous vehicle technology and safety (assurance) have been approached in many different research works, no other work, to the best of the authors’ knowledge, has conducted a holistic analysis of the existing methods relating to providing safety evidence.
There are, however, several works pertinent to the topic of our work and worth mentioning here.
For instance, Nair et al. [47] review the state-of-the-art of safety evidence provision for safety certification across multiple application domains. While Nair et al. [47] use the term evidence provision to cover the following three different aspects: the information constituting evidence, how to structure the evidence as well as how to assess the evidence; we focus our discussions on the first and last aspect and, at least partly, leave the discussion of how to structure the evidence for future work. Further, whereas our work aims to elucidate and discuss the contributions of different concrete methods towards the safety of ADSs, [47] gives an overview and classification of what information and artefacts that could be considered as evidence when fulfilling different safety standards. There are, nevertheless, some bridges between our work and [47]. More precisely, the provided taxonomy of [47] relates to the four main categories of our work, which are nicely covered by the four leaf nodes of the System Life cycle Plan category in the taxonomy of [47, Figure 2]. However, the methods covered in this paper are not only discussed in the light of what Process Information they provide, but also what Product Information that can be supplied.
In [2], review methods for safety assurance of CPSs and evaluate each method in terms of their effectiveness of addressing structural, dynamic and organisational complexities, which are identified as the three main sources of complexities pertaining to the development and safety assurance of CPSs. For the purpose of the discussions in our work we restrict ourselves to only the former two categories of complexity. In our work we contrast the different methods’ abilities to address a number of central challenges for providing safety evidence for an ADS. Such a discussion is not included in [2]. Further, [2] place a large emphasis on risk assessment and hazard identification methods as well as verification and validation methods. These aspects are covered in sections IV and V respectively. Our paper also goes beyond the scope of [2] by covering run-time methods.
Aspects concerning SOTIF are surveyed in [31], which provides a comprehensive exploration of the SOTIF landscape. Interestingly, Wang et al. [31] also acknowledge the operations phase subsuming the traditional v-model [31, Fig. 6.]. On a high level, that structure is similar to the one proposed in this work, but Wang et al. [31] use it in the context of SOTIF. Furthermore, that structure does not cover run-time adaptations explicitly, even though restricted operations are mentioned in relation to the operational phase of the ADS. Notably, we were able to recognise and position all the methods covered in [31] into the structure of this work, providing evidence for the usefulness of the structure proposed in this paper.
Similar to our paper, Burton et al. [48] also discuss the importance of a holistic perspective for safety assurance for ADSs. However, the framework presented in [48] provides a complementary view on the problem to the one discussed in our work. In particular, while our paper focuses on methods for providing safety evidence, [48] address the causes for system complexity and exacerbating factors worsening the consequences of such complexity. For that discussion, [48] include business context, development context and organisational aspects, while we restrict our analysis to technical methods providing safety evidence.
Complexity of systems such as an ADS has also been acknowledged as a key challenge in [3]. The factors of CPSs complexity, in general, are discussed in [2] and [49]. We partly take support from the considerations presented therein, but restrict ourselves to ADSs and methods for safety evidence provision.
The comprehensive review on the impact of AI on ADS safety presented by Nascimento et al. [50] further complements our work by focusing on different ML/AI techniques and the associated ADS safety-related problems. Reviews of AI-safety related works are also presented in [28] and [30], complementing the discussions presented in this paper. Notably, Araújo et al. [28] provide a comprehensive and detailed overview of uncertainty quantification methods. In [29], safety management of ADSs is reviewed in relation to software and perception perspectives.
To support the case for why dynamic risk management is needed for ensuring safety of autonomous systems, Adler et al. [51] provide a structure of the related solutions and open research challenges. Three such solutions are covered in our work, namely: dynamic risk assessment methods (Sec. VI-D), run-time certification (Sec. VII-B) and dynamic safety management (Sec. VII-C). With respect to the presented structure of [51] (presented in their Fig. 1) we cover the notions of situational risk management and partly touch on the notion of bootstrapping confidence in safety claims. The latter notion is however not central in our work and is only treated in the passing when discussing operational data collection (Sec. VI-A). While the provided structure in [51] indeed helps in giving an overview of the methods of dynamic risk management, we go beyond that to discuss how well each of the three solutions cope with each of the eight challenges of section III.
In a survey on the safety assurance considerations for open adaptive systems, Trapp and Schneider [52] discuss four types of models at run-time: safety certificates, safety assurance, V&V models and HARA. While these four models are not explicitly addressed in this work, the first (and partly the second) is captured by the discussion on run-time certification in Sec. VII-B. Furthermore, even if moving V&V models to run-time is not directly discussed here, it could be argued that run-time monitoring, in terms of both operational data collection for fleet level assessment (discussed in Sec. VI-A) as well as vehicle-level risk assessment (as discussed in Sec. VI-B and VI-D), could at least partially support such a shift. Lastly, conducting the HARA at run-time is elaborated upon as part of the Dynamic Safety Management concept discussed in Sec. VII-C.
In conclusion, even though there is a trove of relevant previous work on different methods for supplying safety evidence for ADSs (or comparable systems) no previous work has approached a holistic analysis of the landscape of available methods and how they relate to the development and life cycle of ADSs. This motivates the work presented in this paper.
In the rest of this paper we reference to and draw upon a set of surveys to support the discussion of each of the included methods. While many surveys provide invaluable insight into their own respective domains and scopes, none provide the same holistic perspectives as are covered in our work. In Sec. X-A a summary of which surveys we draw upon and reference to in each section is given.
D. Preliminaries and Definitions
The following subsections introduce the definitions of some central terms used throughout this work.
1) ADS:
An ADS is a system that performs on an SAE automation level 3–5 [1]. This entails that the ADS is completely responsible for the dynamic driving task, at least within a confined Operational Design Domain (ODD) (applicable for levels 3–4). Consequently, the ADS does not require a driver to supervise the operations directly, which is one of the main differences compared to Advanced Driver Assistance Systems (ADAS) (corresponding to SAE automation levels 1-2 [1]). While this shift of responsibility result in a number of different considerations we will not venture further into a direct discussion of the differences between ADAS and AD systems here. Instead, this shift is included as one of the identified challenge of the following section. The impact of shift of the responsibility of the entire DDT to the ADS is thus discussed for each of the reviewed methods.
To have a common reference frame for further discussions, we consider the classical functional breakdown illustrated in Fig. 5, as identified as the “pipeline method” to motion planning by [53]. The decision and path planning is considered to be made in the Decision making block, that receives as input the perceived surroundings of the vehicle and the available capabilities of the platform. The output path is then used within the vehicle control block and ultimately executed on the vehicle platform. Note that this breakdown does not include monitoring or redundancy aspects (as discussed further in Sec. IV-E), but merely represents the purely functional view of the system. Further, one could consider an additional step within the environment perception block, tasked with prediction intentions and trajectories of the other traffic participants. For the purpose of this illustration, this step is kept within the environment perception block.
A common breakdown of subsystems constituting an ADS. The Environment Perception (EP) provides inputs to the Decision Making (DM), which requests a path to be enacted by the Vehicle Control (VC).
2) Safety and Risk:
Safety is commonly defined as the absence of unreasonable risk [20, §3.132], where risk is understood as \begin{equation*}R(x) = P(x) S(x),\end{equation*}
The acceptable levels of safety can be considered in relation to a positive risk-balance [4], as compared to the driving performance of human drivers [6], [7], [54]. Junietz et al. [36] provide an overview of quantitative risk levels from different industries and discuss these in relation to ADSs. Further, Warg et al. [55] propose the concept of a Quantitative Risk Norm (QRN), collecting quantitative safety requirements for an ADS. For the purpose of the discussions in our work, we assume that quantitative requirements differentiating reasonable and unreasonable risks are present. Furthermore, we broadly interpret the term safety such as to encompass not only functional safety (e.g. ISO26262) but also Safety Of The Intended Functionality (SOTIF) [35].
3) Dependability:
In a wider context, safety is just one attribute of the system’s Dependability [56], along with: Availability, Reliability, Confidentiality, Integrity, and Maintainability. While all dependability aspects are not considered in our work we note that availability, reliability, and maintainability are attributes tightly linked to safety.
4) Fault Tolerance:
In [56], four means to achieve a dependable system are detailed: fault prevention, fault tolerance, fault removal and fault forecasting, all four of which are discussed in this work. The first two are related to the design techniques and the latter two are related to the reviewed V&V methods, where operational data collection and EVT modelling would supply the fault forecasting aspects. Managing faults is a way to increase the reliability of the system, and in particular the ability of the system to tolerate certain types and frequency of faults. Following the dependability terminology of Avizienis et al. [56], a fault in the system (e.g. a software bug or inherent performance limitation of the system) might cause an error (an incorrect state in e.g. a software variable) which, in turn, might lead to a failure (at some level of the system, then with a potential continued fault propagation). The faults considered do not necessarily result in safety critical failures. Stolte et al. [57] build on the work of Avizienis et al. [56] and present a taxonomy including three fault tolerance regimes: fail-operational, fail-degraded and fail-safe.
Further, for the discussions of our paper, we define a Fault-Containment Unit (FCU) as a unit, within which a fault is being contained [58]. Such a unit (i.e. subsystem) should exhibit a defined failure at the boundary to its environment, and have its own software and hardware to contain the direct effects of an internal fault. Clearly the value of FCUs are higher if the separated units are ensured to fail independently.
5) Safety/Assurance Case:
A safety case, an important concept for safety argumentation, consists of ”[...] a structured argument, supported by a body of evidence that provides a compelling, comprehensible and valid case that a system is safe for a given application in a given environment.”, see [59, §.13.2.1]. In the context of functional safety in the automotive domain, one could instead consider the definition of ISO 26262, defining it as ”[...] argument that functional safety is achieved for items, or elements, and satisfied by evidence compiled from work products of activities during development” [20]. Yet another view of a safety case is to require the safety arguments (i.e. safety case) to provide justified belief in the safety of the system (e.g. [60]). All three definitions regard evidence as essential. The first definition would be appropriate for the discussions of this paper, particularly as it is gives a broader scope compared to the definition of ISO 26262, not limiting the evidence to be rendered during the development stages only. However, to support the discussions in this work we consider a definition closely related to the third, where we instead refer to providing sufficient and appropriate evidence to support the safety claims of the system.
Similar to a safety case an assurance case may instead consider one or more requirements placed on the system, including dependability, security, safety and quality.
6) Safety Evidence:
The safety case, as described above, requires a “body of evidence” in order to provide support to the argument put forward. In this paper we use safety evidence to refer to any data or artefacts produced during the development and operation of the ADS that contribute (or can contribute) to the safety case. The methods considered in this paper each provide such evidence, but in different ways. For example, the design techniques would mostly provide artefacts that support or help provide structure to the safety argument, whereas verification and validation methods output data which can contribute to the confidence in a safe ADS.
7) Leading and Lagging (Safety) Metrics:
To support the safety argumentation there is a need for leading metrics providing a predictive assessment of the safety of the ADS. These can be contrasted to safety metrics providing lagging indicators, such as reported disengagement rate or number of collisions as required by, e.g., the California DMV for autonomous vehicle testing [61]. While most measurements available in the vehicle are lagging by nature, they might still provide leading indicators for safety. For example, measurements of the Brake Threat Number (BTN) could provide an estimate of the collision rate of the ADS even though there are no actual collisions recorded, for example through statistical modelling as discussed in Sec. V-B. In general, the threat assessment metrics discussed in Sec. VI-B do not provide a direct measurement of safety but rather a proxy thereof. Consequently, with appropriate modelling, they might provide leading, rather than lagging, indicators for safety.
8) Design-Time Activities:
Traditionally, the activity of compiling safety evidence has been carried out and completed before the deployment of the system. In such context, all needed evidence to support the safety claim is collected throughout the specification, analysis, design, development, verification and validation of the system. For example, in the V-model, depicted in Fig. 6: the system is specified (Item Definition); a Hazard Analysis and Risk Assessment (HARA) conducted; the Functional and Technical Safety Concepts (FSC/TSCs) devised; and requirements further refined. The system is then designed, implemented, verified and validated. Following this process, which for example is prescribed by ISO 26262 [20], has proven to yield sufficient safety to most automotive Electrical and Electronic (E/E) systems operating today.
The Challenges for Safety Evidence Provision
There are several traditional safety processes and concepts that have proven highly valuable and useful in the past, but that do not suffice in providing safety evidence for ADSs, as already discussed in the introduction. In the following section we explore the different reasons for why it is difficult to provide adequate safety evidence for ADSs and conclude with listing eight challenges that summarise these aspects.
Firstly, there are aspects regarding uncertainty. The provision of adequate safety evidence for ADSs is difficult due to the uncertainties present in its operational environment as well as the structural and dynamic complexities associated with the system itself [2], [49], [62]. The operational environment is reap with uncertainties, both in terms of the possible configurations, occlusions and actions of the other traffic participants [39] as well as the uncertainties arising from the interaction between these other actors and the ADS. To be able to operate in such an uncertain environment the ADS tends to require a complex set of interwoven functions and subsystems [13], [63], which relates to the structural complexity [2] of the system itself and the complexity arising from (system) size as well as computability related aspects [49].
Secondly, there are aspects of behavioural and structural complexity (some of which have already been mentioned in the paragraph above). Through the transition from a system of SAE automation level 1-2 further to a system of levels 3-5 (i.e. an ADS), the system is required to shoulder the entire Dynamic Driving Task (DDT) [1]. Thus, the system needs to be able to safely cope, not only with unpredictable situations of its surrounding environment, but also with faults present (such as a software bug) or faults occurring (such as a transient hardware failure) within the system itself. Challenges also acknowledged by Koopman and Wagner [62].
Thirdly, to be accepted by society, the system would need to perform at least on par with its human counterparts in traffic. However, since human drivers (especially unimpaired and attentive ones) are quite good at driving, this results in a high dependability requirement of the system (or an ultra-dependable system as suggested in [3]) with enormous amounts of efforts required to provide evidence thereof. In fact, as already mentioned, it is practical infeasibility to collect sufficient evidence from in-traffic testing alone to support such high dependability requirements [6], [7], [8], [62]
Fourthly, to perceive and understand the surrounding environment the ADS relies on Machine Learning (ML)-based components for central safety-critical parts of the system. Since the safety implications of such components and how to validate them, remain elusive [10], [11], [12], [13], [62], [64], [65] it is difficult to assess the resulting safety with such ML-components integrated. These safety implications arise from both the lack of interpretability and explainability of such components (i.e. due to their black-box nature), as well as potential issues with respect to robustness. These robustness issues are especially troublesome considering the vulnerability of ML-based components when it comes to adversarial attacks [66]. Furthermore, when considering end-to-end methods for motion planning (encompassing the entire driving task, from raw perception to control signals), the entire ADS system would be based on ML-based components [53] further aggravating the safety implications.
Lastly, the software intensiveness related to the development of an ADS (and to some extent modern automotive systems in general) feed the industrial shift into agile development processes with frequent releases and continuous deployment, however, also leading to challenges for verification and validation. While this aspect is partly a result of an industrial and organisational shift, it also opens up for quick feedback and learning from operations, which will be paramount to cope with changes to the operational context of the ADS or shifting user needs [67].
The above discussion is summarised in the list below, with eight challenges for providing safety evidence for ADSs:
Uncertainties:
C-U-env
Uncertainties associated with the operational environment of the ADS,
C-U-inter
Uncertainties originating from the interaction of the ADS with other traffic participants,
Behavioural and structural complexity:
C-B-resp
ADS’s responsibility for all strategic, tactical and operational decisions of the driving task,
C-B-func
Complex set of interwoven functions and sub-systems required to realise an ADS,
C-B-adapt
Self-adaptation capabilities of the ADS, in particular, to cope with (temporary) degradations of the system,
Dependability requirements:
C-reqs
High dependability requirements imposed on the system, e.g. originating from a comparison with human performance, highlighting the contribution of corner and edge cases to the overall safety,
AI and ML components
C-AI
Validation of (black box) components relying on Artificial Intelligence (AI) and Machine Learning (ML),
Agile development and continuous deployment
C-agile
Frequent releases and continuous learning, with to a shift to an iterative and agile development process including software upgrades, requiring reduction of safety/assurance case compilation efforts (or re-verification and re-validation of the system).
While most of the challenges could be pertinent also to other safety-critical systems, challenge (C-B-resp) and (C-agile) stand out as challenges particularly present and relevant to an ADS. Further, the challenges collectively distinguish a representative view of the challenge for providing safety evidence for a class of highly automated CPSs, acting in open environments – such as an ADS – compared to other safety-critical systems.
Design-Time Design Techniques
The way a system is specified, analysed, designed and implemented greatly contributes to safety evidence provision. In the following subsections we discuss some common design techniques and methods, as well as their limitations in relation to the challenges listed in Sec. III. The reader can also refer to Fig. 1 for an articulation of the different methods and the corresponding domain areas. TABLE II summarises the safety evidence provided by the methods discussed in this section.
A. Operational Design Domain (ODD)
According to [1], an ODD is defined as ”Operating conditions under which a given driving automation system or feature thereof is specifically designed to function[...]” and provides the scope for the design intent of the system, and delimits the design-time activities. Specifically, the ODD provides the context for the Hazard Analysis and Risk Assessment (HARA) and the conditions to consider when verifying and validating the ADS. While the ODD defines the design intent and the scope of the V&V activities, thus specifying what can be called the problem domain, it is difficult to ensure that this design intent adequately captures all operational uncertainties related to challenges (C-U-env) and (C-U-inter). However, an appropriate specification of the problem domain as provided by the ODD, can simultaneously help avoid certain aspects of the same challenges by explicitly avoiding or limiting certain uncertainties. This aspect is also suggested by Thomas and Vandenberg [68] as a primary motive for the use of the ODD in the first place.
To leverage the benefits of using an ODD for scoping the V&V activities, however, an ADS needs to remain inside the ODD during its operations. For that purpose [69] presents four strategies to mitigate ODD exits and further discuss and elaborate on the ODD’s ability to support the safety argumentation of an ADS. In more detail, the four provided strategies can be grouped based on the two following aspects:
relying on checking the conformance between the conditions of the requested mission (i.e. trip) and the ODD, before accepting the mission in the first place; and
using run-time monitoring of appropriate triggering conditions for avoiding exiting the ODD.
Such triggering conditions should provide predictive indication of the violation of the ODD (i.e. the violation of an operating condition of the ODD). For example, while the ODD could include light rain, the ADS might not have been tested and verified for heavy rain. In such case, a combination of a weather forecast with a local measurement of the rain intensity could be used as a predictive indicator (i.e. triggering condition) to avoid operating in heavy rain. Note that the need for run-time monitors is tightly linked to the concept of Minimal Risk Conditions (MRCs) and Restricted Operational Domains (RODs), which are further discussed in Sec. VII-A.
The operating conditions explicitly defined in the ODD, for which the ADS is designed and verified, can be matched to the operating conditions required by the intended real world use cases, as suggested in [69]. Such matching of the operating conditions between the ODD and prospective real world use cases could help facilitate incremental improvements (partly tackling challenge (C-agile)) and might further help coping with restricted capabilities of the ADS (as formulated in challenge (C-B-adapt)). However, it is currently not clear how to manage versioning of the ODD across the life-cycle of the ADS, as pointed out in [70]. Further, one major challenge in the use of an ODD is how to ensure that the information regarding the ODD is appropriately manifested throughout the ADS and the development cycle. This is important, not only for the purpose of the operation within the ODD, but also when it comes to understanding the connection between specific ODD limitations and the configuration and design decisions of the ADS. For example, a specific selection of sensors might require certain operating conditions, leading to an ODD limit. In consecutive software version releases, such a sensor setup might be swapped, rendering it difficult to know how the rest of the system behaves with respect to conditions outside the previous ODD limit. Unless the knowledge and information of the ODD is appropriately linked and distributed throughout the ADS’s components and subsystems, expansion outside the previous ODD would then be unnecessarily onerous.
If the ODD information is adequately manifested throughout the ODD, this could greatly strengthen the evidence for completeness of the V&V activities and safety assurance of the system. Conversely, if that is not the case, it would be difficult to determine what V&V activities that are sufficient for evaluating such completeness. Rather than addressing this challenge, many research works have focused on monitoring of the functional boundaries of the ADS, i.e. the limits within which the function is intended to operate (as defined by e.g. its ODD), e.g. [71], [72], or on the definition of an ODD for the ADS, e.g. [73], and what dimensions to consider in such a definition, e.g. [74], [75], [76]. Notably, ISO [16] has been published to provide a set of considerations and taxonomies for the construction of an ODD. In [70] a comprehensive overview of the research on ODDs are presented.
B. Hazard Analysis and Risk Assessment
Hazard Analysis and Risk Assessment (HARA) is one of the central processes of standards such as ISO 26262 [20], whereby the hazards faced by the system are identified and their related risks are assessed in order to provide input to the design process of the system. The purpose of the following design process is then to identify the root causes of the hazards, such that they can be removed and/or ensure that the resulting risks are reduced. The HARA is traditionally made through a manual effort, where all hazards and the associated risks are identified.
Regarding (future) ADS, this is, however, no longer tractable considering challenges (C-U-env), (C-U-inter), (C-B-resp), and (C-B-func). Indeed, challenges (C-U-env), (C-U-inter) and (C-B-func) make the enumeration of all hazards difficult (if not impossible) [77], [78]. Furthermore, challenge (C-B-resp) highlights the ADS responsibility, where the autonomous system should have the ability to mitigate hazards it might face even before they occur, therefore impacting the applicable hazards as well as the associated risks. The complexity of the ADS (related to challenge (C-B-func)) also imposes an obstacle to the safety-goal breakdown following the HARA activities. The ability of the ADS to handle degradations (relating to challenge (C-B-adapt)) further aggravates the issue of enumerating all hazards.
The effectiveness of a safety or risk analysis technique is as of today not clearly understood and quantified, as discussed in e.g. [79], [80], also implicitly suggesting a correlation between the results of the analysis with the availability of experts with appropriate domain knowledge. For a novel system, such as the ADS, it is obviously difficult to gather such a collection of experts. Limitations of current methods have been demonstrated even for relatively simple systems, such as an Autonomous Emergency Braking (AEB) system, where the hazard analysis techniques, System Theoretic Process Analysis (STPA) [81] and Failure Modes and Effects Analysis (FMEA) [82], were shown to be insufficient for identifying all hazards [83] and might render different identified failures [84].
There has, however, been several suggestions for how the HARA process can better support development of safe ADSs. For instance, Kramer et al. [85] suggest a method for iterative and data-driven identification of hazards for ADSs, but this method still falls short with respect to achieving completeness of the provided set of hazards. In another work, Khastgir et al. [86] suggest a run-time alteration of the Automotive Safety Integrity Levels (ASILs) associated with each hazard, in the light of the tactical decisions made by the ADS, providing also as a method to guide those tactical decisions. This method relies on a high integrity run-time hazard detection system, and consequently does also not address the completeness of the hazards. In [55], a tailoring of the HARA process is suggested by using a Quantitative Risk Norm (QRN) with consequence classes, and thus relieving some of the burden of achieving a complete enumeration of hazards. The question of how to collect “sufficient” evidence to support the ADS’s fulfilment of such a QRN, however, is still under debate.
For the purpose of the HARA, ML-based components (related to challenge (C-AI)) could be considered as any other subsystem [87]. However, some particular challenges arise in relation to classification accuracy and adversarial attacks, and that is why Salay et al. [87] suggest an alternative analysis method called Classification Failure Mode Effects Analysis (CFMEA).
C. (Qualitative) Process Arguments
The outcome of the HARA and FSC/TSC steps in the V-model, e.g. of ISO 26262 [20] as depicted in Fig. 6, is a set of ASIL requirements allocated to a collection of subsystems or components. In the ISO 26262 [20], a set of requirements including qualitative process arguments are prescribed to ensure the fulfilment of the ASILs. Even though such qualitative processes seem to jointly work for less complex systems, the exact quantitative contributions from each risk reduction method are, as already mentioned, not fully known. In fact, this holds true for the entire safety case approach, for which it is hard to prove its overall effectiveness and quantitative contributions to safety, as discussed in [67] and [80]. Regarding challenge (C-B-func), as the complexity of the system grows, so does the number of process arguments. Hence, if all proposed processes leave a shard of residual risk, this might eventually amount to a considerable contribution for the entire ADS. One way to circumvent this would be to transition from a focus on qualitative processes into a focus on quantitative ones. If one is to consider safety in a quantitative sense, however, there is also the need for the top-level claims to be prescribed as quantitative targets, for example according to a QRN [55].
Traditional development processes also do not suffice for tackling challenge (C-AI) related to integration of ML-based components, even if some steps towards this direction have been accomplished, as surveyed in Rabe et al. [88]. Notably, a W-model for learning assurance is suggested in [89, Fig. 6.1,p. 43], which might provide a stepping stone towards a design process for learning-based components. Assurance of Machine Learning for use in Autonomous Systems (AMLAS) [12] also provides guidance for how to incorporate ML-based components, by providing safety case patterns and adjoined processes for systematic integration of safety when developing such components. Furthermore, Diemert et al. [90] suggest an extension to the current process approach by combining the traditional Safety Integrity Level (SIL) with the complexity of the tasks to be performed by the (AI/ML-)system to render an “AI-SIL”.
D. Contract-Based Design (CBD)
Similar to the Hoare triple [91], Contract-Based Design (CBD) suggests expressing the interactions between elements (systems and components) in terms of contracts, expressing the preconditions that each element expects and, under which, the element can provide the postconditions. Given a suitable formalism, these contracts can be implemented and monitored at compilation, configuration or execution time, where a failed assertion of the contract would result in an exception. In [92], provide a formalisation of Assume-Guarantee (A/G) contracts for system design, describing the preconditions (assumes) and postconditions (guarantees) for the system elements. A simple example of such contracts, allocated on component level, is depicted in Fig. 7.
A simplified example of contracts allocated to components (I), (i) and (ii). If the assumes (A) of the contract are fulfilled the associated component can guarantee (G) the output.
CBD for safety critical properties are explored in, e.g. [93], [94], whereas highly configurable systems have been considered in, e.g. [95]. A/G-like contracts is notably also the approach of Digital Dependability Identities (DDIs) [96] and ConSerts [97], which we discuss further in Sec. VII-B. Furthermore, Warg et al. [98] suggest using contracts in all abstraction levels of the ADS in order to achieve a continuous assurance case, thus mitigating challenge (C-agile). In [99], these methods are discussed more in detail with respect to their potential contributions to safety assurance in the scope of ADSs.
E. Supervisor Architectures
The architecture of an ADS is crucial for reaching dependability targets (e.g. [100]). The generic and general layout presented in Fig. 5 provides a basis for common functionalities that need to be realised in an ADS. This generic view can also be deduced from more complex approaches, such as the ones analysed in [101] or the functional architecture proposed in [4, p. 68, Figure 27]. To form an architecture, these functionalities need to be mapped to a structure that should meet various requirements. For an ADS, architectural design can be seen to have the goal to realise sufficient dependability while reaching performance, cost and scalability targets, subject to further constraints such as energy consumption, space, etc. (recall Fig. 6 for the position of the architecture design as part of the development process). With the inherent complexity of such an architecture, most proposals include some form of monitoring, redundancy and diversity to achieve the necessary availability and reliability [4], [100], [101], [102], [103], [104], [105]. Such capabilities are further discussed in this section. Discussions on particular metrics and ways to assess the risk incurred by the system during run-time are, however, deferred to Sec. VI. Similarly, detailed discussions on how detected limitations and faults should be handled through degradations are postponed for Sec. VII.
The ISO 42010 standard [106] defines “architecture” as ”fundamental concepts or properties of a system in its environment embodied in its elements, relationships, and in the principles of its design and evolution”. Architecture design thus involves deciding on these principles and coming up with a design that meets the posed requirements while dealing with the involved trade-offs [107].
An ADS (its functions, software and hardware), will be subject to failures and unintended behaviours in several ways (related to challenges (C-U-env), (C-U-inter) and (C-B-adapt)), with relevance for functional safety (faults in HW and SW), and safety of the intended functionality (performance limitations and an incomplete understanding of the environment, and how it interacts with the ADS). Moreover, the high dependability requirements (i.e. challenge (C-reqs)), and the fact that an ADS in general does not have a fail-safe state, see e.g. [100], [103], leads to the question what an appropriate highly dependable, yet cost-effective architecture looks like.
Key elements part of ADS architecture design, include the considerations of relevant fault models (or fault hypothesis), suitable patterns, where to deploy error detection mechanisms, how to contain failures and how detected errors should be handled. It appears generally accepted that traditional fault-tolerance concepts such as triple modular redundancy (as applied in flight controls [108]), would be too expensive and not able to deal with common cause failures [100], [103]. The appropriate level of redundancy and diversity required for cost-effective designs remains an open issue, receiving increasing attention in both industry and academia, with many current proposals [109]. Further open research challenges include how environment perception sensors could potentially be shared between different channels of the system (for cost-efficiency) and the level of independence of such channels (relating to potential common-mode failures) [103].
A common solution for realising a supervisory architecture is using a nominal and a supervisory/safety/fallback/high assurance channel [100], [103], [105], [111], an example of which is shown in Fig. 8. The idea is that a high-performance system (possibly with low dependability or even relying on ML-based components) is monitored (by a high dependability component) and the control is, when necessary, handed over to a supervisory channel (also of high dependability). For this solution to support the required high dependability of the system, i.e. fulfilling challenge (C-reqs), Kopetz [100] stresses the importance of each subsystem to be its own Fault Containment Unit (FCU). Monitoring and deciding when to switch to the supervisory channel, represent an essential ingredient in such solutions [4], [100], [101], [102], [103], [104], [105].
Version of the simplex architecture [110] in the ADS context. A nominal channel and a safety channel run in parallel, a monitor of the nominal channel and a decision module is tasked with switching between the two based on the input from the monitor. Both the monitor and the decision module could optionally be allocated inside the safety channel, as suggested in [103], or as a separate components, as recommended in [100].
Detecting failures is non-trivial given the inherent uncertainty in ADS operation (again, related to challenges (C-U-env) and (C-U-inter)). Deriving supervision mechanisms (constraints, rules, etc.) for detecting failures represents an important area, (see [112], [113]), needing more research. The definition of such mechanisms, and specifications thereof, face some of the same challenges as those enumerated later in relation to the V&V methods of Sec. V, especially when deploying formal techniques [111] for realising the supervision. In particular, formally capturing conditions and scenarios from the challenges related to uncertainties and the behavioural and structural complexities of the ADS, and ensuring fulfilment to the high dependability required in challenge (C-reqs) pose an obstacle. Further, challenge (C-B-func) makes it difficult to anticipate all possible system states.
As a final note on supervisory architectures, Jha et al. [65] give an alternative approach for supervising the environment perception block using predictive processing, where the quantities monitored are the deviations of each sensor measurement with respect to the constructed (internal) world model. The approach is still conceptual and a concrete implementation is yet to be provided, but it seems promising, as such a system directly incorporates redundancy in the sensor processing. This, [65] argue, could also provide a way forward to trust and rely on black box algorithms, such as ML-based components [65], i.e. addressing challenge (C-AI).
Verification and Validation Methods
Verification and Validation (V&V) form an integral part in providing evidence of the safety of a system, not the least with respect to the system’s fulfilment of its specifications. Riedmaier et al. [24] give a comprehensive overview of existing methods for V&V of ADSs, focusing on methods using scenarios in the assessment, but also give an overview of safety-assessment approaches in general. Complementing this work, Wishart et al. [114] also present a comprehensive list of current V&V activities within industry and academia. In [9], argue the existence of an implementation, a specification and a testing gap when considering the validation of ADSs. The implementation gap arises in the discrepancy between the implemented behaviour and the specified behaviour. The specification gap manifests between the required and the specified behaviour, and the testing gap pertains to the difference between what is implemented (and tested) and what is required [9].
Additional works considering assessment and testing of ADS include [25], [26]. Khan et al. [26] review safety testing of ADSs, identifying testing methods, tools and datasets and provide a taxonomy of safety features for ADSs. We cover the identified testing methods from Khan et al. [26] but do not provide any discussion of the testing techniques (e.g. search-based testing) deployed to do the testing. Further, Pang et al. [25] survey safety assessment methods for decision making of ADSs, comparing different methods as well as discussing their respective limitations. The methods surveyed in [25] are covered primarily in Sec. V-C related to scenario-based V&V.
The presupposition for the V&V is the existence of a system (or at least an implementation on some abstraction level) and a specification that the system is expected to fulfill. In terms of supplying evidence for the assurance case of the system, there are four general caveats regarding the V&V activities:
The completeness of the specification vis-ã -vis the real operational use case (c.f. specification gap [9]),
The verification might not exhaustively verify the system with respect to the specification (i.e. not exhaustively uncovering potential implementation gaps [9]),
The scalability of the method to enable coverage of verification of the complete ADS system with respect to its specification, and
The complexity of the ADS will also propagate to the V&V methods and tools, [49], implying enormous efforts to develop and ensure that these can be trusted, relating also to the topic of “tool qualification” in functional safety standards such as ISO 26262.
The impact of these four caveats on the remaining residual risk, after the complete V&V processes/activities, constitutes an open research question and one which, in the light of challenges (C-U-env), (C-U-inter), (C-B-resp), (C-B-func) and (C-reqs), impacts the overall assurance of the ADS.
In the sequel, we explore a representative set of approaches to V&V, as illustrated in Fig. 1, and discuss their limitations in the light of addressing the eight challenges of Sec. III. The different forms of safety evidence provided by the methods are summarised in TABLE III.
The observant reader will recognise the absence of an explicit discussion on Fault Injection (FI) in this section. For the purpose of providing safety evidence the use of FI would face similar challenges as those discussed in scenario-based testing, but with the addition that FI scenarios need to also consider time and and location of injected faults [115]. For a slight expansion on FI used for V&V of the decision-making and planning for ADSs we refer the interested reader to [115, Sec. III].
Similarly, there is no explicit section on different testing environments (e.g. physical, driving simulators or virtual). While such environments are central to the V&V activities they do not add any safety evidence on their own. Rather, the environment used is dependent on the methods of testing, which are the focus in the following subsections. However, additional evidence would need to be provided for the reliability of the outputs from such environments, through so called tool-qualification activities.
A. Field Operational Tests
In order to test a system in its real operational conditions, one can make use of Field Operational Tests (FOTs), where the system is tested directly within its operational environment. In the case of automated vehicles, this would include additional safety precautions, e.g. the use of safety operators. Using a FOT is arguably the validation method that gets the system closest to the real operating conditions, therefore capturing the uncertainties related to challenge (C-U-env) most accurately. Systems that do not provide tactical decisions, such as for example Autonomous Emergency Brake (AEB) or Lane Keeping Assist (LKA), or (sub)systems merely providing input to tactical decisions, such as the perception system, are possible to evaluate in open loop. This entails running the system passively in a vehicle (but not intervening), that is otherwise manoeuvred by a human driver (or another system). This approach is sometimes also referred to as shadow-mode testing. From an ADS perspective, while this could alleviate some testing gaps (corresponding to challenges (C-U-env), (C-B-func), (C-B-adapt), and (C-AI)), understanding the behaviour of the full system requires the evaluation of the system in closed loop. By allowing the actions of the ADS (formulated in challenge (C-B-resp)) to be enacted one can also measure the ADS’s interactions with its environment (corresponding to challenge (C-U-inter)). From a safety perspective, closed loop verification is difficult, as it might be dangerous to rely on safety drivers as backup due to human performance issues, such as automation complacency [42], [43]. Furthermore, it is also costly and difficult to achieve such testing at the scale required to provide sufficient evidence in relation to the high dependability requirements prescribed for an ADS [6], [7], [8], therefore failing to address challenge (C-reqs). Collecting operational data from the field, as discussed later in Sec. VI-A, is closely related to FOTs, and further to the supervisory architectures discussed in Sec. IV-E. The difference between FOTs and operational data collection is that during the FOT the system is still under test, whereas the operational data collection happens while the system is operating, and presumably serving customers. Operational data of both sorts however, supports the development of an ADS in several different ways, in particular for:
characterising the Operational Design Domain (ODD) of the system [69], and consequently the specification of the system,
extrapolating the performance of the ADS through, for example, the use of Extreme Value Theory (EVT) models [116] (as discussed in Sec. V-B),
supplying operational data to be used for consecutive releases of the ADS [117] (as discussed in Sec. VI-A), and
serving as a baseline data set for (scenario-based) testing and evaluation [118], [119], [120], [121] (discussed in Sec. V-C).
A non-exhaustive list of large-scale FOTs is provided by [122] [122, p. 4,Table 1].
B. Extreme Value Theory
Extreme Value Theory (EVT) focuses on modelling the tails of a probability distribution by considering the “extreme” events present in the data. Such modelling could be conducted as part of the V&V approach of the ADS, or be included in a fleet level monitor, as suggested in [123]. The data supporting EVT analyses could come from any source but, for the purpose of V&V, it is beneficial to draw upon data collected through FOTs or from the system in operation (i.e. operational data).
For the purpose of providing validation evidence of the ADSs, one could consider different types of metrics to gauge the threats incurred by the system during operations. These metrics (or threat measures as we will refer to them later in Sec. VI-B) are used as a proxy for estimating the risk faced by the ADS. Based on field data, such metrics can be calculated and the extreme events modelled through the use of EVT in order to predict the fulfilment of a safety requirement, e.g. the collision rate of the ADS [116]. The use of an EVT method, such as the one of [116], does not require detailed models of the system itself and its operational environment, therefore alleviating challenges (C-U-env), (C-B-func) and (C-B-adapt). Such a detailed model is however a prerequisite for the other V&V methods discussed in this section. Furthermore, EVT approaches also alleviate challenge (C-reqs) by extrapolating the ADS’s performance from the data available, e.g., from FOTs, therefore allowing for inference on the integrity of the system, beyond the data collected.
However, the results are dependent on the metric used [124], which might impact the validity of the results provided with respect to the ADS’s actual failure rates in real traffic. It should be noted that this could impose a major limitation to the usefulness of EVT if an appropriately predictive threat metric cannot be selected.
C. Scenario-Based Verification and Validation Methods
Most situations occurring in traffic are relatively mundane and do not, consequently, bring value to the testing of an ADS. This is one of the main drawbacks of using FOTs, as these are particularly susceptible to this phenomenon. Using simulations or directed testing could help circumvent this by relying on scenario-based techniques or intelligent agents [121] as a means to expose the ADS to more relevant test cases and in a safer way. Scenarios describe “the temporal development between several scenes in a sequence of scenes” [125] in order to provide a description of the “test case” that the system should be exposed to. Such scenarios can be generated from
real data (as suggested in, e.g. [27], [119], [120], [126], [127], [128], [129]),
synthesised based on models (e.g. using adaptive generation [130] or advanced scenario generation techniques such as deep or reinforcement learning, as discussed in [115] and [121]),
derived using System Theoretic Process Analysis (STPA) [131], or
defined through expert knowledge (e.g. by using an ontology [132], [133]).
In all scenario generation approaches, the goal is to find relevant scenarios to challenge the ADS. One could note that, while not directly relying on defined scenarios, approaches using intelligent agents to challenge the ADS during virtual testing (e.g. [121]) exhibit the same benefits and shortcomings as scenarios. Thus, for the purpose of this paper, such methods are not discussed separately.
Ma et al. [115] provide an overview of scenario-based methods and note that different scenario generation methods allow for either unilateral or multiple interactions. For example, some scenario generation methods postulate a fixed action (unilateral) on part of the other actors in the scene whereas other frameworks (such as relying on deep learning) would allow interaction between the system under test and the other actors present in the scene. Further, Riedmaier et al. [24] give a comprehensive literature survey on scenario-based approaches, Sun et al. [23] review scenario-based test automation methods and in [134] different methods for critical scenario identification are systematically mapped.
Similar to the note by [135] in [135] regarding software testing, [24] point out that, despite the value and usefulness of scenario-based approaches, the scenarios can only provide evidence for falsification of the system, or provide a means to construct test cases. To make the test results more granular, de Gelder and Op den Camp [136] explore how to quantify the uncertainties arising from the use of scenario-based testing.
Several studies have provided methods and discussions on the test coverage of the scenario space achieved through testing [24], [122], [134], [137], [138], [139], [140], but how well the scenario space represents the real world operational conditions remains to be shown (i.e. how well it supports uncovering the testing gap [9]). Thus, an important factor of a quantitative measure of the residual risk after scenario-based assessment is still missing. Scenario-based methods do, however, provide an efficient means for falsification of the ADS. Especially, if coupled with search strategies rewarding critical scenarios, such as, for example, importance sampling [141], sub-set simulation [142], or by densifying the critical states of intelligent agents Feng et al. [121].
In Fig. 9, three different types of scenarios are shown in a two-dimensional scenario space. The scenarios applicable for the ADS given its intended operational domain, A, the scenarios generated for V&V, G, and the scenarios that would lead to safety critical failures, C. The system could only be completely assured if A is fully contained in G, and, in particular, if the intersection between C and A is inside G. In effect, such a configuration (i.e. when
Illustration of the “scenario space”. A corresponding to all possible scenarios in the intended operational design domain of the ADS, G the identified scenarios, and C the safety critical scenarios for the system.
In addition to generating the (concrete) scenario itself, the said scenario needs to be used and executed in a (virtual) testing or simulation environment. Such testing environments could target different abstraction levels of the ADS (c.f. the different abstraction levels of the V-model, Fig. 6) [143]. For example, Szalay [143] presents a X-in-the-loop framework for virtual testing, where X could denote either of software, hardware, full vehicle or a combination thereof. The selected component (X) is then tested in-the-loop towards a virtual simulation environment [143]. If the environmental perception (recall the overview of the ADS from Fig. 5) is included in the virtual simulation environment, synthetic behaviours might need to be complemented with synthetic sensor data when executing the scenario. Similarly, if one wants to evaluate the vehicle control component, there is a need for accurate and precise models of the underlying physical phenomenon (e.g., friction). The general problem of transferring training and validation performance from simulations to the real world, called Sim2Real [144], forms part of engineering and tool qualification efforts, that in turn provides enablers required for the use of evidence garnered from simulation-based methods.
While the major benefits in terms of testing efficiency are to be gained from virtual testing, scenarios can also be used for testing on proving-grounds. Further, scenarios can also be used to assess testing/validation completeness from field data collected through FOTs or even operational data collection.
It is difficult to generate scenarios to capture all uncertainties of the ADS’s environment and its interactions with other road users (outlined in challenges (C-U-env) and (C-U-inter)). In fact, for non-trivial ADSs these aspects result in an infinite scenario space. Further, the high dependability requirements of the ADS (challenge (C-reqs)) mean that rare scenarios, corner and edge cases will have an impact on the assurance of the ADS. Hence, capturing all relevant rare scenarios is paramount to ensure sufficient reliability of the ADS tested. However, covering all relevant rare scenarios would either require huge amounts of driving hours [6], [7], or, if scenarios are generated from expert knowledge, could risk being irrelevant from an exposure perspective and might result in worst-case assumptions (corresponding to the the excessively generated scenarios (
When testing black-box components such as neural networks or other ML-based components, as per challenge (C-AI), the vastness of the scenario space similarly represents a problem. Indeed, it is challenging to decide on an appropriate fidelity of the virtual testing environment as well as the appropriate granularity for testing across the scenario space, as the validity of interpolation of the results is unclear [145], [146]. While this could be partly mitigated through approaches such as
D. Formal Methods
Formal methods provide a means to perform verification of the system. Relying on models of the system as well as of its operational context such methods verify the system’s fulfilment of its specification. Riedmaier et al. [24] give an overview of existing methods for safety verification of ADSs [24, p. 12-13] and distinguish between three different branches within formal verification: theorem proving, reachability analysis and correct-by-construction synthesis. In [147], a complementary classification of automatic formal methods for automotive systems, that provides some guarantee of quality, is presented and includes: abstract interpretation, model checking and deductive methods. Abstract interpretation methods assume an approximation of the system in order to support the verification, whereas deductive methods correspond to the theorem proving category of [24]. Model checking and automated theorem proving are also acknowledged by Mehdipour et al. [2]. In [148], the use of formal methods to ensure compliance with rules of the road for ADSs are reviewed and the formal verification category then also includes barrier certificates as well as worst-case behaviours. Further, Mehdipour et al. [148] distinguish between formal verification, monitoring and formal synthesis. These three methods can all, in some regard, contribute with safety evidence of the ADS’s compliance with the rules of the road.
One of the key strengths of formal methods is their potential to allow for exhaustive verification. While formal methods can assess the system’s fulfilment of its specification, it might nevertheless be resource intensive to transfer the results from one assessment of one part of the system to other parts. Modular design would circumvent such problems of transferring assessment results, suggesting the usefulness of CBD.
Despite the merits and advantages of formal methods, one can identify several potential difficulties and limitations with respect to the safety assurance of an ADS, as detailed below. Indeed, it may be difficult to:
construct a complete specification with respect to the real operating environment of the ADS, related to challenges in the categories of uncertainty, and behavioural and structural complexity, which may also be exacerbated by challenge (C-reqs),
scale such methods to cover the entirety of the ADS, corresponding to challenge (C-B-func),
ensure validity of the formal verification when considering ML-based components [24, Figure 8], corresponding to challenge (C-AI). This is especially relevant considering the lack of interpretability and explainability of the provided outputs from such components, and
ensure the correctness of the models and parameter values used in conducting the verification, which again stem from the challenge categories of uncertainty, and behavioural and structural complexity.
It is worth noting that correct-by-construction synthesis could be considered as a design technique rather than a V&V method. However, the challenges faced are the same as for the other methods within the formal verification domain, and this is why it is discussed in this section rather than within a separate section of Sec. IV.
Formal Rules for Driving Behaviour: One approach in adopting formal methods to support ADS development, is to develop formal rules regarding how the ADS should behave in order to ensure that the system is never at fault in case of an accident. Such approaches are reviewed in [148]. More specific examples of approaches tackling such problems are, for instance: Mobileye/Intel’s “Responsibility Sensitive Safety” (RSS) [149], Nvidia’s “Safety Force Field” [150], the “rulebooks” approach taken by nuTonomy [151] and Arechiga [152], or worst-case assumptions for collision avoidance [153].
There are four key limitations to keep in mind when using (formal) rules or specifications, as detailed bellow:
the methods assume that other traffic participants follow (the same) set of rules, which might not be the case with human drivers,
the methods (implicitly) define who is to blame for an accident. While human drivers tend to naturally help out and collectively avoid accidents, rigidly following a set of rules might instead inhibit such collaborative avoidance by the ADS, therefore increasing the overall number of accidents,
the approaches rely on assumptions on the parameters used in the models and rules. For instance, RSS [149] implicitly relies on assumptions regarding the vehicle’s braking capabilities, as well as those of the surrounding vehicles [154]. Ensuring that these assumptions are correct in all operational conditions of the ADS is central to safety, as a mismatch could yield safety issues, as discussed in [154], and
accurately estimating the system’s parameters (as well as those of the operating environment) is difficult and one is often left with making worst-case assumptions, which could yield a system that is unable to operate due to an overly pessimistic view of the system’s capabilities [39].
Run-Time Risk Assessment
Despite the existing design techniques and methods for verification and validation purposes, covered earlier in this paper, the eight challenges outlined in Sec. III also warrant efforts for upholding safety of the system during operations. We divide run-time methods into two parts, the first pertaining to risk assessment (treated in the following), and a second, how to adapt to run-time information, covered in Sec. VII. These two parts are closely related, such that the output of the risk assessment is consumed and guides the adaptation. Further, the available adaptations of the system impact and determine what metrics that should be monitored and assessed during operations. The monitoring and degradation capabilities of the system are themselves reflected in the architecture deployed.
Run-time monitoring of the ADS can have several purposes. Firstly, it can support appropriate collection of operational data, as further discussed in Sec. VI-A, which is also associated with fleet level monitoring of the ADS’s safety. Secondly, run-time monitoring can provide essential information for assessing the operational risk related to run-time (safety) supervision, effectively providing necessary input to the (tactical) decision-making of the ADS.
There are several reasons for having run-time monitoring (beyond operational data collection). To support the discussions in this section, we consider three different types of run-time monitoring, with slightly different purposes:
Contribute to safe tactical decisions despite internal errors or (unexpected) changes to the operating environment, while in the ODD,
Cope with (more permanent) system degradations, and
Avoid leaving the ODD.
The first two types of monitoring are related to the system’s fault tolerance, where the focus is to identify errors and faults and to establish appropriate counter-action measures, in order to avoid safety-critical failures. The latter type (iii) might not be considered as a means for fault tolerance, but nevertheless contributes to avoiding the safety-critical situation of operating outside the ODD. One way of viewing this problem of monitoring is through partitioning the operational space into safe, warning and catastrophic states [111, Fig. 1]. With respect to the different monitoring considerations above (i) – (iii), the states that one needs to avoid are slightly different.
Firstly, regarding (i), the focus concerns (“catastrophic” and “warning”) states related to (temporary) errors in any one of the ADS’s sub-systems or (unexpected) changes to the operating environment. Either of these may result in an accident or a requirement violation if left unmitigated. Such states can, for example, be reached due to an erroneous perception of the world by the Environment Perception (EP) block (see Fig. 5), an erroneous plan by the Decision Making (DM) block, or unsuitable path following by the Vehicle Control (VC) block, all of which might lead to a violation of a safety requirement. A change to the operating environment could also lead to a catastrophic state without the presence of any internal failures, such as another road user bending traffic rules in very unexpected ways (beyond what was predicted by the EP block).
As for (ii), the “catastrophic” states are similar to the internal errors considered in (i), but pertaining to challenge (C-B-adapt) associated with more permanent or larger degradations, such as, for example, the (permanent) loss of a sensor, reduced braking capabilities or limited computational resources. This type of failures might also result in an inability of the ADS to safely fulfil its strategic mission. One might consider using Restricted Operational Domains (RODs) as a means to analyse and cope with situations of system degradation, as elaborated in Sec. VII-A. The capabilities of the ADS to monitor its own system performance is largely dependent on the architecture as well as what requirements are imposed on each of the subsystems and components. Architectural considerations have already been covered in Sec. IV-E. Note that more detailed discussions on component requirements supporting internal monitoring is left for future work.
Lastly, the purpose of (iii) is to avoid the “catastrophic” state of operating outside of the ODD, which can be mitigated by employing the ODD-strategies given in [69] and, where appropriate, transitioning into a Minimal Risk Condition (MRC).
Note that run-time verification [155] would provide a means for run-time assessment of the system, especially in relation to the fulfilment of the specification by the present system configuration. However, such methods face similar challenges as formal methods, already discussed in V-D, and will thus not be dedicated a separate subsection here.
TABLE IV presents the different safety evidence corresponding to the methods discussed in this section.
A. Operational Data Collection
Beyond monitoring for the purpose of adaption at run-time there is also a need to collect operational data for the purpose of safety assurance. For the purposes of our discussions we can distinguish between two types of fleet-level monitoring capabilities:
monitors for the collection of operational data (e.g. similar to gather data from FOTs, see Sec. V-A), and
monitors for assurance reasons and inhibition/recall of the system, contributing to the containment of the residual risk.
The first (a), is tightly linked to the V&V of the system while the second (b), rather highlights the need for monitoring the assumptions made in design-time to assess the system’s safety. While (a) refers to the general case of data collection for (off-line) modelling, (b) refers the monitoring of specific indicators for the purpose of assurance processes carried out centrally, i.e. not in the vehicle itself. Both these aspects are further elaborated in this section. Both of these capabilities (a) and (b) complement the monitoring capabilities discussed in the introduction of the main section above, with the main difference being the collection of the data for central or offline analyses.
In [99], several of the assurance approaches discussed require (b) monitoring of the system, its operational contexts and/or some Key Performance Indicators (KPIs) [156], [157], [158]. Concrete measures and metrics underpinning such KPIs are elaborated upon in Sec. VI-B. In the dynamic safety case concept of Denney et al. [156], it is acknowledged that the monitoring is to be done both inter-mission, corresponding to our second category above (b), as well as intra-mission, corresponding more to the type of threat and risk assessment discussed in the following sections and relating to monitoring capabilities (i) – (iii) above.
Asaadi et al. [158] suggest to monitor certain indicators that can be analysed to identify trends and shifts, and trigger an update of the assurance case. Similarly, in the UL 4600 [17] standard, it is suggested to monitor Safety Performance Indicators (SPIs) of the system. These leading measures (should) give predictive indication for when the system might operate unsafely and spur appropriate actions to mitigate the associated risks.
The first type of monitoring (a), on the other hand, particularly supports increased release cadence as well as promoting a learning cycle (related to challenge (C-agile)) by improving the basis on which models, analysis and design are founded [67]. Continuously adding new data from operations also provides basis for retroactively fulfilling a statistical proof of the high dependability requirements [51], [159], pertaining to challenge (C-reqs). Further, systems for retroactive in-vehicle assessment could also be used for validation purposes (see e.g. [160], [161], [162]), notably for ML-based components of the perception system, and thus reduce the gap of challenge (C-AI). Incorporating this type of operational data is also beneficial for capturing changes to the traffic (behaviour) due to increased penetration of technologies such as ADSs or other types of unknowns [9].
Even though collecting and incorporating operational data into the development process helps ameliorating all of the challenges of Sec. III it should be noted that it might be difficult to ensure the applicability of the data collected. For example, it might be hard to ascertain that the data collected with a previous version of the ADS is also useful for the next generation of the system. Thus, frequent releases (challenge (C-agile)) might in fact limit the usability of operational data. Further, interruptions of the driving task by a driver-initiated hand-over could potentially impact the validity of such data. Additionally, the collection of operational data could be challenging for the following reasons:
potential lack of computational resources for run-time evaluation,
limited transmission bandwidth, requiring a careful selection and curation of the data, and
limited predictive power of the KPI/SPIs resulting in limited risk reduction, related to monitoring activity (b).
B. Threat Assessment Techniques
Within the domain of ADASs, assessing the collision threat is an integral part of being able to trigger appropriate corrective measures that are able to avoid collisions through driver support functions. While such corrective measures are relevant also for an ADS, ideally many of these situations should have been avoided altogether by appropriate tactical decisions from the ADS. The formal rules for driving behaviour (discussed in Sec. V-D), for example, heavily rely on the appropriate use of Threat Assessment (TA) metrics to initiate (early) corrective actions to uphold the postulated formal rules with respect to the TA metric(s) chosen. Thus, TA techniques are relevant for the safe operation of an ADS but, and perhaps more importantly, they are central for some of the approaches discussed earlier in this paper, namely in the sections on EVT (Sec. V-B) and operational data collection (Sec. VI-A) respectively. When it comes to providing input to adaptation of the tactical decisions of the ADS (beyond formal rules), Dynamic Risk Assessment (DRA) has been proposed to provide a more nuanced and accurate situational awareness in contrast to TA. DRA is discussed more in-depth in one of the following sections.
There exist a number of overviews on the literature focusing on threat or criticality metrics and TA [32], [163], [164], [165], [166], [167]. For instance, a comprehensive analysis of different metrics for collision avoidance is provided in [163, Table 3.1,pages 44–48]. In addition to listing the metrics, [163] also provides a short description about which situation is targeted, what assumption are made in terms of prediction models and what actions and situations such metrics are aimed at capturing. This exposé is similar to those provided by Wishart et al. [164] and later Westhofen et al. [165]. Westhofen et al. [165] further provides a detailed visualisation of the interrelations between different criticality metrics (Fig. 5 [165]).
In [32], review surrogate safety measures for ADSs and conclude that available metrics are appropriate to estimate relative safety performance of ADSs. Lefèvre et al. [168] also present a survey on motion prediction and risk assessment for intelligent vehicles. In [168], motion prediction and risk assessment methods are divided into three categories: physics-based, manoeuvre-based and interaction-aware motion models. Lefèvre et al. [168] conclude that while the latter is the most refined, it faces issues with computational complexity due to the high number of considerations, consequently inhibiting run-time applications (at least in [168] when the survey was conducted). This obstacle in particular is addressed in [169], where a new risk assessment methodology is provided, merging a “network-level” collision estimate (i.e. the risks estimated across the traffic and road system) with an estimate on vehicle level. More precisely, this approach integrates a dynamic Bayesian network and interaction-aware motion models [169], not too different from the DRA system suggested in [170] and which is further discussed in Sec. VI-D.
Extending traditional criticality metrics Kusano and Victor [171] suggest a method for determining the maximum injury potential for a certain situation.
Naturally, these different threat metrics come with their own set of assumptions and limitations. In the sequel, we will nevertheless try to assess their collective ability to alleviate the challenges of Sec. III.
C. Out-of-Distribution Detection
In order to integrate and trust AI/ML-based components, their ability, performance and failure modes naturally need to be properly handled. It is difficult, however, to assess the abilities of neural network-based algorithms as small changes in input might drastically alter their results [145]. Further, the estimated accuracy and performance of such components are measured based on a validation set concerning the intended operational domain. Thus, to be able to rely on performance estimates from such validation sets, it is paramount that AI/ML-based components operate on samples from the same distribution that they have been trained on. For that purpose, anomaly or Out-of-Distribution (OoD) detection approaches can be used [172], [173], [174], [175]. Such methods assess the similarity between a sample and the training data of the AI/ML-based component. If an anomaly is detected (i.e. the dissimilarity is too large) the system can decide not to rely on the output of the AI/ML-based component. See [28, Sec. 4.1] for a comprehensive review of OoD detection methods.
An alternative to OoD detection, is using a network that directly rejects unrelated open set inputs [176], whereby outputs are produced only if inputs, from within some defined set, are provided to the network. The proposal in [176] could be seen as a means for a network to operate under a contract, requiring the inputs to come from the specified set.
D. Dynamic Risk Assessment
By allowing the system to dynamically adapt to the current situation one can circumvent the need for completely assuring the tactical decisions of the ADS at design-time. This approach also avoids the need for employing worst-case assumptions or hard limits on operational parameters (as would be the case when using formal rules as previously discussed). In practise, such dynamic adaptation would rely on creating situation awareness, according to which the ADS can modulate or adapt its behaviour [177]. Through this adaptation, the ADS could achieve improved performance, while ensuring safety [178]. Situation awareness is constructed based on the perceived surroundings of the ADS and prediction models of how the current state will evolve [177], as well as knowledge of the capabilities of the own system [179].
If such situation awareness is used for adapting the behaviour of the ADS, one could view the action space of the ADS as being restricted by the system’s capabilities, including the uncertainties of the perception system, and the surrounding environment. An example of this restriction of the action space is shown in Fig. 10. For example, to account for uncertainties of the steering and braking capabilities of the ADS (e.g. due to varying road friction) the possible action space (in terms of what the system can “reach” given its current capabilities) needs to be limited from I to
The action space (striped region) is limited by the internal capabilities of the ADS,
Intention prediction is the common name of the task concerning the prediction of the movements of other traffic participants, for which Brown et al. [180] present a taxonomy. The presented taxonomy is constructed around four core tasks: state estimation, intention estimation, trait estimation, and motion prediction [180]. Furthermore, [180] also acknowledge that risk estimation constitutes an auxiliary task of the modelling. Commonly speaking, these types of models rely heavily on ML-based algorithms, and could provide a means for improving the situation awareness. This weight on ML-based algorithms can be seen from the review of trajectory planning methods presented in [181], where the four categories of methods considered are based on: physics, classic ML, deep learning or reinforcement learning.
Dynamic Risk Assessment (DRA) is an integral part of the proposed framework for Dynamic Safety Management (DSM) [178] (discussed in more details in Sec. VII-C). DRA relies on situation awareness to provide support for risk-aware run-time decision-making of an ADS, e.g. [170], [177], [182], [183]. In addition to situation awareness, DRA capabilities need to connect a given situation to the safety requirements of the system, or at least to some kind of (quantitative) risk measurement. Similar to the supervision methods discussed in Sec. IV-E and the TA metrics discussed in Sec. VI-B, Reich and Trapp [182] and Feth et al. [183] suggest using risk metrics as a proxy for deducing the current (dynamic) risk. This connection to a quantitative measurement of risk (and optionally the connection to the safety requirements of the system themselves) sets the DRA approach apart from the TA metrics discussed earlier.
In [183], DRA is done by three parallel components, one for each integrity level (low, mid and high), corresponding to the integrity levels with which each (sub)system has been developed, i.e. by following ISO 26262. In the approach of [183], also the Environment Perception (EP) block (see Fig. 5) reports with respect to these three integrity levels (in-line with the proposal of [184]). Each of the DRA components consume the appropriate EP outputs corresponding to its integrity level. Considering our previous discussion in Sec. IV-C in terms of the quantitative contributions of process arguments related to different integrity levels, the question remains how well the approach of [183] help quantify and limit the risks of the ADS.
Dynamic behaviour risk assessment is further elaborated upon in the thesis [163] by [163], where the connection between safety supervision and such a DRA method is also explored.
In [182], a framework for Situation-Aware DRA (SINADRA) is outlined. This framework is further substantiated and exemplified in [170]. Notably, SINADRA makes use of probabilistic environmental knowledge to support the risk assessment which is achieved through a pipeline where the situation class detection is followed by behavioural prediction (c.f. intention prediction above) and the generation of correlated trajectory distributions [170]. The final risk estimation is based on the approach outlined in [185], which provides a means to connect the estimations to the (dynamic) risk faced by the ADS.
Run-Time (Self) Adaptation
Having presented different risk assessment techniques in the previous section, we now focus on the task of adapting the ADS’s behaviour based on such situation awareness.
One definition of self-adaptation, adopted in this work, determines that the system adapts its behaviour to its environment and context [186]. For an ADS, this adaptation can be viewed at different levels. The ADS is explicitly designed to be self-adaptive in the sense of avoiding collisions with other (dynamic) objects, as well as to follow the road, etc., effectively integrating monitoring aspect (i) discussed in Sec. VI. However, the system can adapt the way these objectives are fulfilled and it can possibly also adapt its available capabilities and features, formulated as challenge (C-B-adapt). The latter aspect corresponds to monitoring aspects (ii) and (iii) discussed in Sec. VI. For the sake of the clarity of the following discussion, let us distinguish between three different notions of self-adaptiveness of an ADS:
Adaptation to changed user requirements, to unexpected changes in the operational context or to changes of the system state (i.e. degradation), in terms of available services, features, and capabilities (c.f. monitoring aspects (ii) and (iii) in Sec. VI),
Operational adaptation enabling the fulfilment of certain (safety) objectives (c.f. monitoring aspect (i) from Sec. VI), and
Adaptation of the assumptions and models used for determining the (safety) objectives.
The focus of (a) is to elucidate when the user requests changes to the mission, or when system-level changes or degradations result in the need for adaptation. (b), on the other hand, regards the abilities of the ADS to avoid obstacles in its environment as well as to account for intentions and predicted behaviours of other traffic participants. Thirdly, (c) refers to changing admissible risk levels in the light of the present operating conditions. These three adaptation notions are discussed in the following subsections related to the reviewed methods on run-time (self-) adaptation.
While run-time adaptation could provide significant efficiency and performance gains for the ADS, it should be noted that, by incorporating run-time adaptation (including additional dynamics, switching algorithms etc.), additional complexity will be added. Furthermore, the adaptation mechanisms will themselves be highly safety critical. If they fail, the whole system is at risk, and failures in the adaptation mechanisms may also introduce new hazards and system level failure modes. Thus, specific care must be taken in developing such safety-critical run-time features. In practise, the design, development and V&V of these mechanisms would need to draw on the methods from the previous sections of this review in order to provide appropriate safety evidence. This is not too dissimilar to the need for the tool-qualification needed for the testing and simulation environments used for the V&V and as discussed briefly in Sec. V.
In TABLE V, the different types of safety evidence provided by the run-time adaptation methods are given.
A. Degradation Strategies
Unexpected changes to the operating context, such that they would risk violating the ODD, or severe system degradations (relating to the first type of adaptation (a)) are commonly mitigated through a fail-safe state [57], i.e. by transitioning to a Minimal Risk Condition (MRC). However, simply stopping the ADS upon each and every (small) change to the system’s capabilities might not be safe nor desirable. Hence, the concept of Restricted Operational Domains (RODs) have been proposed in the literature [187].
The MRC is a ”stable stopped condition at a position with an acceptable risk [...] The ADS is brought to this state by the user or the system itself, by performing the Dynamic Driving Task Fall-Back (DDT-FB), when a given trip cannot or should not be completed” [188]. Using the MRC reduces the likelihood of an ODD exit [69] as well as avoids operating the system while facing severe degradations that inhibit the fulfilment of the original, user-defined strategic mission [188]. In Fig. 11, the relationship between the MRC and the different levels of decision-making is illustrated as a decision hierarchy.
Depicts a decision hierarchy pertaining to the use of an MRC, adopted from [188]. The ODD governs the overall context and available decisions in the lower layers. These decisions might further be restricted through an ROD upon system degradations. When the MRC takes precedence over the user-defined strategic goal, the Dynamic Driving Task (DDT) is aborted and the DDT-Fall Back (DDT-FB) is initiated.
To avoid abandoning the strategic mission of the ADS upon any given system degradation, Colwell et al. [187] suggest using a ROD, which encodes the operational domain of the “new” system after the degradation. The ROD could thus effectively help determining whether it is feasible to safely fulfil the strategic mission, despite the system degradation, or if the mission should be abandoned in favour of an MRC. The relationship between the MRC and the ROD is elaborated upon in [188], where the contribution to the safety assurance of such concepts is also discussed.
Fu et al. [189] present a distributed safety mechanism concept, that provides multiple layers of monitoring and enables degradation policies for the ADS. Degradation strategies may range from a reduced driving envelope (e.g. corresponding to a ROD), all the way through to a worst case, immediate stop (corresponding to a highly restrictive MRC). It is worth mentioning that any degradation strategy will require sufficient internal capabilities of the ADS, as well as architectural support (e.g. sensors and computing to actually carry out the required manoeuvre). In [190], an architecture with such support is outlined and the planning of an appropriate safe stop is formulated as an optimal control problem.
B. Run-Time Certification
As the possible space of all configurations, in relation to adaptation type (a), could be vast, it has been proposed to shift the certification of the specific configuration of the system from design-time to run-time [191]. As more evidence of the systems operational capabilities and context is available at run-time, shifting parts of the certification task to run-time facilitates accurate evaluation of the certificates. Schneider and Trapp [97] elaborate on conditional safety certificates (ConSerts) (originally proposed in [192]) with the purpose of providing such a run-time certification for open adaptive systems. In essence, the ConSerts represent contracts with demands (cf. assumes of contract-based design, Sec. IV-D) under which the subsystem is guaranteeing the supply of a specific output. The ConSerts of a (run-time configured) system are evaluated in run-time in the light of the available run-time evidence, in order to assess the applicability of a particular configuration. If one configuration is found to be invalid, Schneider and Trapp [97] propose to continue evaluating the next available configuration of the system, suggesting the existence of some hierarchy of system configurations. As such, ConSerts provides a potential way of managing system degradations (related to challenge (C-B-adapt)).
A dynamic measure for the risks of the operational environment, provided e.g. by DRA (discussed in Sec. VI-D), could be matched to the assumes of the ConSerts. Consequently, the two approaches could likely support each other in the construction of a safe and performant ADS.
The formalisation required for run-time certification does face some of the same obstacles as those discussed in relation to formal methods, see Sec. V-D. With a modular system, an appropriate abstraction of the system configuration and of the limitations to the factors modelled, the impact of such obstacles might be reduced. Nevertheless, this remains an aspect to be shown.
In a somewhat complementary vein to run-time certification, Fredericks et al. [193] suggest run-time testing as a paradigm to cope with dynamic adaptive systems and go on to propose MAPE-T (Monitor-Analyse-Plan-Execute and Test), a run-time monitoring and adaptation of test cases built as an extension from the MAPE-K [194] feedback loop for monitoring and analysing adaptive systems. While ADSs could be considered dynamic adaptive systems this approach has not been widely adopted, perhaps due to the challenges of testing and providing evidence for the safety and functionality of the system even when considering non-run-time methods and approaches. Thus, the method of run-time testing is not expanded on further, but it is acknowledged that such methods would face similar challenges as those iterated in this section on run-time certification as well as some of the issues of completeness and soundness, discussed in Sec. V-D on formal methods and the section on scenario-based V&V of Sec. V-C.
C. Dynamic Safety Management
Solutions to adaptation type (b) are largely provided through the different methods discussed in Sec. IV, and are also the focus of the V&V methods discussed in Sec. V. However, to show that such solutions yield a safe ADS, while considering all operational uncertainties (i.e. challenges (C-U-env) and (C-U-inter)), there is a need to make certain assumptions, often worst-case assumptions, which encapsulate all (statistically) relevant operational situations. These worst-case assumptions could yield a safe but oftentimes unnecessarily conservative system. To circumvent this, one can monitor the operational environment, as discussed in Sec. VI, and adapt the (worst-case) assumptions subject to the present operational situation of the ADS. This effectively corresponds to adaptation type (c). Thus, the specific solution provided for adaptation type (b) is deferred to run-time by adapting with respect to dynamic objectives or constraints, i.e. according to adaptation type (c).
In [195], Dynamic Safety Management (DSM) is proposed as a solution to adapt the system objectives and constraints in run-time according to the systems safety awareness. DSM framework helps circumvent the need for a static safety analysis at design-time allowing the system to “self-optimise its performance during run-time” [195]. DSM presumes access to run-time information providing contextual- as well as self-awareness, the combination of which is called safety awareness [195], and allows the system to reason about the current risk and adapt its behaviour accordingly. Notably, Trapp et al. [195] explores this idea by proposing a dynamic risk analysis, where the quantification of the HARA is done at run-time based on such safety awareness. This quantification is made by assessing the controllability, severity and exposure related to the hazard(s) at run-time, as opposed to defining this quantification before operations.
Developed in parallel, Khastgir et al. [86] also suggest to dynamically update the parameters of the HARA in run-time, based on the current operational situations. However, [86] suggests that the update to the HARA should imply changes to the driving behaviour of the ADS, by restricting or relaxing the integrity requirements for the system to solve the current operational situation.
While [178] suggest DSM as a means to optimise the system (configuration) according to the current safety awareness, and [86] propose to alter the behaviour of the ADS implicitly by evaluating the HARA in run-time, one can also consider adapting the (tactical) behaviour of the ADS according to the dynamically assessed risk (e.g. from DRA). This latter concept is explored in [163] and [170] and is also hinted at in [196], [197], and [198].
Also Calinescu et al. [157] suggest allocating some of the assurance tasks to run-time, by dynamically generating the assurance case throughout both design-time as well as run-time. This run-time assurance generation is predominantly dependent on formal methods and model checking [157], assuming formalisable system models and requirements. There is also a link between the formalisation-dependent ideas of run-time certification and ConSerts (discussed in Sec. VII-B above) to DSM, where the DSM framework [178], [195] builds on the fundamental ideas of ConSerts [97].
D. Precautionary Safety
The concept of Precautionary Safety (PcS) policy was first introduced by de Campos et al. [199], with the purpose of achieving an improved ADS performance while ensuring the fulfilment of ambitious quantitative safety requirements, prescribed as a Quantitative Risk Norm (QRN) [55]. The proposed methodology accounts for the system’s emergency response capabilities, sensing performance and the exposure levels to different adverse events in order to enable the derivation of an appropriate driving policy, with which the ADS is able to fulfil the prescribed safety requirements. For example, the speed is set to ascertain that the probabilities of the resulting consequences of an adverse event are below the frequencies prescribed by the QRN. In particular, after modelling the emergency response given a certain speed, and while considering the detection error rates of the ADS as well as the expected exposure level of the adverse event, the consequence probabilities should be below those allowed by the QRN. As such, the fulfilment of the quantitative requirements is shown in a statistical way, rather than proving the fulfilment of safety requirements based on worst-case assumptions. In [117], this methodology is elaborated upon to include more complex perception error rates and a process of rate estimation, both for perception error rates as well as for the arrival rate of an adverse event. These aspects led to a probabilistic approach for coping with random errors or degradations of the system (related to challenge (C-B-adapt)). Adapting the driving policy based on feedback of the fulfilment of the QRN corresponds to adaptation type (c).
The notion of PcS could also be merged with a framework for DRA in order to achieve a dynamic adaptation of the policy, based on the current risk levels of the ADS, which could help to achieve an even more performant system. This is partly exemplified in [170], but with the main difference that the requirements on the ADS are not posted as quantitative elements. However, the risks are dynamically estimated in [170], including a probabilistic formulation of the uncertainties, which is the same as suggested in [117] and [199]. For example, the jay-walking avoidance use case analysed by [199] presents a very crude way of DRA, where the considered two different road types are associated with different exposure levels. Thus, given the knowledge about which road type the ADS is operating on, it is possible to adapt the driving policy in order to ensure the fulfilment of the safety requirements.
In a similar vein, Khonji et al. [200] suggest a risk-aware architecture for an autonomous vehicle, where the uncertainties estimated from the perception and the intention prediction subsystems are consumed by the planner to produce a risk aware control policy. However, the considered uncertainties do not seem to include obscured or unseen object nor does the rendered policy ensure fulfilment of a quantitative requirement such as the QRN.
Review Results
TABLE VI summarises the ability of each of the discussed methods to overcome the eight challenges listed in Sec. III. The table is a result of a qualitative assessment made by the authors, resulting in a classification (identified by letters) that indicates how each of the the surveyed methods responds to the identified challenges. The classifications are motivated in the sections for each of the reviewed methods. However, for conciseness of this paper, some of the less central discussion are made available in [201]. Jointly, the discussions presented in this paper and in [201] provide explicit motivations for all the challenge classifications of TABLE VI.
The proposed classification framework can be structured into three main groups:
the ones indicating a positive contribution (S and A),
the ones indicating a neutral contribution (U and N), and
the ones indicating that the particular challenge is difficult to, or not tackled by, the method (O and FC).
Solution, S: Methods promising a solution to a particular challenge are annotated with an S. For example, completely adopting contract-based design promises to solve challenge (C-agile). Whether this is feasible considering the remaining seven challenges is, however, questionable.
Amelioration, A: An A, is reserved for methods that support or partly solve the given challenge. For instance, operational data ameliorates all discussed aspects in one way or another, but is not sufficient on its own to provide a complete solution to any of the challenges.
Uncertain/Unclear, U: A U indicates that the authors are unable to deduce the method’s applicability to solve a given challenge. This suggests the need for future work to answer that question.
Neutral, N: Cells annotated with N implies that the method is deemed indifferent with respect to the particular challenge, since it neither ameliorates nor exacerbates the challenge.
Obstacle, O: Challenges that impose an Obstacle for the methods to provide valuable assurance evidence. For example, while scenario-based V&V could increase efficiency of testing and verification, ensuring that the provided test cases actually correspond to the system fulfilling its high-dependency requirements (corresponding to challenge (C-reqs)) remains an obstacle.
Fundamental Challenge, FC: Finally, a FC denotes challenges that are fundamental to the method such that, despite continued efforts and future work, it will likely remain troublesome. While a FC is used to indicate the fundamental limitation to the method with respect to a given challenge, an O suggests that future work might provide solutions to overcome the obstacles currently present. It is, for example, unlikely that continued efforts into modelling and formalising the operational uncertainties of the ADS (challenges (C-U-env) and (C-U-inter)) would be sufficient to enable complete use of CBD, rendering a classification as FC in TABLE VI.
The novelty of a system such as an ADS, as well as the lack of best practises and sufficient data, affect all methods and techniques discussed. The proposed classification may come to change as new best practises evolve and, especially, when more data is gathered. Particularly, the challenges pertaining to uncertainties (i.e. challenges (C-U-env) and (C-U-inter)) might not be as daunting given the existence of billions of miles of operational data. Thus, the assessment of TABLE VI reveals a snapshot – at this point in time – of which of the challenges that currently impose issues for the discussed methods.
A. Addressing the Eight Challenges
Below we analyse in more detail the collected results of TABLE VI in order to deduce which methods suggest solutions to each of the challenges. Notably, each challenge has a method which seems to at least ameliorate the challenge, albeit not providing a complete solution.
In general, many of the challenges are (at least to some extent) supported by the collection of more operational data together with the use of DRA and DSM.
For challenges (C-U-env) and (C-U-inter), pertaining to the uncertainties imposed on the ADS, it appears necessary (and promising) to address them by shifting at least parts of the assurance provision into run-time. This can be achieved through monitoring of the operations of the system, using a run-time monitor or a supervisor. Ideally, such monitoring is coupled with DSM or a precautionary driving policy to optimise available performance of the ADS.
Challenge (C-B-resp), the tactical responsibility, might be possible to address through DRA, but more work remains for a conclusive statement. However, the challenge can be partly ameliorated on the specification side, by appropriate limitations through the ODD. Further, challenge (C-B-resp) can be mitigated through appropriate supervisor architectures and operational data collection.
Coping with challenge (C-B-func), the complexity of the ADS, can be supported by DSM, together with DRA, as well as through confinement using the ODD. Further, the challenge can be approached through collection of operational data (through FOTs, operational data collection and by extension EVT) as well as through scenario-based methods.
Handling degradations of the system (challenge (C-B-adapt)), requires supervisor architectures, but can also be solved through appropriate degradation strategies, run-time certification and DSM (all of which are closely related, as e.g. supervisor architectures typically embody some level of degradation). All of the V&V techniques also provide means to assess the validity of such degradations, albeit under the assumption of explicit analysis of each subsystem in relation to its capabilities, performance and ROD.
The high dependability requirements (challenge (C-reqs)), are difficult to ensure through the V&V methods presently available, even though operational data collection and EVT would provide some support. These requirements seem best addressed by supervisor architectures coupled with appropriate degradation strategies. However, the probabilistic formulation for precautionary safety also suggests a solution.
As for incorporating and growing trust in AI and ML-based components (challenge (C-AI)), we also seem to have a solution in the precautionary safety approach, however, operational data collection, supervisor architectures as well as FOTs, coupled with EVT, also provide ameliorating solutions, as does the use of OoD detection methods.
Lastly, many of the discussed methods could to be compatible with agile development and frequent releases, related to challenge (C-agile). Most notably, the contract-based techniques, such as contract-based design and run-time certification, lend themselves particularly well for this purpose, but with the caveat of scalability and the ability to compose contracts in the light of challenges (C-U-env), (C-U-inter), (C-B-resp), (C-B-func), and (C-B-adapt).
Research Gaps
Following the challenge classification we now turn to identification of the research gaps. The method for this is outlined in Sec. IX-A below. After which, we present five categories of research gaps shown in Fig. 12.
A. Method for Derivation of Research Gaps
As outlined in Fig. 3 the methodology of this paper concludes with gap analysis and formulation of research questions. The gap analysis is conducted by consulting TABLE VI and identifying the challenges for each method that provide an obstacle (O), a fundamental challenge (FC) or where the assessment is yet uncertain/unclear (U). While the fundamental challenges (FC) might not directly warrant further development of the method itself, they could still leave a missing piece for the safety assurance of the ADS and are hence included in the derivation of the research gaps. All cells classified as (FC), (O), or (U) are subsequently collected in an intermediate table [202] with one row per collected cell (i.e. with a (FC), (O), or (U)). Drawing on the investigation of the methods, as presented in the discussions of Sec. IV–Sec. VII, research questions were posed for each row of the intermediate table, in such a way that the answer to each question would help address the gap or assess the consequences resulting from the gap indicated by each row. Overlaps between the formed questions of the different rows were subsequently analysed in order to group similar questions. Finally, the derived questions of each row were gathered into similar themes forming the five categories of research gaps presented below.
B. Identified Research Gaps
The research gaps are clustered into five categories. The specific research questions within each category are presented in the following subsections.
1) Completeness of Provided Safety Evidence:
How to ensure that the confinement to the design made through the ODD (IV-A) is appropriate with respect to the uncertainties of challenges (C-U-env) and (C-U-inter)?
How to amend or tailor the process for HARA (IV-B) to ensure completeness of the provided hazards with respect to the operational uncertainties ((C-U-env) and (C-U-inter)), the fact that the ADS is responsible for the tactical decisions (challenge (C-B-resp)), the complexity of the system (challenge (C-B-func)) as well as the system’s obligations to handle degradations (challenge (C-B-adapt))?
How could the tactical responsibility and operational uncertainties of the ADS (corresponding to challenges (C-U-env), (C-U-inter), (C-B-resp), (C-B-func), and (C-B-adapt)) be formalised? And if not what are the implications and potential remedies for contract-based design (IV-D), formal methods (V-D) and run-time certification (VII-B)?
When deploying formal methods (V-D) how to minimise the specification gap [63] when considering the high dependability requirements of challenge (C-reqs) and operational uncertainties (challenges (C-U-env) and (C-U-inter))?
How to mitigate the impact from a mismatch between the real operational uncertainties of the ADS (challenges (C-U-env) and (C-U-inter)) and the considered scenario space for scenario-based V&V (V-C)?
2) Improvements, Analyses, and Automation of Methods:
How to automate HARA (IV-B) to support challenge (C-agile), with frequent releases and continuous learning, while perhaps integrating data from operations to overcome the high integrity requirements of challenge (C-reqs)?
What are the quantitative contributions from current (safety) design and development processes (IV-C), especially considering the operational uncertainties of the ADS, i.e. challenges (C-U-env) and (C-U-inter), but also challenges (C-B-resp), (C-B-func), and (C-B-adapt), and (C-reqs)?
How to integrate the best elements of current safety processes with agile releases (challenge (C-agile)), alternatively how could adequate safety evidence to support such processes be rendered from within the agile development cycle?
How to derive realistic and statistically probable (albeit rare) scenarios (V-C) corresponding to challenge (C-reqs), the high dependability requirements of an ADS?
What are appropriate leading metrics for (safety) operational data collection (VI-A) of an ADS, in particular to capture the operational uncertainties ((C-U-env) and (C-U-inter)) and the tactical responsibilities (C-B-resp) in relation to the high dependability requirements of challenge (C-reqs)?
What are appropriate metrics for threat assessment (VI-B) to appropriately capture the uncertainties of challenges (C-U-env) and (C-U-inter), especially considering rareness of events (related to challenge (C-reqs))?
How to assure the integrity of run-time methods: OoD detection methods (VI-C), DRA (VI-D), and DSM (VII-C) in the light of the high dependability requirements on the ADS (challenge (C-reqs))?
How well does DRA (VI-D) accommodate degradations of the system (challenge (C-B-adapt))?
How to construct run-time contracts (VII-B) to appropriately capture the uncertainties present in run-time (i.e. challenges (C-U-env) and (C-U-inter))?
3) Collecting Closed Loop Data and Handling the Responsibility of Tactical Decision Allocated to the ADS:
How to collect large quantities of closed loop data (supporting the fulfilment of the high dependability requirements of challenge (C-reqs)) from FOTs (V-A) without compromising safety (especially when considering interaction uncertainties of challenge (C-U-inter) and the tactical responsibilities of challenge (C-B-resp))? These considerations for safety are at least three-fold:
the operations of the system need to be safe especially considering the automation complacency related to supervising the operations [42], [43];
the data collected should not interfere with the closed loop operations of the system and also not impact the validity of the gathered data; and
ensure that the data collected is relevant and correct to avoid incorrect analyses and wrongly informed decisions.
How to ensure that tested scenarios are relevant considering the ability of the ADS to avoid the situation leading up to the scenario through its tactical decision, i.e. challenge (C-B-resp)?
How does the tactical decision responsibility of the ADS (challenge (C-B-resp)) impact the DRA (VI-D), DSM (VII-C) and precautionary safety (VII-D) methods?
4) Coping With AI/ML-Based Components (Challenge C-AI):
This category of gaps corresponds to the column with challenge (C-AI) of TABLE VI.
What are appropriate design and development processes (IV-C) to incorporate and rely on AI/ML-based components?
What is the impact on scenario-based V&V considering the non-interpolatable results when testing AI/ML-based components?
How to ensure validity when using formal methods (V-D) for such components, especially in relation to the high dependability requirements of challenge (C-reqs)?
How to compose contracts for AI/ML-based components, both for contract-based design (IV-D) as well as run-time certification (VII-B)?
How to derive quantitative risk measures from such components for the use in DRA (VI-D) and in turn for DSM (VII-C), while also ensuring dependability of the resulting outputs?
5) Scalability of Method and Patterns:
How do contract-based design (IV-D), supervisor architectures (IV-E), formal methods (V-D), run-time certification (VII-B), degradation strategies (VII-A), and precautionary safety (VII-D) scale when applied on a complex system such as the ADS (c.f. challenge (C-B-func))?
How to best leverage FOTs (V-A) for providing safety evidence of the system in relation to an agile development process (challenge (C-agile)) and considering the high dependability requirements of challenge (C-reqs)?
How to facilitate frequent releases (challenge (C-agile)) while ensuring appropriate analysis and development of corresponding degradation strategies (VII-A), especially concerning MRCs and RODs?
How can a precautionary safety design methodo- logy (VII-D) support frequent releases (challenge (C-agile))?
Discussion
In this section we turn to discussing the threats to validity pertaining to the methodology of the paper itself and conclude with areas of future work.
A. Threats to Validity
This paper presents a holistic perspective on safety evidence provision for an ADS and discuss methods related thereto. For each of the methods discussed we draw upon a selection of papers to support the view presented. Due to the diversity of topics included in our work, and the novelty and specificity of the application (ADS), we eventually discarded explicit systematic literature searches and exploration methods, due to the vastness of publications found through such an approach. Further, systematic search for relevant surveys only yielded a small number applicable results. The work of trying to systematise the search did however yield a solid basis for further work, both in terms of a comprehensive list of related work as well as resulting in the mind-map of Fig. 1, providing structure to both the work with the review as well as for this paper.
We will here discuss two aspects regarding the validity of the work presented herein: i) the difficulty of reproducibility (due to the ad hoc collection of references), and ii) the potential lack of completeness with respect to the identified challenges and gaps. To address the latter concern, a focused high-level systematic literature survey was conduced, and is presented in Sec. II-A3. This focused survey provides evidence for the usefulness and exhaustiveness of the results presented in this paper. Further, methods as well as challenges considered in the references presented in Sec. II-A3 suggest that the structure of methods, as well as the challenges identified in our work, are adequate and complete for the purpose of our review. Moreover, the focused survey highlights the novelty of our approach by taking a holistic perspective on methods related to safety evidence provision. A reproduction of our review would likely not yield the specific references used to support the views and discussions throughout Sec. IV to Sec. VII. However, to cover the same variety of methods, and thus to provide a holistic view, such a reproduction would necessarily need to present a structure similar to the one guiding our review. Thus, the final results would likely remain the same despite the potential differences in the reference list. Where applicable, we rely on literature surveys conducted in the field to provide an overview of the respective areas. A summary of which survey we use for which sections is given in the following.
Mehlhorn et al. [70] provide a holistic overview of the research on ODDs complementing the discussion provided in Sec. IV-A. In [88], an overview of design processes for safety of AI/ML-based components is given and Habli [79] discusses qualitative processes in relation to safety in general, giving support to the section on qualitative process arguments of Sec. IV-C. For supervisor architectures, we draw upon the results of [101], [104], and [109].
An overview of V&V methods for ADSs is given by [24] and [114], guiding the discussions of Sec. V together with [25], [26], where reviews of assessment and testing approaches for ADS are given. Insights into scenario-based V&V methods are provided by [24], [115], [134], [137], and [203], whereas we draw upon the works in [2], [24], [147], and [148] for the section on formal methods. As for the run-time risk assessment methods, treated in Sec. VI, the connection between operational data and assurance methodologies is given in [99]; threat assessment techniques are discussed taking support from [32], [163], [164], [165], [166], [167], and [168]; intention and trajectory prediction (as part of DRA) draws upon [180], [181]; and OoD detection points to [28].
Based on these surveys we implicitly inherit completeness with respect to at least those sub-topics. For other topics, we have relied on snowballing [204] starting from one or two prominent papers on the topic. We have naturally also drawn upon the complementary expertise of the co-authors for the covered areas. Jointly with the results of the focused, high-level systematic literature survey, we believe that this paper provides a representative and useful overview of the current challenges and methods for safety evidence provision for ADSs. Furthermore, works potentially overlooked in this work would not have a significant impact on the assessments of TABLE VI, nor on the derived research gaps.
B. Future Work
How well the reviewed methods work together and how they can be combined to address the challenges presented in this work have only been briefly discussed herein and an in-depth analysis thereof is suggested for future work. In a similar vein, analysing the dependencies, connections and relationships between the reviewed methods is also deferred to future work. Another line of research would be to examine the discussed methods (and likely an extended set thereof) when considering collaborative and connected facets to the ADS, including systems-of-systems. Further, an analysis of which assumptions, models and uncertainties that each method is imposing, consuming, and mitigating/leaving would be a reasonable next step to extend on the results presented in this work. Finally, to understand the holistic safety perspective when also including assumptions, models and uncertainties, it will be paramount to also include assurance methodologies in the scope. Especially, such an addition should focus on the organisation and traceability of the arguments and evidence supporting the assurance case.
Conclusion
In this paper we identify eight challenges pertaining to the safety evidence provision for ADSs. Furthermore, we analyse and discuss state-of-the-art design, development and run-time methods in relation to these eight challenges, thereby providing a holistic perspective of the current progress of safety evidence provision for ADSs. The results of the discussion are summarised in TABLE VI, where the ability of each method to mitigate the challenges is given. Additionally, the challenges considered especially onerous are highlighted for the respective method. Supported by these results, a list of research gaps are identified, grouped into five major themes: IX-B1 completeness of provided safety evidence, IX-B2 improvements and analysis needs, IX-B3 safely collecting closed loop data and accounting for tactical responsibility on the part of the ADS, IX-B4 coping with AI/ML-based components, and IX-B5 the scalability of the approaches with respect to the complexity of the ADS.
We conclude that the existing methods provide a good base for safety evidence provision, but there are several challenges remaining when considering the complexity and novelty of an ADS. Several methods need to come together to bridge this gap. This provides the baseline for extending this work: to analyse the interconnections and dependencies between the methods. Such analysis could beneficially include assumptions and models deployed by each method in order to help elucidate the interplay amongst the methods. As an additional extension, we propose to include assurance concepts (i.e. how to organise, trace and present the assurance arguments and evidence into an assurance case) in the analysis. Finally, we suggest to analyse how and where, throughout the assurance life-cycle of the ADS, that uncertainties originate and are mitigated in relation to the analysed methods.
ACKNOWLEDGMENT
The authors would like to thank Mattias Brännström (formerly at Zenseact, currently at Waymo) for initial discussions inspiring this work; to Christine Räisänen (Chalmers) for her excellent feedback and guidance through the course Writing up for publication, resulting in clear improvements to the text’s readability and focus; to Jonas Krook (Zenseact) for a thorough review and valuable feedback; to Fredrik Warg (RISE) for insightful comments and feedback during the writing process; and to Yuvaraj Selvaraj (Zenseact) for feedback on the section on formal methods. Lastly, they would like to extend their gratitude to the anonymous reviewers for their challenging and thoughtful comments, which greatly have contributed to the quality of the final paper.