Introduction
In highly automated vehicles, artificial intelligence (AI) and deep learning (DL) are crucial parts for complex tasks, like environmental perception [1]. In deep learning, a data-driven approach is taken, in which deep neural networks (DNNs) learn these tasks from high-dimensional data [2]. After training and testing activities, the DNN black box models are integrated into the overall automated driving (AD) system architecture, as shown in the upper part of Fig. 1, and perform safety-critical tasks, like object detection and image classification. However, ensuring safe behavior of these data-driven algorithms is challenging due to their insufficiencies and the infinite number of scenarios in open-world. For example, distributional shift and adversarial attacks can force the DNN to predict high confidence scores on incorrect outputs [2]. In the context of safety of the intended functionality (SOTIF) and the corresponding automotive safety standard ISO 21448 [3], the open-world problem is addressed by systematically minimizing the area of unknown scenarios with iterative process steps of analysis, functional modifications, verification, and validation. For the remaining area of unknown, potentially unsafe scenarios, appropriate monitoring methods have to be provided during runtime. However, methods for monitoring traditional software, which are recommended in the functional safety standard ISO 26262 [4] (e.g., range checks), are not applicable for DL-based algorithms, like DNNs [2]. For this reason, various DNN runtime monitoring methods have been published in recent years, which we discuss in the related work in Section II-C. However, these methods do not take into account various DNN insufficiencies simultaneously. Rather, they focus on individual error root causes (i.e., triggering conditions), like out of distribution inputs and adversarial attacks. Furthermore, to the best of our knowledge, established SOTIF concepts like the cause and effect chain, which is shown in Fig. 2, have not yet been taken into account in the context of DNN runtime monitoring. To address these gaps, we propose in Section III an insufficiency-driven approach for monitoring DNNs, as shown in the lower part of Fig. 1. In Section IV, we describe the implementation and application of our proposed error detection method on the safety-related automated driving use case of traffic sign recognition (TSR). Afterwards, we present and discuss our results by comparing our error detection approach to a baseline DNN model and to state of the art monitoring methods in Section V. Finally, we summarize our work and draw conclusions in Section VI.
Proposed architecture for an insufficiency-driven detection of DNN errors (using the example of DNN-based perception).
Related Work
In this section, we discuss theoretical background and relevant research for modeling, simulating, and testing of our proposed DNN error detection approach. First, we introduce SOTIF and the corresponding safety standard ISO 21448 in Section II-A. Afterwards, we discuss testing in automated driving and deep learning including relevant datasets and toolsets in Section II-B. Finally, we summarize methods for monitoring DNNs during runtime in Section II-C.
A. Safety of the Intended Functionality
According to the automotive safety standard ISO 21448, SOTIF is defined as “the absence of unreasonable risk due to a hazard caused by functional insufficiencies” [3] of the intended functionality. In the context of automated driving, functional insufficiencies can be insufficiencies of specification (e.g., gaps in specification of operational design domain) or performance insufficiencies (e.g., technical limitations of sensors and perception algorithms). Certain conditions of a scenario (i.e., triggering conditions) can activate these functional insufficiencies and might provoke a hazardous behavior of the AD system [3], [5]. Fig. 2 shows the corresponding cause and effect chain and illustrates the connection between functional insufficiency, triggering condition, and hazardous behavior. To ensure the SOTIF, ISO 21448 defines an iterative development process including phases of analysis, design, verification, validation, and monitoring [2], [3]. Thereby it addresses the open-world problem for complex systems (e.g., unknown, unsafe scenarios in automated driving), which are not sufficiently covered within the established functional safety standard ISO 26262 [2], [3]. In doing so, ISO 21448 augments the process of ISO 26262 [2]. Furthermore, ISO 21448 has reasonably foreseeable misuse of the intended functionality in scope and proposes measures, like human machine interface (HMI) improvement and driver monitoring implementation [2], [3]. However, ISO 21448 does not go into detail about insufficiencies in context of deep learning algorithms and related measures to address the safety of DNNs. This is were various research activities come in, like Willers et al. [5], who identified DNN-specific safety concerns and corresponding mitigation measures, or Burton et al. [6], who proposed an approach for the construction of confidence arguments in the context of DNN performance evaluation.
B. Testing in Automated Driving and Deep Learning
Virtual testing is becoming increasingly important in automated driving due to well-known advantages like repeatability, scaling, safety and costs [7], [8]. Especially with respect to the development of DL algorithms (e.g., DNNs for perception tasks), testing in virtual environments is an important field [9]. Recent survey papers [7], [9] summarize relevant datasets for testing of DL-based AD systems and virtual testing environments for open- and closed-loop simulations, like MATLAB Automated Driving Toolbox. Using the example of traffic sign recognition task, well-known datasets like GTSDB (German Traffic Sign Detection Benchmark) [10] and LISA (Laboratory for Intelligent and Safe Automobiles) [11] are publicly available. Furthermore, Zhu et al. [12] introduced the Tsinghua-Tencent 100K dataset with Chinese traffic signs in bad weather conditions. To test the AD system in relevant scenarios and to derive potential failure cases, Ghodsi et al. [13] and Wang et al. [14] recently published methods for generation of safety-critical scenarios. However, the data-driven approach in deep learning results in major differences compared to traditional software, which requires DL-specific testing methods on software level. Recent survey papers [15], [16] highlight these differences and summarize research of appropriate testing methods to address DL-specific insufficiencies, like robustness limitations and black-box characteristics. For example, Pei et al. [17] introduced the white-box testing framework DeepXplore to measure neuron coverage and to uncover thousands of incorrect DNN corner case behaviors [16]. Furthermore, Tian et al. [18] proposed DeepTest as a systematic testing tool, which supports derivation of erroneous DNN behaviors in the context of automated driving. Additionally, various explainability methods offer the possibility in testing to explain the decisions of a DL algorithm and to analyze and understand its errors [19]. For example, explainability methods like GradCam [20] and Occlusion Sensitivity [21] provide saliency maps to highlight relevant features within the input image.
C. Runtime Monitoring Methods for Deep Neural Networks
In addition to DL-specific testing activities, it is of great practical importance to monitor the tested DL algorithm during runtime [22]. To discuss state of the art monitoring methods in deep learning, we define a DNN as a high-dimensional function
Predictive Uncertainty
Out of Distribution Detection
Adversarial Detection
Runtime monitoring methods address different types of triggering conditions (using the example of traffic sign recognition).
1) Predictive Uncertainty Methods
Well-known predictive uncertainty methods are based on Bayes theory, like Monte Carlo (MC) dropout and Bayesian neural networks [25]. These methods estimate an output probability distribution by multiple sampling of non-deterministic forward passes during runtime \begin{equation*} \mu = \frac {1}{n} \cdot \sum _{i=0}^{n-1}f_{\theta,i}(x); \quad \sigma ^{2} = \frac {1}{n} \cdot \sum _{i=0}^{n-1}\big (f_{\theta,i}(x)-\mu \big)^{2}\tag{1}\end{equation*}
2) Out of Distribution Detection Methods
Instead of estimating a probability distribution for the current DNN prediction, OOD detection methods deal with a binary classification problem, whether the current input is inside or outside the training data distribution. For example, Cheng et al. [27] proposed a method for monitoring neuron activation of the hidden DNN layers. The neuron activation patterns in the last hidden layer are stored with Binary Decision Diagrams (BDD) [28] during design phase [23]. These patterns are built up with Boolean variables, which indicate if the monitored neurons are active or not. To detect anomalies, the neuron activation pattern is compared to the stored patterns during runtime by measuring the Hamming distance. Furthermore, Hendrycks et al. [29] proposed Outlier Exposure, where the training process of the DNN is augmented with OOD samples from auxiliary datasets (e.g., 80 Million Tiny Images [30]). The DNN is trained on the original dataset with original labels \begin{align*} \min _{\theta }&\Big [{} \mathbb {E}_{D_{ID}\left ({x,y}\right)} \big [-\text {log} f_{\theta }(x) \big] \\ && + \lambda \cdot \mathbb {E}_{D_{OOD}\left ({x^{\prime }}\right)} \big [KL \big (\mathcal {U}(y) \| f_{\theta }\left ({x^{\prime }}\right)\big) \big] \Big] \tag{2}\\ S_{OOD}=&H \big (f_{\theta }(x) \big).\tag{3}\end{align*}
3) Adversarial Detection Methods
To address adversarial perturbations within input data during runtime, Meng and Chen [33] introduced the autoencoder-based method MagNet, which is trained on the original dataset. The input samples are processed through the autoencoder, which compresses and reconstructs the input. Afterwards, adversarial inputs are detected by a high reconstruction loss between original and reconstructed input. Additionally, the reconstructed input is processed through the DNN and shifts within the DNN prediction between the original and reconstructed input are measured for adversarial detection as well. Furthermore, Xu et al. [34] proposed Feature Squeezing, which is based on DNN predictions on squeezed input images. Therefore, a reduced color depth image and a spatial smoothed image are generated from the original input. The DNN predictions on the two squeezed images are compared with the DNN prediction on the original input image. If the difference \begin{equation*} S_{adv} = \big \| f_{\theta }(x) - f_{\theta }\left ({x^{squeezed}}\right)\big \|_{1}\tag{4}\end{equation*}
Modeling of an Insufficiency-Driven Error Detector for Deep Neural Networks
In this section, we describe the development of a general model for DNN error detection in the context of safety of the intended functionality, as shown in Fig. 1. Therefore, we take into account the SOTIF cause and effect chain from Fig. 2 and the literature fields of DNN runtime monitoring from Section II-C. First, we create an error root cause model for DNNs with DNN-specific insufficiencies and triggering conditions in Section III-A. Afterwards, we derive general DNN monitor categories and link them to the insufficiencies in Section III-B. Finally, we combine the monitor categories with a meta model, based on stack learning technique, in Section III-C. This enables a general approach for monitoring DNNs, which addresses safety-related DNN insufficiencies and considers various types of DNN errors related to in distribution, out of distribution, and adversarial input data.
A. Creation of an Error Root Cause Model for Deep neural Networks
Willers et al. [5] recently defined nine DNN-specific safety concerns leading to insufficiencies, like data distribution shift to real world, distributional shift over time, incomprehensible behavior, unknown behavior in rare critical situations, unreliable confidence information, brittleness, dependence on labeling quality, inadequate separation of test and training data, and insufficient consideration of safety metrics. Cheng et al. [35] identified core properties and corresponding insufficiencies of DNNs, which have to be addressed in safety context, like robustness, interpretability, completeness, and correctness. Dataset limitations as well as limitations regarding robustness, explainability, and uncertainty are also highlighted by Houben et al. [36].
In the following, we summarize these results in six general DNN insufficiencies and categorize them by using the SOTIF terms of performance insufficiency and insufficiency of specification, as shown in Table 1. Furthermore, we make a distinction between data- and model-related insufficiencies. Whereas data-related insufficiencies refer to the underlying datasets, which are used in DNN development process for training, validation, and testing, model-related insufficiencies refer to limitations of the trained DNN model. Considering the DNN-specific triggering condition types and the categorized insufficiencies, we create an error root cause model in the context of deep learning. Fig. 4 shows the error root cause model and illustrates the relation between DNN-specific functional insufficiencies, triggering conditions, and DNN errors. In distribution, out of distribution, and adversarial inputs can activate corresponding DNN insufficiencies and lead to a DNN error (i.e., false positive or false negative prediction). If the DNN error is not detected on software level, it can contribute to a hazardous behavior on vehicle level. To prevent DNN error propagation, we propose an error detection approach on software level, which takes into account the DNN insufficiencies to detect all types of DNN-specific triggering conditions.
Error root cause model for DNNs: Triggering conditions in input data activate functional insufficiencies of the DNN and lead to an error.
B. Derivation of General Monitor Categories to Address DNN Insufficiencies During Runtime
To enable a general error detection approach, which addresses the safety-related DNN insufficiencies in Table 1, appropriate monitoring methods have to be provided for each insufficiency. Therefore, we derive five general monitor categories, as shown in the upper part of Fig. 5. Each monitor category should provide an error score
DNN monitor categories address safety-related insufficiencies of DNNs for detection of different triggering condition types.
We introduce OOD-Monitor category, to address insufficiencies regarding data completeness. For implementation of the OOD monitor, well-known methods from the OOD detection literature field, can be used to calculate an OOD score
C. Runtime Detection of Triggering Conditions With a Combining Meta Model Approach
To consider the DNN insufficiencies during runtime, we combine the results of the individual monitor categories from Section III-B, as shown in the lower part of Fig. 5. In doing so, we introduce a meta model \begin{equation*} P_{TC} = \mathcal {M}\left ({S_{OOD}, S_{sal}, S_{plau}, S_{adv}, S_{unc}}\right)\tag{5}\end{equation*}
Various machine learning architectures are applicable for implementation of the meta model \begin{align*} P_{TC, LR} &=& \frac {1}{1 + e^{-z}} \tag{6}\\ z=&\beta _{0} + \beta _{1} \cdot S_{OOD} + \beta _{2} \cdot S_{sal} \\&{}+ \beta _{3} \cdot S_{plau} + \beta _{4} \cdot S_{adv} + \beta _{5} \cdot S_{unc}\tag{7}\end{align*}
Experimental Set-Up and Implementation on Automated Driving Use Case of Traffic Sign Recognition
In this section, we implement the proposed error detection method from Section III on the safety-related automated driving use case of traffic sign recognition by using MATLAB Simulink and MATLAB RoadRunner toolset, as shown in Fig. 6. Therefore, we adapt the general monitor categories and the meta model on the TSR use case and implement them in model-based MATLAB Simulink language. The overall model is then embedded in a MATLAB Simulink open loop simulation, which contains a self-developed TSR function consisting of DNN-based traffic sign detection and classification components. We create diverse 3D traffic sign scenarios with MATLAB RoadRunner and feed the simulated camera data into the open loop simulation. Our error detection approach is implemented to supervise the baseline DNN for the traffic sign classification task. In doing so, we treat the DNN as a black box with a non-modifiable architecture and inaccessible inner states to increase flexibility in application. Therefore, we focus on the DNN input (clipped bounding box with traffic sign image from traffic sign detection) and the DNN output (confidence scores for traffic sign classes) as input data for the error detector.
Experimental set-up in MATLAB Simulink and MATLAB RoadRunner environment for training and testing of the insufficiency-driven DNN error detector.
A. Traffic Sign Detection
For the traffic sign detection task, we train a YOLO v2 [41] object detection algorithm on the German Traffic Sign Detection Benchmark (GTSDB) dataset [10]. The dataset comprises 900 images with a resolution of
B. Traffic Sign Classification (Baseline DNN)
For the classification task, we train a DNN with 5-layer convolutional neural network (CNN) architecture \begin{equation*} P_{TC,DNN} = 1 - f_{\theta,max}(x)\tag{8}\end{equation*}
C. 3D Training and Test Scenarios
To simulate entire prediction tracks of various triggering conditions, we create diverse 3D scenarios for TSR use case in MATLAB RoadRunner environment. In doing so, we design a straight street with different traffic signs, which are separated in a difference of 100 m to each other. The trajectory of the ego vehicle with a mounted camera in the front grill (height relative to road level: 0.3 m) is defined with a constant speed of 80 km/h by using MATLAB Automated Driving Toolbox. To generate the camera data, we simulate the 3D scenario with the ego vehicle trajectory. Afterwards, we export the simulated camera data with resolution of
We model Out of Distribution Inputs with traffic sign classes of the GTSRB dataset that are not in the training data distribution of the baseline DNN (i.e., no speed signs). We consider these classes as out of distribution due to the fact, that they have never been seen during training time. To generate Adversarial Inputs, we apply Projected Gradient Descent (PGD) [45] method, which is an iterative variant of Fast Gradient Sign Method (FGSM) [24], on in distribution traffic sign data (i.e., speed signs in GTSRB). According to (9), we calculate a FGSM perturbation \begin{align*} x_{adv} &= x + \alpha \cdot sign\Big (\nabla _{x} \mathcal {L} \big (f_{\theta }(x),y\big)\Big)\tag{9}\\ x_{adv}^{t+1} &= \text {Clip}_{\epsilon } \Big \{\text {FGSM}\big (x_{adv}^{t}\big)\Big \}\tag{10}\end{align*}
D. Implementation of the Monitor Categories
For implementation of some monitor categories, different runtime monitoring methods from the literature fields in Section II-C have to be chosen. Note that the decision criterion for these methods is based on popularity and citation frequency. A complete benchmark on the use case would go beyond the scope of this work. Rather, the benefit of their combination within the meta model approach, which address different DNN insufficiencies simultaneously during runtime, should be shown.
We treat the baseline DNN as a black box and consider this boundary condition in the following specification of the monitor categories. We implement the Out of Distribution Monitor with an Outlier Exposure [29] approach. In doing so, we train a similar DNN architecture on ID data (i.e., speed signs in GTSRB dataset) as well as on OOD data with the help of the modified loss function described in (2) by using a penalty parameter of
Saliency monitor calculates saliency maps for current DNN prediction and interprets the results, whereas plausibility monitor performs color- and shape-based plausibility checks.
Algorithm 1 Occlusion Sensitivity
Input:
Output:
for
end
Return
E. Implementation of the Meta Models
Finally, we implement different meta model architectures for triggering condition detection. We optimize the meta models on training data (from training scenario in Section IV-C) to classify the results of the implemented monitor categories. For optimization of the LR meta model, we use a maximum likelihood estimation with an iteratively reweighted least squares algorithm. The optimized LR meta model is then used for detection of triggering conditions on test data (from test scenario in Section IV-C) by estimating triggering condition probability
Results and Discussion
In this section, we present and discuss our results related to the error detection performance of baseline DNN, monitor categories, and meta models. Table 2 shows our adapted confusion matrix for the task of triggering condition detection. The diagonal entries (TP, TN) indicate that a prediction of the DNN error detector (i.e., triggering condition vs. safe input) fits to the ground truth, whereas entries on the side diagonal indicate misclassification (FP, FN). We label our training and test data from Section IV-C according to the baseline DNN prediction: False DNN predictions are labeled as triggering conditions, whereas correct DNN predictions are labeled as safe inputs.
First, we analyze the baseline DNN and the individual monitor categories with respect to their detection performance on different triggering condition types in Section V-A. Finally, we compare the different meta model architectures in Section V-B under consideration of the trade-off between interpretability and detection performance.
A. Results and Discussion for Monitor Categories
We analyze the performance of the baseline DNN and the monitor categories on the test data from test scenario. The test dataset consists of 72 traffic sign tracks (number of test images \begin{align*} \text {TPR} &= \frac {\text {TP}}{\text {TP} + \text {FN}}; \quad \text {FPR} = \frac {\text {FP}}{\text {FP} + \text {TN}} \tag{11}\\ \text {AUC} &= \int _{0}^{1} \text {ROC}(x) \, dx\tag{12}\end{align*}
Triggering condition detection performance on test dataset, shown by ROC diagram. (a) Overall performance of monitor categories, LR & GBT meta model, and baseline DNN. (b) Performance of LR meta model and baseline DNN on different triggering condition types (ID, OOD, Adversarial).
B. Results and Discussion for Meta Models
We optimize the meta models (LR, NB, KNN, RF, GBT, SVM, FFNN) on training data from training scenario. The training dataset consists of 144 traffic sign tracks (number of training images
Separation ability between safe inputs and triggering conditions on overall test dataset, shown by histogram. (a) Baseline DNN and uncertainty monitor (Deep Ensembles). (b) LR and GBT meta model.
Conclusion
Reliable detection of environmental error root causes (i.e., triggering conditions) and related DNN errors is a crucial part of automated driving systems. In this work, we proposed a general error detection approach for DNNs, which addresses various safety-related DNN insufficiencies simultaneously during runtime. We developed our approach, taking into account concepts of the automotive safety standard ISO 21448 (SOTIF). Therefore, we created an error root cause model with DNN-specific insufficiencies and triggering conditions. Afterwards, we derived general monitor categories to address the identified DNN insufficiencies during runtime. In doing so, we considered the main literature fields of DNN runtime monitoring. Finally, we introduced a meta model, based on stack learning, that combines the results of the individual monitor categories for an insufficiency-driven error detection. We applied our error detection approach on traffic sign recognition use case in MATLAB Simulink simulation by using self-created 3D scenarios with MATLAB RoadRunner. Our simulation covered all types of triggering conditions, like in distribution inputs, out of distribution inputs, and adversarial attacks. In performance evaluation, we showed that each monitor category has different strengths and weaknesses with respect to the different triggering condition types. Furthermore, we optimized various meta model architectures like logistic regression, naive Bayes, k-nearest neighbors, random forest, support vector machine, gradient boosted trees (XGBoost), and feed forward neural network to predict an error probability based on the monitor category outputs. We showed that the meta models are able to clearly distinguish between safe input data and all types of triggering conditions in contrast to a baseline DNN and to state of the art DNN monitoring methods. Furthermore, the interpretable meta model architecture, based on logistic regression, achieves similar error detection performance as more complex models like gradient boosted trees and feed forward neural network due to linear relationship between the monitor category predictions and the logit transformation of triggering condition probability. However, the performance of the meta models is influenced by the use case and the implemented monitor categories, which requires an application-specific assessment of meta model architectures. Furthermore, care must be taken that the implemented monitor categories are not highly correlated in their predictions to utilize stack learning. Additionally, data balance in training data is important to achieve optimal detection performance with respect to different triggering condition types. Finally, the available computing capacity on target hardware has to be taken into account due to the usage of multiple monitoring methods in parallel. In the future work, we plan to reduce the needed computing effort by minimizing the amount of sample-based approaches. Furthermore, an extension of the 3D driving scenarios with further OOD data (e.g., U.S. and Chinese traffic signs) and adversarial data (e.g., sticker-based adversarial attacks) is planned. We also plan field studies on automated driving target hardware for traffic sign recognition use case to evaluate the runtime performance of our approach on live data.
ACKNOWLEDGMENT
Thanks to Heiko Möckel, Nico Kem, Markus Baumeister, and Marek Kowalczyk for inspiring discussions during the research activities.