Introduction
Recently, significant improvements have been reported in the development of vehicular sensors for performing different simple and complex tasks including object detection [1], localization [2], tracking [3], and activity recognition [4] for numerous applications. Such advancements have improved the sensing and computing processes of autonomous driving (AD) [5]. Despite this, AD demands further attention of industry and academia due to its sensitive nature and key role in reducing the number of accidents and saving human lives. For instance, only in USA 6 million car accidents happen on average every year, out of which around 3 million people get injured and around 2 million experience permanent injuries [6]. Besides injuries, more than 90 people die in car accidents every day. The main reasons of these accidents include alcoholic (40%), speeding (30%), and reckless driving (33%). Similarly, distracted driving also results in huge number of accidents [7]. According to a report, more than 9 people are killed each day due to distracted driving in USA [8]. Similarly, more than 1060 people are injured in crashes due to driver distraction. These crashes can be dramatically reduced by using driverless vehicle technology as supporting tools to drivers or in full automation. Furthermore, the disabled community can be greatly benefitted from this technology [9].
Due to the wide range of benefits the governments and companies worldwide are taking interest in AD. For instance, top twenty-five countries are evaluated in [10] for AD readiness and given scores in terms of policy and legislation, technology and innovation, and infrastructure. As depicted in Figure 1, Singapore is leading in the policy and legislation, Israel in technology and innovation, and Netherlands in the infrastructure for AD. The countries like UAE with good infrastructures are limited to advanced technologies to operate AD in their roads.
The successes achieved by top twenty-five countries so far in race of AD in terms of policy and legislation, technology and innovation, and infrastructure.
Literature shows that autonomous vehicles have five different levels of automation as defined by SAE International standard with a range 0~5 [11]. This level-based roadmap is visualized in Figure 2. Level 0 vehicles are those which are under the full control of drivers [12]. Level 1 allows performing minor tasks of acceleration or steering by car and rest of the control is with human driver e.g., adaptive cruise control [13]. Level 2 car can take some safety actions such as emergency breaking but still driver needs to be alert while driving. Tesla’s autopilot or Nissan’s ProPilot can be regarded as level 2 because they can keep the car in the desired lane. At level 3, the car can automatically drive in certain conditions by monitoring the surrounding environment, but human driver needs to be still on command for taking control if the autonomous system fails [14]. Audi claimed that its A8 models featuring Traffic Jam Pilot has Level 3. In case of Level 4, the car can safely take control and proceed accordingly if its request for human intervention is not responded [15]. Level 4 cars are not recommended to be driven in uncertain weather conditions or unmapped areas. Lastly, level 5 vehicles cover full automation in all conditions and modes [16].
To date, several efforts and initiatives have been triggered by major industries to mature the AD technology. For instance, the famous DARPA Grand Challenge of 2004 [17] for covering a 150 miles road using a driverless car, which was failed by all the 15 vehicles participating in the challenge. A further improvement was made thereafter, and 5 out of 23 participants passed the challenge in the 2005 edition of the contest. Another event “DARPA Urban Challenge” [18] was later initiated in 2007, in which six participants completed the mission. Other noteworthy events include “Intelligent Vehicle Future Challenge” [19] (2009~2013), “Hyundai Autonomous Challenge” [20] 2010, and “Public Road Urban Driverless-Car Test” [21], 2013. Most recently, in 2015~2016, Google self-driving car and Tesla’s autopilot system [22] were introduced as commercial examples. Apart from these milestones, different famous companies are planning to announce their autonomous vehicles of different levels in the near future. For instance, Ford has plan to deliver a “Level 4” driverless vehicle in 2021. Similarly, BMW is targeting “Level 4” or “Level 5” autonomous car in 2021 [23].
Despite the aforementioned achievements, several issues still restrict the usefulness of AD technology for numerous environments such as the maturity of Artificial Intelligence (AI) methods – particularly, those relying on Deep Learning (DL) – for visual sensors, dependency of performance of individual parts of AD system, and social acceptance at large scale [24]. Among these, the first two are important aspects and key enablers for AD system, which can significantly increase its safety and consequently result in huge public appreciation. Indeed, safety is a key requirement of AD, assisting drivers and minimizing the risks of potential accidents. This requirement can be mainly ensured by seven key tasks including road detection [25], lane detection [26], vehicle detection [27], pedestrian detection [28], drowsiness detection [29], collision avoidance [30], and traffic sign detection [31] for which numerous hand-crafted and learned representation based methods are presented. The current literature contains surveys of traditional methods on different aspects of AD such as planning and controls [32], traffic light recognition [33], and vehicle localization [34]. Many others have emphasized on the paramount role of DL in the ITS domain [35]–[38]. However, a detailed study on DL methods for safe AD is missing, which is considered as a backbone for AD and its safety.
This survey aims to cover this literature gap by analyzing the most recent DL works related to the aforementioned seven tasks for safe AD. We encapsulate the seven tasks into a three-step pipeline of measurement, analysis, and execution, and enlist their major achievements and key limitations. Our work also highlights the current issues of safe AD with several research recommendations focused on enhancing the applicability of DL methods in realistic vehicular environment, with safety at their primary requirement. We complement our critical analysis of the existing literature with an excerpt of empirical results of different DL model architectures for several safety-related AD tasks, which shed light on the enormous potential of these models. We end up with a discussion on current research areas in the DL realm that remain insufficiently studied to date, despite their straightforward connection to safety issues in this particular application field.
The remainder of this article is structured as follows: Section 2 briefly describes the major control processes of AD. Section 3 provides details of the recent DL approaches, their strengths and limitations for AD systems. In Section 4, the major challenges of safe AD are discussed, and future directions are suggested in Section 5. In Section 6, we conclude the survey with some concluding remarks and an outlook.
Major Embodiments of AD and Related Studies
This section aims to briefly describe the three-step pipeline of measurement, analysis, and execution (MAE) and presents several related studies associated with AD. “Measurement” refers to data collection from the surrounding environment via sensors, cameras, or radars and processing associated to detecting road, lane, vehicle, pedestrian etc., [39]. To this end, all the methods of this survey associated to these tasks will be covered under “M”. “Analysis” phase uses more advanced algorithms for filtering, tracking, and other concrete steps for fulfilling a certain set of optimization requirements for AD. As a result of analysis, the “execution” part uses certain actuators to trigger an alarm or revoke control of vehicle. In this phase, automatic braking can be enabled to save the vehicle to avoid collision, thus ensuring safety on roads for AD systems.
Studies show that MAE is the necessary pipeline to control the automatic behavior of such vehicles and thus it is covered in this survey. Other aspects of such systems are already covered in detail by different surveys mentioned in Table I. For instance, state-of-the-art associated with vision based recognition of traffic light for AD, is covered in [33]. Similarly, the planning and control aspect of AD for urban settings is investigated in [16]. To the best of our knowledge, an in-depth study of DL approaches for safe AD is missing in current literature and is thus presented in this paper with its overview in Figure 3.
Distribution of DL approaches for major embodiments (measurement, analysis, execution-MAE) of the AD systems.
Critical Literature Analysis of AD Tasks
In this section, the state-of-the-art DL approaches mentioned in Table II are briefly described, considering the target seven tasks. Despite the fact that there are many studies about different parts of the AD system such as sensing, image processing, and communication etc., which work collectively for enabling it to drive itself, certain parts have achieved more attraction due to their huge impact on the overall performance of such vehicles. The most important parts are the seven tasks associated to MAE, in context of which the concerned studies are explored as follows:
A. Road Detection
This task aims at detecting round boundaries and areas where autonomous vehicle can possibly drive. In this context four representative works are selected. The first framework [52] applies CNNs to estimate longer distance road course for augmented reality applications. The second one investigates cascaded end-to-end CNN (CasNet) for accurate road detection and localization of centerline in the presence of complex backgrounds and significant occlusions of trees and cars as given in [53]. The other works present a Siamese fully convolutional network based framework for accurate detection of road boundaries using RGB images, semantic contours, and location priors [54] and a completely end-to-end model called as RBNet [55] for road presence as well as boundary detection in a single network.
B. Lane Detection
Lane detection has a key role in ensuring the safety of autonomous vehicles via lane keeping and lane departure control systems, enabling them to be on their specified lane, minimizing the chances of collision. In this context, four DL recent approaches are selected as examples. In the first approach, [56] utilized multi-sensor data and passed it through a deep neural network for lane detection in 3D space. The second approach investigates waveforms and CNNs for detecting lane markings for safe AD as discussed in detail in [57]. In the third work, an energy-friendly lane detection and classification strategy is proposed using stereo vision and CNN for lateral positioning of ego-car and issuing forward collision warning for safe AD [58]. In the next work [59], a recurrent neural network is utilized for road lane detection. Thus, it ensures both lane detection as well as collision avoidance.
C. Vehicle Detection
In order to avoid possible accident, the autonomous vehicle needs to detect and track other vehicles on the road. For this task, it needs to estimate different aspects of surrounding vehicles such as its shape, relative speed, size, and 3-dimensional locations. In this context, some of the state-of-the-art techniques are described as an example from recent literature. The first one is an automatic approach for vehicle detection and counting using convolutional regression neural network for traffic management and safe AD with detailed discussion in [60]. Chen et al. [61] presented a framework for 3D object detection by utilization of deep CNN model for object, location, and contextual boxes prediction. Similarly, Rajaram et al. [2] presented a mathematical strategy for object localization. They utilized Faster-RCNN along with RefineNet and region of interest pooling for vehicle detection and localization. Another work is a vehicle detection framework using multi-task deep CNN and voting strategy of region-of-interest [62]. Both enable the autonomous vehicle to detect other on-road car vehicles to initiate safety measures, thus increasing the safety level of AD.
D. Pedestrian Detection
Vehicle-to-pedestrian accident is a common scenario and mostly happen on roads. For autonomous vehicle, it is necessary to differentiate other objects from humans due to their higher importance. Thus, visual cameras are installed on autonomous vehicle for detection, tracking, and possible recognition of pedestrians for avoiding collision and different other purposes. For instance, Ouyang et al. [63] presented a joint framework of deep features extraction, handling deformation and occlusion, and classification for pedestrian detection that helps increasing the safety of AD. Another approach presented by Cai et al. [64] formulated complexity aware cascade training for pedestrian detection. They integrated cascade with the CNN to enable accurate pedestrian detection at a faster speed. Similarly, Wang et al. [65] proposed a pedestrian detection approach by investigating body part semantics and contextual information with complex handling of occlusions, achieving highly accurate localization results, which consequently increase the safety of AD.
E. Drowsiness Detection
This task is related to drivers and especially for level 1 to level 3 autonomous vehicles as level 4 and level 5 vehicles are fully driverless. It is one of the key contributors for safety applications as it can automatically take necessary action once driver seems distracted or any drowsy state is detected. To do so, several approaches exist in literature. For instance, Lyu et al. [66] proposed a multi-granularity based deep framework by intelligent usage of CNN and LSTM drowsiness detection in videos. Vijayan and Sherly [67] presented three CNN architectures including ResNet50, VGG16, and InceptionV3 for drowsiness detection in first person driver videos. These models are trained together by fusing them using a feature fused architecture layer. A similar approach is followed by Park et al. [68], where they integrated the results achieved by AlexNet, VGG-FaceNet, and FlowNet by fully connected layers for drowsiness detection. In another similar work, Guo and Markoni [69] investigated CNN and LSTM for drowsiness detection.
F. Collision Avoidance
It is obvious from the previous tasks that the important objects associated with autonomous vehicles can be tracked and detected by them, however, they are not enough to take a decision. The important decision and action are taken by collision avoidance system. Thus, it is a higher-level task on which the safety of AD is heavily dependent and numerous studies are conducted in this direction. For example, Song et al. [58] can detect both lane and avoid collision by taking necessary action. Similarly, Nguyen et al. [70] proposed a system, which detects the obstacles, recognize them using autoencoder, TCNN, and R-TCNN based DL architecture, and finally track them to avoid the chances of collision during AD. In another approach, Long et al. [71] proposed a novel end-to-end collision avoidance system using deep neural network from noisy sensory measurements.
G. Traffic Sign Detection
This task is mainly related to the control of the vehicles from collisions at zebra crossing and road junctions, to reduce speed at speed jumps, notify the driver before turns, and suggest about U-turn, etc. Its function is simple, yet very important and challenging to make decision as discussed in several studies. For instance, Zhu et al. [72] developed an object proposal-based framework for traffic sign detection and recognition. The searching area for traffic signs is reduced through CNN and then the detection and classification are performed using R-CNN and EdgeBox methods. Li et al. [73] proposed traffic light sign recognition model for on-vehicle cameras by using prior frame information that keeps the previous frame detection record and aggregates channel features that analyze the interframe information. Wang and Zhou [74] recognized traffic light signs in dynamic images using a lightweight DL model. A dual-channel mechanism is proposed for traffic light detection in dark frames and a lightweight CNN model is developed to classify them in real-time. For dark channel saliency model is developed to extract light from different colors simultaneously. Jensen et al. [75] applied real-time object detector algorithm for traffic light signs detection using various YOLO versions and achieved state-of-the-art results over challenging datasets. Ouyang et al. [76] used heuristic candidate region selection module for traffic light sign identification and developed a lightweight traffic light detection (TDL) model for its classification. The model is evaluated on both collected and benchmark datasets. Also, the model is tested through offline simulation and an on-road test. The model is integrated with Nvidia Jetson for on-road testing in normal traffic over a bus and a car. Yuan et al. [77] developed VSSA-NET architecture for traffic sign detection and treated it as a regression and sequence classification task. The network architecture is based on vertical spatial sequence attention and multi resolution feature learning module. Also, the contextual information are extracted through regression and classification with attention procedure. In a similar method, Tabernek and Skočaj [42] used mask R-CNN object detection algorithm with different adaptation of the network to achieve the final detection. For better performance, appearance and geometric distortion distributions are applied as data augmentation to increase the data. All these tasks contribute to the overall safety of AD and thus researchers are increasingly investigating these areas.
Performance Evaluation of the Safe AD
This section has been organized for the performance comparison of different state-of-the-art DL models when applied to tasks closely related to AD safety. Each task is evaluated using multiple evaluation methods including F-measure, precision, recall, overall accuracy, average precision (AP), area under the curve (AUC), and runtime. However, we have discussed only those evaluation results which are achieved via a common assessment criterion. For instance, majority of the road detection techniques are assessed using the F-measure score. F-measure is also known as F1-score which considers both precision and recall, calculating a harmonic mean of both values and captures the trade-off between them. It can be calculated using the formula given in Eq. 1.\begin{equation*} F1=2\times \frac {Precision\times Recall}{(Precision+Recall)}\tag{1}\end{equation*}
Similarly, the mainstream techniques for lane detection utilized AP and AUC for its evaluation. AP (also known as mean average precision (mAP)) is a performance evaluation metric for object detectors, which computes the AP value from precision (Eq. 2) for different recall (Eq. 3) levels. More generally, AP is used to find area under the precision-recall curve in the range 0 to 1.\begin{align*} Precision=&\frac {True~positive}{(True~positive+False~positive)}\tag{2}\\ Recall=&\frac {True~positive}{\left ({True ~positive+False~negative }\right)}\tag{3}\end{align*}
Likewise, AUC is used for the analysis of AI models where true positives are plotted against false positive rate in order to know that at which threshold the trained model performs well. Figure 4 (a) visualizes the state-of-the-art results achieved by different DL models on KITTI’s [78] benchmark dataset. This dataset is one of the challenging datasets for AD tasks such as road detection, lane detection, pedestrian detection on road, and vehicle detection. It is divided into three sets, which are urban marked (easy), urban multiple marked lanes (moderate), and urban unmarked (hard). Further, it has 289 training images and 290 testing images. For instance, RBNet [55] has achieved the current highest F-measure score. It resorts to five convolutional layers, DCNN for feature extraction, followed by post-processing for road boundary detection. The model is trained for 100k epochs with a learning rate equal to 0.01, amounting to 0.18 seconds per processed frame. The DNN [79], s-FCN-loc [54], and Up Conv [80] have achieved 93.43%, 93.26%, and 93.83% F-measure score, respectively. Furthermore, the DNN [79] and s-FCN-loc [54] utilized very deep CNN architectures, followed by complex post-processing. As a result, they require 2 and 0.4 seconds of processing time per frame, respectively. The Up Conv [80] takes only 0.083 seconds per processed frame, but its accuracy is lower than RBNet [55].
Figure 4 (b) shows the compassion of different state-of-the-art methods for lane detection using AP and AUC scores. They used the Caltech lanes dataset [83] for experiments which contains 1225 challenging images taken in the busy streets of Pasadena. The SCNN [81] and DMS [56] utilized AP score for the evaluation of their techniques and achieved 59.5 and 84.7 scores, respectively. The SCNN [81] utilized the VGG16 CNN architecture as its backend, with three additional fully connected layers added for road detection. They have trained this extended model with a learning rate of 0.01 and a weight decay of 0.0001 using “poly” learning rate policy. Their model is not efficient for real time processing because it requires 0.115 seconds of processing time due to the VGG16 backbone. Similarly, Li et al. [82] utilized two convolutional layers and fully connected layers architecture followed by multitask object detection, where the first task detects object and the second estimates the geometry output. They experimented RNN, CNN, and SVM after the features extracted from fully connected layers for multitask learning, where the RNN performed well and achieved the highest AUC value of 0.99. We represented 0.99 as 99 in the graph because the AUC values are between 0 and 1. However, due to the graph representation, the scores are normalized for better visualization.
The pedestrian detection techniques given in Figure 5 have been evaluated using the mAP scores on KITTI’s [78] benchmark dataset. In state-of-the-art, three scenarios have been chosen for pedestrian detection evaluation including easy, moderate, and hard as defined by the benchmark datasets. In the easy scenarios, the minimum bounding box height for the object is 40 pixels and the objects are fully visible without any occlusion. In the moderate level data, the minimum bounding box height is 25 pixels and the objects are partly occluded. In the hard scenarios, the objects are very much occluded and difficult to see, where the minimum height of the bounding box for the object is 25 pixels. For easy, moderate, and hard data, the F-PointNet [85] has achieved maximum mAP scores of 87.81, 77.25, and 74.46, respectively. They utilized 2D and 3D CNN architectures and their fusion for pedestrian detection. The MM-MRFC [86] utilized color, motion, and depth features and achieved the second highest accuracy with per frame processing time of 0.05 seconds. The overall performance of the state-of-the-art in easy scenarios is around 85%, for moderate it is under 70% and 80%, and for the hard case it is less than 65%. Therefore, this is a very challenging issue for safe and trustworthy AD, where the accuracy of pedestrian detection should reach the level of human perception for easy, moderate, and hard levels.
Similar to pedestrian detection, the evaluation of vehicle detection is also performed in easy, moderate, and hard scenarios of KITTI’s [78] benchmark dataset as given in Figure 6. The evaluation of these methods for vehicle detection has been performed using the AUC. The mainstream methods have performed well in the easy cases, reaching a maximum of 93.04 by 3DOP [87], for moderate and hard cases 88.64 and 79.27 maximum AUC values achieved by SubCNN [89]. The 3DOP [87] encodes object size priors, ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. They utilized structured SVM, which takes input-output pairs and learns the parameters by their proposed optimization function. SubCNN [89] exploited two very deep CNN architectures based Fast R-CNN; 1) region proposal network and 2) object detection network. The processing time is not discussed in the article, but it is greatly agreed that multiple CNNs based methods are not well suited for real-time processing due to huge computational complexity, thus of limited importance to AD. \begin{align*} Overall~accuracy\!=\!\frac {Number~of~correct~predictions}{Total~number~of~ predictions~made}\!\!\! \\{}\tag{4}\end{align*}
Drowsiness detection is assessed using overall accuracy matric and the results of well-known DL methods are compared in Figure 7. The overall accuracy tells us “out of all test samples, what proportion were mapped correctly” as given in Eq. 4. The overall accuracy is usually expressed as a percent, with 100% accuracy being a perfect model. Drowsiness detection is very challenging task even for human to detect it is very difficult and the average accuracy of human in day and night scenarios are only 80% [68]. The FFA reached the maximum accuracy of 75.57% which is the highest so far achieved in drowsiness detection. Furthermore, the AlexNet [91], VGG-FaceNet [92], LRCN [93], FlowImageNet [93], and DDD-FFA [68] achieved 65.85%, 67.85%, 61.5%, 62.99%, and 70.81%, respectively. The AlexNet [91] and VGG-FaceNet [92] are very deep CNN architectures, containing 60 and 138 million parameters, respectively which is not efficient for real time task like AD. The FlowImageNet [93] is originally trained for action recognition but it is fine-tuned for drowsy states such as face and head gestures from the input image sequence. It has five convolutional and two fully connected layers and the final layer is changed in fine tuning process from 101 to 4 classes. It is fast but not effective enough for trustable AD. DDD-FFA [68] and FFA [67] features from the FC layers of three CNN models are ensembled for drowsiness detection. This strategy has increased the overall accuracy; however, the processing time has also increased up to three times. Drowsiness is very dangerous in driving and is a frequent reason for accidents. Therefore, its accuracy should be considered to improve in the future work for safe AD, jointly with other aspects in predictive modeling that are often overlooked, such as the quantification of the model’s output confidence, the explainability of the knowledge captured by the model or the accountability of predictions. Since, safe AD decisions may put human lives at risk, therefore the need for explaining what a model observes in its input for producing its output becomes a critical factor for the sake of its viability with regulatory constraints.
The mainstream techniques for traffic sign detection are evaluated using precision, recall, and intersection over union (IOU). Precision and recall scores achieved on Swedish traffic-sign dataset (STSD) [94] using different models are shown in Figure 8. The STSD is very challenging dataset that contains 19236 images of 20 traffic sign categories. The R-CNN [72], FCN [72], Faster R-CNN [42], Mask R-CNN [42], MR Features [77], and MR Features + VSSA [77] achieved precision score of 91.2%, 97.7%, 95.4%, 97.5%, 98.83%, and 99.18%, respectively, and achieved recall score of 87.2%, 92.9%, 94.6%, 96.7%, 93.96%, and 94.42%, respectively. As this task is more about detection rather than classification or recognition, therefore, the researchers utilized IOU. IOU evaluates the results by comparing the area of the detected bounding box and the area of ground truth to be detected. The high intersection rate means good performance of the detection model and vice versa. The performance of state-of-the-art traffic sign detection models is given in Figure 9. These results are achieved on a very challenging dataset known as VIVA [73], which is captured in extremely complex scenes and contains almost all of the traffic light signals including green, red, and their related right and left turns with short- and long-distance images in different day/night conditions and illuminations. Traffic sign detection is one of the prompt applications of the AD, where both the effectiveness of the method and its processing time are very important for decision making to control vehicles. Therefore, we investigated the results of faster yet effective methods for traffic sign detection. The very famous Yolo and its variants achieved 25%, 21%, 18%, and 16% of IOU for Yolo2 [95], Yolo2-tiny [95], Yolo3 [96], and Yolo3-tiny [96], respectively. The SSD [97] and Faster RNN [88] achieved 10% and 12% IOU and the recent Rttld detector [76] achieved the highest IOU of 44% on VIVA traffic light detection dataset [73]. From the results, we can see that several famous detectors have less accuracy on the given challenging dataset, needing serious attention from researchers. Therefore, its accuracy should be considered to improve in future works for safe AD. Furthermore, this task is related to the open environment, so the conditions are changing according to different places and weather conditions, therefore model evaluation is an important future work to be considered for traffic sign detection.
The competence and performance of the aforementioned techniques in the capacity of seven missions, targeting safe AD reveal many challenges. In our investigated techniques, none of them is capable to encounter measurement, analysis, and execution together. Even though these methods show promising results for their targeted mission, they are computationally very expensive. Furthermore, the mainstream methods are functional with high-specs GPUs and cloud servers, which is not a realistic setting for real application environments as it neglects important aspects such as energy consumption, or prediction latency. Besides the computational complexity, there exist several other open challenges that are discussed in the next section.
Challenges in Safe AD
Despite significant investment of both academia and industry in AD technology, certain aspects of these systems are still facing difficulties due to numerous challenges discussed as follows:
Complexity of AD Systems: AD systems consist of a series of decision-making problems where the solution of one problem is the input to another one. Although, significant improvement can be noted in certain parts, but overall there is dependency of individual parts on the performance of overall AD system [98]. For instance, Furda and Vlacic [99] presented a multiple criteria decision making for the selection of most suitable driving move where the decision making is divided into successive stages. The first stage in their proposal is safety critical. However, there are multiple decisions need to be taken similar to human thinking. Therefore, an efficient motion planner of AD can be compatible with only an energy-intensive feedback controller. On the other hand, simpler controllers may be less robust [100], needing less energy but will need motion-planning approach of significant detail. Thus, intelligent frameworks need to be developed to balance such conflicting metrics and come up with an optimal solution on the fly.
Dynamicity of Road Environment: It is agreeable that current cities are becoming more dynamic due to significant digitization on roads with colored advertisements and illumination. Researchers have presented multiple sensors-based solutions including radar [101], vision [2], lasers [102], and different other modality-based solutions [103], however, in dynamicity of road scenarios the level of accuracy is still very low. There is also greater tendency of humans for keeping personal luxury vehicles, increasing the traffic on road. These practices make the environment of autonomous vehicles further complex, thus increase its challenges by affecting the detection, tracking, and recognition accuracy of different tasks associated with AD.
Big Data and Real-Time Processing: To keep the autonomous vehicle well aware of its surrounding environment, a variety of sensing devices including sensors [104], cameras [2], LIDAR [105] etc., are attached to it, capturing data continuously, resulting in big data. In addition, high quality data [106]–[108] (e.g., videos of higher resolution) are collected, considering the critical nature of AD. Thus, processing such huge amount of data in real-time is a big challenge, considering the accuracy, power consumption, and cost [109].
Intelligent Data Prioritization: As discussed, a significant amount of data of different nature is captured, resulting in big data. Literature shows that it is infeasible for an autonomous vehicle to process all captured data and thus data prioritization mechanism [110], [111] is needed to filter only important contents for further processing and discard unnecessary data. This prioritization mechanism should be intelligent enough to prioritize a variety of data captured in different environmental scenarios [112].
Robustness and Adaptability: Studies [56]–[65] suggest that it is comparatively easy to capture and process data for autonomous vehicle in certain environment. Most of the mainstream AI techniques for AD are trained on data collected in certain environment and not reliable in cross weather conditions. This issue is recently encountered by Google team and they presented an idea of all-weather autonomously driven vehicle. However, it becomes inherently challenging when the environment is uncertain with captured data affected by snow, rain [113], and fog [114]. Thus, the system for different tasks associated with MAE should be robust enough to adapt itself with the surrounding environment.
Integration/Fusion of Sensory Data for Dynamic Decision Making: In the real-world applications, it is very hard to achieve ideal performance and target accuracy using a single sensor. So, there are a wide use of the decision-making algorithms for processing fusion data acquired from multisensory [115], [116]. Two major types of sensors are used in autonomous vehicles i.e., environment perception and localization. The environment perception is used to detect surrounding objects of the vehicle while localization tracks the location of the vehicle. The fusion algorithms are categorized into two groups: 1) machine learning methods (deep neural network) and 2) multi sensor information fusion for measuring the state i.e., Kalman filter (KF). In the literature many sensor fusion based models are proposed using various sensors and fusion algorithms. These frameworks mainly focused on improving the accuracy but the implementation feasibility of these methods is less explored. The main challenges in the AD are perception, real-time computing and communication, and learning based controls methods. There is a huge space for efficient, lightweight, and robust fusion based pipeline for autonomous vehicle [117].
Fairness, Accountability, and Transparency in DL for AD: Recent studies have stressed on the utmost necessity of explaining decision provided by AI models in scenarios where such decisions ultimately impact on humans’ lives (e.g., health, law, etc.) [118], [119]. AD harnessing DL approaches is not an exception to this requirement for model explainability, particularly due to the black-box nature of this kind of AI models. By virtue of eXplainable AI (XAI) techniques [120], it is possible not only to shed light on the internals and knowledge learned by DL models, but also to ease the traceability and post-mortem analysis of incorrect decisions (accountability) for model refinement. Likewise, ensuring that DL models [121] for AD are not affected by severely imbalanced or scarce training samples (i.e., lack of model bias) guarantees improved generalization performance, hence a more reliable contextual awareness of the vehicle, and a compliance with eventual regulatory constraints [122]. Interpreting what the model has learned and constructing plausible counterfactuals via XAI techniques has the potential to delimit the performance boundaries of the model, unveil possible sources of bias, and analyze how decisions were made in search for possible deficiencies in the model. Without advances in model explainability, AD functionalities harnessing the powerful modeling capability of DL will be far from practicality.
Online Learning Capabilities in AD: One of the major challenges in the AD is dealing with various environments with a scalable model. For instance, a model trained for urban environment may not be applicable in rural areas, since the traffic rules are quite different in both scenarios. Similarly, this condition can be applied due to newly constructed areas, weather condition, and climates changes etc., [123]. This problem can be tackling with online learning strategy (updating model with new data). Recently, researchers have applied online learning strategies in many domains such as surveillance [124], where the deep model iteratively fine-tunes itself and update the parameters of trained model to adopt the changed environment. Similarly, Guaranteed Safe Online Learning via Reachability (GSOLR) [125], Stochastic Online Learning [126], and Online Learning via meta-learning [127] are the recent approaches which can be adopted to online learning in AD cars to update different deep models using maps, weather conditions, and visual changes accordingly.
Robustness Against Adversarial Attacks: Analogously to the above, much has been lately said around the weakness of DL models against intelligently crafted examples that even if visually imperceptible, lead them to incorrect decisions (e.g., misclassfication [128]). Adversarial attacks pose enormous challenges in the vehicular domain, as has been exemplified with traffic signs being wrongly classified by vehicular cameras due to physical adversarial modifications in the form of printable stickers [129]. Although the activity around defense strategies against adversarial attacks is vibrant at the time, definitely there is still road ahead in regard to the compliance of its effectiveness with design specifications and admissible risk limits.
Variability of Traffic Sign Boards: The object detection models are usually trained with fixed size resolution data. However, most of the traffic signs appear to be very small and when high resolution images are resized to the required input size of the model. The large size sign board can be easily captured in the resized image, however, this leads to the misdetection problem of small size traffic sign boards [42]. Furthermore, when the vehicle travels with a very high speed i.e., 100 km/h, such high-speed camera motion destroys the structure of small size sign boards. This is a very challenging task to detect and recognize all types of traffic sign boards, which can be possibly achieved by using high resolution images as input to the model and not the resized images.
Recommendations for Future Research
In the light of challenges raised in Section 5 and literature, a list of important areas for further research in safe AD is presented with brief details for industry and academia. Improvement in these directions can increase the significance of AD systems and can contribute to their safety and reliability.
Energy-Friendly Convolutional Neural Networks (CNNs): A study of several surveys shows that CNNs have obtained state-of-the-art achievements in various computer vision tasks associated to AD such as tracking [2], [130], speed control [76], and obstacles avoidance [131], [132] etc. However, their high memory requirement and computational complexity limit their usefulness. Thus, energy-friendly and efficient CNN models should be designed for improving the driving safety of AD technology.
Reinforcement Learning for AD: Reinforcement learning (RL) is an active research focus in various domains of AD, such as control [133], [134] and path planning [135], [136]. Reinforcement learning techniques have at no doubt achieved good performance levels, evincing the capability of these techniques to learn near-optimal policies to efficiently operate different subsystems of the autonomous vehicle. Nevertheless, most research contributions reported so far in the literature have been conducted on various simulators or restricted trial environments due to a manifold of reasons, ranging from established regulatory restrictions to the availability of vehicle prototypes, or the earliness of research outcomes. As a result, current RL models cannot fully cope with real-world environments, which are full of uncertainties that hinder the provision of safety guarantees [137]. Even though simulators allow generating driving scenarios at a low cost, models are trained off-line over virtual environments, but cannot be expected to perform that effectively in real conditions, and ultimately cannot be deployed directly. Therefore, further research is needed towards ensuring good generalization properties of RL models when used in simulated and real environments. To this end, several directions should be targeted in the near future, such as increasingly higher levels of realism attained by vehicular simulation software (for instance, procedural generation of urban scenarios), the latest advances in data augmentation methods (to e.g., imprint varying meteorological conditions on data captured on driving tests), or specific algorithmic proposals aimed at improving the generalization of RL models to unseen environments and/or tasks (Meta Reinforcement Learning [138], with initial findings in the ITS domain appearing very recently [139].
Sequence Learning and Generative Adversarial Network for AD: Vision sensors deployed on AV capture pedestrians performing different activities. The patterns underlying these activities cannot be captured from a single frame, but they rather require learning over a sequence of consecutive frames [124]. This augmented information substratum requires efficient techniques for sequence learning for pedestrian activity recognition in the AV surroundings, considering additional elements of complexity such as partial occlusions or different camera angles over time. To this end, data augmentation techniques capable of imprinting these effects in the data from which sequence learning models are built constitute a promising path to follow. Similarly, generative adversarial networks (GAN) can be investigated to generate accurate environments in simulation for training self-driving car policies. GANs can learn to re-render a scene from a different viewpoint, which could be useful for laying new learning environments for Reinforcement Learning methods, and ultimately producing more generalizable policies for self-driving cars.
Reliable and Efficient Motion Planners and Feedback controllers: Motion planner and feedback controller are one of the critical parts of AD systems as they have a key role in the overall running time of the system [140]. However, they are working in an inverse way as described in Section 4. Thus, further investigation is needed to come up with a reliable and efficient motion planner and feedback controller to balance the computational burden, speed, and safety [141].
Universal Benchmark Datasets: Despite the available datasets [142]–[145] for evaluating different individual aspects of AD systems (such as KITTI benchmark [78] and publicly accessible datasets [146]), there is a need to make universal benchmark datasets to measure the overall performance of AD prototypes. Such efforts will make AD a hot topic for both academia and industries, helping in benchmarking and arranging competitions for the concerned research community to improve different individual aspects as well as overall performance of AD systems.
Industrialization and Personalization: Although significant research is in progress for improving almost every aspect of AD such as tracking [147]–[149], velocity control [150], [151], localization and mapping [34], [152], path planning [153]–[156], and visual guidance [98], [43], such systems are yet not globally recognized and adoptable due to safety risks and lack of large-scale industrialization. Thus, AD models should be made mature enough to be universally trusted and adoptable at large scales. Further, personalization (such as preliminary explorations for cruise control [157] and lane departure [158], [159]) can be another interesting research direction for users to adjust their preferences in terms of safety, speed limit, available features, and cost. As an example, companies like Google and NVIDIA are building powerful AI-based self-driving cars, investing resources towards dedicated high-processing GPU and TPU devices for AD that can efficiently run DL models as the ones addressed in this study.
Edge Computing for Autonomous Vehicles: To guarantee the safety and robustness of AD, AV are equipped with various smart sensors and high-computing embedded devices. Data acquired from these sensors are processed through DL models for accurate decisions. In this context, one of the main challenges is to properly balance the tradeoff between the cost of processing devices and the competence of the computational model [160], [161]. In general, manufacturing industries prioritize the fabrication of sensors at minimum costs with maximum performance [162]. This paves an unprecedented opportunity for Edge Computing to contribute to safe AD. Edge Computing for DL [163] requires research for online training over the Edge because vehicular data dynamically changes over time [164]. The traditional training process is often performed on devices with high computational resources, and then, once trained, they are installed on the edge. This strategy is far from effective when adopted for tasks associated to AD, due to the need for refreshing the knowledge captured by the model. This is challenging research direction, requiring effective and optimized online learning mechanisms for training DL models over the edge [165].
Specifically, there is a growing need for mature software frameworks capable of federating DL models learned locally across different distant contexts, without compromising protected data. This federated learning scenario fits perfectly in safe AD, wherein models can be enriched by sharing model information among vehicles rather than the captured data themselves. The relative youth of this research area deserves further attention from the community towards extrapolating the early findings achieved with already existing frameworks to the automotive domain, placing an emphasis on crucial implementation aspects such as latency, reliability/reputation of the federated models, and the obsolescence of the model [166].
Privacy-Aware Knowledge Sharing: In safety matters, human is the last asset to put at risk. Therefore, the community should synergistically aim at more accurate models. Unfortunately, the huge variability of vehicular situations and environments makes it complicated to build DL models capable of maintaining their performance levels across diverse environmental scenarios. In this context, a need emerges for feeding models with diverse datasets that can represent as many practical safety-critical situations as possible. However, technical aspects aside and despite ultimately targeting an increase of safety, this workaround becomes complex to implement in competitive markets with stakeholders reluctant to share the data acquired from their portfolio of clients. Bearing this in mind, the focus should be drifted onto Federated Learning [167], a new distributed computing paradigm with DL at their core, by which locally trained models deployed at vehicles can share their knowledge (embodied in their adjusted parameters delivered to a central server), and exploit them locally towards improving their performance [175]. Interestingly, this distributed computation is accomplished without compromising the privacy of the data from where local models were trained. We foresee an exciting application scenario of federated DL in vehicular perception, allowing manufacturers to attain unprecedented levels of vehicular perception without major concerns with respect to the privacy and confidentiality of their datasets.
Internet of Everything for Increased Safety: In the future smart cities [169], different entities associated with roads such as vehicles, sign-boards, traffic lights, etc., will be connected with each other for sharing useful information [170], [171]. Of course, they need to be interoperable and thus, a diverse set of communication standards need to be investigated for autonomous vehicles so that there is no interoperability issue [172], [173]. This will enable autonomous vehicles to get necessary information about traffic jams, real-time best available route suggestions, and expected collisions and can increase the safety of AD.
Risk Assessment: One of the goals of AD is to reduce road fatalities and eliminate human error on roads. However, AVs are not completely risk-free due to the prevalence of real-world uncertainties. Therefore, risk assessment is crucial in order to improve the safety of AD. A plentiful strand of literature has focused on the various aspects of the AD including path planning, motion planning, scene segmentation and understanding, DL based solutions, and pedestrian’s receptivity towards fully AVs to reduce the risk of AD in real environments. For instance, Cunneen et al. [174] studied the ethical framings of AD technology to reduce the risk. Further, the challenge of the AD relies in how AVs perceive the external environment to understand different situations that could minimize the overall risk of the AD. A survey conducted on the road and lane detection by Hillel et al. [175], suggests that path planning for AVs involves two strategies: 1) bounding-box detection, maximizing the likelihood of detecting an object inside the box and 2) semantic segmentation by classifying each pixel in the input frames. However, for both the scenarios, the performance of the neural networks is dominantly successful in AVs that could efficiently segment the lanes to follow road up to the final destination. Moreover, there is a high probability of risks when AVs are exposed to drive autonomously on a completely new environment. To minimize the risk of AD in such environments, new large scale datasets have been proposed [176] for benchmarking the scene understanding. For example, SYNTHIA dataset [177] containing images for scene understanding, and [178] algorithm for real time scene segmentation in AVs to reduce risks in AD. Furthermore, Johnson-Roberson et al. [179] gathered more than 200,000 images of the computer game “Grand Theft Auto V”, for the vehicle detection and speed optimization of AVs to reduce the risk in real environments. The experimental results showed that using virtual environment images in training process significantly reduced the risk of AVs in real-world environments.
Besides the significant achievements of DL in AVs, a large limitation of DL-based perception systems is the inadequate feedback of uncertainty. Cunneen et al. [180] reviewed the challenges of AI based decision making, its risks, and societal benefits. Bayesian DL is the bridge between DL and Bayesian probability that offers principled risks analysis within DL. Furthermore, the uncertainty assessment of the model can be measured using Monte Carlo dropout sampling via circulating the input data through the network multiple times with various dropout weights. Furthermore, as suggested by McAllister et al. [181], using the Bayesian network to estimate and propagate the risk assessment would enable AVs to cope with uncertainty. Other techniques related to risk assessment should be inspected further in the years to come towards addressing this issue.
Another risk factor of AD is the autonomous vehicle itself because it involves certain complex tasks where several motors and cognitive actions are simultaneously applied and sometimes in a quick succession. Also, the performance of the AVs is profoundly dependent on the varying weather, lightning condition, and road surface. Moreover, the pedestrian behavior is also a critical factor that imprints additional uncertainty to the vehicle’s decisional environment [182]. Due to these challenges, it is perhaps not surprising that if anything goes wrong, the cost it does will be very high. For the reliability of AVs in public, they must be driven for billions of miles in complex environments and varying conditions.
Conclusion
The recent emergence of sensing, perception, and signal processing technologies have brought significant improvement to the maturity of AD, thereby reducing human drivers’ efforts and contributing to the overall safety of AD. DL strategies recently solved numerous complex problems related to different areas in general and AD, however, their detailed investigation on control processes for AD is not covered by current literature. This article pointed out the key strengths of DL methods and surveyed state-of-the-art approaches for safe AD, covering both their major achievements and limitations. In addition, this survey identified the major embodiments of AD pipeline i.e., measurement, analysis, and execution (also known as control processes) and investigated the performance of DL methods for several safety-related AD tasks, including road, lane, vehicle, pedestrian, drowsiness detection, collision avoidance, and traffic sign detection. Lastly, this paper highlighted the major challenges and issues faced by AD community and suggested recommendations for future research in further development towards safe AD.
Research on safe AD has been on the spotlight of the ITS community for decades. DL experts are in continuous race to reach a sufficient level of maturity in the AVs domain to make vehicles achieve a thorough, reliable context awareness through sensors endowed with this family of powerful modeling approaches. We advocate for a new time in which investigations should not only aim at improving the accuracy of modern DL flavors, but also inspect aspects related to their usability and practicability [183], such as the need for explanations, robustness to adversarial attacks, the assessment of epistemic uncertainty and risk characterizing such models, or the derivation of neural architectures capable of lowering their energy consumption. Unless these directions are actively pursued by the research community, DL will remain relegated to academic research and controlled trial environments, and vehicular safety will not harness the enormous potential of this branch of Artificial Intelligence.