Introduction
Every year the lives of more than 1.25 million people are cut short as a result of road traffic accident. Between 20 and 50 million more people suffer non-fatal injuries, with many incurring a disability as a result of their injury. Road traffic injuries cause considerable economic losses to individuals, their families, and nations as a whole. These losses arise from the cost of treatment, as well as lost productivity for those killed or disabled by their injuries and for family members who need to take time off work or school to care for the injured. Road traffic accident cost most countries 3% of their gross domestic product [1]. Traffic accidents severity prediction is an important step in accident management. It provides important information for emergency personnel to assess the severity of accidents, assess the potential impact of the accident, and implement effective accident management procedures [2].
Nevertheless, for the analysis of the severity of traffic accident, there still exist two problems. The first one is unable to confirm the severity of traffic accident by traditional methods in the field of transportation, the second one is the performance of current traffic accident severity prediction methods is generally not high, and these two problems will lead to untimely rescue or even inadequate rescue facilities, resulting in casualties. Therefore, a correct prediction of traffic accident’s severity provides extremely important help to save the lives of those in accidents, the problem of predicting the severity of traffic accident is a major challenge in the field of traffic safety.
At present, deep learning has attracted great attention from researchers. Among them, deep learning theory explains text, image, and voice and has been widely used in the fields of text, image, and voice recognition [3]–[9]. The deep learning method, convolutional neural network (CNN) has become a heated research topic in many scientific fields. CNN is a fast and effective feedforward neural network that is widely used in the computer vision, image recognition, and speech recognition fields and has achieved remarkable results [10]–[15]. CNN has the following feature extraction characteristics: 1) the convolution layer in CNN is locally connected but not fully connected, which means that the output neurons are only connected to locally adjacent input neurons. 2) Another layer structure in CNN, the pool layer, only selectively selects significant features from the receiving area, which greatly reduces the parameter scale of the model. 3) The full connection layer is only used in the final stage of CNN [16]–[18].
This paper aims to solve the above two problems using proposed TASP-CNN model based on CNN which can consider and excavate in detail the combination relationships among the traffic accident’s features that affect the severity of traffic accident. In order to extract the combination relationships between traffic accident’s severity features and since the model TASP-CNN’s input can only be an image, the method for measure the weight of traffic accident’s features and an algorithm, FM2GI, to convert a single feature relationship of traffic accident’s data into a gray image containing combination relationships in parallel based the weights of traffic accident’s features have been proposed at the beginning of this paper, then, a TASP-CNN model is constructed to predict traffic accident’s severity.
The main contributions of this paper are summarized as follows:
An algorithm, FM2GI, is proposed to convert a single feature relationship of traffic accident’s data into a gray image containing combination relationships based on the weights of traffic accident’s features in parallel.
The combination relationships among traffic accident’s features are considered and applied simultaneously in traffic accident severity prediction using the proposed algorithm, FM2GI, and the TASP-CNN model.
The performance of the proposed TASP-CNN model was compared with 9 competitive models and the results show that the proposed TASP-CNN model was better to those models.
The remainder of this article is organized as follows. Section 2 discusses related work. Section 3 presents the method for measure the weights of traffic accident’s features, the methodology of converting a single feature relationship of traffic accident’s data into a gray image containing combination relationships in parallel based on the weights of traffic accident’s features and the TASP-CNN model to predict traffic accident’s severity. Section 4 gives the experimental results of the TASP-CNN model and analysis of these results. Finally, the conclusion and future work are described in Section 5.
Related Work
According to the primary literature on traffic accident’s severity prediction in recent years, prediction methods for traffic accident’s severity can be divided into two categories: (1) the statistical-learning methods and (2) the machine-learning methods [19].
A. The Statistical Methods
The statistical methods have been widely used in traffic accident’s severity prediction. For example, various researchers have proposed the logistic regression model, the ordered probit model, and the mixed logit model, which have been used to analyze relevant data from traffic accident and to analyze the influence of multiple variables on the severity of traffic accident [2], [20]–[26], the purpose of these prediction models is to predict the severity of traffic accident. References [2], [27], and [28] have shown that most regression models have assumptions and predefined basic relationships between independent and dependent variables (i.e., a linear relationship between variables). When these assumptions are violated, the severity of traffic accident will be inaccurately predicted by a large probability. Therefore, De Oña et al. [28] presented an analysis of 1,536 accidents on rural highways in Spain, where 18 variables representing the aforementioned contributing factors were used to build three different BNs (Bayesian networks) that classified the severity of accidents into slightly injured, killed, or severely injured. The variables that most accurately identified the factors that are associated with accidents in which someone was a killed seriously injured (accident type, driver age, lighting, and number of injuries) were identified by inference. In particular, Zong et al. [2] presented a comparison between two modeling techniques, Bayesian networks and regression models, by applying them in accident severity analysis. Three severity indicators, including the number of fatalities, the number of injuries, and property damage, are investigated using the two methods, and the major contributing factors and their effects are identified. The results indicated that the goodness-of-fit of the Bayesian network was higher than that of the regression models in accident severity modeling. Abellán et al. [29] demonstrated that rule extraction is unlimited to the structure of the decision tree (DT), and some important relationships between variables cannot be extracted when only one DT is used. Therefore, a more effective method to extract rules from DTs is proposed, but this method can only be applied to a specific traffic accident’s data set. Chang and Chien [30] collected the 2005–2006 truck-involved accident data from national freeways in Taiwan and developed a non-parametric classification and regression tree (CART) model to establish the empirical relationship between injury severity outcomes and driver/vehicle characteristics, highway geometric variables, environmental characteristics, and accident variables. The results showed that drinking and driving, seatbelt use, vehicle type, collision type, contributing circumstances leading to driver/vehicle action, number of vehicles involved in the accident, and accident location were the key determinants of injury severity outcomes in truck accidents. Li et al. [31] compared the performance of the support vector machine (SVM) model and the ordered probit (OP) model. It was found that the SVM model produced better prediction performance outcomes for crash injury severity than the OP model did. The percentage of correct predictions from the SVM model was found to be 48.8%, which was higher than that produced by the OP model (44.0%). Even though the SVM model may suffer from a multi-class classification problem, it still provided better prediction results for small proportion injury severities than the OP model did. Chen et al. [32] utilized a classification and regression tree (CART) model to identify significant variables, and then SVM models with polynomial and Gaussian radius basis function (RBF) kernels were used for model performance evaluation. It has been shown that SVM models have a reasonable prediction performance, and the polynomial kernel has outperformed the Gaussian RBF kernel. Hashmienejad [33] proposed a novel rule-based method to predict traffic accident’s severity according to user’s preferences instead of conventional DTs. In their method, they customized a multi-objective genetic algorithm, the non-dominated sorting genetic algorithm (NSGA-II), to optimize and identify rules according to support, confidence, and comprehensibility metrics. The evaluation results revealed that the proposed method outperformed classification methods, such as ANN, SVM, and conventional DTs, according to classification metrics, such as accuracy (88.2%) and performance metrics of rules, such as support and confidence (0.79 and 0.74, respectively).
B. Machine Learning Methods
In recent years, machine learning methods are efficient technologies that have been widely used in traffic prediction problems because of their ability to process multi-dimensional data, flexibility in implementation, versatility, and strong predictive capabilities [34]–[37]. In terms of traffic accident’s severity prediction, Kunt et al. [38] used twelve accident-related parameters in a genetic algorithm (GA) pattern search and multi-layer perceptron (MLP) structural modeling methods in an artificial neural network (ANN) to predict the severity of freeway traffic accident. The models were constructed based on 1,000 crashes that occurred during 2007 on the Tehran–Ghom Freeway. The best-fit model was selected according to the R-value, root mean square errors (RMSE), mean absolute errors (MAE), and the sum of square error (SSE). The highest R-value was obtained for the ANN, approximately 0.87, demonstrating that ANN provided the best prediction. Zeng and Huang [39] proposed a convex combination (CC) algorithm to rapidly and stably train a neural network (NN) model for crash injury severity prediction, and a modified NN pruning function approximation (N2PFA) algorithm to optimize the network structure. According to the results, the CC algorithm outperformed the BP algorithm both in convergence ability and training speed. Compared with a fully connected NN, the optimized NN contained much fewer network nodes and achieved comparable classification accuracy. Both of them had better fitting and predictive performance than the OL model, which again demonstrates the NN’s superiority over statistical models for predicting crash injury severity. Sameen and Pradhan [40] developed a manchine-learning model using a recurrent neural network (RNN) and employed to predict the injury severity of traffic accident based on 1,130 accident records that have occurred on the North-South Expressway (NSE), Malaysia over a six-year period from 2009 to 2015. The proposed RNN model was compared with the multilayer perceptron (MLP) and Bayesian logistic regression (BLR) models to understand its advantages and limitations. The results of the comparative analyses showed that the RNN model outperformed the MLP and BLR models. The validation accuracy of the RNN model was 71.77%, whereas the MLP and BLR models achieved 65.48% and 58.30%, respectively.
The majority of traditional statistical and machine learning models consider that data’s features are ordered or disordered one by one, as shown in Figure 1.
Considering the data’s features of traditional models (a) ordered, (b) disordered.
From the Figure 1, the traditional models only consider the single feature relationship of the data, and don’t consider the combination relationships among the features. For this reason, according to the characteristics of CNN, this paper proposes a method to combine the features of data to obtain the combination relationships of data, as shown in Figure 2: the red rectangle is no longer extract a single feature relationship of the data, but a combination relationship among four features. Correspondingly, the size of the red rectangle can be flexibly changed according to different situations to obtain different combination relationships, specifically, it is achieved by changing the kernel size of the convolution operation in CNN. But the key to this method which is also the key to affecting the performance of the model is how to arrange the features of the data into the corresponding positions in the matrix, this requires the features’ weights, measuring the weights of features will be explained in the next chapter.
Although feature-to-image conversion practice is not a something new [41], [42], the method we proposed in converting features into images is obviously different from those papers, because our inspiration comes from the inherent attributes and characteristics of CNN, by combining the measuring of feature weights, the features are filled into the all-zero matrix, and finally converted into images as input of CNN, let’s use Figure 2 to illustrate why we did this. As we all know, the core of CNN is convolution operation, so suppose the kernel size is 2 and the stride is 1, we can see that the number of convolutions at the center of the matrix is the largest while the number of convolutions at the edge of the matrix is the smallest when all convolution operations are completed, where feature 5 participated in convolution operation four times, while feature 1, feature 3, feature 7 and feature 9 only participated in convolution operation once, therefore, we may conclude that feature 5 contributes more to extracting feature information from CNN than feature in edge position like feature1, feature 3, feature 7 and so on, this is why we need to measure the weights of features, generally speaking, the greater the weights of features, the greater the impact on the prediction results, so, we fill the features with larger weights into the center position of the all-zero matrix, accordingly, the features with smaller weights will be filled into the edge position of the all-zero matrix. In this way, we can give full play to the inherent attributes and characteristics of CNN to improve the performance of the model.
Methodology
Traffic accident’s data with feature information must be considered comprehensively to predict the severity of traffic accident. In accordance with Chang and Wang [27] and Kopelias et al. [43], the factors affecting the severity of traffic accident mainly include the following five parent features: roadway, accident, vehicle, casualty, and environmental features. However, the most of existing literature did not effectively use and excavate the combination relationships among these features. Based on the above-mentioned five parent features that affect the severity of traffic accident, the method for measure the weights of traffic accident’s features, an algorithm, FM2GI, to convert a single feature relationship of traffic accident’s data into a gray image containing combination relationships in parallel and TASP-CNN architecture are proposed and will be explained in this section.
A. Measuring the Weight of Traffic Accident’s Features
In order to analyze the combination relationships of traffic accident’s features and its contribution to the traffic accident, it is necessary to measure the weights of the child and parent features of traffic accident. Principle of measuring the weights of traffic accident’s features is based on Gradient Boosting Decision Tree (GBDT) [44], [45]. Let there is a single decision tree \begin{equation*} \Im _{\ell }^{2} \left ({T }\right)=\mathop \sum \limits _{t=1}^{J-1} {\hat {l}^{2}_{t}}I(v(t)={\ell })\tag{1}\end{equation*}
This weight measure is easily generalized to additive tree expansions and it’s simply averaged over the trees as the equation (2) [45].\begin{equation*} \Im _{\ell }^{2} =\frac {1}{M}\mathop \sum \limits _{m=1}^{M} \Im _{\ell }^{2} \left ({{T_{m}} }\right)\tag{2}\end{equation*}
\begin{equation*} f_{k} (x)= \sum \limits _{m=1}^{M} {T_{km} (x)}\tag{3}\end{equation*}
\begin{equation*} \Im _{\ell k}^{2} =\frac {1}{M}\mathop \sum \limits _{m=1}^{M} \Im _{\ell }^{2} \left ({{T_{km}} }\right)\tag{4}\end{equation*}
\begin{equation*} \Im _{\ell }^{2} =\frac {1}{K}\mathop \sum \limits _{k=1}^{K} \Im _{\ell k}^{2}\tag{5}\end{equation*}
B. FM2GI: Converting Traffic Accidents’ Feature Matrix to Gray Image
According to the feature weight and the characteristics of CNN, this paper proposes an algorithm for converting a single feature relationship of data into a gray image containing combination relationships of data features and will be explained below.
First, the data features and their correlation were formally defined.
Definition 1 (Feature Vector):
The feature vector is the expression form of a data’s features which is a 3-tuple, FV = {FP, FC, wc}, where:
FP represents all the parent features of a data. FP = { FP 1,…, FPm },
represents the number of parent features of a data, in this paper, m = 5;m FC represents all the child features of a dat. FC = { FC1, 1,…, FCm, n},
represents the number of child features of a data. In this paper,n FCn= 12.\,\,\forall FPi, where_{i,j}\in [1,i\in ],m [1,j\in ] represents then th child feature, and FCj ’s parent feature is FPi of a data. Satisfy:_{i,j} ,\bigcup \nolimits _{i=1}^{m} {FP_{i}} =FC FP\forall FP_{i}\cap ,_{j}=\emptyset , and the number of child features included in the ith parent feature is recorded as NPi\ne j FP_{i}=\vert ;_{i}\vert wc represents the weights vector of all child features of a data,
, where\boldsymbol{wc} = (w_{1,1},\ldots, w_{i,j}) [1,i\in ],m [1,\text{j}\in ],n represents the weight of the\forall w_{i,j} th child feature of a data, and it belongs to the parent feature FPi.j
According to definition 1, the feature vector of a data in this traffic accidents’ data sets can be formally described as: FV = (FP, FC, wc), where:
FP ={Accident Features1, Roadway Features2, Environmental Features3, Vehicle Features4, Casualty Features5};
FC ={Easting1, 1, Northing1, 2, 1st Road Class1, 3, Accident Time1, 4, Number of Vehicles1, 5, Road Surface2, 6, Lighting Conditions3, 7, Weather Conditions3, 8, Type of Vehicle4, 9, Casualty Class5, 10, Sex of Casualty5, 11, Age of Casualty5, 12};
wc=(0.1657745381, 1, 0.1715307851, 2, 0.0822282591, 3, 0.0477714721, 4, 0.0607633751, 5, 0.0488474062, 6, 0.0418269363, 7, 0.043548433, 8, 0.1263146574, 9, 0.0670575895, 10, 0.0491163895, 11, 0.0952201635, 12)
Definition 2 (Feature Matrix):
The feature matrix is the expression of all data features in the data sets. The feature matrix is a set of feature vectors whose expression is:\begin{equation*} \textit {FM}= \{\textit {FV}_{1},\ldots, \textit {FV}_{k}\},\quad \text {and}~\textit {FM}\in R^{k\times n}\end{equation*}
Algorithm 1 (FV2GI): Converting the Feature Vector of a Piece of Data Sets Into a Gray Image
Feature vector of a piece of data sets: FV
Gray image form of a piece of data sets: grayImage
Steps:
Begin
for i
max_dim = max (NPi, m);
end for
Initialize an all-zero matrix namely Mat
for i
for j
end for
end for
Define wp= (wp 1,…, wpm);
Sorted_FP = sort (FP,wp,‘Descend’);
for i
Mat= fill
for j
Sorted_FC= sort (FC,wc,‘Descend’);
Mat = fill
end for
end for
grayImage = reshape (Mat);
Return grayImage;
End
Algorithm 1 classifies the features in data sets, the
After obtaining the all-zero matrix, the features of the data set filled into the all-zero matrix according to the following steps and rules: first, all the parent features are arranged in descending order, and the descending order standard is based on the weight of each of parent feature (the weight of each parent feature is the sum of the weights of all child features belong to it). All parent features will be filled by the center of the all-zero matrix along the longitudinal axis according to the rule that the maximum weight is in the center, and the upper weight is greater than the lower weight (implementation by function
In order to clearly understand the process of algorithm 1, we give an example to illustrate it, as shown in the following Figure 3. Suppose there are three parent features in the dataset: parent feature1, parent feature2 and parent feature3. Among them, parent feature1 contains five child features: child feature1, 1, child feature1, 2, child feature1, 3, child feature1, 4, child feature1, 5 and their weights are 0.03, 0.06, 0.01, 0.11, 0.15, respectively. Parent feature2 contains three child features: child feature2, 6, child feature2, 7, child feature2, 8 and their weights are 0.1, 0.04, 0.3, respectively. There are four child features in parent feature3: child feature3, 9, child feature3, 10, child feature3, 11, child feature3, 12 and their weights are 0.02, 0.03, 0.08, 0.07, respectively.
Algorithm 2 converting the feature matrix of the data sets into gray images in parallel as follows:
Algorithm 2 (FM2GI): The Feature Matrix of the Data Sets is Converted into Gray Images in Parallel
Feature matrix of all data sets: FM
Gray images linked list of all data sets: grayImageList
Steps:
Begin
for
FV2GI (FM [i]);
Return grayImage;
end for
grayImageList = grayImageList
;
Return grayImageList;
End
Algorithm 2 (FM2GI) uses a parallel method to convert the feature matrix of the data sets into gray images in parallel. The algorithm first obtains the size of data sets of the feature matrix FM, and records it as
Figure 4 shows how to convert the feature vector of traffic accidents’ data sets into gray images.
A schematic diagram of how to convert feature vector of traffic accidents’ data sets into gray images.
C. TASP-CNN Architecture
As shown in Figure 5, the structure of TASP-CNN includes four main parts: model input, convolution layer, full connection layer, and model output layer. Each section will be described in detail below.
First, the input of TASP-CNN is the gray image of the converted traffic accidents’ data sets, which includes 5 parent features and 12 child features of the traffic accident. Accordingly, the input mathematical form of the model is expressed as follow:\begin{align*} x_{i}=&\left [{ {{\begin{array}{*{20}c} {p_{11}} &\quad {p_{12}} &\quad \ldots &\quad {p_{1M}} \\ {p_{21}} &\quad {p_{22}} &\quad \ldots &\quad {p_{2M}} \\ \ldots &\quad \ldots &\quad \ldots &\quad \ldots \\ {p_{M1}} &\quad {p_{M2}} &\quad \ldots &\quad {p_{MM}} \\ \end{array}}} }\right]\!,\quad i\in [1,N], \\ M=&\max (PC,CC)\tag{6}\end{align*}
Then, the core layer of TASP – CNN is a convolution layer, and its purpose is to extract the abstract feature in the traffic accidents’ data sets. To clearly describe the calculation process of the convolution layer, each pixel of a traffic accident’s image is first numbered, \begin{equation*} a_{i,j} =f\left({\sum \limits _{c=1}^{C} {\sum \limits _{m,n=1}^{F} {w_{c,m,n} p_{c,i+m,j+n} +w_{b}}} }\right)\tag{7}\end{equation*}
The activation function used in this article is the rectified linear unit (ReLU) [47], and the calculation equation is as follow:\begin{equation*} f(x)=\max (0,x)\tag{8}\end{equation*}
Next, the following setup of the full connection layer is to convert the final and highest-level features that were extracted and learned by the last convolution layer, flatten them into a one-dimensional vector, and calculate the flatten full connection layer using the following equation:\begin{equation*} a^{flatten}=flatten([a_{1},a_{2},\ldots,a_{c}]),\quad c\in [1,C]\tag{9}\end{equation*}
Finally, the output of the upper full connection layer is taken as the input of the next full connection layer and finally output to the output layer. The output layer uses the softmax activation function [48] to classify the severity of traffic accident. The output of the model is the corresponding traffic accident’s severity level, including slight traffic accident, serious traffic accident, and fatal traffic accident. The calculation equation for the full connection layer is as follow:\begin{equation*} \hat {y}=w_{f} a_{D}^{flatten} +b_{f}\tag{10}\end{equation*}
In addition, batch normalization [49] is used between the convolution layers, between the convolution layer and the full connection layer, and between the full connection layer and the full connection layer to train the acceleration model and prevent over-fitting.
Experimental Results and Analysis
The TASP-CNN model proposed in this study was implemented in Python using Google’s open source deep-learning framework TensorFlow [50]. TensorFlow was used because it has the advantages of availability, flexibility, and high efficiency. It can easily define and execute various deep-learning networks. It was specifically configured as Intel Xeon E5-2682v4 (Broad Well) processor, 2.5 GHz clock speed, NVIDIA p100 GPU, 12 GiB display memory, GPU server with 9.3 TFLOPS single-precision floating point and 4.7 TFLOPS double-precision floating point computing capability, all the experiments in this paper were completed under this experimental environment, the average result of ten repeated of experiments is the final result of the experiment and the CSP-CNN model was used to train 100 epochs in each experiment.
A. Data Description
The traffic accidents’ data for an 8-year period (2009–2016) from the Leeds City Council, United Kingdom were used in this study. The total number of accident records obtained for this period was 21,436. The severity was categorized into three levels: slight, serious, and fatal. Twelve different child features were collected from each accident record of the traffic accident that included Easting, Northing, 1st road class, Accident time, Number of vehicles, Road surface, Lighting conditions, Weather conditions, Type of vehicle, Casualty class, Sex of casualty and Age of casualty, and they are belong to one of the five parent features as shown in Table 1, among them, the child features Easting, Northing, 1st road class, Accident time and Number of vehicles belong to parent feature Accident feature, the child feature Road surface belongs to parent feature Roadway feature, the child features Lighting conditions and Weather conditions are belong to parent feature Environmental feature, the child feature Type of vehicle belongs to parent feature Vehicle feature, the child features Casualty class, Sex of casualty and Age of casualty are belong to parent feature Casualty feature.
Based on the above, through the interface provided by XGboost since it’s an improved algorithm based on the GBDT principle, which is efficient and can be parallelized [51], the weights distribution results obtained by 1,000 iterations of 12 child features of traffic accident are shown in Figure 6 and Table 2 below.
According to the Table 2, we can find that the weight of Accident feature is the highest among the five of parent features from the perspective of parent features, which indicates that the Accident feature has the greatest impact on the traffic accident. From the perspective of child features, from the perspective of child features, the weights of Northing, Easting and Type of vehicle are the highest among the twelve child features, which indicates that there are some traffic accident’s black spots with frequent traffic accident and the possibility of traffic accident of different vehicle types are different.
B. Data Preprocessing and Evaluation Metrics
Before applying the traffic accidents’ data sets as input to TASP-CNN, the data sets needed to be preprocessed. The preliminary processing of the data included deleting incomplete, erroneous, and repeated traffic accidents’ data, normalizing traffic accidents’ data sets and imbalanced processing of traffic accidents’ data. There were 18,727 pieces of data in the entire data sets that were available to be trained after incomplete, erroneous, and repeated data were deleted.
Since the dimensions of each of the 12 child features of traffic accident are different, it was necessary to normalize the data under each feature, remove the unit limitation of the data, and convert it into dimensionless pure values so that the features of the different units could be compared. In addition, the normalization of traffic accidents’ data sets can also improve the convergence speed and accuracy of the model [49], [52]. By using the standardization method in statistics: z-score normalization, also called zero-mean normalization, to normalize the traffic accidents’ data sets, the data obtained after normalizing the traffic accidents’ data sets conforms to the standard normal distribution. Specifically, the average value is 0, the standard deviation is 1, and the transformation function is:\begin{equation*} x^{\ast }=\frac {x-u}{\sigma }\tag{11}\end{equation*}
If the data sets of traffic accident are not balanced, the training of the model will focus on the data categories that account for a large proportion of the total data, while ignoring the data categories that account for a small proportion of the total data. This will eventually result in the training model over-fitting the sample categories that account for a large proportion, while under-fitting the sample categories that account for a small proportion [53], [54]. Generally speaking, there are two ways to deal with imbalanced data using sampling methods, under-sampling and over-sampling. Due to the loss of some data sets due to under-sampling, data sets cannot be fully utilized, what’s more, the traffic accidents’ data with categories fatal and serious is much less than slight. To make full use of traffic accidents’ data sets, the improved Borderline-SMOTE2 algorithm [55] based on the Synthetic Minority Oversampling Technique (SMOTE) [56] was used to solve the problem of data imbalance, and it proved to be effective for the processing of imbalanced data [57]–[60].
To prevent testing set from being affected by synthetic data (through Borderline-SMOTE2), the testing set needs be isolated away from the training set in the preprocessing and we only oversampled the training set. First, we randomly extract 20% of the data from the data set as the testing set, then over-sampling the remaining 80% of the data as the training set. The data distribution of ten experiments is shown in the table 3.
Since the testing set contains only the real data itself and will eventually be imbalanced, total accuracy is not appropriate for it, Micro_F1score is used because some classes are much larger(more instances) than others in traffic accident’s data set [61], moreover, in order to consider the actual application situation of the model, precision and recall are introduced to analyze the testing data set of traffic accident [62]. The calculation equation of them as follows:\begin{align*} Micro\_{}Precision=&\frac {\sum \limits _{i=1}^{n} {TP_{i}}}{\sum \limits _{i=1}^{n} {TP_{i}} +\sum \limits _{i=1}^{n} {FP_{i}}} \tag{12}\\ Micro\_{}Recall=&\frac {\sum \limits _{i=1}^{n} {TP_{i}}}{\sum \limits _{i=1}^{n} {TP_{i}} +\sum \limits _{i=1}^{n} {FN_{i}}} \tag{13}\\ Micro\_{}F1score=&\frac {2\times Micro\_{}Precision\times Micro\_{}Recall}{Micro\_{}Precision+Micro\_{}Recall} \\\tag{14}\end{align*}
C. Determination of the TASP-CNN Network
Through the interface provided by scikit-learn [63] and a combination of GridSearchCV and RandomizeSearchCV algorithms, the parameter combination of TASP-CNN was searched for 100 epochs, and the optimal TASP-CNN hyperparameter combination was determined. Using only GridSearchCV requires high computational cost, while using only RandomizeSearchCV will find the locally optimal combination of hyperparameters. To make better use of them, RandomizeSearchCV was used to search for the globally optimal combination of hyperparameters, and GridSearchCV was used to search for the locally optimal combination of hyperparameters. Thus, the computational cost required was reduced, and it was not easy to fall into a situation of the locally optimal combination of hyperparameters. More accurate results can be obtained by adjusting the combination of hyperparameters using this cross-combination method. By establishing models with various hyperparameter combination and using 5-fold cross-validation to evaluate each model, the hyperparameter combination with the highest Micro_F1score was finally obtained. Table 4 shows the hyperparameter combination used in TASP-CNN after searching using this hybrid method.
In general, in the deep-learning model, multiple modules and multiple layers can be stacked together. Therefore, it is very important to analyze the depth of the network to understand the network behavior. Generally speaking, the depth of CNN should not be too large or too small [35], so CNN can learn more complicated relationships, while maintaining the convergence of the model. Different depth values were assigned to the TASP-CNN model for testing from small to large. Table 5 lists the network structures of TASP-CNN at different depths. Experiments based on the TASP-CNN network structures show the Micro_F1score of the testing set and are listed in Table 5. When the depth of TASP-CNN was 4, the Micro_F1score of the testing set was 0.86. The Micro_F1score of the testing set reached its highest at 0.87 when the depth was 5. When the depth of the TASP-CNN model was more than 5, the Micro_F1score of the testing set gradually decreased. The best Micro_F1score was achieved by using two convolution layers with 256 filters: 1 flatten layer, 1 full connection layer with 128 hidden units, and 1 softmax full connection layer with 3 hidden units. The Micro_F1score of the testing set reached 0.87. Therefore, the TASP-CNN model with a depth of 5 was used in this experiment.
D. Experiment Results
To illustrate the effectiveness of the TASP-CNN model, this experiment compared the model with six statistical models and three machine learning models.
The six of statistical models were: K-nearest neighbor algorithm (K-NN) [64], DT [65]. Naive Bayes’ classifiers (NBC) [66], Logistic regression (LR) [67], Gradient boosting (GB) [68] and Support vector machines (SVMs, also known as support vector networks) [69]. Correspondingly, the three machine learning algorithms are: neural networks (NNs) or connectionist systems [70], Long Short-Term Memory-Recurrent Neural Network (LSTM-RNN) [71]–[73] and 1D convolution (a convolutional form of convolutional neural networks).
Moreover, the above six statistical models were implemented using an interface provided by scikit-learn [63], and the parameters are optimized. The NN model was tuned to form up to five hidden layers with 128 hidden units in each layer and one softmax fully connected layer. The activation function was ReLU, and the optimizer was SGD (Stochastic Gradient Descent). In addition, the kernel initializer was uniform. LSTM-RNN was optimized to contain one LSTM layer with 128 hidden units and three hidden layers with 64, 128, and 256 hidden units in the three layers, respectively, and the last one is the softmax fully connected layer. The optimizer was SGD with a learning rate=0.001, decay=0.9, and momentum=0.8. The Conv1D’s configurations were set up to include three hidden layers with 256 hidden units in each 1D convolution layer and one softmax fully connected layer was added. Finally, the activation function was ReLU, and the optimizer was Adam [74].
Table 6 and Figure 7 show the average Micro_F1score and the Micro_F1score of all ten experiments after applying six statistical models, three machine learning models, and TASP-CNN to the traffic accident’s data sets. The results show that the TASP-CNN model proposed in this study is better to other statistical models and machine learning models in the Micro_F1score of the testing set. This is the evidence that TASP-CNN can be generalized to the new traffic accident’s data sets. One possible reason is that when the statistical model treated the traffic accident’s data sets, it thought that there was no local correlation between the features of the traffic accident’s data sets, and it ignored the combination relationships among the features of the traffic accident’s data sets. Similarly, these machine learning models cannot analyze the combination relationships among traffic accident’s data sets features from the perspective of model structure, and these traffic accident’s data sets features have strong combination relationships and correlation relationships, while the TASP-CNN model proposed in this study was locally perceived and fully extracted the combination relationships among the features of traffic accident’s data sets which had explained above.
The essential purpose of predicting the severity of traffic accident is to provide corresponding medical assistance to the personnel involved in the traffic accident in time, reduce casualties in the accident, inform the corresponding emergency decision-making department in time, and avoid greater property losses. Therefore, the predicted severity of traffic accident was further divided into three levels of analysis: slight traffic accident, serious traffic accident, and fatal traffic accident.
Table 7 and Figures 8–10 show the average Precision, average Recall and the Precision, Recall of all ten experiments by different models under slight, serious and fatal traffic accident.
Precision and recall line chart predicted using different models under slight traffic accident.
Precision and recall line chart predicted using different models under serious traffic accident.
Precision and recall line chart predicted using different models under fatal traffic accident.
From Table 7 and Figures 8–10, the results from the slight traffic accident’s testing set show that the precision of the NBC model is the highest compared with other models, while the recall is the highest for the TASP-CNN. The results from the serious traffic accident’s testing show that the TASP-CNN and GB have the relatively high precision compared with the other eight of different models. The results of the fatal traffic accident’s testing set show that TASP-CNN has the highest precision with other models. But by considering with the actual situation analysis, we can allow certain errors in the precision of the prediction of slight traffic accident, because slight traffic accident has high probability of not causing significant casualties and major property losses. However, for serious and fatal traffic accident, especially for fatal accident, the precision requirement of the prediction must be relatively high, because as long as the prediction is slightly inaccurate, the corresponding emergency medical support and the decision of the corresponding emergency department may not be provided, resulting in significant casualties and properties losses. Therefore, the performance of the TASP-CNN model is better than other models when analyzed from the perspective of this combination of specific situations.
To sum up, the performance of the TASP-CNN model proposed in this study was better than that of other nine models, when it was based on the analysis of the Micro_F1score of model prediction and considering the specific application scenarios to analyze traffic accident with different severity levels.
Conclusion and Future Work
In this paper, a deep-learning approach with a TASP-CNN model was proposed for traffic accident severity prediction. Unlike the previous methods that only consider the shallow structure of traffic accident, the proposed method successfully fund the latent traffic accident’s severity feature representation, such as the feature combination, and deeper feature correlations from traffic accident’s data. The performance of the proposed TASP-CNN model was evaluated using traffic accident’s data for an 8-year period (2009–2016) from the Leeds City Council, and it was compared with the NBC, the KNN, the LR, the DT, the GB, the SVC the Conv1D, the NN, and the LSTM-RNN models. The results show that the proposed TASP-CNN model was better to the competitive models.
For future work, it would be significant to investigate other machine learning algorithms for traffic accident’s severity prediction and apply these algorithms on different public traffic data sets to examine their effectiveness. Furthermore, the TASP-CNN model in this study is novel, especially the FM2GI algorithm for converting traffic accident’s data into gray images. Therefore, there is a need to analyze the generalization ability of the TASP-CNN model proposed in this study. More specifically, there is a need to understand whether the TASP-CNN model can be applied to other areas, and whether it is more accurate than the corresponding competitive model.