Introduction
Autonomous driving has gained great attention of many researchers recently. The key aim of Intelligent Transport Systems (ITS) is to avoid accidents while accurately guiding the vehicle through the road, along with considering traffic safety rules and avoiding obstacles in the way [1], [2]. Advancement in the field of Autonomous Vehicles (AV) has put forth enormous challenges in the automotive industry [3]. These challenges include real-time processing along with accurate and reliable segmentation. The availability of safe, secure and reliable AV will benefit humanity including visually impaired and handicapped people to commute more easily and efficiently [4].
To reduce the risk of road accidents, it is necessary to accurately distinguish the road region from the other regions. This will help autonomous vehicles to navigate correctly, as well as understand the situation of the surrounding environment including traffic signs [5], [6] and signals [7], [8], pedestrians [9], road lanes [10] and other vehicles on the road [11]. Mersky et al. [12] have concluded that reducing the prediction time of detection/segmentation algorithm used in AVs can improve fuel economy since more aspects of driving (i.e., acceleration, speed, etc.) are based on decisions made by the AV, instead of the human driver.
Typically, object detection and tracking techniques use a bounding box to locate specific objects in an image [11], [13], [14]. Road detection differs from other object detection techniques as road region is usually occluded by vehicles, cyclist and pedestrians. For road detection, semantic segmentation of perspective view can differentiate the road region from the other regions.
Researchers have proposed different techniques to address road detection and segmentation [15]–[22]. Some of these are based on Digital Image Processing (DIP) [17], [19], [22], whereas some use Machine Learning (ML) techniques [23], [24] to address the problem.
Initially, researchers [17], [19], [22] used DIP techniques to detect road lines in order to help drivers. DIP techniques can only detect roads based on structural and color information. Similarly, ML techniques have been also used in object detection [25], [26], classification [27], [28], and segmentation [29]–[31].
DIP and ML techniques work based on predefined criteria and features (such as color, edges, corner points, structures, etc.) in images. Hence, such techniques do not perform well in noisy images as well as images with different lighting conditions. Such techniques require pre-processing of images and manual fine-tuning which varies with different types of images under different circumstances [32].
Recently, Deep Learning (DL) has demonstrated excellent results in artificial intelligence and computer vision [33]–[37] applications. Many DL based applications have been proposed recently. The applications include image classification [32], [38], [39], image segmentation [40], [41], change detection in images [42], [43] and object tracking [44], [45]. While Deep Convolutional Neural Networks (DCNN) have been useful in the aforementioned DL applications, it also proves helpful in semantic segmentation [29], [46].
Semantic segmentation models classify pixels of an image into one of the predefined classes. Recently, several semantic segmentation solutions based on DCNN have been proposed [18], [40], [47], [48]. The amazing result gained by AlexNet [49] and GoogleNet [37] for classification led to an increase in the use of DCNN based computer vision applications, including segmentation.
In DCNN, each convolutional layer extracts features and learns abstract information from the input images. The deeper the network is, the more generalized information it tends to learn. Most of the state-of-the-art DL architectures, like UNet [33], ResNet [34], SegNet [35], VGGNet [36], and GoogleNet [37], deepen the network to achieve improved model accuracy and segmentation performance. However, deeper networks are data-hungry and require much more data to perform better. In situations where the availability of data is insufficient, making the networks deeper can cause over-fitting. Hence, the objective of learning generalized features will not be achieved.
Similarly, the deeper the network, the higher is the time required to obtain the results. However, in computer-aided real-time applications like driver assistance, the processing time is of key importance. Thus, there is an intense need for shallow and generalized networks that require less processing time to segment and classify images. Keeping in view the above points, it would be useful if the required abstract information are pre-calculated and passed along with the original input. This will result in a less-deeper and accurate network, which will require less data for training.
The recent state-of-the-art DL techniques use DCNN [33], [35] to achieve better results. However, deeper networks face vanishing gradient problem. The vanishing gradient problem has been handled with residual networks in [34]. However, residual connections make the network complex, which results in increasing the time required for training the network. Such networks cannot be used in real-time computer-vision based applications.
The proposed Multi-feature View-based Shallow Convolutional Neural Network (MVS-CNN) utilizes gradients information as additional features along with the input image. The gradient of an image shows the change in intensity or color in an image, which is useful to extract essential information from the images (i.e. edge, texture, etc). By examining the outputs of the convolution layers in a DCNN, shown in Fig. 1, it can be observed that the outputs are similar to a gradient image. Since convolutional layers in DCNN learn features, the gradient features along with input image will be useful for the DL model to enhance their learning process. Furthermore, they can also help with reducing the processing time of the DL architectures.
The outputs generated by each filter of the first convolutional layer of a CNN with eight filters.
The novelty of our proposed techniques is that the gradient information is considered to enhance the learning process. It has a minimal number of convolutional layers, hence not very deep nor complex, which requires less processing time. The proposed technique is a sequential model with no complexity such as shortcut connections and encoder-decoder networks, which makes it fast and efficient in comparison to other recently proposed DL architectures. The key contributions of this study are:
A novel MVS-CNN model is developed for road segmentation task and its performance is measured against recent state-of-the-art segmentation models on publicly available databases.
The multi-feature view based on gradient information of the images is used to the enhance the network learning process.
The proposed segmentation architecture consists of only the encoder network with seven (7) convolution layers, and as a result it is less complex than state-of-the-art encoder-decoder based segmentation networks.
The proposed scheme is evaluated against various state-of-the-art DL architectures based on training accuracy, processing time, and segmentation performance.
Literature Review
Recently, researchers have proposed different techniques for road detection and segmentation. Initially, researchers proposed image processing techniques to detect road lanes and boundaries [17], [19], [22].
Aly et al. [17] used the Hough transform based method to recognize lane marks and to classify the road region in an image. Hough transform extracts shapes; therefore, it can result in false detection. Structures with boundaries similar to lanes, such as railings, signboards or road surface cracks, can be mis-recognized as lane markings. To accurately identify the lane, Wang et al. [22] proposed a novel B-Snake based lane model for detecting different types of lanes. The proposed model can detect various types of structures such as straight, curved or parabolic lanes. The authors merge the problem of side lane detection to detect the middle lane of the road.
Kim [19] introduced a lane detection algorithm based on the Random Sample Consensus (RANSAC). The algorithm is robust and able to perform in real-time. First, possible lane lines are calculated, which are then grouped by a hypothesis-verify method. The selected lines are grouped separately into right and left lanes. Since image processing techniques are designed for specific scenarios, they are not able to perform in complex and unpredictable situations. Such drawback of image processing techniques prompted researchers to discover techniques that could learn and adjust to diverse scenarios.
Unlike image processing techniques, Machine Learning (ML) techniques can extract and learn features for classification. ML techniques have been used in prior AV models [23], [24], and can perform detection and segmentation tasks in vast scenarios. A method to detect road lane boundaries has been recommended by Fang and Wang [23]. The authors propose a method named Vector Fuzzy Connectedness (VFC). From a skeleton image, the lane curves are calculated. Finally, the control points for the right and left lanes are estimated from the curves using VFC.
Zhu et al. [24] have studied and evaluated Extreme Learning Machines (ELM) for road and vehicle detecting. For road detection, the color histogram has been used. Similarly, gray color features and Histogram of Oriented Gradients (HOG) are used to detect vehicles. Their proposed network is proved to perform better as compared to Support Vector Machines (SVM) and Back Propagation Network (BPN). The results of road segmentation using the ELM network are not accurate because of the rectangular pattern of segmentation. Apart from the results ML provides, it is worthy to note that since the feature extraction and classification process are independent of each other, the classifiers do not provide adequate results.
DL architectures tend to extract and learn features from input images which result in better segmentation and classification results. DL requires a large dataset of images for training. Keeping this in view, AlHaija et al. [15] have proposed an augmented reality method to augment road scene images to obtain a large dataset for training deep networks. Augmented reality has been used to create additional images with more traffic information. The augmented images are then used to train the state-of-the-art Multi-task Network Cascade (MNC) [50]. The addition of augmented images helps in improving the segmentation performance as compared to real or synthetic images.
Liu and Deng [20] proposed fully convolutional deep residual network along with pyramid pooling. Since light exposure in images also affects the training and prediction of deep networks, the authors suggest augmentation based on exposure. Underexposure compensation has been used to augment the training images to get a larger training dataset. However, such technique requires a high knowledge of image processing to collect different images with similar illumination features to generate realistic scenes.
The impact of using a higher number of convolutional layers in the DCNN has been studied in [34]. The authors suggested that using residual layers in deep networks can result in gaining higher accuracy than conventional networks. Using ImageNet [51] dataset, they compare the results of 34-layer plain and 34-layer residual networks. They conclude that deeper networks have higher training errors due to vanishing gradient problem. However, residual connection overcomes degradation and allows us to gain accuracy from deeper networks. Their proposed network can reduce training error by 3.5% as compared to conventional DL networks.
Long et al. [52] discussed Fully Convolutional Neural Network (FCNN) for pixel-to-pixel prediction for semantic segmentation. The authors present a network that replaces the dense layer by an up-sampling layer to get a pixel-to-pixel segmented output. Converting the network to fully convolutional networks results in lower inference time, which is very important in real-time applications. A network-in-network CNN model has been proposed by Mendes et al. [21], which is converted into FCNN to gain fast results in road detection after training. The proposed technique have achieved comparably accurate results, along with maintaining a fast inference time. Due to the minimized inference time, the method could be used effectively in real-time road detection. However, the proposed method is incapable to correctly classify between road regions due to varied lighting conditions.
Romera et al. [53] presented a real-time solution for road segmentation using CNN with residual connections. Residual connection overcomes the degradation problem which is faced in most architectures with a larger number of layers. In this way, their architecture allows accurate classification along with being efficiently fast. Contributing to real-time road segmentation, Romera et al. [54] have presented a method to redesign the residual layer. Their goal is to achieve speed efficiency while retaining the accuracy of the recently proposed techniques. Although deep and complex architectures provided competent results than DIP and ML techniques, they require high computation power and big data for training.
A self-ensembling attention network was proposed by Xu et al. [55]. The network consists of a student model that acts as a base network and a teacher model acting as the ensembling network. In their proposed network, the student model learns from the output of the teacher networks. As the model is trained the student model becomes accurate, hence the prediction of the teacher models also gets closer to the accurate labels in the target domain. Since the student network learns from the teacher network, the performance of the overall framework mainly depends on the teacher network. Since multi-scale features are an important part of DL, a scale-aware model is proposed in [56]. The same images of different scales are used by the network for feature extraction at different scales. The extracted features are then merged and used for the classification of every pixel in an image.
DIP based techniques [17], [19], [22] are prone to variations such as changes in lighting conditions, variations in structures, and color differences. These techniques work well in specified conditions. Methods proposed in [15], [20] require high domain expertise. These techniques require image processing knowledge to create new augmented images that look similar to the original images. Similarly, DL techniques require high computing power and large dataset. To reduce the complexity and tackle the huge data requirement, a shallow CNN which uses multi-view features is proposed. The proposed shallow CNN provides higher segmentation accuracy while minimizing the complexity of the network.
Proposed Methodology
A high-level diagram of the proposed DL architecture for road segmentation is shown in Fig. 2. In this study, segmentation is performed using MVS-CNN architecture that comprises of a feature extraction module (E), a feature selection module (S), and a prediction module (P) as can be seen in Fig. 2.
Example of DL modules with input image and ground-truth labelled image of “
In road images, the road boundaries differentiate the road region from the background regions. The gradient of an image resembles the boundaries of the road region. Feature extraction in the segmentation architecture can be improved by adding such information. This will enhance the efficiency and efficacy of the architecture. Such views help in training the architecture without going deeper.
The network takes training data as input in the form of \begin{align*} G_{x}=&\sum _{u=-s}^{u=s} \sum _{v=-s}^{v=s} \Delta _{x} \left ({x_{i} + u, y_{i} + v}\right)^{2} \tag{1}\\ G_{y}=&\sum _{u=-s}^{u=s} \sum _{v=-s}^{v=s} \Delta _{y} \left ({x_{i} + u, y_{i} + v}\right)^{2}\tag{2}\end{align*}
The gradient magnitude view \begin{equation*} G_{Mag} = \sqrt {G_{x} + G_{y}}\tag{3}\end{equation*}
Pictorial representation of the views are shown in Fig. 3. Fig. 3(a), shows a sample image from the KITTI dataset, whereas Fig. 3(b), 3(c) and 3(d) show the gradient in horizontal and vertical along with the gradient magnitude of the sample image, respectively. The purpose of road segmentation is to learn a representation function \begin{equation*} \hat {y} = R\left ({Y|X;W, b }\right)\tag{4}\end{equation*}
\begin{equation*} \hat {y} = \text {Sigmoid}(R\left ({Y|X;W, b }\right))\tag{5}\end{equation*}
Different views created from sample road image “
The function \begin{equation*} R = MVS_{CNN}(X)\tag{6}\end{equation*}
\begin{equation*} MVS_{CNN}(X)=S^{(N)}\ldots (S^{(1) }(E^{(M)}\ldots (E^{(2) }(E^{(1) }(X)))))\tag{7}\end{equation*}
\begin{align*} E^{(l)}=&g^{(l)} (E^{(l-1)}; W^{(l)},b^{(l)}) \\=&f_{norm} (P(f_{act}(W^{(l)}*E^{(l-1)}+b^{(l)})))\tag{8}\end{align*}
\begin{align*} S^{(l)}=&h^{(l)} (S^{(l-1)}; W^{(l)},b^{(l)}) \\=&(f_{act} (W^{(l)}.S^{(l-1)}+b^{(l)}))\tag{9}\end{align*}
Proposed MVS-CNN architecture with input (
Experimentation and Results
The proposed MVS-CNN road segmentation network is designed to make the computer-aided road segmentation systems less complex. The primary experimental verifications required are (i) whether the inclusion of the multi-feature views improve the network performance without significant degradation of segmentation accuracy. (ii) whether the proposed network is reliable and efficient at road segmentation task and achieves improved performance as compared to recent state-of-the-art segmentation models.
A. Dataset
For evaluating the performance of DL architectures in this study, two well known dataset, KITTI Vision Benchmark Suite [57], and Cityscapes dataset [61] are utilized for experimentation, which are discussed below:
1) The KITTI Dataset
The KITTI dataset [57] consists of 289 input images along with the lane and road labels images. The dataset comprise of different variations of roads including; marked roads (
2) Cityscapes Dataset
The Cityscapes dataset [61] comprises of 5000 labeled images. The images have been collected from 50 cities during different months and seasons. There are 2975 training, 1525 testing and 500 validation images in the dataset. The images are classified into 19 categories in the Cityscapes dataset. Since this study focuses on the segmentation of roads, only road labels are selected and all other labels are considered as background regions.
B. Proposed Network Architecture
The proposed network architecture is derived by exhaustively testing various combinations of the following four network hyper-parameters to find the better performing segmentation architecture:
The convolution layers in the network are varied between 5 and 9 and max-pooling is performed after every layer or after stacking two or three convolution layers.
For training, different optimization algorithms such as SGD [60] and RMSProp [62] are tested.
ReLU [59] and Leaky-ReLU [63] activation functions are evaluated after every convolution layer.
For regularization, both Batch-Normalization (BN) [64] and Dropout [65] have been evaluated.
Based on extensive testing, the network consisting of seven (7) convolution layers performed better than the other network configurations. Amongst the tested optimizer, RMSProp exhibited faster convergence and better accuracy as compared to SGD optimizer over 50 epochs. For the tested ReLU and Leaky-ReLU activation functions, both achieved comparable accuracy but ReLU was more efficient. Furthermore, BN after each convolution layer performed better, while increasing speed, performance, and stability of the proposed architecture. The MVS-CNN architecture yielding the best results is given in Table 2. The experimentations in this study are performed using the TensorFlow [66] library on the Google Colaboratory platform [67].
C. Proposed MVS-CNN Network Configuration
In this section, the MVS-CNN is evaluated by varying different multi-feature view combinations to find the optimal combination. Different views such as gradient in horizontal and vertical directions (
To study the effect of individual feature views on proposed MVS-CNN architecture, the KITTI dataset [57] is considered, and the results of the proposed network with the different feature view combinations are presented in Table 3. The baseline proposed MVS-CNN with RGB channel input achieved a training accuracy of 95.9% along with testing accuracy of 94.2%. When horizontal and vertical gradients are used with input image (i.e.,
D. Network Performances
In order to carry out further experimentation, both dataset (i.e. KITTI [57] and Cityscapes [61]) are considered for the performance analysis of the different network architectures. The performances of the MVS-CNN, SegNet [35], UNet [33], and ResNet [34] models are compared in terms of model accuracy, processing time, and network complexity (number of trainable parameters) presented in Table 4. It can be seen that the UNet [33] has large number of trainable parameters (i.e. 31 millions) and slow model training and prediction time when compared to other networks. Similarly, ResNet [34] with highest trainable parameters (i.e. 35 millions) has achieved better training and validation accuracy when compared to UNet [33], along with less model processing and prediction time respectively. Similarly, the state-of-the-art SegNet [35] with fewest trainable parameters (i.e. 5.4 millions) achieved the better model accuracy and take less processing and prediction time when compared to UNet [33] and ResNet [34]. However, the proposed MVS-CNN with 6.6 million trainable parameters performs better than the state-of-the-art networks, while achieving highest training and testing accuracy of 98.8% and 96.9% on KITTI dataset, respectively. Similarly, the proposed MVS-CNN achieved training and testing accuracy of 99.1% and 96.2% on Cityscapes dataset. The proposed MVS-CNN show supremacy in terms of processing time, while achieving significant less training and prediction times respectively. It is clear from Table 4, that although SegNet [35] has fewer trainable parameters, the proposed MVS-CNN outperformed in terms of model accuracy and processing time. Similarly, in comparison to other state-of-the-art networks, our proposed architecture has a fewer number of convolution layers and can deal with the additional multi-features view while optimizing the network performance when compared to other state-of-the-art DL architectures.
E. Metrics for Semantic Segmentation Accuracy
In this study, the segmentation accuracy is calculated using the most commonly used semantic segmentation metrics [32], [52], such as mean Intersection over Union (IoU), mean accuracy, pixel accuracy, and frequency weighted (f.w.). IoU. These metrics are computed as follows [52]:\begin{align*} Pixel\;accuracy=&\frac {\sum _{i}n_{ii}}{\sum _{i}t_{i}} \tag{10}\\ Mean\;accuracy=&\frac {(1/n_{cl})\sum _{i}n_{ii}}{\sum _{i}t_{i}} \tag{11}\\ Mean\;IoU=&\frac {(1/n_{cl})\sum _{i}n_{ii}}{\left({t_{i}+\sum _{i}n_{ji}-n_{ii}}\right)} \tag{12}\\ f.w.\;IoU=&\frac {\left({\sum _{k} t_{k}}\right)^{-1}\sum _{i} t_{i} n_{ii}}{\left({t_{i}+\sum _{i}n_{ji}-n_{ii}}\right)}\tag{13}\end{align*}
The predicted output of the SegNet [35], ResNet [34], UNet [33] and MVS-CNN on KITTI dataset are illustrated in Fig. 7. Similarly, the predicted results of utilized and proposed DL architectures on the Cityscapes dataset are also shown in Fig. 8. It can be observed that our proposed MVS-CNN and SegNet [35] have obtained comparable results by precisely segmenting road regions, which are very close to the ground-truth labels. Whereas, UNet [33], ResNet [34] and SEANet [55] have misclassified some of the pixels in either road regions or background regions.
Segmentation results of all deep learning architectures on input images from KITTI [57] dataset (top-to-bottom: ‘
Segmentation results of all deep learning architectures on input images from Cityscapes [61] dataset (top-to-bottom: ‘frankfurt_000000_004617’, ‘frankfurt_000000_004617’, ‘frankfurt_000001_015768’, ‘lindau_000000_000019’, ‘lindau_000010_000019’, ‘lindau_000058_000019’, ‘munster_000001_000019’, ‘munster_000022_000019’ and ‘munster_000038_000019’). (a) Input image (b) Ground truth labeled image (c) UNet [33] (d) Res-Net [24] (e) SegNet [35] (f) Proposed MVS-CNN (g) SEANet [55].
The experimental results exhibit that our proposed MVS-CNN model shows supremacy in terms of model accuracy, processing time, and segmentation accuracy when compared to other state-of-the-art DL networks.
Conclusion
In this study, A MVS-CNN model has been proposed, which uses additional features such as gradient information along with the RGB channels of road images to train the network. The lesser number of convolutional layers in the proposed MVS-CNN architecture makes it faster and less complex, while additional features help the network to learn quickly without going deeper. The validation accuracy of the MVS-CNN comprising of six channels achieved an improvement of 2.7% as compared to the baseline three-channel CNN (RGB). The proposed MVS-CNN is evaluated in term of model accuracy, computational complexities and segmentation accuracy result. In comparison with ResNet [34], UNet [33] and SEANet [55], the proposed MVS-CNN outperforms while achieving comparable results to SegNet [35]. The proposed shallow architecture consisting of only seven convolutional layers, require prediction time of 0.6–0.8 millisecond (depending on the size of input image), which is less complex and faster as compared to other evaluated state-of-the-art DL architectures. In this study, only road labels are considered for evaluation and experimentation. In future work, the lane labels, as well as the multi-class segmentation approach will be considered to expand the scope of our proposed MVS-CNN. Similarly, other than gradient information of input images, additional different features representation (i.e., curvatures, blob, ridges, etc) will be also considered to enhance the performance of the proposed multi-feature view-based network architecture. This study deals with the driving image databases only, the performance of the proposed MVS-CNN architecture can be further evaluated on the driving video repository.