Loading web-font TeX/Math/Italic
Multi-Feature View-Based Shallow Convolutional Neural Network for Road Segmentation | IEEE Journals & Magazine | IEEE Xplore

Multi-Feature View-Based Shallow Convolutional Neural Network for Road Segmentation


Proposed MVS-CNN architecture with input (I_{RGB} + G_{Mag}) 'um 000027' image and network output is segmented road region.

Abstract:

This study presents a shallow and robust road segmentation model. The computer-aided real-time applications, like driver assistance, require real-time and accurate proces...Show More
Topic: Scalable Deep Learning for Big Data

Abstract:

This study presents a shallow and robust road segmentation model. The computer-aided real-time applications, like driver assistance, require real-time and accurate processing. Current studies use Deep Convolutional Neural Networks (DCNN) for road segmentation. However, DCNN requires high computational power and lots of labeled data to learn abstract features for deeper layers. The deeper the layer is, the more abstract information it tends to learn. Moreover, the prediction time of the DCNN network is an important aspect of autonomous vehicles. To overcome these issues, a Multi-feature View-based Shallow Convolutional Neural Network (MVS-CNN) is proposed that utilizes the abstract features extracted from the explicitly derived representations of the input image. Gradient information of the input image is used as additional channels to enhance the learning process of the proposed deep learning architecture. The multi-feature views are fed to a fully-connected neural network to accurately segment the road regions. The testing accuracy demonstrates that the proposed MVS-CNN achieved an improvement of 2.7% as compared to baseline CNN consisting of only RGB inputs. Furthermore, the comparison of the proposed method with the popular semantic segmentation network (SegNet) has shown that the proposed scheme performs better while being more efficient during training and evaluation. Unlike traditional segmentation techniques, which are based on the encoder-decoder architecture, the proposed MVS-CNN consists of only the encoder network. The proposed MVS-CNN has been trained and validated with two well-known datasets: the KITTI Vision Benchmark and the Cityscapes dataset. The results have been compared with the state-of-the-art deep learning architectures. The proposed MVS-CNN outperforms and shows supremacy in terms of model accuracy, processing time, and segmentation accuracy. Based on the experimental results, the proposed architecture can be considered as an efficient road ...
Topic: Scalable Deep Learning for Big Data
Proposed MVS-CNN architecture with input (I_{RGB} + G_{Mag}) 'um 000027' image and network output is segmented road region.
Published in: IEEE Access ( Volume: 8)
Page(s): 36612 - 36623
Date of Publication: 10 February 2020
Electronic ISSN: 2169-3536

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Autonomous driving has gained great attention of many researchers recently. The key aim of Intelligent Transport Systems (ITS) is to avoid accidents while accurately guiding the vehicle through the road, along with considering traffic safety rules and avoiding obstacles in the way [1], [2]. Advancement in the field of Autonomous Vehicles (AV) has put forth enormous challenges in the automotive industry [3]. These challenges include real-time processing along with accurate and reliable segmentation. The availability of safe, secure and reliable AV will benefit humanity including visually impaired and handicapped people to commute more easily and efficiently [4].

To reduce the risk of road accidents, it is necessary to accurately distinguish the road region from the other regions. This will help autonomous vehicles to navigate correctly, as well as understand the situation of the surrounding environment including traffic signs [5], [6] and signals [7], [8], pedestrians [9], road lanes [10] and other vehicles on the road [11]. Mersky et al. [12] have concluded that reducing the prediction time of detection/segmentation algorithm used in AVs can improve fuel economy since more aspects of driving (i.e., acceleration, speed, etc.) are based on decisions made by the AV, instead of the human driver.

Typically, object detection and tracking techniques use a bounding box to locate specific objects in an image [11], [13], [14]. Road detection differs from other object detection techniques as road region is usually occluded by vehicles, cyclist and pedestrians. For road detection, semantic segmentation of perspective view can differentiate the road region from the other regions.

Researchers have proposed different techniques to address road detection and segmentation [15]–​[22]. Some of these are based on Digital Image Processing (DIP) [17], [19], [22], whereas some use Machine Learning (ML) techniques [23], [24] to address the problem.

Initially, researchers [17], [19], [22] used DIP techniques to detect road lines in order to help drivers. DIP techniques can only detect roads based on structural and color information. Similarly, ML techniques have been also used in object detection [25], [26], classification [27], [28], and segmentation [29]–​[31].

DIP and ML techniques work based on predefined criteria and features (such as color, edges, corner points, structures, etc.) in images. Hence, such techniques do not perform well in noisy images as well as images with different lighting conditions. Such techniques require pre-processing of images and manual fine-tuning which varies with different types of images under different circumstances [32].

Recently, Deep Learning (DL) has demonstrated excellent results in artificial intelligence and computer vision [33]–​[37] applications. Many DL based applications have been proposed recently. The applications include image classification [32], [38], [39], image segmentation [40], [41], change detection in images [42], [43] and object tracking [44], [45]. While Deep Convolutional Neural Networks (DCNN) have been useful in the aforementioned DL applications, it also proves helpful in semantic segmentation [29], [46].

Semantic segmentation models classify pixels of an image into one of the predefined classes. Recently, several semantic segmentation solutions based on DCNN have been proposed [18], [40], [47], [48]. The amazing result gained by AlexNet [49] and GoogleNet [37] for classification led to an increase in the use of DCNN based computer vision applications, including segmentation.

In DCNN, each convolutional layer extracts features and learns abstract information from the input images. The deeper the network is, the more generalized information it tends to learn. Most of the state-of-the-art DL architectures, like UNet [33], ResNet [34], SegNet [35], VGGNet [36], and GoogleNet [37], deepen the network to achieve improved model accuracy and segmentation performance. However, deeper networks are data-hungry and require much more data to perform better. In situations where the availability of data is insufficient, making the networks deeper can cause over-fitting. Hence, the objective of learning generalized features will not be achieved.

Similarly, the deeper the network, the higher is the time required to obtain the results. However, in computer-aided real-time applications like driver assistance, the processing time is of key importance. Thus, there is an intense need for shallow and generalized networks that require less processing time to segment and classify images. Keeping in view the above points, it would be useful if the required abstract information are pre-calculated and passed along with the original input. This will result in a less-deeper and accurate network, which will require less data for training.

The recent state-of-the-art DL techniques use DCNN [33], [35] to achieve better results. However, deeper networks face vanishing gradient problem. The vanishing gradient problem has been handled with residual networks in [34]. However, residual connections make the network complex, which results in increasing the time required for training the network. Such networks cannot be used in real-time computer-vision based applications.

The proposed Multi-feature View-based Shallow Convolutional Neural Network (MVS-CNN) utilizes gradients information as additional features along with the input image. The gradient of an image shows the change in intensity or color in an image, which is useful to extract essential information from the images (i.e. edge, texture, etc). By examining the outputs of the convolution layers in a DCNN, shown in Fig. 1, it can be observed that the outputs are similar to a gradient image. Since convolutional layers in DCNN learn features, the gradient features along with input image will be useful for the DL model to enhance their learning process. Furthermore, they can also help with reducing the processing time of the DL architectures.

FIGURE 1. - The outputs generated by each filter of the first convolutional layer of a CNN with eight filters.
FIGURE 1.

The outputs generated by each filter of the first convolutional layer of a CNN with eight filters.

The novelty of our proposed techniques is that the gradient information is considered to enhance the learning process. It has a minimal number of convolutional layers, hence not very deep nor complex, which requires less processing time. The proposed technique is a sequential model with no complexity such as shortcut connections and encoder-decoder networks, which makes it fast and efficient in comparison to other recently proposed DL architectures. The key contributions of this study are:

  1. A novel MVS-CNN model is developed for road segmentation task and its performance is measured against recent state-of-the-art segmentation models on publicly available databases.

  2. The multi-feature view based on gradient information of the images is used to the enhance the network learning process.

  3. The proposed segmentation architecture consists of only the encoder network with seven (7) convolution layers, and as a result it is less complex than state-of-the-art encoder-decoder based segmentation networks.

  4. The proposed scheme is evaluated against various state-of-the-art DL architectures based on training accuracy, processing time, and segmentation performance.

The rest of the paper is organized as follows: Section II discusses related studies. The proposed methodology has been discussed in section III. Results and comparisons have been discussed in section IV. Finally, section V concludes the paper.

SECTION II.

Literature Review

Recently, researchers have proposed different techniques for road detection and segmentation. Initially, researchers proposed image processing techniques to detect road lanes and boundaries [17], [19], [22].

Aly et al. [17] used the Hough transform based method to recognize lane marks and to classify the road region in an image. Hough transform extracts shapes; therefore, it can result in false detection. Structures with boundaries similar to lanes, such as railings, signboards or road surface cracks, can be mis-recognized as lane markings. To accurately identify the lane, Wang et al. [22] proposed a novel B-Snake based lane model for detecting different types of lanes. The proposed model can detect various types of structures such as straight, curved or parabolic lanes. The authors merge the problem of side lane detection to detect the middle lane of the road.

Kim [19] introduced a lane detection algorithm based on the Random Sample Consensus (RANSAC). The algorithm is robust and able to perform in real-time. First, possible lane lines are calculated, which are then grouped by a hypothesis-verify method. The selected lines are grouped separately into right and left lanes. Since image processing techniques are designed for specific scenarios, they are not able to perform in complex and unpredictable situations. Such drawback of image processing techniques prompted researchers to discover techniques that could learn and adjust to diverse scenarios.

Unlike image processing techniques, Machine Learning (ML) techniques can extract and learn features for classification. ML techniques have been used in prior AV models [23], [24], and can perform detection and segmentation tasks in vast scenarios. A method to detect road lane boundaries has been recommended by Fang and Wang [23]. The authors propose a method named Vector Fuzzy Connectedness (VFC). From a skeleton image, the lane curves are calculated. Finally, the control points for the right and left lanes are estimated from the curves using VFC.

Zhu et al. [24] have studied and evaluated Extreme Learning Machines (ELM) for road and vehicle detecting. For road detection, the color histogram has been used. Similarly, gray color features and Histogram of Oriented Gradients (HOG) are used to detect vehicles. Their proposed network is proved to perform better as compared to Support Vector Machines (SVM) and Back Propagation Network (BPN). The results of road segmentation using the ELM network are not accurate because of the rectangular pattern of segmentation. Apart from the results ML provides, it is worthy to note that since the feature extraction and classification process are independent of each other, the classifiers do not provide adequate results.

DL architectures tend to extract and learn features from input images which result in better segmentation and classification results. DL requires a large dataset of images for training. Keeping this in view, AlHaija et al. [15] have proposed an augmented reality method to augment road scene images to obtain a large dataset for training deep networks. Augmented reality has been used to create additional images with more traffic information. The augmented images are then used to train the state-of-the-art Multi-task Network Cascade (MNC) [50]. The addition of augmented images helps in improving the segmentation performance as compared to real or synthetic images.

Liu and Deng [20] proposed fully convolutional deep residual network along with pyramid pooling. Since light exposure in images also affects the training and prediction of deep networks, the authors suggest augmentation based on exposure. Underexposure compensation has been used to augment the training images to get a larger training dataset. However, such technique requires a high knowledge of image processing to collect different images with similar illumination features to generate realistic scenes.

The impact of using a higher number of convolutional layers in the DCNN has been studied in [34]. The authors suggested that using residual layers in deep networks can result in gaining higher accuracy than conventional networks. Using ImageNet [51] dataset, they compare the results of 34-layer plain and 34-layer residual networks. They conclude that deeper networks have higher training errors due to vanishing gradient problem. However, residual connection overcomes degradation and allows us to gain accuracy from deeper networks. Their proposed network can reduce training error by 3.5% as compared to conventional DL networks.

Long et al. [52] discussed Fully Convolutional Neural Network (FCNN) for pixel-to-pixel prediction for semantic segmentation. The authors present a network that replaces the dense layer by an up-sampling layer to get a pixel-to-pixel segmented output. Converting the network to fully convolutional networks results in lower inference time, which is very important in real-time applications. A network-in-network CNN model has been proposed by Mendes et al. [21], which is converted into FCNN to gain fast results in road detection after training. The proposed technique have achieved comparably accurate results, along with maintaining a fast inference time. Due to the minimized inference time, the method could be used effectively in real-time road detection. However, the proposed method is incapable to correctly classify between road regions due to varied lighting conditions.

Romera et al. [53] presented a real-time solution for road segmentation using CNN with residual connections. Residual connection overcomes the degradation problem which is faced in most architectures with a larger number of layers. In this way, their architecture allows accurate classification along with being efficiently fast. Contributing to real-time road segmentation, Romera et al. [54] have presented a method to redesign the residual layer. Their goal is to achieve speed efficiency while retaining the accuracy of the recently proposed techniques. Although deep and complex architectures provided competent results than DIP and ML techniques, they require high computation power and big data for training.

A self-ensembling attention network was proposed by Xu et al. [55]. The network consists of a student model that acts as a base network and a teacher model acting as the ensembling network. In their proposed network, the student model learns from the output of the teacher networks. As the model is trained the student model becomes accurate, hence the prediction of the teacher models also gets closer to the accurate labels in the target domain. Since the student network learns from the teacher network, the performance of the overall framework mainly depends on the teacher network. Since multi-scale features are an important part of DL, a scale-aware model is proposed in [56]. The same images of different scales are used by the network for feature extraction at different scales. The extracted features are then merged and used for the classification of every pixel in an image.

DIP based techniques [17], [19], [22] are prone to variations such as changes in lighting conditions, variations in structures, and color differences. These techniques work well in specified conditions. Methods proposed in [15], [20] require high domain expertise. These techniques require image processing knowledge to create new augmented images that look similar to the original images. Similarly, DL techniques require high computing power and large dataset. To reduce the complexity and tackle the huge data requirement, a shallow CNN which uses multi-view features is proposed. The proposed shallow CNN provides higher segmentation accuracy while minimizing the complexity of the network.

SECTION III.

Proposed Methodology

A high-level diagram of the proposed DL architecture for road segmentation is shown in Fig. 2. In this study, segmentation is performed using MVS-CNN architecture that comprises of a feature extraction module (E), a feature selection module (S), and a prediction module (P) as can be seen in Fig. 2.

FIGURE 2. - Example of DL modules with input image and ground-truth labelled image of “
$\mu {\mathrm{ m}}$
_000027” file from the KITTI [57] dataset.
FIGURE 2.

Example of DL modules with input image and ground-truth labelled image of “$\mu {\mathrm{ m}}$ _000027” file from the KITTI [57] dataset.

In road images, the road boundaries differentiate the road region from the background regions. The gradient of an image resembles the boundaries of the road region. Feature extraction in the segmentation architecture can be improved by adding such information. This will enhance the efficiency and efficacy of the architecture. Such views help in training the architecture without going deeper.

The network takes training data as input in the form of $(X,Y)$ . The $X$ is the road image denoted as $I^{(r\times c \times v)}$ , where $r$ , $c$ represent the height and width of the road image, respectively, and $v$ represents the number of feature views (including RGB views). The $Y$ is the corresponding segmented ground-truth of the road image such that $Y \in [{0;1}]^{(r\times c)}$ , where $r$ , $c$ represent the height and width of the ground-truth image, respectively, which are the same as the input image. The different view combinations considered in this study include horizontal gradient $(G_{x})$ , vertical gradient $(G_{y})$ , and the gradient magnitude $(G_{Mag})$ views. The horizontal and vertical gradient views are constructed by computing the respective gradients of the road image using (1) and (2), where $\Delta _{x}$ and $\Delta _{y}$ represent the gradient components computed using Gaussian filter and $i$ and $j$ represent row and column pixel coordinates respectively:\begin{align*} G_{x}=&\sum _{u=-s}^{u=s} \sum _{v=-s}^{v=s} \Delta _{x} \left ({x_{i} + u, y_{i} + v}\right)^{2} \tag{1}\\ G_{y}=&\sum _{u=-s}^{u=s} \sum _{v=-s}^{v=s} \Delta _{y} \left ({x_{i} + u, y_{i} + v}\right)^{2}\tag{2}\end{align*} View SourceRight-click on figure for MathML and additional features.

The gradient magnitude view $G_{Mag}$ computed using $G_{x}$ and $G_{y}$ as given by (3):\begin{equation*} G_{Mag} = \sqrt {G_{x} + G_{y}}\tag{3}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Pictorial representation of the views are shown in Fig. 3. Fig. 3(a), shows a sample image from the KITTI dataset, whereas Fig. 3(b), 3(c) and 3(d) show the gradient in horizontal and vertical along with the gradient magnitude of the sample image, respectively. The purpose of road segmentation is to learn a representation function $R$ that can predict a binary image $\hat {y}$ relative to the ground-truth segmented image $Y$ given input $X$ , as given by (4):\begin{equation*} \hat {y} = R\left ({Y|X;W, b }\right)\tag{4}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $W$ , $b$ are parameters of $R$ . Function $R$ is parameterized with Sigmoid [58] function to compute the output probability of each pixel between 0 and 1:\begin{equation*} \hat {y} = \text {Sigmoid}(R\left ({Y|X;W, b }\right))\tag{5}\end{equation*} View SourceRight-click on figure for MathML and additional features. The threshold value of 0.5 is applied on the predicted image $\hat {y}$ to generate the final output image.

FIGURE 3. - Different views created from sample road image “
$\mu {\mathrm{ m}}$
_000069” from KITTI [57] dataset. (a) Input Image 
$(I_{RGB})$
 (b) Horizontal gradient 
$(G_{x})$
 (c) Vertical gradient 
$(G_{y})$
 (d) Gradient magnitude 
$(G_{Mag})$
.
FIGURE 3.

Different views created from sample road image “$\mu {\mathrm{ m}}$ _000069” from KITTI [57] dataset. (a) Input Image $(I_{RGB})$ (b) Horizontal gradient $(G_{x})$ (c) Vertical gradient $(G_{y})$ (d) Gradient magnitude $(G_{Mag})$ .

The function $R$ , in this case, represents the proposed MVS-CNN and it can be represented as:\begin{equation*} R = MVS_{CNN}(X)\tag{6}\end{equation*} View SourceRight-click on figure for MathML and additional features. Specifically, \begin{equation*} MVS_{CNN}(X)=S^{(N)}\ldots (S^{(1) }(E^{(M)}\ldots (E^{(2) }(E^{(1) }(X)))))\tag{7}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $M$ and $N$ indicate the total number of feature extraction and feature selection modules respectively. Feature extraction module consists of sequential processes that define the layer-wise operations at each layer $l$ , given as $g^{(l)}$ , and extracts features represented as $E^{(l)}$ :\begin{align*} E^{(l)}=&g^{(l)} (E^{(l-1)}; W^{(l)},b^{(l)}) \\=&f_{norm} (P(f_{act}(W^{(l)}*E^{(l-1)}+b^{(l)})))\tag{8}\end{align*} View SourceRight-click on figure for MathML and additional features. where $b^{(l)}$ and $W^{(l)}$ represent the $l^{th}$ layer bias and filter weights, * represents the convolution operation, $f_{act}$ is the non-linear activation function, ${P}$ defines the max-pooling operation, $f_{norm}$ represents normalization function, and $E^{(l-1)}$ is either the input image $X$ for the first layer ($l = 1$ ) or the $(l-1)^{th}$ activation for other layers ($l > 1$ ). In the convolution layer, the input image is processed to learn filter weights that extract useful features. To introduce non-linearity, Rectified Linear Unit (ReLU) [59] is used after the convolution layers. The pooling layer reduces the size of the feature maps by merging the locally associated features into a single feature by taking either the average or maximum of the feature values. The feature selection module at layer $l$ represented as $h^{(l)}$ , where $l>(M-1)$ , is a dense neural network layer where each neuron is connected to all the neurons in the previous layer. The process at this layer involves dot product (.) operations and non-linear activations as given by (9):\begin{align*} S^{(l)}=&h^{(l)} (S^{(l-1)}; W^{(l)},b^{(l)}) \\=&(f_{act} (W^{(l)}.S^{(l-1)}+b^{(l)}))\tag{9}\end{align*} View SourceRight-click on figure for MathML and additional features. where $S^{(l-1)}$ defines either the $(l-1)^{th}$ layer activation for $l>M$ or it represents $E^{(M)}$ for $l=M$ connecting the feature selection module with the feature extraction module. The parameters of $R$ are optimized based on Stochastic Gradient Descent (SGD) [60] with the cross-entropy loss function. The proposed MVS-CNN is illustrated in Fig. 4.

FIGURE 4. - Proposed MVS-CNN architecture with input (
$I_{RGB}+G_{x}+G_{y}+G_{Mag}$
, 6 channel) “
$\mu {\mathrm{ m}}$
_000027” image and network output is segmented road region.
FIGURE 4.

Proposed MVS-CNN architecture with input ($I_{RGB}+G_{x}+G_{y}+G_{Mag}$ , 6 channel) “$\mu {\mathrm{ m}}$ _000027” image and network output is segmented road region.

SECTION IV.

Experimentation and Results

The proposed MVS-CNN road segmentation network is designed to make the computer-aided road segmentation systems less complex. The primary experimental verifications required are (i) whether the inclusion of the multi-feature views improve the network performance without significant degradation of segmentation accuracy. (ii) whether the proposed network is reliable and efficient at road segmentation task and achieves improved performance as compared to recent state-of-the-art segmentation models.

A. Dataset

For evaluating the performance of DL architectures in this study, two well known dataset, KITTI Vision Benchmark Suite [57], and Cityscapes dataset [61] are utilized for experimentation, which are discussed below:

1) The KITTI Dataset

The KITTI dataset [57] consists of 289 input images along with the lane and road labels images. The dataset comprise of different variations of roads including; marked roads ($\mu {\mathrm{ m}}$ ), multiple marked roads (umm), and unmarked roads (uu). Lane labels are available for the marked and multiple marked roads only. This study focused on road segmentation; therefore, only road labels have been selected from the dataset. Since deep networks require a large number of images, these images are augmented to artificially expand the dataset to 1500 images. The parameters used for augmentation are presented in Table 1. The dataset is partitioned into 80% training and 20% validation set randomly. The same augmented dataset is used for training and validation of all DL architecture evaluated in this study for a fair comparison.

TABLE 1 Data Augmentation Parameters
Table 1- 
Data Augmentation Parameters

2) Cityscapes Dataset

The Cityscapes dataset [61] comprises of 5000 labeled images. The images have been collected from 50 cities during different months and seasons. There are 2975 training, 1525 testing and 500 validation images in the dataset. The images are classified into 19 categories in the Cityscapes dataset. Since this study focuses on the segmentation of roads, only road labels are selected and all other labels are considered as background regions.

B. Proposed Network Architecture

The proposed network architecture is derived by exhaustively testing various combinations of the following four network hyper-parameters to find the better performing segmentation architecture:

  • The convolution layers in the network are varied between 5 and 9 and max-pooling is performed after every layer or after stacking two or three convolution layers.

  • For training, different optimization algorithms such as SGD [60] and RMSProp [62] are tested.

  • ReLU [59] and Leaky-ReLU [63] activation functions are evaluated after every convolution layer.

  • For regularization, both Batch-Normalization (BN) [64] and Dropout [65] have been evaluated.

Based on extensive testing, the network consisting of seven (7) convolution layers performed better than the other network configurations. Amongst the tested optimizer, RMSProp exhibited faster convergence and better accuracy as compared to SGD optimizer over 50 epochs. For the tested ReLU and Leaky-ReLU activation functions, both achieved comparable accuracy but ReLU was more efficient. Furthermore, BN after each convolution layer performed better, while increasing speed, performance, and stability of the proposed architecture. The MVS-CNN architecture yielding the best results is given in Table 2. The experimentations in this study are performed using the TensorFlow [66] library on the Google Colaboratory platform [67].

TABLE 2 Architecture of the Proposed MVS-CNN (Baseline)
Table 2- 
Architecture of the Proposed MVS-CNN (Baseline)

C. Proposed MVS-CNN Network Configuration

In this section, the MVS-CNN is evaluated by varying different multi-feature view combinations to find the optimal combination. Different views such as gradient in horizontal and vertical directions ($G_{x}$ and $G_{y}$ ) and gradient magnitude ($G_{Mag}$ ) are derived from the road images, as discussed in Section III, with an example shown in Fig. 3 of the same section.

To study the effect of individual feature views on proposed MVS-CNN architecture, the KITTI dataset [57] is considered, and the results of the proposed network with the different feature view combinations are presented in Table 3. The baseline proposed MVS-CNN with RGB channel input achieved a training accuracy of 95.9% along with testing accuracy of 94.2%. When horizontal and vertical gradients are used with input image (i.e., $I_{RGB} + G_{x} + G_{y}$ , 5-channel input), the feature learning process improve due to gradient features, hence leading to proposed MVS-CNN training and testing accuracy of 97.3% and 95.7% respectively. Similarly, another experiment is carried out by using input image with gradient magnitude (i.e., $I_{RGB} + G_{Mag}$ , 4-channel input), the proposed MVS-CNN accuracy increased from baseline MVS-CNN (i.e. $I_{RGB}$ , 3-channel input), resulting with training and testing accuracy of 96.5% and 95.1% respectively. Furthermore, when all gradient information is used along with input image (i.e. $I_{RGB}+G_{x}+G_{y}+G_{Mag}$ , 6-channel input), the proposed MVS-CNN achieved 98.4% training accuracy along with testing accuracy of 96.9%. It can be observed in Table 3, that the gradient information is helpful in improving the learning process of the proposed MVS-CNN architecture. Furthermore, the effect of different multi-feature view combinations on model accuracy is shown in Fig. 5. Fig. 5 shows that the feature maps extracted using the first convolution layer (i.e. Conv1), intermediate convolution layer (i.e. Conv4) and the last convolution layer (i.e. Conv7) of the proposed MVS-CNN architecture, the proposed multi-feature view combination (i.e. $I_{RGB}+G_{x}+G_{y}+G_{Mag}$ ) is learning more features as compared to other multi-feature view combinations. Similarly, it can be observed that the input images along with horizontal and vertical gradient based combination (i.e. $I_{RGB}+G_{x}+G_{y}$ ) are retaining useful features in convolutional layers when compared to input images along with gradient magnitude combination (i.e. $I_{RGB}+G_{Mag}$ ), thus resulting in better accuracy and output image similar to the ground-truth. Moreover, it can also be noted that the gradient information enhances the model learning process when compared to just input images (i.e. $I_{RGB}$ ) based model training approach, hence yielding higher model accuracy. Based on the analysis presented in this sub-section, further experimentation and evaluation of the proposed MVS-CNN architecture are carried out using all multi-feature view based combinations along with input image (i.e. $I_{RGB}+G_{x}+G_{y}+G_{Mag}$ ).

TABLE 3 Model Accuracy of the Proposed MVS-CNN on Different Multi-Feature Views Combination for KITTI [57] Vision Benchmark Suite Database
Table 3- 
Model Accuracy of the Proposed MVS-CNN on Different Multi-Feature Views Combination for KITTI [57] Vision Benchmark Suite Database
FIGURE 5. - Feature maps from first, intermediate and last convolution layers of the proposed MVS-CNN on image “
$\mu {\mathrm{ m}}$
_000015” from the KITTI [57] dataset.
FIGURE 5.

Feature maps from first, intermediate and last convolution layers of the proposed MVS-CNN on image “$\mu {\mathrm{ m}}$ _000015” from the KITTI [57] dataset.

D. Network Performances

In order to carry out further experimentation, both dataset (i.e. KITTI [57] and Cityscapes [61]) are considered for the performance analysis of the different network architectures. The performances of the MVS-CNN, SegNet [35], UNet [33], and ResNet [34] models are compared in terms of model accuracy, processing time, and network complexity (number of trainable parameters) presented in Table 4. It can be seen that the UNet [33] has large number of trainable parameters (i.e. 31 millions) and slow model training and prediction time when compared to other networks. Similarly, ResNet [34] with highest trainable parameters (i.e. 35 millions) has achieved better training and validation accuracy when compared to UNet [33], along with less model processing and prediction time respectively. Similarly, the state-of-the-art SegNet [35] with fewest trainable parameters (i.e. 5.4 millions) achieved the better model accuracy and take less processing and prediction time when compared to UNet [33] and ResNet [34]. However, the proposed MVS-CNN with 6.6 million trainable parameters performs better than the state-of-the-art networks, while achieving highest training and testing accuracy of 98.8% and 96.9% on KITTI dataset, respectively. Similarly, the proposed MVS-CNN achieved training and testing accuracy of 99.1% and 96.2% on Cityscapes dataset. The proposed MVS-CNN show supremacy in terms of processing time, while achieving significant less training and prediction times respectively. It is clear from Table 4, that although SegNet [35] has fewer trainable parameters, the proposed MVS-CNN outperformed in terms of model accuracy and processing time. Similarly, in comparison to other state-of-the-art networks, our proposed architecture has a fewer number of convolution layers and can deal with the additional multi-features view while optimizing the network performance when compared to other state-of-the-art DL architectures.

TABLE 4 Network Complexities in Term of Trainable Parameters, Training Time and Testing Time
Table 4- 
Network Complexities in Term of Trainable Parameters, Training Time and Testing Time

E. Metrics for Semantic Segmentation Accuracy

In this study, the segmentation accuracy is calculated using the most commonly used semantic segmentation metrics [32], [52], such as mean Intersection over Union (IoU), mean accuracy, pixel accuracy, and frequency weighted (f.w.). IoU. These metrics are computed as follows [52]:\begin{align*} Pixel\;accuracy=&\frac {\sum _{i}n_{ii}}{\sum _{i}t_{i}} \tag{10}\\ Mean\;accuracy=&\frac {(1/n_{cl})\sum _{i}n_{ii}}{\sum _{i}t_{i}} \tag{11}\\ Mean\;IoU=&\frac {(1/n_{cl})\sum _{i}n_{ii}}{\left({t_{i}+\sum _{i}n_{ji}-n_{ii}}\right)} \tag{12}\\ f.w.\;IoU=&\frac {\left({\sum _{k} t_{k}}\right)^{-1}\sum _{i} t_{i} n_{ii}}{\left({t_{i}+\sum _{i}n_{ji}-n_{ii}}\right)}\tag{13}\end{align*} View SourceRight-click on figure for MathML and additional features. where $n_{ii}$ represents the number of correctly identified pixels, $t_{i}$ represents the overall pixels in class $i$ , $n_{cl}$ represents total classes, $n_{ij}$ represents number of class $i$ pixels accurately predicted as class $j$ , and $n_{ji}$ represents the number of pixels incorrectly rejected for class $i$ . In addition to UNet, ResNet, and SegNet, SEANet [55] model is also utilized for comparison and evaluation of segmentation accuracy. In this study, the pre-trained model of SEANet is used for comparisons. Since the SEANet [55] model is pre-trained on different labels, the predicted output of the SEANet is refined to represent all labels other than road regions as a single background region as shown in Fig. 6. The segmentation accuracy results and comparison of proposed network with the aforementioned DL architectures are presented in Table 5, which shows that the MVS-CNN architecture performs better on KITTI [57] and Cityscapes [61]. Whereas, SegNet achieved comparable results against proposed architecture. It is clear from the results reported in Table 5 that the proposed MVS-CNN precisely predicts the road regions in the input images compared to other DL architectures.

TABLE 5 Comparison of Road Regions Segmentation Accuracy (in %) Using DL Architectures
Table 5- 
Comparison of Road Regions Segmentation Accuracy (in %) Using DL Architectures
FIGURE 6. - Results of utilized SEANet [55] architecture on input images from Cityscapes [61] dataset (top-to-bottom: “frankfurt_000000_004617”, “lindau_000058_000019”). (a) Input image (b) SEANet [55] predicted output (c) Final output with only road region.
FIGURE 6.

Results of utilized SEANet [55] architecture on input images from Cityscapes [61] dataset (top-to-bottom: “frankfurt_000000_004617”, “lindau_000058_000019”). (a) Input image (b) SEANet [55] predicted output (c) Final output with only road region.

The predicted output of the SegNet [35], ResNet [34], UNet [33] and MVS-CNN on KITTI dataset are illustrated in Fig. 7. Similarly, the predicted results of utilized and proposed DL architectures on the Cityscapes dataset are also shown in Fig. 8. It can be observed that our proposed MVS-CNN and SegNet [35] have obtained comparable results by precisely segmenting road regions, which are very close to the ground-truth labels. Whereas, UNet [33], ResNet [34] and SEANet [55] have misclassified some of the pixels in either road regions or background regions.

FIGURE 7. - Segmentation results of all deep learning architectures on input images from KITTI [57] dataset (top-to-bottom: ‘
$\mu {\mathrm{ m}}$
_000083’, ‘
$\mu {\mathrm{mm}}$
_000011’, ‘
$\mu {\mathrm{mm}}$
_000015’, ‘
$\mu {\mathrm{u}}$
_000063’, ‘
$\mu {\mathrm{u}}$
_000066’ and ‘
$\mu {\mathrm{u}}$
_000072’). (a) Input image (b) Ground truth labeled image (c) UNet [33] (d) ResNet [24](e) SegNet [35] (f) Proposed MVS-CNN.
FIGURE 7.

Segmentation results of all deep learning architectures on input images from KITTI [57] dataset (top-to-bottom: ‘$\mu {\mathrm{ m}}$ _000083’, ‘$\mu {\mathrm{mm}}$ _000011’, ‘$\mu {\mathrm{mm}}$ _000015’, ‘$\mu {\mathrm{u}}$ _000063’, ‘$\mu {\mathrm{u}}$ _000066’ and ‘$\mu {\mathrm{u}}$ _000072’). (a) Input image (b) Ground truth labeled image (c) UNet [33] (d) ResNet [24](e) SegNet [35] (f) Proposed MVS-CNN.

FIGURE 8. - Segmentation results of all deep learning architectures on input images from Cityscapes [61] dataset (top-to-bottom: ‘frankfurt_000000_004617’, ‘frankfurt_000000_004617’, ‘frankfurt_000001_015768’, ‘lindau_000000_000019’, ‘lindau_000010_000019’, ‘lindau_000058_000019’, ‘munster_000001_000019’, ‘munster_000022_000019’ and ‘munster_000038_000019’). (a) Input image (b) Ground truth labeled image (c) UNet [33] (d) Res-Net [24] (e) SegNet [35] (f) Proposed MVS-CNN (g) SEANet [55].
FIGURE 8.

Segmentation results of all deep learning architectures on input images from Cityscapes [61] dataset (top-to-bottom: ‘frankfurt_000000_004617’, ‘frankfurt_000000_004617’, ‘frankfurt_000001_015768’, ‘lindau_000000_000019’, ‘lindau_000010_000019’, ‘lindau_000058_000019’, ‘munster_000001_000019’, ‘munster_000022_000019’ and ‘munster_000038_000019’). (a) Input image (b) Ground truth labeled image (c) UNet [33] (d) Res-Net [24] (e) SegNet [35] (f) Proposed MVS-CNN (g) SEANet [55].

The experimental results exhibit that our proposed MVS-CNN model shows supremacy in terms of model accuracy, processing time, and segmentation accuracy when compared to other state-of-the-art DL networks.

SECTION V.

Conclusion

In this study, A MVS-CNN model has been proposed, which uses additional features such as gradient information along with the RGB channels of road images to train the network. The lesser number of convolutional layers in the proposed MVS-CNN architecture makes it faster and less complex, while additional features help the network to learn quickly without going deeper. The validation accuracy of the MVS-CNN comprising of six channels achieved an improvement of 2.7% as compared to the baseline three-channel CNN (RGB). The proposed MVS-CNN is evaluated in term of model accuracy, computational complexities and segmentation accuracy result. In comparison with ResNet [34], UNet [33] and SEANet [55], the proposed MVS-CNN outperforms while achieving comparable results to SegNet [35]. The proposed shallow architecture consisting of only seven convolutional layers, require prediction time of 0.6–0.8 millisecond (depending on the size of input image), which is less complex and faster as compared to other evaluated state-of-the-art DL architectures. In this study, only road labels are considered for evaluation and experimentation. In future work, the lane labels, as well as the multi-class segmentation approach will be considered to expand the scope of our proposed MVS-CNN. Similarly, other than gradient information of input images, additional different features representation (i.e., curvatures, blob, ridges, etc) will be also considered to enhance the performance of the proposed multi-feature view-based network architecture. This study deals with the driving image databases only, the performance of the proposed MVS-CNN architecture can be further evaluated on the driving video repository.

References

References is not available for this document.