Journals & Magazines >IEEE Access >Volume: 12

Illumination Robust Semantic Segmentation Based on Cross-Dimensional Multispectral Edge Fusion in Dynamic Traffic Scenes

The proposed Cross-dimensional Multispectral Edge Fusion Semantic Segmentation Network (CMEFNet) contains: A bilateral multimodal fusion encoder to integrate attentional ...

Abstract:

Semantic segmentation in dynamic urban traffic scenes holds paramount importance for enhancing the safety and efficiency of intelligent vehicles. However, the performance...Show More

Metadata

Abstract:

Semantic segmentation in dynamic urban traffic scenes holds paramount importance for enhancing the safety and efficiency of intelligent vehicles. However, the performance of this technique can be hindered in challenging lighting conditions and intricate backgrounds, even with the aid of geometric information derived from depth-images or LiDAR point clouds alongside single-modal RGB images. To overcome this limitation, it is a feasible method to introduce thermal data alongside visible images, which could boost the accuracy and robustness of semantic segmentation in dynamic urban traffic scenes with varying illumination conditions. Considering that, we propose a novel cross-dimensional multispectral edge fusion network (CMEFNet) specifically designed for RGB-Thermal (RGB-T) semantic segmentation in dynamic traffic scenes. It incorporates the dual encoders regrouping low-level and high-level RGB-T features, the hierarchical attentional fusion module to refine multiscale multimodal features, a skip edge guidance structure for information supplement and further promoting accuracy, and a deep supervision mechanism for fine-tuning. Experimental results demonstrate that our method can significantly improves segmentation accuracy and robustness (improving 0.9-18.3% in mIoU), particularly in detecting dynamic traffic participants such as vehicles and pedestrians (improving 0.4-24.6%). Furthermore, it displays commendable accuracy and robustness in both daytime and nighttime scenarios, surpassing the performance of state-of-the-art networks.

The proposed Cross-dimensional Multispectral Edge Fusion Semantic Segmentation Network (CMEFNet) contains: A bilateral multimodal fusion encoder to integrate attentional ...

Published in: IEEE Access ( Volume: 12)

Page(s): 171589 - 171600

Date of Publication: 15 November 2024

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2024.3498896

Funding Agency:

Contents

SECTION I.

Introduction

Traffic safety is an important subject in social research, which has attracted significant attention [1]. To mitigate the pressure and adverse effects of traffic congestion, intelligent vehicles have emerged as a focal point. In an intricate intelligent driving system, acquiring surrounding information and achieving scene understanding are the foundation for implementing various advanced functions, such as semantic mapping, decision making, and safe efficient navigation [2]. Semantic segmentation categorizes each pixel in an image and has become a crucial computer vision task [3], enabling the deep understanding of the driving environment.

Convolutional neural networks (CNNs) have attracted significant attention for their outstanding achievements in semantic segmentation, achieving vast application in intelligent vehicles. Most of these CNN-driven techniques primarily rely on visual cues [4]. However, despite their remarkable progresses for visible light images, single-modal sensory approaches have proven inadequate for comprehending scenes captured under diverse lighting and weather conditions, or complex backgrounds. Specifically, images captured in darkness, with glare, or under shadows often perform especially poor in quality, resulting in a sharp decline in semantic segmentation accuracy. Furthermore, low visibility and poor lighting conditions at night can hinder the accurate ability of drivers and pedestrians to observe road conditions.

Addressing that, depth data is employed to supplement two-dimensional (2D) RGB images with three-dimensional (3D) geometric information, enhancing image segmentation accuracy to some extent. However, depth information may exhibit shortcomings in certain scenarios. Firstly, LiDAR sensors provide sparse and uneven depth data, which fails to align precisely with the dense semantic information of images on a pixel-by-pixel basis, thus limiting the utilization of the image’s rich content [5], [6], [7], [8], [9], [10]; secondly, depth data captured by depth cameras may become blurred during high-speed movements and its time-of-flight-based measurement could encounter challenges in handling multiple reflections [11], [12], [13].Consequently, depth information lacks robustness and reliability in conditions with insufficient illumination or cluttered backgrounds.

Thermal infrared cameras provide a unique approach to overcome depth mapping constraints by capturing the infrared thermal energy emitted by objects, thereby reducing the reliance on external lighting. This kind of cameras excel in imaging objects in challenging lighting conditions, from night-time scenarios to bright lights and shadows. They are particularly adept at observing objects with temperatures exceeding their surroundings, such as vehicles and pedestrians.

Despite historically high costs and limited availability [14], recent developments have made thermal cameras more competitively priced and accessible, subsequently boosting their utilization. This increased accessibility has created new possibilities for computer vision to directly benefit from thermal imaging. Numerous studies [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27] have demonstrated the effectiveness of RGB-and-thermal (RGB-T) semantic segmentation in enhancing sensitivity to these objects. However, in comparison to colour images, thermal infrared images usually lack colour and texture details, possess low contrast and presence of cluttered noise, etc.

These large domain gaps hinder the effective integration of cross-modal features and fully exploiting the benefits of thermal image features, as different modality features contribute unequally to the final results. Moreover, inappropriate feature fusion strategies fail to adequately consider the diverse properties and interactions between low- and high-level features.

In order to address the aforementioned challenges, we propose a novel cross-dimensional multispectral edge fusion network (CMEFNet) to enhance the performance and robustness of RGB-T semantic segmentation under complex lighting conditions. The proposed CMEFNet addresses the challenge of effectively integrating cross-modal features and fully considering the different properties and mutual connection of low-level spatial details and high-level semantic information in RGB-T data.

The proposed network is compared with classical unimodal and multimodal networks in several aspects and the comparison results fully demonstrate its effectiveness in enhancing segmentation accuracy during both daytime and night-time. The main contributions of this article are summarized as follows.

A novel RGB-T semantic segmentation network based on cross-dimensional multi-spectral edge fusion is proposed to achieve robust semantic segmentation results in dynamic traffic scenarios by effectively performing multilevel feature fusion and multi-type feature aggregation.
A bilateral multimodal fusion encoder based on hierarchical attention is proposed to perform the thorough extraction of multiscale multimodal features. It could dynamically reweight and select effective features at different scales, thereby optimizing the fusion process.
A skip edge guidance structure based on multi-head cross attention is proposed to merge edge cues with high-level semantic features for further enhancing RGB-T semantic segmentation accuracy. It not only fuses semantic and edge features, but also addresses the problem of information loss.

The rest of this paper consists of the following sections: In Section II, we review the related work. Section III primarily describes our network structure. In Section IV, we show the experimental results. In Section V, we make a discussion. Finally, we make the conclusion in Section VI.

SECTION II.

Related Works

A. RGB Image Semantic Segmentation

RGB image semantic segmentation plays as a vital task of intelligent vehicles and has witnessed significant advancements.

Pioneer methods such as Fully Convolutional Networks (FCN [28]) revolutionizes the field by replacing the fully connected layer with convolutional layers, enabling end-to-end networks for pixel-wise labeling of RGB images. U-Net [29] introduces a symmetric design with skip connections, efficiently merging low-level semantic features with high-level ones, particularly adept at handling irregular and coarse image edges. Meanwhile, DeepLab [30] and its successors capitalizes on dilated convolution to enhance the receptive field while minimizing computational overhead. These methods leverage sophisticated feature extraction networks, like ResNet [31] and DenseNet (Dense Convolutional Network) [32], to capture deep discriminative features for precise dense segmentation prediction. PSPNet (Pyramid Scene Parsing Network) [33] aggregates feature in different regions and different scales to comprehensively learn local and global contextual information. SegNet [34] stands out with its encoder-decoder architecture, utilizing a VGG network [35] as the encoder and a mirror-symmetric structure as the decoder.

The image progressively loses fine details through repeated convolution and pooling operations, resulting in a progressively shrinking feature map that fails to accurately delineate the specific object contour and the classification of individual pixels. ESPNet (Efficient Spatial Pyramid Network) [36] mitigates this by replacing traditional convolutions with a combination of pointwise convolutions and dilated convolution spatial pyramids, thereby minimizing parameters and expanding the receptive field of features. MDCN (Multi-stream Densely Connected Network) [37] proposes a multi-stream network to capture features across varying scales utilizing intra- and inter-connections to densify receptive fields. STDC2 (Short-Term Dense Concatenate network) [38] introduces a detail guidance module aimed at encoding low-level spatial information and exhibiting a trade-off between efficiency and performance. MFRM (Multi-directional Feature Refinement Module) [39] designs three sub-branches to gather information at different scales and directions, while reducing computational load through strip pooling and dilated convolution operations.

Despite the commendable outcomes achieved by these single-modal approaches in semantic segmentation, their performance is hindered by their limited environmental adaptability. When confronted with intricate or challenging scenarios characterized by varying lighting conditions and complex backgrounds, it is difficult for a single perception modality to provide necessary and effective knowledge for accurate semantic segmentation results.

B. RGB-T Image Semantic Segmentation

Given the inherent robustness of thermal images to varying illumination and intricate backgrounds, RGB-T image semantic segmentation stands out as an effective approach for comprehending cluttered surroundings.

MFNet (Multi-spectral Fusion Networks) [15] pioneers the use of a CNN-based two-encoders-one-decoder architecture for multispectral image scene understanding and released an RGB-T dataset for semantic segmentation tasks. RTFNet (RGB-Thermal Fusion Network) [16] employs an encoder-decoder architecture for feature extraction, while the decoder focuses on restoring feature-map resolution. Building upon this foundation, FuseSeg [17] innovates by introducing a fusion architecture for RGB-T images, employing ladder fusion to enhance urban semantic segmentation.

PST900 (Penn Subterranean Thermal 900) [18] efficiently processes RGB images separately, while MLFNet (Multi Level Fusion Network) [19] presents a novel framework with RGB-T images fusion to enhance the robustness and accuracy of scene understanding in a lighting environment. FEANet (Feature-Enhanced Attention Network) [20] applies channel and spatial attention to enhance multilevel fusion. ABMDRNet (Adaptive-weighted Bi-directional Modality Difference Reduction Network) [21] presents a bidirectional modality difference reduction adaptive-weighted framework to effectively minimize the modality disparity. AFNet (Attention fusion network) [22] introduces an attentional fusion mechanism for enhancing feature presentation. MMNet (Multi-Stage and Multi-Scale Fusion Network) [23] addresses potential cross-modal conflicts by separately extracting features from different modalities in distinct stages. GCNet (Grid-like Context-aware Network) [24] proposes a grid-like context-aware network. [25] introduces an edge-aware guidance fusion network, emphasizing the importance of edge information in the fusion process. CEKD (Cross-modal Edge-privileged Knowledge Distillation) [26] proposes a cross-modal edge privileged knowledge distillation method specifically tailored for RGB-T urban scene semantic segmentation. [27] presents a modality difference reduction mask-guided network, offering an innovative approach to addressing disparities in the RGB-T urban scene semantic segmentation process.

While the aforementioned approaches have significantly advanced RGB-T urban scene semantic segmentation, a common characteristic among them is the reliance on unitary and simplistic fusion strategies, which often overlook its complexity and diversity. This narrow approach can limit their effectiveness in achieving long-term, sustainable solutions, as it fails to account for the intricate interplay of factors and the need for tailored, multifaceted strategies. This tendency raises concern about the potential inadequacy in integrating and fully leveraging the specific characteristics inherent in multilevel features. In response to this limitation, we advocate for an alternative approach embodied in the proposed Cross-dimensional Multispectral Edge Fusion Semantic Segmentation Network (CMEFNet), which consistently fuses multilevel multimodal features.

SECTION III.

Methodology

We propose a novel Cross-dimensional Multispectral Edge Fusion Semantic Segmentation Network (CMEFNet) shown in Figure 1.

FIGURE 1.

Overall architecture of the proposed method.

Show All

First, a bilateral multimodal fusion encoder based on hierarchical attention is proposed to extract and link low- and high-level features from RGB-T modalities. Then, an edge detection branch is introduced to standardize the intermediate features and enhance the utilization of global semantic information using the skip edge guidance structure. Finally, a deep supervision mechanism based on sideout fusion is designed to supervise the network with multiple loss functions.

In the following subsections, the architecture of the proposed method will be de-scribed in detail.

A. Bilateral Multimodal Fusion Encoder

In the encoding stage, we first adopt two identical modified ResNet blocks as the feature extraction backbone for RGB and thermal images (see Figure 2), which is consistent with the recent RGBT semantic segmentation frameworks and facilitates fair comparison. Specifically, in order to reduce the loss of informative features and the number of network parameters, we choose the ResNet18 variant structure and remove its last fully connected layers. The channel number of the ResNet initialization module of the thermal branch is set to 1, which is consistent with the input channel number of thermal images.

FIGURE 2.

The structure of bilateral multimodal fusion encoder.

Show All

Given the inherent sensitivity of RGB images to environmental conditions, the direct incorporation of RGB image features may potentially detrimentally impact overall performance. Considering that, a judicious approach is taken to selectively fuse those effective RGB image features with thermal image features.

Recognizing the varied impact of RGB image features across different regions and channels, it becomes imperative to conduct a pixel-level evaluation to discern the nuanced effects on the results. This granular assessment ensures that the integration of RGB image features is informed by the understanding of their influence on the overall performance. Such meticulous scrutiny at the pixel level serves to optimize the fusion process, safeguarding against any adverse effects and fostering a refined synergy between RGB and thermal image features in the pursuit of enhanced performance in urban semantic segmentation.

Motivated by the efficacy of attention modules, we introduce a hierarchical attention-based feature fusion module to dynamically reweight and select effective features (see Figure 3). It takes RGB image features at four distinct scales as inputs, employing separate convolutional blocks to compute weights associated with features at each scale.

FIGURE 3.

The structure of hierarchical attention based feature fusion module.

Show All

Notably, the shapes of the resultant weights are the same as those of the original features. Subsequently, these weights (which are the pixel-by-pixel confidences for corresponding feature maps and are automatically generated through supervised learning during the training process) are assigned to the corresponding features through elementwise multiplication. In this manner, we can selectively emphasize or de-emphasize specific features at different scales. This strategic incorporation of attention-based feature fusion aims to enhance the adaptability and discriminative power of the model, thereby optimizing the fusion process. The process can be expressed as followed:\begin{align*} F_{ci}^{\prime }& =f\left ({{F_{ci} }}\right ), i=1,2,3,4 \\ f\left ({{F_{ci} }}\right )& =ConvA(F_{ci} \odot \sigma (ConvA(F_{ci}))) \tag {1}\end{align*} View Sourcewhere $ConvA(\cdot)$ represents convolution operation with activation, $\odot $ represents element-wise multiplication and $\sigma (\cdot)$ represents the sigmoid function.

After that, the selected RGB image features $F_{ci}^{\prime }$ are added to thermal image features $F_{ti}$ obtaining the fusion features $F_{fi}$ as the input of the next layer.

Finally, the fusion of multiscale selected RGB image feature $F_{c}^{\prime }$ is skip connected to the input feature of the last decoder layer. In this manner, we can make the error gradient propagates directly back to the feature fusion module, realizing the direct supervision.

B. Skip Edge Guidance Structure

The redundant information of background can introduce noise during the feature re-aggregation process. Fortunately, edge information can deal with these noises and make up for the loss of information caused by down-sampling. This is because semantic segmentation and edge detection are two dual problems with interchangeable outputs and can reinforce and optimize each other’s results [40].

Considering that, we propose an edge detection branch, integrate global edge information into the semantic segmentation branch to ensure comprehensive exploration and exploitation of relevant contextual features while preserving edge details, thereby mitigating noise and optimizing object extraction in the context of semantic segmentation.

Nevertheless, the traditional edge enhancing modules primarily focus on aggregating features at the same scale. While effective, this approach may lack comprehensiveness, limiting the space and flexibility for feature aggregation.

Moreover, in these two branches, repeated pooling operations could result in the loss of spatial information, which is more severe in detecting edges. The lost information cannot be easily recovered by skip connections. To enhance the learning ability for global information, we design a skip edge guidance structure (SEG, see Figure 4). It fuses global edges through multi-skip connection, which not only supplements global information, but also enhances the network’s attention to target edge information.

FIGURE 4.

The structure of skip edge guidance (SEG).

Show All

While comprehensively considering the segmentation effectivity and efficiency, we specially design three dual attention feature enhancement blocks (DAFEB) shown in Figure 4, which is inspired by the memory function of recurrent neural network (RNN). Benefit from the mechanism of RNNs [41], DAFEBs can transmit critical information like RNN transmits hidden layer information. Thus, DAFEBs can not only fuse semantic and edge features, but also enhance features by introducing information before the pooling operation. It should be emphasized that three DAFEB s have the same structural design, but do not share parameters. They have different sizes of their input and output.

Here, we take the first DAFEB (shown in Figure 5) as an example to explain its process. Details are as followed.\begin{align*} F_{DAFEB}& =\mathcal {R}\left ({{ LN\left ({{ F_{CA2}+MLP\left ({{ F_{CA2} }}\right ) }}\right ) }}\right ) \tag {2}\\ F_{CA1}& =LN\left ({{ \hat {F}_{s}+MHCA\left ({{ \hat {F}_{s},\hat {F}_{e},\hat {F}_{e} }}\right ) }}\right ) \tag {3}\\ F_{CA2}& =LN\left ({{ \hat {F}_{CAB}+MHCA\left ({{ \hat {F}_{CAB},F_{CA1},F_{CA1} }}\right ) }}\right ) \tag {4}\end{align*} View Source

FIGURE 5.

The structure of dual attention feature enhancement block (DAFEB).

Show All

The inputs are intermediate feature $F_{s}$ from semantic branch, intermediate feature $F_{e}$ from edge branch and $F_{in}$ from input data. Both $F_{s}$ and $F_{e}$ are in size of $\left [{{ \frac {w}{2}\times \frac {h}{2}\times 4 }}\right]$ , and are reshaped to $\hat {F}_{s}$ and $\hat {F}_{e}$ with the size of $\left [{{ \frac {wh}{4}\times 4 }}\right]$ . $\hat {F}_{s}$ and $\hat {F}_{e}$ are fused by multi-head cross attention (MHCA [42], [43]) and layer norm (LN) to generate the feature $F_{CA1}$ . Specifically, MHCA in (3) sets the query Q from $\hat {F}_{s}$ , and sets the key K and value V from $\hat {F}_{e}$ . $F_{in}$ with the size of $\left [{{ w\times h\times 1 }}\right]$ is reshaped to $\hat {F}_{in}$ with the size of $\left [{{ \frac {wh}{4}\times 4 }}\right]$ . Then $\hat {F}_{in}$ and $F_{CA1}$ are also fused by MHCA and LN to generate the feature $F_{CA2}$ . $F_{CA2}$ is processed by MLP and LN, and is reshaped to $F_{DAFEB}$ with the size of $\left [{{ \frac {w}{4}\times \frac {h}{4}\times 16 }}\right]$ . $F_{DAFEB}$ is the input of the third layer of two branches and the second DAFEB.

The fusion process follows the direction from low resolution to high resolution (the corresponding pseudocode is described in Algorithm 1), so that the corresponding details are supplemented. These features cover different scales and different semantic levels. It not only enhances the attention of the network to object edges, but also supplements the useful information of the global edge features to the objects. In this manner, edges with semantic information can improve the ability of distinguishing between classes especially on the similar external features of adjacent objects, and can strengthen the optimization of semantic features.

Algorithm 1 SEG

Input: Thermal images ${Input}_{T}$ :

${Input}_{T}=\left \{{{ {Input}_{T}^{1},{Input}_{T}^{2},\ldots ,{Input}_{T}^{n} }}\right \}$ ;

Result: Semantic feature maps ${Output}_{s}$ and edge feature maps ${Output}_{e}$ :

${Output}_{s}=\left \{{{ {Output}_{s}^{1},{Output}_{s}^{2},\ldots ,{Output}_{s}^{n} }}\right \}$ ;

${Output}_{e}=\left \{{{ {Output}_{e}^{1},{Output}_{e}^{2},\ldots ,{Output}_{e}^{n} }}\right \}$ .

/ n denotes the frame number of the dataset /

for each ${Input}_{T}^{i}$ in ${Input}_{T}$ do

Separately feeding into the first two encoders of semantic branch and edge branch, and generate the corresponding intermediate features $F_{s1}^{i}$ and $F_{e1}^{i}$ ;

$F_{DAFEB1}^{i}=DAFEB({Input}_{T}^{i},F_{s1}^{i},F_{e1}^{i})$ separately feed into the next two encoding modules of semantic branch and edge branch, and generate the corresponding intermediate features $F_{s2}^{i}$ and $F_{e2}^{i}$ ;

$F_{DAFEB2}^{i}=DAFEB(F_{DAFEB1}^{i},F_{s2}^{i},F_{e2}^{i})$ separately feed into the first two decoders of semantic branch and edge branch, and generate the corresponding intermediate features $F_{s3}^{i}$ and $F_{e3}^{i}$ ;

$F_{DAFEB3}^{i}=DAFEB(F_{DAFEB2}^{i},F_{s3}^{i},F_{e3}^{i})$ separately feed into the last decoder of semantic branch and edge branch, and generate the corresponding feature map ${Output}_{s}^{i}$ and ${Output}_{s}^{i}$ ;

return ${Output}_{s}$ , ${Output}_{e}$ ;

end for

C. Deep Supervision

During the decoding process, there are a total of 4 decoded feature outputs. Traditional encoder-based methods utilize the last feature for prediction, which is not sufficient enough. Moreover, it is not adaptable to different scale targets.

Considering that, we utilize all the features in the decoding process and add an output layer after each decoded output to get the segmentation output of the network. Specifically, the decoding features of each layer are first interpolated to restore the resolution to the image input size, and then the number of output channels is adjusted by $1\times 1$ convolution. After that, all the decoded outputs are fused together and refined by a $3\times 3$ convolution.

In order to implement effective supervision on the whole network, we design a deep fusion feature supervision method. We supervise not only the side outputs of all decoded features, but also the fused outputs. A total of 5 losses jointly supervises the network. Due to the large difference in the pixels of the datasets used, we use the weighted cross-entropy function to train the network, and the weighted loss function can be calculated as:\begin{equation*} L_{wce}\left ({{ x }}\right )=-\frac {1}{N}\sum \limits _{i=1}^{N} \sum \limits _{c=1}^{C} {y_{i,c}\mathrm {\cdot }log\left ({{ x_{i,c} }}\right )\cdot \mathrm {w}_{c}} \tag {5}\end{equation*} View Sourcewhere $y_{i,c}$ and $x_{i,c}$ denote the target label and predicted probability of the class c and the i-th pixel in the batch, N denotes the number of pixels in one batch, and $\mathrm {w}_{c}$ indicates the weight of class c.

The loss function $L_{fuse}$ of the fusion output is calculated as:\begin{align*} L_{fuse}& =L_{wce}\left ({{ x_{fuse} }}\right ) \tag {6}\\ x_{fuse}& = \mathrm {Softmax}\left ({{ {\mathrm {w}_{fuse}\left |{{ d_{i} }}\right |}_{i\mathrm {=0}}^{i=M} }}\right ) \tag {7}\end{align*} View Sourcewhere $d_{i}$ is the decoding feature, $\mathrm {w}_{fuse}$ is the fusion output weight. The overall loss function of the network is:\begin{equation*} Loss=L_{fuse}+\sum \limits _{m\mathrm {=0}}^{M} {L_{wce}\left ({{ x_{m} }}\right )} \tag {8}\end{equation*} View Sourcewhere $x_{m}$ is the prediction output of each decoding output layer, M is the decoding stage (in this paper $M\mathrm {=4}$ ).

SECTION IV.

Experiments and Results

A. Dataset

Due to the lack of special thermal image edge detection data set, we use RGB image edge detection data set BSDS500 [44] to train the edge detection network. BSDS500 includes 500 natural images with carefully annotated edges, with an average of 5 different objects to be detected per image. The dataset is divided into three parts: 200 images for training, 100 for validation, and the remaining 200 for testing.

We use the public RGB-T dataset provided by MFNet [15]. This dataset contains 1569 pairs of RGB-T images with a resolution of $480\times 640$ , of which 749 pairs are taken at nighttime and 820 pairs are taken at daytime. This data set marks eight common obstacles encountered during driving (cars, people, bike, curve, car stop, guardrail, color cone, and bump) and a background. The training set includes 50% daytime images and 50% nighttime images, whereas the validation set and test set contain 25% daytime images and 25% nighttime images, respectively.

In addition, we use another public RGB-T dataset PST900 [18] to validate the improvements are not due to random chance. PST900 provides 894 RGB-thermal images with a resolution of $1280\times 720$ , taken under the cave and subterranean environments for DARPA Subterranean Challenge. The dataset contains annotated segmentation labels for five classes, including one background class (i.e., unlabeled) and four object classes.

Data augmentation techniques include flipping, cropping and noise injecting.

B. Implement Details

Before training, the RGB and thermal encoders are initialized with the ResNet weights provided by pytorch.

During training, the semantic segmentation branch and theedge detection branch are trained as a whole. Since there are no paired RGB and thermal image edge detection datasets, we use the BSDS500 dataset to pretrain the edge detection network. In detail, we convert the RGB image of BSDS500 into a single-channel grayscale image as the input of the thermal branch, and input the RGB image into the RGB feature extraction branch. The main purpose is to utilize a supervised-trained edge detector for the whole network. The results are shown in the section Appendix. After that, we utilize the learned parameters for further training in the RGB-T datasets.

Besides, in order to supervise and train the edge branch, we need to generate pseudo ground-truth binary edge label maps. Specifically, we directly check the neighbors within a $3\times 3$ window of each valid pixel in each semantic label map. If one valid pixel has at least one adjacent point with different semantic label, it will be marked as an edge point.

We implement our CMEFNet on pytorch, train and validate the network using a single 3090 graphics card. The network is trained using batch size 2, epoch 300 and stochastic gradient descent optimizer with a momentum of 0.9 and a weight decay of 0.005, referring [27].

C. Evaluation Metrics

For quantitative evaluation, we use mean Intersection over Union (mIoU) to evaluate semantic segmentation performance. Its calculation formula is as follows:\begin{equation*} mIoU=\frac {1}{K}\sum \limits _{i=1}^{K} \frac {{TP}_{i}}{{TP}_{i}+{{FP}_{i}+FN}_{i}} \tag {9}\end{equation*} View Sourcewhere $TP,FP $ and FN denote the total number of true positive, false positive and false negative, respectively; K represents the number of categories in the dataset ($K\mathrm {=9}$ in this article).

D. Comparative Results and Discussion on MFNeT Dataset

To test the segmentation performance of CMEFNet, this section performs experimental analysis on the RGB-T dataset.

We compared with three unimodal models (U-Net, SegNet, and Deeplabv3), which load 4 channels of data stitched from thermal images and RGB images; one representative RGB-D model (FuseNet), where the depth channel is replaced by thermal channel; and twelve representative RGB-T models. The quantitative comparison results are shown in Table 1, where the models with “$\ast $ ” were computed using their released codes (using their original training strategy to ensure equity) and the others are from their own studies, the bold numbers indicate the best results. Note that the mIoU is calculated with the unlabeled classes, but the results for the unlabeled classes are not displayed, which is consistent with existing methods.

TABLE 1 Comparison Results (IoU, %) of Typical Methods on MFNet RGB-T Dataset

From the results in Table 1, it is obvious that our method CMEFNet outperforms most models in terms of mIoU (improving 0.9% over the second best model) and achieves the best performance in 4 of all 8 classes, especially on the important traffic participants like pedestrian and vehicle categories (improving 0.4-0.7% over the second best model). Comparing the three-channel (3C) and four-channel (4C) results of single-modal models, we can see that the latter all perform better in terms of mIoU and in almost all categories (except for cars). These uniform improvements indicate the effectiveness of introducing thermal image data in benefitting overall performance.

The visual comparison results are presented in Figure 6 and Figure 7, representing nighttime and daytime scenes respectively. Among those open-access methods, we select CMEFNet, Deeplabv3, FuseNet, and RTFNet, which have better mIoU performance, and discard UNet, SegNet, and MFNet.

FIGURE 6.

Visual comparisons with representative methods (nighttime).

Show All

FIGURE 7.

Visual comparisons with representative methods (daytime).

Show All

Figure 6 demonstrate that the segmentation method using visible RGB images and thermal images has satisfactory results in the night scene. It should be emphasized that there are cases of missing targets to be segmented in the RGB-T dataset labels, such as the distant pedestrians in the first column. The night scene has insufficient lighting conditions such that the visible RGB image contains very little information. However, the thermal images supplement the effective scene information and significantly improve the segmentation accuracy at night. Compared with other methods, the segmentation results of CMEFNet have more complete targets and clear edges. Other methods not only have missegmentation results, but also the edge processing effect needs to be improved.

From Figure 7, we can see that CMEFNet has a good segmentation effect in the daytime scene, not only capturing the tiny pedestrians in the distance, but also segmenting the targets well in front of the complex background. It proves that the proposed CMEFNet provides superior performance under various lighting conditions.

For further evaluation, we also compared methods under nighttime and daytime conditions (see in Table 2). We can see that the proposed method performs superior performance in both daytime and night-time scenes. In addition, comparing first 6 rows, it is obvious that the model based on multispectral data fusion achieves better segmentation results in both day and night, and the segmentation accuracy is significantly improved compared with the single-data mode. This indicates that the fusion of different perceptual data has a positive effect on understanding the same scene.

TABLE 2 Comparison Results (mIoU, %) of Typical Methods in Nighttime and Daytime

E. Comparative Results and Discussion on PST900 Dataset

To further evaluate the effectiveness of the proposed CMEFNet, we also conduct the quantitative analysis on the PST900 RGB-T dataset. Table 3 lists the evaluation results.

TABLE 3 Comparison Results (IoU, %) of Typical Methods on PST900

The proposed CMEFNet outperforms the previous SOTA method by 1.7% in mIoU, demonstrating its generalization ability.

F. Comparison of Computation Time

In addition, we also compare the efficiency and performance tradeoffs of the proposed method, shown in Table 4. It is obvious that CMEFNet achieves higher performance than the applied methods. Although UNet, SegNet, MFNet and FuseNet relatively have advantages in inference time, they do not provide satisfactory segmentation results. RTFNet, which is closest to the segmentation result of CMEFNet, has too long inference time and insufficient timeliness. Therefore, CMEFNet has better balance of effectiveness and efficiency, which is proved to be an efficient algorithm suitable for real-time applications.

TABLE 4 Quantitative Comparison of Various Methods With Time (ms)

G. Ablation Study

Various ablation studies were conducted to verify the effectiveness of each component in CMEFNet, including the effects of hierarchical attention based fusion module (“HAF”), skip edge guidance structure (“SEG”) and deep supervision strategy (“DS”). We use the same parameters to train these methods. The results are shown in Table 5.

TABLE 5 Results of Ablation Studies for Network Components

1) Effectiveness of HAF

To verify the effectiveness of HAF, we replace it with traditional skip-connection, obtaining the results listed at row 1 of Table 4. There is a 1.6% decrease of mIoU, which proves the contribution of HAF in effectively reweighting and fusing RGB-T features and promoting semantic segmentation results.

Moreover, we conduct extra experiments in Table 6, where row 1 indicates that the fusion is performed only at the last layer of the encoder. The sustained improvements verify the effectiveness of feature fusion across different feature scales.

TABLE 6 Results of Feature Fusion Across Different Feature Scales

2) Effectiveness of SEG

To testify the effectiveness of SEG, we remove it (WO) and replace it with single-level skip connections (W/Skip, which is series of convolutions after directly adding intermediate edge features and semantic features), obtaining the results listed at row 2 and row 3 of Table 4. The performance of WO and W/Skip decrease by approximately 3.1% and 2.3%, respectively. It verifies the importance of guidance by edge information the effectiveness of SEG in comprehensive fusing cross-dimensional features.

3) Effectiveness of DS

We remove the lateral fusion output of CMEFNet and only supervise the decoded output of the last layer (W/Oneout), obtaining the results listed at row 4 of Table 4. Comparing the results of line 4 and 5, we note that our deep fusion feature supervision has a positive effect on improving segmentation accuracy relative to single-output supervision (an increase of 2.7%). Moreover, the results also show that the strategy of convolution optimization after fusing all decoding features is more advantageous in feature utilization, because the multi-scale decoding features enhance the network’s ability to adapt to objects of different sizes.

SECTION V.

Discussion

Despite the above advantages, there are still areas where our work can be improved. The IoUs of our method are not always the best in all classes, especially when it comes to segmenting small objects with parser geometric information, such as guardrail and color cone. These failures are obvious in the visualized results. Specifically, the color cones emphasized in the red circle in Figure. 6 do not have the correct shapes and the guardrail emphasized in the red circle in Figure. 7 is messed up with the pedestrian. This is because the ResNet-based backbone is not specially designed for small scale objects. Therefore, our future work will focus on further optimizing the segmentation accuracy of these classes by adding multi-scale learning module, designing new loss functions and introducing recent released high-performance backbones (such as ViT).

SECTION VI.

Conclusion

In this paper, we propose CMEFNet for dynamic traffic scene semantic segmentation of RGB-T images under variable illumination conditions. First, the multispectral images are passed through two similar but independent encoders. Then, the hierarchical attention based feature fusion module achieves feature complementarity in the encoding stage. Subsequently, a newly designed skip edge guidance structure optimizes feature extraction with edge features. Finally, a deep supervision mechanism is applied to mixture multiscale decoded features and fused features. The proposed CMEFNet achieves efficient and effective segmentation results on public RGB-T dataset when compared to SOTA methods in both night-time and daytime. The effectiveness of our network components is demonstrated by the thorough ablation study.

Appendix

In order to verify the effectiveness of the edge detection network designed in this paper, we conduct training and testing on the BSDS500 dataset and compares it with the common method HED [45]. The comparison results are shown in Table 7, and the detection effects are shown in Figure 8 and Figure 9.

TABLE 7 Comparison Results

FIGURE 8.

Visible results of edge detection on BSDS500 dataset.

Show All

FIGURE 9.

Visible results of thermal image edge detection.

Show All

References is not available for this document.

Illumination Robust Semantic Segmentation Based on Cross-Dimensional Multispectral Edge Fusion in Dynamic Traffic Scenes

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction