Introduction
Road information [1] is one of the foundational elements of geographic databases and a core component of electronic navigation maps. It plays a vital role in drone operations, where air platforms can either accurately strike roads to disrupt traffic or rapidly gather road network information in unfamiliar regions. When the wireless positioning systems of drones are disrupted, the unique topology of road networks can aid in auxiliary positioning and target search tasks. However, roads are often challenging to precisely identify in aerial images because they tend to appear as inconspicuous lines.
High-resolution remote sensing images (HRSIs) [2], [3], [4] offer exceptional spatial resolution, detailed spatial data, and complex surface features, enhancing our understanding of Earth’s terrain. These images are crucial for earthquake response, vehicle navigation, and disaster assessment. However, they also present challenges such as geometric complexity, diverse texture information, and limited spectral data, leading to spectral differences, spatial data loss, and registration inaccuracies between images. Seasonal changes and varying lighting conditions can create unique, nonuniform spectral features that affect road recognition accuracy. Additionally, shadows cast by trees and buildings can obscure data and disrupt road continuity, sometimes misclassified as road features [5]. The visual similarity between dirt roads and railways further complicates feature differentiation. Current research focuses on effectively extracting comprehensive feature information from remote sensing images while minimizing redundant data interference to improve segmentation accuracy. Historically, road extraction methodologies have predominantly depended on manually crafted techniques that adeptly differentiate road-related data from intricate backgrounds by leveraging spectral, geometric, chromatic, textural, and topological attributes, alongside machine learning algorithms for the identification of road features [6], [7], [8]. However, the accuracy of these methods usually falls short of the desired outcome.
Deep learning has transformed road extraction from remote sensing images, with convolutional neural networks (CNNs) [9] becoming central to this process. CNNs are now the backbone of computer-aided road extraction, with most methods employing encoder-decoder architectures. U-Net [10] has served as the foundation for numerous studies on road extraction from remote sensing images, addressing the challenge of automatic road detection across varying spatial resolutions. Subsequently, well-known models such as DeepLabv3 [11] and D-LinkNet [12] have been developed based on U-Net’s symmetrical encoder-decoder architecture with skip connections, which effectively extract and reconstruct fine image details. The encoder processes and downsamples the input image through convolution operations, while the decoder restores the spatial dimensions of these features and generates a classification map matching the size of the input image [13], [14], [15], [16], [17], [18], [19]. This method effectively extracts semantic features of roads in complex scenes. Yuan et al. [20] introduced a network called SCTransNet, a spatial channel cross-transformer network, which addresses these challenges by employing spatial channel cross-transformer blocks on top of the skip connections in a U-shaped architecture. Yao et al. [21] proposed an iterative semisupervised CNNs framework by means of active learning and superpixel segmentation techniques, dubbed as SA-CNNs. While most foundation models are tailored to effectively process RGB images for various visual tasks. Hong et al. [22] introduced SpectralGPT, the first universal remote sensing model designed to process spectral images using a novel 3-D generative pretrained transformer. This model handles images of varying sizes, resolutions, time series, and regions through progressive training, optimizing the use of extensive remote sensing data. However, relying solely on single information makes it difficult to detect roads from aerial or remote sensing images, especially when roads are heavily obscured by trees. Therefore, extracting roads from single-modal satellite remote sensing images remains highly challenging.
Many current studies have introduced other modal data such as GPS trajectories [23] or LiDAR [24] to solve the difficulties in road feature extraction, helping HRSIs acquire more accurate road information. If an area has a large amount of GPS-tracked trajectory, it is likely to be a road structure, significantly improving the feasibility of trajectory-based road extraction. In addition, LiDAR data contains depth and distance information. It distinguishes roads, buildings, and trees based on their different laser reflectivity characteristics. Qi et al. [25] introduced a dual enhancement module for channels, promoting interaction between the two modalities to complement the information missing from each single modality. Hong et al. [26] proposed a decoupled-and-coupled network called DC-Net for the HS-SR task. It is a novel progressive fusion framework, from pixel-level to subpixel-level fusion and from image-level to feature-level fusion. Wu et al. [27] proposed a deep learning framework for multimodal remote sensing data classification (CCR-Net), using CNNs as the backbone and featuring an advanced cross-channel reconstruction module. This approach has greatly inspired our design and accelerated the development of feature extraction strategies for multimodal models.
Previous research methods can be broadly classified into two categories. One approach involved simple feature fusion, where remote sensing images and GPS trajectories/LiDAR images are concatenated and then fed into a semantic segmentation network. This can lead to redundant road segments being treated as noise, which significantly impacts model performance and hinders multimodal integration. The other approach involves first extracting features from individual modalities and then using the differences between modalities to complement the extracted features. However, these method can result in excess segmentation due to similar terrain features, such as rivers or light rail, creating redundant or erroneous information that cannot be corrected by the other modality. Furthermore, multimodal fusion generates extensive computational complexity, which makes it challenging to deploy the model on mobile devices. To address these problems, we propose a multimodal fusion strategy, called the lightweight cross-modal information interaction for road feature extraction (LCIRE-Net), which fully leverages the complementarity between pairs of modalities among HRSIs, GPS trajectories, and LiDAR. Specifically, LCIRE-Net designs two encoders for modality feature learning; subsequently, we propose a cross-modal information dynamic interaction (CMIDI) mechanism, which refines different modal features through mutual information complementation via a progress propagator. To further enhance robustness, we integrate CMIDI into each layer of encoder to enhance features of both modalities layer by layer. Eventually, the outputs of two encoders are fused and weighted through a multimodal feature fusion module (MFFM) to improve prediction accuracy. To address issues such as poor edge feature extraction, we adopt a feature refinement and enhancement module (FREM) between the encoder and decoder to expand the receptive field, making more detailed features to enter the decoder. This is a brief overview of our most significant contributions.
We propose the CMIDI mechanism, which enhances complementary information at different scales between multimodal images, supplementing the information differences between features extracted from each downsampling and those extracted from another modality.
Based on D-LinkNet, we replace the original residual blocks with an improved ghost basic block, reducing parameter computation and enhancing inference speed.
An MFFM is a designed postencoder output to better fuse features from both modalities with minimal noise interference.
We design the FREM for the connection between the encoder and decoder, which improves the extraction of edge details by refining the spatial elements of feature maps.
The rest of this article is organized as follows. Section II introduces the related works from the perspectives of single model and multimodel in detail. We subsequently elaborate the proposed LCIRE-Net and detail each newly developed module. Experimental results are presented in comparison with current SOTA methods in Section IV. Finally, Section V draws a conclusion with a possible future outlook.
Related Works
A. Single-Modal-Based Road Feature Extraction Method
Based on the input data method, previous research methods can be divided into three categories. We will conduct an in-depth study on the related work of each category.
Remote Sensing Images-Based Road Extraction: With the rapid development of satellite remote sensing imaging technology, it has become feasible to obtain a large number of HRSIs conveniently. Early works typically relied on handcrafted texture, contour features, and shallow models (deformable models [28] and Markov random fields [29]) to identify road features. However, these traditional methods often struggle to capture high-level semantic information, significantly limiting the model’s ability to accurately extract roads. In recent years, CNNs, known for their excellent representation learning, have gradually become the mainstream models in this field. Yang et al. [30] proposed a method in which recurrent units replace the traditional convolutional units in U-Net, enabling the preservation of detailed spatial information through multiple summations using dilated convolutions [31]. This approach is crucial for enhancing the model’s ability to capture details when extracting roads from high-resolution satellite images. Zhang et al. [32] pioneered an end-to-end road segmentation method that significantly improves the model’s perception of road edges and shapes by effectively leveraging the multilevel features of convolutional layers. This approach addresses the imbalance between CNN depth and spatial resolution. However, relying solely on visual data for road extraction remains challenging, especially in complex or occluded environments. Therefore, exploring additional data sources is crucial for enhancing accuracy.
GPS Trajectories-Based Road Extraction: Studies have leveraged vehicle trajectories to identify road segments, under the assumption that dense GPS data indicates road presence. While this approach improves road extraction, it has limitations. For example, parking lots can be misclassified as roads due to high trajectory density [33], and GPS data may become unreliable in areas with poor signal, such as tunnels or mountainous regions. Communication delays can lead to unstable trajectories, complicating accurate road width measurement, which high-resolution imagery can more effectively provide. Previous research has focused on reducing GPS noise and uncertainty through methods such as cluster-based models, trajectory merging, kernel density estimation, and, more recently, neural network-based approaches. Ruan et al. [34] proposed a deep learning-based framework that infers road centerlines from trajectory data in spatial and transitional views. Due to the constraints of GPS trajectory information and the challenges of GPS noise reduction, these methods still face limitations in using GPS trajectories for road extraction.
LiDAR-Based Road Extraction: Compared to aerial images, LiDAR data provides depth and distance information, offering unique characteristics based on the different reflectivity of objects such as buildings, trees, and roads. This makes the angle of road smoothness prominent, aiding in distinguishing road proposals from buildings and trees. Many researchers have designed algorithms to identify roads using LiDAR data. For example, after obtaining ground intensity images, Hu et al. [35] designed structural templates to search for roads and determined road width and direction based on LiDAR characteristics. Despite certain advancements, challenges remain in LiDAR-based road extraction tasks due to the sparsity of LiDAR data and the poor performance of noisy points in complex scenes.
B. Multimodal-Based Road Feature Extraction Method
Each modality of remote sensing images, GPS trajectories, and LiDAR has its own advantages and disadvantages. Therefore, an effective method for road extraction research is to combine these single modalities to utilize the complementary useful information between them. Xu et al. [36] first segmented road primitives from optical images and LiDAR data, then used an iterative Hough transform algorithm to detect road stripes, and finally formed the road network structure through topological analysis. Parajuli et al. [37] developed a modular deep convolutional network called TriSeg, which involves using two SegNets to extract features from remote sensing images and LiDAR data separately, and another SegNet fusion module to estimate the final road map.
Single modalities such as remote sensing images or LiDAR data often lack sufficient detail to identify roads obscured by trees or buildings, leading to segmentation gaps. Combining visual information helps eliminate noise, reveals hidden roads, and accurately distinguishes false nonroad areas. Therefore, Xu et al. [38] integrated GPS trajectory maps and remote sensing images into neural networks such as U-Net, Res-UNet, LinkNet, and D-LinkNet for road or semantic segmentation to improve the accuracy of road segmentation and prediction. Liu et al. [18] input GPS trajectory maps and remote sensing images into different networks for feature extraction and fused modular features from multiple layers to predict the final roads. Despite these advancements, the fusion methods did not fully utilize the complementarity of different modalities. Bai et al. [39] designed a new model to accurately acquire road information. They introduced an OR operation-based fusion strategy to combine image and trajectory data to extract road information, avoiding the impact of trajectory noise on network training. Hong et al. [40] proposed HighDAN, a high-resolution domain adaptation network, to enhance AI model generalization across multiple cities. HighDAN effectively preserves the spatial topological structure of urban scenes through parallel high-to-low resolution fusion and uses adversarial learning to bridge the representation gap between remote sensing images from different cities.
Although these methods improve accuracy over single-modal approaches, they generally use a single downsampling module to extract features from one modality and fuse them with labels from another, leading to noise and inadequate information complementarity. We propose a CMIDI module to fully utilize complementary information between modalities and an MFFM to minimize noise interference, thus enhancing road feature extraction. Additionally, we have implemented lightweight techniques to ensure efficient deployment on mobile edge computing devices.
Proposed Approach
A. Network Architecture
The overall network architecture design of LCIRE-Net is shown in Fig. 1. We combined the structural diagrams of most semantic segmentation networks, using the encoder and decoder as the main framework structure of our network. Since it is a multimodal data task for road feature extraction, we designed a downsampling parallel dual-branch road feature extraction structure in the encoder, with one branch taking remote sensing images as input and the other branch taking GPS trajectory/LiDAR as input.
Overall architecture of the proposed LCIRE-Net for road extraction. Our framework consists of the following three parts. 1) Dual-input encoder with CMIDI mechanism for cross-modal scenario tasks. 2) MFFM is employed to capture more complex semantic information. 3) FREM for edge feature enhancement. Moreover, the proposed LCIRE-Net is also suitable for RSIs and LiDAR scenarios.
We designed the CMIDI mechanism to dynamically propagate the global context information and local detail information of the two different modalities to achieve complementary differences between the information, thereby mutually improving and enhancing the features of each modality. Simultaneously, the difference in information interaction in the CMIDI is used as a skip connection and cascaded with the upsampled images in the decoder to obtain images of the same size, using
In addition, we have illustrated the process of extracting roads from remote sensing images and LiDAR in LCIRE-Net in the lower right corner of Fig. 1, which is consistent with the principles of remote sensing images and GPS trajectories.
B. CMIDI Mechanism
For specific cross-modal feature learning, we set up a CMIDI mechanism based on a message-passing mechanism in the encoder. This module uses information from the two modalities to optimize the extracted road features mutually. It identifies the differential information between the features extracted at each downsampling layer and further dynamically complements and interacts with these differences. In this section, we use the refinement of features
As shown in Fig. 2, we input \begin{align*} F{}'_{\mathrm { HRSIs}}& ={\mathrm { Conv}}_{3\times 3}({F}_{\mathrm { HRSIs}}) \tag {1}\\ F{}'_{\mathrm { GPS/LiDAR}}& ={\mathrm { Conv}}_{3\times 3}({F}_{\mathrm { GPS/LiDAR}}) \tag {2}\\ G& =\text {FC}\big (\text {MaxPool}\big ({F}'_{\mathrm { HRSIs}} \oplus {F}'_{\mathrm { GPS/LiDAR}}\big) \tag {3}\\ {G}_{1},{G}_{2}& ={\mathrm { Conv}}_{1\times 1}(G),\quad {G}_{1}={G}_{2}. \tag {4}\end{align*}
CMIDI mechanism structure mainly illustrates how to use the complementary differences between remote sensing images and GPS trajectories/LiDAR features to extract information and dynamically enhance the image features of both modalities. Using learnable fusion weights to dynamically fuse local and global information to obtain cross-modal information. We set a threshold
Additionally, we designed an information balance constraint \begin{align*} \theta & =\text {Sigmoid}\big ({G}_{1}-{F}'_{\mathrm { HRSIs}},\, {G}_{2}-{F}'_{\mathrm { GPS/LiDAR}}\big) \tag {5}\\ {F}'_{\mathrm { HRSIs}}& ={G}_{1} \otimes {F}_{\mathrm { HRSIs}},\quad \theta \geq 0.15 \tag {6}\\ {F}"_{\mathrm { GPS/LiDAR}}& ={G}_{2}\otimes {F}_{\mathrm { GPS/LiDAR}},\quad \theta \geq 0.15 \tag {7}\\ {F}"_{\mathrm { HRSIs}}& ={F}_{\mathrm { HRSIs}},\quad {F}"_{\mathrm { GPS/LiDAR}}={F}_{\mathrm { GPS/LiDAR}},~\theta \lt 0.15. \tag {8}\end{align*}
C. Multimodal Feature Fusion Module
To better fuse the features of the two modalities and produce less noise interference, we designed an MFFM. In Fig. 1, HRSIs and GPS trajectories/LiDAR are used as fusion objects. The encoder extracts road information from HRSIs or GPS trajectories/LiDAR, which are treated as two feature extraction tasks, i.e., two parallel downsampling modules. One branch is used only to extract features from HRSIs, and the other branch is used only to extract features from GPS trajectories/LiDAR images. Finally, the two output results are fused. To fully utilize the complementarity of road features in HRSIs and GPS trajectory/LiDAR images, we concatenate the two downsampled features in the fusion module according to the channel dimension to combine their feature information in the spatial direction. Subsequently, the features are divided into left and right adaptive fusion modules to generate weights. The adaptive fusion module is divided into a \begin{align*} {S}_{\mathrm { HRSIs}}& =\frac {{e}^{{F}_{\mathrm { HRSIs}}\times {W}_{1}}} {{e}^{{F}_{\mathrm { HRSIs}} \times {W}_{1}}+{e}^{{F}_{\mathrm { GPS/LiDAR}} \times {W}_{2}}} \tag {9}\\ {S}_{\mathrm { GPS/LiDAR}}& =\frac {{e}^{{F}_{\mathrm { LiDAR}} \times {W}_{1}}} {{e}^{{F}_{\mathrm { HRSIs}} \times {W}_{1}}+{e}^{{F}_{\mathrm { GPS/LiDAR}} \times {W}_{2}}}. \tag {10}\end{align*}
\begin{align*} {X}_{1}& ={S}_{\mathrm { HRSIs}} \odot {F}_{\mathrm { HRSIs}} \tag {11}\\ {X}_{2}& ={S}_{\mathrm { GPS/LiDAR}} \odot {F}_{\mathrm { GPS/LiDAR}}. \tag {12}\end{align*}
Two maxpool layers are used to extract edge and texture information from the features and suppress background information to compensate for the lost feature information. The features of the remote sensing images and GPS trajectory/LiDAR images are reweighted and added to obtain the combined feature \begin{align*} {\mathrm { Output}}_{\mathrm { Fusion}}& =\text {Conv}(\text {BN}(\text {ReLU}(\text {Concat} ({X}_{1}+{X}_{2}, \\ & \qquad \qquad \qquad \text {Maxpool}({F}_{\mathrm { HRSIs}}), \\ & \qquad \qquad \qquad \text {Maxpool}({F}_{\mathrm { GPS/LiDAR}})))). \tag {13}\end{align*}
D. Lightweight Strategy in D-LinkNet
In recent years, designing a lightweight CNN architecture [41] with small parameters and FLOPs [42] has been an important research topic in the field of computer vision. This article proposes a lightweight CNN with fewer parameters on the encoding end. It is used for downsampling to extract road features and simplifies the redundancy of conventional network structures through a unified submodule design. By introducing the concept of depthwise separable convolutions from MobileNetV1, regular convolutions are decomposed into depthwise convolutions (DW-Convs) [43] and pointwise convolutions (PW-Conv) [44] for extracting spatial and channel features, respectively. This approach significantly reduces the computational load and parameters of the network, improving image processing speed.
As shown in Fig. 3, when the input is a four-channel image of
Lightweight structure in encoder (TDNet). The downsampling layer’s entire process, from input to output, which is divided into stages 0 to 4. In stage 0, a reduced max pooling technique is employed. Stage 1 utilizes a ghost basic block
E. Feature Refinement and Enhancement Module
Solely relying on CNN for downsampling HRSIs and GPS trajectory/LiDAR data and multimodal fusion modules makes it difficult to fully retain the spatial details of the images. The loss of spatial details significantly affects the quality of fused images. Expanding the receptive field of roads improves the accuracy of road edge feature information, thereby obtaining effective multiscale information. Therefore, we designed the FREM to refine the spatial details of feature mappings.
Most CNNs are typically executed using square kernels that allow the network to learn feature mappings within a square window. However, linearly distributed strip features cannot effectively capture their linear characteristics, inevitably including unrelated information from adjacent pixels, increasing unnecessary computational costs. To address these issues, we constructed a BLCM which uses horizontal strip convolutions and vertical strip convolutions to convolve along the horizontal and vertical directions of the feature map, respectively. These directional convolutions capture features of different shapes and directions, enhancing the richness and accuracy of feature representation. To capture more comprehensive feature information and better adapt to edges and textures in different directions, we added depthwise separable convolutions and introduced channel attention (CA) after the horizontal and vertical convolutions. The horizontal branch consists of
As shown in Fig. 4, first, FREM connects the output characteristics of the MFFM, then uses five dilated convolutions with different dilation rates and five BLCMs to obtain multiscale road features. We adopt shared weights for different branches of the BLCM to reduce the number of network parameters. The outputs of the five branches are processed through different transposed convolution layers to obtain feature maps of the same dimensions. These feature maps are then concatenated to merge a large amount of contextual information into a larger receptive field, thereby restoring lost details. This process can be represented as follows:\begin{align*} {F}'_{1}& ={f}_{\mathrm { deconv1}}({f}_{\mathrm { BLCM}}({\text {Dilate}}_{1}(F))) \tag {14}\\ {F}'_{2}& ={f}_{\mathrm { deconv2}}({f}_{\mathrm { BLCM}}({\text {Dilate}}_{2}(F))) \tag {15}\\ {F}'_{3}& ={f}_{\mathrm { deconv3}}({f}_{\mathrm { BLCM}}({\text {Dilate}}_{3}(F))) \tag {16}\\ {F}'_{4}& ={f}_{\mathrm { deconv4}}({f}_{\mathrm { BLCM}}({\text {Dilate}}_{4}(F))) \tag {17}\\ {F}'_{5}& ={f}_{\mathrm { deconv5}}({f}_{\mathrm { BLCM}}({\text {Dilate}}_{5}(F))) \tag {18}\\ \text {Feature Map}& ={\text {Conv}}_{1\times 1}\big (\text {Switch}\big (\text {BN}\big (\text {Concat} \\ & \qquad \qquad \quad \times \big ({F}'_{1},{\, F}'_{2},\, {F}'_{3},{F}'_{4},{F}'_{5}\big)\big)\big)\big) \tag {19}\end{align*}
\begin{align*} {F}'_{5}& ={f}_{\mathrm { deconv5}}({f}_{\mathrm { BLCM}}({\text {Dilate}}_{5}(F))) \tag {20}\\ \text {Out}& ={\text {Conv}}_{2}({F}" \oplus F). \tag {21}\end{align*}
Structure of FREM. In order to increase the convolution kernel’s receptive field and obtain edge feature information, the fused picture with increased features was processed using dilated convolutions at different dilation rates. The image is processed by the BLCM, and
Here,
Experiments and Results
A. Experimental Road Datasets
We selected three publicly available multimodal road datasets (BJRoad [45], Porto [46], and TLCGIS [47]) as experimental data for feature extraction and segmentation of roads in complex scenes using the semantic segmentation network. As shown in Fig. 5, these three datasets are publicly available and widely used for semantic segmentation in multimodal tasks.
Three road dataset image examples; the first three images in the first row are BJR, the last three images are PRD, the second row corresponds to GPS trajectory maps, and the third and fourth rows are two modal maps of TRD.
The BJRoad datasets (BJR) contain 1350 HRSIs covering an area of approximately 100 km2. They include about 50 million GPS trajectory records from 28000 vehicles with features such as latitude and longitude, speed, direction, sampling interval, and vehicle status, collected from various devices with different sampling intervals and measurement resolutions. The image resolution is
The Porto road datasets (PRDs) are a multimodal dataset consisting of remote sensing images and GPS trajectories of Porto, the largest port city in Portugal, covering an area of approximately 209 km2. It includes GPS trajectory information collected from 442 taxes between 2013 and 2014. Due to the lack of specific details about the training and test sets, we split the complete images of the area into 6048 nonoverlapping subimages, each with a resolution of
The TLCGIS road datasets (TRDs) contain 5860 pairs of remote sensing and LiDAR images. The resolution of these images is
B. Configurations and Implementation Details
The experimental system environment is Linux CentOS 7, using the PyTorch deep learning framework. Two 24 GB NVIDIA GeForce RTX 4090 GPUs are used for model training and testing, respectively. The complexity of the network is measured by the number of parameters and floating-point operations (FLOPs). Before training, the training images are further augmented by horizontal or vertical flipping, grid distortion, and 90° rotation, expanding the number of training images to three times the original training set.
We selected the Adam optimizer [48] to minimize the loss function and find the optimal control parameters. The batch size for training was set to 4, the initial learning rate was set to 0.0001, and the total number of training and validation epochs was set to 200. We introduced a custom update strategy for learning rate decay, which involves comparing the new round’s loss with the previous round’s loss. If the former converges, the training loss is updated, and the model weights are saved. If the loss does not update for more than three epochs, the learning rate is divided by 5 and replaced with the next round’s learning rate. Training will terminate if the updated learning rate is less than
For the BJRoad datasets, the original image resolution is
C. Evaluation Metrics
We use four evaluation metrics to measure the model’s extraction performance: Precision, Recall, IoU, and \begin{align*} \text {Precision} & = \frac {\mathrm {TP}}{\mathrm {TP + FP}} \tag {22}\\ \text {Recall} & = \frac {\mathrm {TP}}{\mathrm {TP + FN}} \tag {23}\\ {F_{1}} & = \frac {\mathrm {2 \times Precision \times Recall}}{\mathrm {Precision + Recall}} = \frac {\mathrm {2TP}}{\mathrm {2TP + FN + FP}} \tag {24}\\ \text {IoU} & = \frac {\mathrm {TP}}{\mathrm {TP + TN + FP}} \tag {25}\end{align*}
D. Hybrid Loss Function
This article combines region-based BCE loss [49] with distribution-based Dice loss [50] to form a composite loss, aimed at optimizing the segmentation performance of the model. BCE loss retains the texture details of both modalities of the images, while Dice loss enhances the model’s ability to capture high-frequency details and edge information. The specific formula for the hybrid loss function is as follows:\begin{align*} {{L}_{\mathrm { BCE-Road}}}& =1-\frac {2\sum {_{i}^{H\times W}\varphi _{\mathrm { true}}^{i}\varphi _{\mathrm { pred}}^{i}+1}}{{{\sum {_{i}^{H\times W}\left ({{ \varphi _{\mathrm { true}}^{i} }}\right)}}^{2}}+\sum {_{i}^{H\times W}{{\left ({{ \varphi _{\mathrm { pred}}^{i} }}\right)}^{2}}+1}} \tag {26}\\ {{L}_{\mathrm { Dice-Edge}}}& = -\frac {1}{N} \sum _{i=1}^{N} \Bigg (\gamma \varphi _{\mathrm { true}}^{i} \log \big (\varphi _{\mathrm { pred}}^{i}\big) \\ & \qquad \qquad \quad ~ + \big (1-\varphi _{\mathrm { true}}^{i}\big) \log \big (1-\varphi _{\mathrm { pred}}^{i}\big)\Bigg) \tag {27}\\ {{L}_{\mathrm { Total}}}& ={L}_{\mathrm { Road}}+{L}_{\mathrm { Edge}} \\ & ={{L}_{\mathrm { Road}}}\left ({{ {{\delta }_{\mathrm { true}}},{{\delta }_{\mathrm { pred}}} }}\right)+{{L}_{\mathrm { Edge}}}\left ({{ {{\delta }_{\mathrm { true}}},{{\delta }_{\mathrm { pred}}} }}\right) \tag {28}\end{align*}
E. Evaluation of Model Performance
We selected ten networks for experimental comparison: U-Net, LinkNet [15], D-LinkNet, DeepLabv3+ [16], Res-UNet [51], TransUNet [17], DeepDualMapper [52], SA-Gate [53], CMMPNet [18], and CMIPNet [19]. The first six networks are single modality, while the latter four are multimodality. Performance metrics such as Precision, Recall, IoU, and
Table I provides quantitative results for various classical methods, concluding that multimodal models generally outperform single-modal models. Our proposed LCIRE-Net achieved the best IoU metrics on the three datasets (0.6623, 0.6757, and 0.6655). Additionally, it achieved the best
BJRoad Datasets: Among single-modal models, DeepLabv3 performed the worst with an IoU of 51.07%, and U-Net had the worst Recall at 68.91%. On the other hand, D-LinkNet achieved the best
-score at 77.83%, and LinkNet achieved the best Precision at 77.33%. These methods were originally designed for single-modal feature extraction tasks, but we applied them directly to the multimodal task of HRSIs and GPS trajectories, limiting their feature capture and multimodal interaction capabilities, resulting in IoU below 60%. Among multimodal models, DeepDualMapper, using a gated fusion module for image and trajectory features, achieved a competitive IoU of 61.41%. CMMPNet is based on D-LinkNet, and achieved a significant advantage with an IoU of 64.22%. CMIPNet, using an improved BN layer constraint factor, achieved a highly competitive IoU of 65.21%. Our LCIRE-Net achieved the best IoU of 66.23%, and it was also the best model in terms of Recall. This demonstrates that our designed CMIDI mechanism effectively addresses noise interference in many cross-modal models, ultimately improving performance to a new height.F1 Porto Datasets: Compared to D-LinkNet, LCIRE-Net’s Recall increased by 7.6%, indicating that our model based on image and GPS trajectory data effectively extracts the complete road network structure. In terms of IoU, although LCIRE-Net’s performance is slightly lower than the latest CMIPNet, we lead by a narrow margin in other metrics. Compared to single-modal road networks (D-LinkNet and Res-UNet), our model achieved the best metrics, showing that LCIRE-Net, supplemented by another modality, extracts road labels that better match ground truth. Although CMMPNet and CMIPNet performed best in Precision and Recall respectively, our network surpassed them in other metrics. Compared to other multimodal road segmentation methods (DeepDualMapper and SA-Gate), LCIRE-Net performed worse on some individual metrics across the three datasets but outperformed them in overall performance, demonstrating the superiority of our proposed strategy in identifying pixel categories. LCIRE-Net also has fewer parameters and FLOPs compared to other multimodal road segmentation networks. Additionally, we found that the precision metric of transformer-based single-modal road extraction outperformed all multimodal networks, which we attribute to GPS trajectory data possibly misinterpreting scenes like parking lots or open spaces as roads, causing interference and leading to worse segmentation results compared to single-modal transformer-based road extraction.
TLCGIS Datasets: For single-modal networks, due to the dataset being largely occluded by trees or buildings, single remote sensing images have a significant impact on obtaining road occlusion information. As a result, single-modal networks generally performed poorly on this dataset. Among multimodal networks, DeepDualMapper and SA-Gate had considerable competitive IoU and Recall, with IoU of 63.67% and 64.41%, respectively. Our LCIRE-Net achieved the state-of-the-art IoU of 66.55% on this dataset, with an improvement of nearly 2%, significantly outperforming other multimodal road segmentation networks. Overall, the comparison of these performance metrics indicates that our LCIRE-Net is effective in extracting traffic roads from remote sensing images and LiDAR data.
Additionally, LCIRE-Net did not achieve the best precision scores (0.7029, 0.7144, and 0.6933) on the three datasets. Our analysis suggests that the incomplete and blank background areas in the remote sensing images caused invalid segmentation of some false region features. However, the complementary information from the other modality enabled LCIRE-Net to outperform other segmentation models overall. This confirms that our cross-modal information interaction strategy has good generalization and superiority in road feature extraction tasks.
F. Ablation Study
To demonstrate the effectiveness of each module in our proposed LCIRE-Net, we conducted ablation experiments using TDNet as the baseline model. The metrics significantly improved after incorporating the CMIDI mechanism for direct fusion, and the model achieved optimal performance after adding the MFFM fusion strategy and enhancing edge features with FREM. This showcases the innovation and effectiveness of our designed modules. Additionally, we visualized the ablation experiments using feature heatmaps to compare each module.
1) Effectiveness Analysis of LCIRE-Net:
As shown in Table II, the network with the CMIDI mechanism for cross-modal information interaction in the decoder outperformed the baseline on BJRoad and TLCGIS in all evaluation metrics, demonstrating the complementary differences of the other modality in enhancing feature extraction. Removing FREM resulted in Recall dropping from 76.01% and 81.45% to 75.36% and 80.44%, indicating that multiscale feature fusion and dilated convolutions, which increase the receptive field of convolution kernels, significantly improve segmentation accuracy for global roads and edge details. Without MFFM, Precision on BJRoad decreased by 1.77%, and IoU dropped by 0.12% and 1.58%, confirming that our design effectively aggregates road features.
2) Visual Feature Heatmap Comparison:
Fig. 6 shows the feature heatmaps visualizing the last layer after introducing each module in the ablation experiments. Fig. 6(a) is the input image of the remote sensing image, (b) is based on the GPS trajectory map, (c) is the segmentation heatmap of the baseline, and (d) is the feature heatmap after introducing CMIDI mechanism. It can be seen that multimodal information extraction effectively aids road extraction in complex background environments with occlusions. (e) and (f) are the heatmaps after introducing MFFM and FREM modules, respectively, indicating that the network learns richer background features, contains more detailed information, and achieves more comprehensive semantic representation with better accuracy and continuity, especially for small roads. (g) is our final network model LCIRE-Net, which extracts more effective edge road feature information, proving that our model has considerable capability in complementing another modality and edge feature extraction.
Visualization of the feature maps of the last decoder. (a) Remote sensing image. (b) GPS trajectory. (c) Baseline. (d) Baseline + CMIDI. (e) Baseline + CMIDI + MFFM. (f) Baseline + CMIDI + MFFM + FREM. (g) LCIRE-Net.
3) Comparative Analysis of Different Fusion Strategies:
To test whether the single-modal information provided by multimodal data is reliably used for road extraction tasks, we conducted ablation experiments for verification. As shown in Table III, in single-modal scenarios, when only GPS trajectories were input into the model, poor performance was obtained on BJRoad datasets with IoU (52.38%) and
4) Analysis of the Effectiveness of Lightweight Design:
Table IV demonstrates a comparative analysis of the parameter counts, FLOPs, and IoU for the light ghost basic blocks
5) Comparative Analysis of Intermediate Connection Modules in Encoder-Decoder:
To validate the effectiveness of the FREM design, Table V details a comparison of different classic modules (FPN [56] and ASPP [57]) and the BLCM replacement within our FREM, focusing on parameter size and performance (IoU/
6) Selection of Loss Function:
Table VI summarizes the performance metrics (IoU and
7) Analysis of the Proportional Impact of Hybrid Loss:
Table VII presents a comparison of performance metrics for BCE-Dice at different weight ratios across three datasets. We find that a 1:2 weight ratio achieves optimal results, which we have selected as our final ratio for the loss function. Specifically, a 1:1 ratio maintains overall model performance well, although it does not reach the optimal level, it is closest to the 1:2 ratio. Analysis across the datasets reveals that a higher proportion of BCE results in mediocre model performance, while a higher proportion of Dice generally yields better results. The best performance is observed on BJRoad, where three metrics reach their optimal values, with two metrics achieving optimal results on the other two datasets. Even when metrics do not reach optimal levels, our results are close to the best. Additionally, we observe that larger proportions tend to degrade model performance. This is likely due to the loss functions based on region and distribution; as one increases, the effect of the other diminishes or disappears, leading to suboptimal segmentation results and additional computational costs, which ultimately affect the model’s computational load and processing time.
G. Model Parameter and Inference Time
Table VIII presents a comparative analysis of ten segmentation networks, including traditional semantic segmentation networks (U-Net, LinkNet, and Res-UNet), single-modal networks for road feature extraction from remote sensing images (D-LinkNet and DeepLabv3+), Transformer-based road extraction networks (TransUNet), multimodal networks (DeepDualMapper, SA-Gate, and CMMPNet), and a lightweight multimodal remote sensing network (CMIPNet).
It is evident that multimodal networks generally have more parameters and FLOPs compared to single-modal networks. This is due to the substantial computational demands and noise introduced during multimodal fusion, necessitating additional parameters to address noise reduction. Given the exceptional performance of transformers, TransUNet exhibits the highest overall performance among single-modal networks. Although our network (LCIRE-Net) has fewer model parameters compared to TransUNet and DeepDualMapper, with reductions of 49.15 and 12.79 M parameters, respectively, and FLOPs reduced by 229.46 and 23.22 G, its overall performance significantly surpasses that of single-modal networks. In comparison with multimodal networks, we found that the SA-Gate model, which is smaller than our network, performs less effectively in feature extraction due to its lightweight design and lack of feature enhancement. Compared to the latest CMMPNet and CMIPNet, our LCIRE-Net, which employs a parameter-shared D-LinkNet network, has fewer parameters than CMMPNet. While CMIPNet improves
We evaluated the computation time of various network models for processing individual images in the test set. The comparison revealed that traditional road segmentation networks generally outperform Transformer-based networks in processing time. Our LCIRE-Net has more parameters compared to U-Net or LinkNet because these networks are simply U-shaped architectures without feature enhancement modules, resulting in fewer parameters. In comparison, TransUNet is 77%, 85%, and 76% faster in processing time across the three datasets. We attribute this to the complex computations introduced by the Transformer in TransUNet, while our design employs a lightweight CNN for feature extraction in the encoder. This approach, as demonstrated in the Performance Comparison (Section IV-E), significantly improves performance over single-modal and multimodal networks. Additionally, our network achieves competitive inference time and speed compared to the latest multimodal networks (CMMPNet and CMIPNet).
H. Analysis of Experimental Visualization Results
As shown in Fig. 7, the visualization results of six networks on the BJRoad test set indicate that multimodal scenarios outperform single-modal ones. U-Net struggles with occlusions caused by trees or buildings, while D-LinkNet uses standard dilated convolution to introduce context, extracting some occluded roads but still performing poorly overall. TransUNet, due to the strong attention mechanism of the transformer, segments relatively complete road information. CMMPNet and CMIPNet, benefiting from complementary multimodal information, produce predictions close to the labels. Our LCIRE-Net achieves even closer predictions to the true labels with more complete segmentation details. This is because our CMIDI mechanism fully leverages interactive information, significantly enhancing segmentation accuracy, while the FREM refines edge feature information. The red boxes highlight the distinct differences between our network and various road networks, and the blue boxes demonstrate our method’s superior edge feature extraction capability. LCIRE-Net accurately identifies and segments detailed features and edge information. The visualization results confirm that LCIRE-Net effectively addresses road occlusions and discontinuities, providing more complete road extraction results for image details and boundaries, demonstrating its ability to segment fine features.
Visualization results of BJRoad datasets using different methods. (a) Remote sensing image. (b) GPS trajectory. (c) U-Net. (d) D-LinkNet. (e) TransUNet. (f) CMMPNet. (g) CMIPNet. (h) LCIRE-Net.
Fig. 8 compares the performance of six networks on the Porto dataset in scenarios where pseudo-road features such as rivers or urban rail transit in remote sensing images interfere with network extraction. The red boxes mainly highlight our network’s superiority over other segmentation networks. In the first and second rows, compared to other multimodal road networks, our advantage is mainly reflected in edge feature extraction. This demonstrates that LCIRE-Net accurately distinguishes small-scale targets from the background in complex real-world scenarios. However, compared to the labels, there is still potential for improvement in segmenting road structural features in densely interconnected urban areas. Some models in the third row mistakenly extract river information as road information and directly merge it with the GPS trajectory modality, failing to correct the erroneous features. Our network dynamically complements, potentially eliminating erroneous features at certain layers, minimizing misclassification of objects with road-like features, and achieving more accurate predictions.
Visualization results of PRDs using different methods. (a) Remote sensing image. (b) GPS trajectory. (c) U-Net. (d) D-LinkNet. (e) TransUNet. (f) CMMPNet. (g) CMIPNet. (h) LCIRE-Net.
Fig. 9 presents the predicted results of six networks on the TLCGIS dataset for segmenting fine features in occluded road scenes. The red and blue boxes have the same significance as in the BJRoad dataset. We found that in the first row, the single-modal remote sensing images barely reveal road information, leading the model to conclude there are no road features, thus resulting in no roads being detected. In the second and third rows, although the multimodal approach supplements information from another modality, the information is not dynamically complemented during each downsampling extraction. LiDAR data can help detect some occluded or inconspicuous roads in aerial images. LCIRE-Net enables remote sensing images to dynamically complement LiDAR information through multiple layers of downsampling, accurately distinguishing small-scale targets from the background. Moreover, our method simultaneously extracts occluded large and small object information without being disturbed, thus avoiding redundant false roads and achieving clearer detailed contours than existing road extraction methods.
Visualization results of TLCGIS using different methods. (a) Remote sensing image. (b) GPS trajectory. (c) U-Net. (d) D-LinkNet. (e) TransUNet. (f) CMMPNet. (g) CMIPNet. (h) LCIRE-Net.
Additionally, although our experiments in the Ablation Study (Section IV-F) demonstrate that the IoU of remote sensing images is significantly better than that of trajectories, we believe that GPS trajectories are equally crucial for the robustness of road extraction. In cities like Chongqing and Chengdu in China, and London in the U.K., which are often heavily covered by fog and mist, the data collected by remote sensing images can be greatly disturbed. Since there are no foggy images in the BJRoad dataset, we used the fog effect renderer in PhotoShop (PS) to generate some foggy images to explore road feature extraction from foggy images and GPS trajectories. As shown in Fig. 10, in terms of segmentation results, LCIRE-Net can still generate road network maps almost identical to the labels by fully utilizing the complementary information from the GPS trajectory. This demonstrates that our model is also suitable for extreme weather conditions like heavy fog or mist, further enhancing its generalization performance.
First line shows some foggy remote sensing images on the BJRoad dataset test set with fog. Although the traffic roads in these images are severely obscured by heavy fog, in the second row, LCIRE-Net can still generate a road network map that is close to the label by making full use of the complementary information of the GPS trajectory of another modality.
Conclusion
In this study, we propose a novel LCIRE-Net with a parallel dual branch based on multimodal networks at the decoder end. The CMIDI mechanism dynamically complements the differential features of the two modalities during simultaneous downsampling, effectively addressing the issue of road occlusion that a single modality cannot handle well. The MFFM newly defines the fusion of information between different modalities, and when applied to the decoder output features, it balances global and local road feature information. This approach focuses on effectively capturing long-range dependencies and extracting occlusion details in satellite remote sensing image road extraction methods. Since single-modal targets often have uneven distributions, we introduce the FREM, which provides valuable road structure information and, based on the expansion characteristics of dilated convolutions, enables the network to fully extract edge feature information.
Additionally, in terms of model lightweighting, we designed a novel ghost basic block, alternating DW-Conv and PW-Conv to replace the original residual blocks, based on the SOTA method from GhostNet. Extensive experiments on three public datasets (BJRoad, TLCGIS, and Porto) demonstrate the superiority and generalization of our proposed LCIRE-Net. The Ablation Study validates the necessity and effectiveness of CMIDI, MFFM, and FREM. Although the introduction of these innovative modules significantly improves road feature segmentation and extraction, there is still potential for optimization in terms of computational cost, model size, and inference speed. In the future work, we will further focus on making the model more lightweight and conduct related research.