Loading [MathJax]/jax/element/mml/optable/BasicLatin.js
LCIRE-Net: Lightweight Cross-Modal Information Interaction for Road Feature Extraction From Remote Sensing Images and GPS Trajectory/LiDAR | IEEE Journals & Magazine | IEEE Xplore

LCIRE-Net: Lightweight Cross-Modal Information Interaction for Road Feature Extraction From Remote Sensing Images and GPS Trajectory/LiDAR


Abstract:

Due to obstructions such as trees and buildings, single-modal satellite or aerial images are insufficient for continuous high-precision representation of road features. T...Show More

Abstract:

Due to obstructions such as trees and buildings, single-modal satellite or aerial images are insufficient for continuous high-precision representation of road features. To address this problem, this article proposes a lightweight cross-modal information interaction for road feature extraction (LCIRE-Net) from high-resolution remote sensing images (HRSIs) and GPS trajectory/LiDAR images. We design two parallel encoders for modality feature learning, using pairs of multimodal information as inputs to the encoders. By designing a cross-modal information dynamic interaction (CMIDI) mechanism, thresholds are used to decide whether to supplement redundant information from another modality, solving the issue of ineffective fusion calculations due to minor differences in multimodal feedback. A multimodal feature fusion module (MFFM) is proposed after the encoder output to achieve effective dual-modal fusion while addressing the interference of redundant noise generated during extraction. Subsequently, we present the feature refinement and enhancement module (FREM), which successfully captures edge features of the image using the receptive field of dilated convolution kernels. Additionally, in terms of lightweight design, we employ a novel SOTA method on D-LinkNet by replacing the original residual blocks with an enhanced ghost basic block. Extensive experiments are conducted on the BJRoad, Porto, and TLCGIS datasets, demonstrating that our network, with smaller parameters and FLOPs, outperforms other road-based semantic segmentation methods.
Article Sequence Number: 2000418
Date of Publication: 13 December 2024

ISSN Information:

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Road information [1] is one of the foundational elements of geographic databases and a core component of electronic navigation maps. It plays a vital role in drone operations, where air platforms can either accurately strike roads to disrupt traffic or rapidly gather road network information in unfamiliar regions. When the wireless positioning systems of drones are disrupted, the unique topology of road networks can aid in auxiliary positioning and target search tasks. However, roads are often challenging to precisely identify in aerial images because they tend to appear as inconspicuous lines.

High-resolution remote sensing images (HRSIs) [2], [3], [4] offer exceptional spatial resolution, detailed spatial data, and complex surface features, enhancing our understanding of Earth’s terrain. These images are crucial for earthquake response, vehicle navigation, and disaster assessment. However, they also present challenges such as geometric complexity, diverse texture information, and limited spectral data, leading to spectral differences, spatial data loss, and registration inaccuracies between images. Seasonal changes and varying lighting conditions can create unique, nonuniform spectral features that affect road recognition accuracy. Additionally, shadows cast by trees and buildings can obscure data and disrupt road continuity, sometimes misclassified as road features [5]. The visual similarity between dirt roads and railways further complicates feature differentiation. Current research focuses on effectively extracting comprehensive feature information from remote sensing images while minimizing redundant data interference to improve segmentation accuracy. Historically, road extraction methodologies have predominantly depended on manually crafted techniques that adeptly differentiate road-related data from intricate backgrounds by leveraging spectral, geometric, chromatic, textural, and topological attributes, alongside machine learning algorithms for the identification of road features [6], [7], [8]. However, the accuracy of these methods usually falls short of the desired outcome.

Deep learning has transformed road extraction from remote sensing images, with convolutional neural networks (CNNs) [9] becoming central to this process. CNNs are now the backbone of computer-aided road extraction, with most methods employing encoder-decoder architectures. U-Net [10] has served as the foundation for numerous studies on road extraction from remote sensing images, addressing the challenge of automatic road detection across varying spatial resolutions. Subsequently, well-known models such as DeepLabv3 [11] and D-LinkNet [12] have been developed based on U-Net’s symmetrical encoder-decoder architecture with skip connections, which effectively extract and reconstruct fine image details. The encoder processes and downsamples the input image through convolution operations, while the decoder restores the spatial dimensions of these features and generates a classification map matching the size of the input image [13], [14], [15], [16], [17], [18], [19]. This method effectively extracts semantic features of roads in complex scenes. Yuan et al. [20] introduced a network called SCTransNet, a spatial channel cross-transformer network, which addresses these challenges by employing spatial channel cross-transformer blocks on top of the skip connections in a U-shaped architecture. Yao et al. [21] proposed an iterative semisupervised CNNs framework by means of active learning and superpixel segmentation techniques, dubbed as SA-CNNs. While most foundation models are tailored to effectively process RGB images for various visual tasks. Hong et al. [22] introduced SpectralGPT, the first universal remote sensing model designed to process spectral images using a novel 3-D generative pretrained transformer. This model handles images of varying sizes, resolutions, time series, and regions through progressive training, optimizing the use of extensive remote sensing data. However, relying solely on single information makes it difficult to detect roads from aerial or remote sensing images, especially when roads are heavily obscured by trees. Therefore, extracting roads from single-modal satellite remote sensing images remains highly challenging.

Many current studies have introduced other modal data such as GPS trajectories [23] or LiDAR [24] to solve the difficulties in road feature extraction, helping HRSIs acquire more accurate road information. If an area has a large amount of GPS-tracked trajectory, it is likely to be a road structure, significantly improving the feasibility of trajectory-based road extraction. In addition, LiDAR data contains depth and distance information. It distinguishes roads, buildings, and trees based on their different laser reflectivity characteristics. Qi et al. [25] introduced a dual enhancement module for channels, promoting interaction between the two modalities to complement the information missing from each single modality. Hong et al. [26] proposed a decoupled-and-coupled network called DC-Net for the HS-SR task. It is a novel progressive fusion framework, from pixel-level to subpixel-level fusion and from image-level to feature-level fusion. Wu et al. [27] proposed a deep learning framework for multimodal remote sensing data classification (CCR-Net), using CNNs as the backbone and featuring an advanced cross-channel reconstruction module. This approach has greatly inspired our design and accelerated the development of feature extraction strategies for multimodal models.

Previous research methods can be broadly classified into two categories. One approach involved simple feature fusion, where remote sensing images and GPS trajectories/LiDAR images are concatenated and then fed into a semantic segmentation network. This can lead to redundant road segments being treated as noise, which significantly impacts model performance and hinders multimodal integration. The other approach involves first extracting features from individual modalities and then using the differences between modalities to complement the extracted features. However, these method can result in excess segmentation due to similar terrain features, such as rivers or light rail, creating redundant or erroneous information that cannot be corrected by the other modality. Furthermore, multimodal fusion generates extensive computational complexity, which makes it challenging to deploy the model on mobile devices. To address these problems, we propose a multimodal fusion strategy, called the lightweight cross-modal information interaction for road feature extraction (LCIRE-Net), which fully leverages the complementarity between pairs of modalities among HRSIs, GPS trajectories, and LiDAR. Specifically, LCIRE-Net designs two encoders for modality feature learning; subsequently, we propose a cross-modal information dynamic interaction (CMIDI) mechanism, which refines different modal features through mutual information complementation via a progress propagator. To further enhance robustness, we integrate CMIDI into each layer of encoder to enhance features of both modalities layer by layer. Eventually, the outputs of two encoders are fused and weighted through a multimodal feature fusion module (MFFM) to improve prediction accuracy. To address issues such as poor edge feature extraction, we adopt a feature refinement and enhancement module (FREM) between the encoder and decoder to expand the receptive field, making more detailed features to enter the decoder. This is a brief overview of our most significant contributions.

  1. We propose the CMIDI mechanism, which enhances complementary information at different scales between multimodal images, supplementing the information differences between features extracted from each downsampling and those extracted from another modality.

  2. Based on D-LinkNet, we replace the original residual blocks with an improved ghost basic block, reducing parameter computation and enhancing inference speed.

  3. An MFFM is a designed postencoder output to better fuse features from both modalities with minimal noise interference.

  4. We design the FREM for the connection between the encoder and decoder, which improves the extraction of edge details by refining the spatial elements of feature maps.

The rest of this article is organized as follows. Section II introduces the related works from the perspectives of single model and multimodel in detail. We subsequently elaborate the proposed LCIRE-Net and detail each newly developed module. Experimental results are presented in comparison with current SOTA methods in Section IV. Finally, Section V draws a conclusion with a possible future outlook.

SECTION II.

Related Works

A. Single-Modal-Based Road Feature Extraction Method

Based on the input data method, previous research methods can be divided into three categories. We will conduct an in-depth study on the related work of each category.

  1. Remote Sensing Images-Based Road Extraction: With the rapid development of satellite remote sensing imaging technology, it has become feasible to obtain a large number of HRSIs conveniently. Early works typically relied on handcrafted texture, contour features, and shallow models (deformable models [28] and Markov random fields [29]) to identify road features. However, these traditional methods often struggle to capture high-level semantic information, significantly limiting the model’s ability to accurately extract roads. In recent years, CNNs, known for their excellent representation learning, have gradually become the mainstream models in this field. Yang et al. [30] proposed a method in which recurrent units replace the traditional convolutional units in U-Net, enabling the preservation of detailed spatial information through multiple summations using dilated convolutions [31]. This approach is crucial for enhancing the model’s ability to capture details when extracting roads from high-resolution satellite images. Zhang et al. [32] pioneered an end-to-end road segmentation method that significantly improves the model’s perception of road edges and shapes by effectively leveraging the multilevel features of convolutional layers. This approach addresses the imbalance between CNN depth and spatial resolution. However, relying solely on visual data for road extraction remains challenging, especially in complex or occluded environments. Therefore, exploring additional data sources is crucial for enhancing accuracy.

  2. GPS Trajectories-Based Road Extraction: Studies have leveraged vehicle trajectories to identify road segments, under the assumption that dense GPS data indicates road presence. While this approach improves road extraction, it has limitations. For example, parking lots can be misclassified as roads due to high trajectory density [33], and GPS data may become unreliable in areas with poor signal, such as tunnels or mountainous regions. Communication delays can lead to unstable trajectories, complicating accurate road width measurement, which high-resolution imagery can more effectively provide. Previous research has focused on reducing GPS noise and uncertainty through methods such as cluster-based models, trajectory merging, kernel density estimation, and, more recently, neural network-based approaches. Ruan et al. [34] proposed a deep learning-based framework that infers road centerlines from trajectory data in spatial and transitional views. Due to the constraints of GPS trajectory information and the challenges of GPS noise reduction, these methods still face limitations in using GPS trajectories for road extraction.

  3. LiDAR-Based Road Extraction: Compared to aerial images, LiDAR data provides depth and distance information, offering unique characteristics based on the different reflectivity of objects such as buildings, trees, and roads. This makes the angle of road smoothness prominent, aiding in distinguishing road proposals from buildings and trees. Many researchers have designed algorithms to identify roads using LiDAR data. For example, after obtaining ground intensity images, Hu et al. [35] designed structural templates to search for roads and determined road width and direction based on LiDAR characteristics. Despite certain advancements, challenges remain in LiDAR-based road extraction tasks due to the sparsity of LiDAR data and the poor performance of noisy points in complex scenes.

B. Multimodal-Based Road Feature Extraction Method

Each modality of remote sensing images, GPS trajectories, and LiDAR has its own advantages and disadvantages. Therefore, an effective method for road extraction research is to combine these single modalities to utilize the complementary useful information between them. Xu et al. [36] first segmented road primitives from optical images and LiDAR data, then used an iterative Hough transform algorithm to detect road stripes, and finally formed the road network structure through topological analysis. Parajuli et al. [37] developed a modular deep convolutional network called TriSeg, which involves using two SegNets to extract features from remote sensing images and LiDAR data separately, and another SegNet fusion module to estimate the final road map.

Single modalities such as remote sensing images or LiDAR data often lack sufficient detail to identify roads obscured by trees or buildings, leading to segmentation gaps. Combining visual information helps eliminate noise, reveals hidden roads, and accurately distinguishes false nonroad areas. Therefore, Xu et al. [38] integrated GPS trajectory maps and remote sensing images into neural networks such as U-Net, Res-UNet, LinkNet, and D-LinkNet for road or semantic segmentation to improve the accuracy of road segmentation and prediction. Liu et al. [18] input GPS trajectory maps and remote sensing images into different networks for feature extraction and fused modular features from multiple layers to predict the final roads. Despite these advancements, the fusion methods did not fully utilize the complementarity of different modalities. Bai et al. [39] designed a new model to accurately acquire road information. They introduced an OR operation-based fusion strategy to combine image and trajectory data to extract road information, avoiding the impact of trajectory noise on network training. Hong et al. [40] proposed HighDAN, a high-resolution domain adaptation network, to enhance AI model generalization across multiple cities. HighDAN effectively preserves the spatial topological structure of urban scenes through parallel high-to-low resolution fusion and uses adversarial learning to bridge the representation gap between remote sensing images from different cities.

Although these methods improve accuracy over single-modal approaches, they generally use a single downsampling module to extract features from one modality and fuse them with labels from another, leading to noise and inadequate information complementarity. We propose a CMIDI module to fully utilize complementary information between modalities and an MFFM to minimize noise interference, thus enhancing road feature extraction. Additionally, we have implemented lightweight techniques to ensure efficient deployment on mobile edge computing devices.

SECTION III.

Proposed Approach

A. Network Architecture

The overall network architecture design of LCIRE-Net is shown in Fig. 1. We combined the structural diagrams of most semantic segmentation networks, using the encoder and decoder as the main framework structure of our network. Since it is a multimodal data task for road feature extraction, we designed a downsampling parallel dual-branch road feature extraction structure in the encoder, with one branch taking remote sensing images as input and the other branch taking GPS trajectory/LiDAR as input.

Fig. 1. - Overall architecture of the proposed LCIRE-Net for road extraction. Our framework consists of the following three parts. 1) Dual-input encoder with CMIDI mechanism for cross-modal scenario tasks. 2) MFFM is employed to capture more complex semantic information. 3) FREM for edge feature enhancement. Moreover, the proposed LCIRE-Net is also suitable for RSIs and LiDAR scenarios.
Fig. 1.

Overall architecture of the proposed LCIRE-Net for road extraction. Our framework consists of the following three parts. 1) Dual-input encoder with CMIDI mechanism for cross-modal scenario tasks. 2) MFFM is employed to capture more complex semantic information. 3) FREM for edge feature enhancement. Moreover, the proposed LCIRE-Net is also suitable for RSIs and LiDAR scenarios.

We designed the CMIDI mechanism to dynamically propagate the global context information and local detail information of the two different modalities to achieve complementary differences between the information, thereby mutually improving and enhancing the features of each modality. Simultaneously, the difference in information interaction in the CMIDI is used as a skip connection and cascaded with the upsampled images in the decoder to obtain images of the same size, using 1\times 1 convolutions to adjust the channel numbers. We designed the MFFM to fuse the decoder outputs of the two modalities. Subsequently, to address the issue of poor edge detail feature extraction in the fused images, we designed the FREM, which enhances the convolution kernel’s field of view through dilated convolutions and extracts road features from different directions using the bidirectional lightweight convolution module (BLCM), equivalent to a secondary feature extraction operation. Finally, upsampling convolutions are used in the decoder to gradually restore the image size. In summary, when dealing with noise or incomplete road information in a single modality, LCIRE-Net can fully exploit the complementarity between different modalities to accurately segment road features.

In addition, we have illustrated the process of extracting roads from remote sensing images and LiDAR in LCIRE-Net in the lower right corner of Fig. 1, which is consistent with the principles of remote sensing images and GPS trajectories.

B. CMIDI Mechanism

For specific cross-modal feature learning, we set up a CMIDI mechanism based on a message-passing mechanism in the encoder. This module uses information from the two modalities to optimize the extracted road features mutually. It identifies the differential information between the features extracted at each downsampling layer and further dynamically complements and interacts with these differences. In this section, we use the refinement of features {F}_{\mathrm { HRSIs}} and {F}_{\mathrm { GPS/LiDAR}}~\in ~{R}^{C\times H\times W} as an example to demonstrate the working principle of our designed CMIDI module. C, H, and W represent the number of channels, height, and width of these features, respectively.

As shown in Fig. 2, we input {F}_{\mathrm { HRSIs}} and {F}_{\mathrm { GPS/LiDAR}} into their respective 3\times 3 convolutional layers in the parallel connection, extracting two mappings of local information with dimensions {R}^{C\times H\times W} . Subsequently, we aggregate the feature information from different positions to generate a feature map containing global information. Specifically, we select the extracted features from layer K (K = 1 , 2, 3, 4), divide K into (2K-1 ) \times (2K-1 ) regions, with each region having dimensions C \times (H/2K-1 ) \times (W/2K-1 ), and input them into a MaxPool layer of size (H/2K-1 ) \times (W/2K-1 ), resulting in an information vector of size C \times 1\times 1 . Through the fully connected (FC) layer, we generate a global information vector from c output neurons, which is then copied H \times W times to reshape into a global information map {R}^{C\times H\times W} . Finally, we use deconvolution and 1\times 1 convolution to adjust the channels and image sizes, while separating out two feature maps {G}_{1} and {G}_{1} with the same dimensions {F}_{\mathrm { HRSIs}} and {F}_{\mathrm { GPS/LiDAR}}~\in ~{R}^{C\times H\times W} . These feature maps are elementwise multiplied with {F}_{\mathrm { HRSIs}} and {F}_{\mathrm { GPS/LiDAR}} using feature concatenation, resulting in new enhanced feature maps to be used as input for the next downsampling layer. The refinement formula is as follows:\begin{align*} F{}'_{\mathrm { HRSIs}}& ={\mathrm { Conv}}_{3\times 3}({F}_{\mathrm { HRSIs}}) \tag {1}\\ F{}'_{\mathrm { GPS/LiDAR}}& ={\mathrm { Conv}}_{3\times 3}({F}_{\mathrm { GPS/LiDAR}}) \tag {2}\\ G& =\text {FC}\big (\text {MaxPool}\big ({F}'_{\mathrm { HRSIs}} \oplus {F}'_{\mathrm { GPS/LiDAR}}\big) \tag {3}\\ {G}_{1},{G}_{2}& ={\mathrm { Conv}}_{1\times 1}(G),\quad {G}_{1}={G}_{2}. \tag {4}\end{align*} View SourceRight-click on figure for MathML and additional features.

Fig. 2. - CMIDI mechanism structure mainly illustrates how to use the complementary differences between remote sensing images and GPS trajectories/LiDAR features to extract information and dynamically enhance the image features of both modalities. Using learnable fusion weights to dynamically fuse local and global information to obtain cross-modal information. We set a threshold 
$\theta $
 to differentiate and compare with normal downsampling features. When 
$\theta \geq 0.15$
, the image is enhanced. If 
$\theta \lt 0.15$
, the original image is downsampled. 
$\bigotimes $
 represents element multiplication. 
$\bigoplus $
 represents channelwise connection.
Fig. 2.

CMIDI mechanism structure mainly illustrates how to use the complementary differences between remote sensing images and GPS trajectories/LiDAR features to extract information and dynamically enhance the image features of both modalities. Using learnable fusion weights to dynamically fuse local and global information to obtain cross-modal information. We set a threshold \theta to differentiate and compare with normal downsampling features. When \theta \geq 0.15 , the image is enhanced. If \theta \lt 0.15 , the original image is downsampled. \bigotimes represents element multiplication. \bigoplus represents channelwise connection.

Additionally, we designed an information balance constraint \theta in this module, which is similar to many dynamic complementary mechanisms based on gating devices. Specifically, the {G}_{1} and {G}_{2} feature maps obtained through the cascading and separation operations of the two modality feature maps are compared with the local information {F}'_{\mathrm { HRSIs}} and {F}'_{\mathrm { GPS/LiDAR}} extracted by the 3\times 3 convolutional layers. Specifically, the two feature maps being compared are subtracted pixel by pixel. The pixel difference weights after subtraction are processed using the sigmoid function to be in the range of 0–1. The resulting values, along with the change ratio of the original feature maps, are used as the value of \theta . We set a controllable threshold range for \theta . If \theta \geq 0.15 , we consider the difference to be significant and apply complementary enhancement to each downsampling layer through the CMIDI mechanism. If \theta \lt 0.15 , it indicates that the differences in the information extracted by parallel downsampling are not significant, so feature enhancement is not necessary. This avoids unnecessary complex computations and inference time due to minor or ineffective enhancements. The specific gating principle calculation is as follows:\begin{align*} \theta & =\text {Sigmoid}\big ({G}_{1}-{F}'_{\mathrm { HRSIs}},\, {G}_{2}-{F}'_{\mathrm { GPS/LiDAR}}\big) \tag {5}\\ {F}'_{\mathrm { HRSIs}}& ={G}_{1} \otimes {F}_{\mathrm { HRSIs}},\quad \theta \geq 0.15 \tag {6}\\ {F}"_{\mathrm { GPS/LiDAR}}& ={G}_{2}\otimes {F}_{\mathrm { GPS/LiDAR}},\quad \theta \geq 0.15 \tag {7}\\ {F}"_{\mathrm { HRSIs}}& ={F}_{\mathrm { HRSIs}},\quad {F}"_{\mathrm { GPS/LiDAR}}={F}_{\mathrm { GPS/LiDAR}},~\theta \lt 0.15. \tag {8}\end{align*} View SourceRight-click on figure for MathML and additional features.

C. Multimodal Feature Fusion Module

To better fuse the features of the two modalities and produce less noise interference, we designed an MFFM. In Fig. 1, HRSIs and GPS trajectories/LiDAR are used as fusion objects. The encoder extracts road information from HRSIs or GPS trajectories/LiDAR, which are treated as two feature extraction tasks, i.e., two parallel downsampling modules. One branch is used only to extract features from HRSIs, and the other branch is used only to extract features from GPS trajectories/LiDAR images. Finally, the two output results are fused. To fully utilize the complementarity of road features in HRSIs and GPS trajectory/LiDAR images, we concatenate the two downsampled features in the fusion module according to the channel dimension to combine their feature information in the spatial direction. Subsequently, the features are divided into left and right adaptive fusion modules to generate weights. The adaptive fusion module is divided into a 1\times 1 convolutional layer and a Softmax layer. The concatenated multidimensional information is divided into two groups of vectors, spatial and channel, denoted as {F}_{\mathrm { HRSIs}} and {F}_{\mathrm { GPS/LiDAR}} . Each group of vectors is assigned several channels or spatial information, and the number of channels is reduced through a 1\times 1 convolutional layer to obtain weights {W}_{1} and {W}_{2} . Finally, the original vectors and {W}_{1} and {W}_{2} are multiplied pixelwise to perform weighted operations. The output is normalized by the subsequent Softmax layer to obtain {S}_{\mathrm { HRSIs}} and {S}_{\mathrm { GPS/LiDAR}} . The specific calculation process of the Softmax function is as follows:\begin{align*} {S}_{\mathrm { HRSIs}}& =\frac {{e}^{{F}_{\mathrm { HRSIs}}\times {W}_{1}}} {{e}^{{F}_{\mathrm { HRSIs}} \times {W}_{1}}+{e}^{{F}_{\mathrm { GPS/LiDAR}} \times {W}_{2}}} \tag {9}\\ {S}_{\mathrm { GPS/LiDAR}}& =\frac {{e}^{{F}_{\mathrm { LiDAR}} \times {W}_{1}}} {{e}^{{F}_{\mathrm { HRSIs}} \times {W}_{1}}+{e}^{{F}_{\mathrm { GPS/LiDAR}} \times {W}_{2}}}. \tag {10}\end{align*} View SourceRight-click on figure for MathML and additional features.

{W}_{1} \in {R}^{4\times H\times W} assigns the weight to the road features of the remote sensing images, and {W}_{2} \in {R}^{1\times H\times W} assigns the weight to the road features of the GPS trajectory/LiDAR images. The normalized weights {S}_{\mathrm { HRSIs}} and {S}_{\mathrm { GPS/LiDAR}} are multiplied elementwise with the original dimensional vectors for adaptive weighting to obtain the output features {X}_{1} and {X}_{2} . The calculation process is as follows:\begin{align*} {X}_{1}& ={S}_{\mathrm { HRSIs}} \odot {F}_{\mathrm { HRSIs}} \tag {11}\\ {X}_{2}& ={S}_{\mathrm { GPS/LiDAR}} \odot {F}_{\mathrm { GPS/LiDAR}}. \tag {12}\end{align*} View SourceRight-click on figure for MathML and additional features.

Two maxpool layers are used to extract edge and texture information from the features and suppress background information to compensate for the lost feature information. The features of the remote sensing images and GPS trajectory/LiDAR images are reweighted and added to obtain the combined feature {X}_{1} + {X}_{2} to achieve multimodal feature fusion. Finally, after BN and ReLU layers, a 1\times 1 convolutional layer is used to change the channel dimension, which serves as the input for FREM. The formula is expressed as follows:\begin{align*} {\mathrm { Output}}_{\mathrm { Fusion}}& =\text {Conv}(\text {BN}(\text {ReLU}(\text {Concat} ({X}_{1}+{X}_{2}, \\ & \qquad \qquad \qquad \text {Maxpool}({F}_{\mathrm { HRSIs}}), \\ & \qquad \qquad \qquad \text {Maxpool}({F}_{\mathrm { GPS/LiDAR}})))). \tag {13}\end{align*} View SourceRight-click on figure for MathML and additional features.

D. Lightweight Strategy in D-LinkNet

In recent years, designing a lightweight CNN architecture [41] with small parameters and FLOPs [42] has been an important research topic in the field of computer vision. This article proposes a lightweight CNN with fewer parameters on the encoding end. It is used for downsampling to extract road features and simplifies the redundancy of conventional network structures through a unified submodule design. By introducing the concept of depthwise separable convolutions from MobileNetV1, regular convolutions are decomposed into depthwise convolutions (DW-Convs) [43] and pointwise convolutions (PW-Conv) [44] for extracting spatial and channel features, respectively. This approach significantly reduces the computational load and parameters of the network, improving image processing speed.

As shown in Fig. 3, when the input is a four-channel image of 4\times 1024\times 1024 and a GPS trajectory/LiDAR image of 1\times 1024\times 1024 , it first passes through Stage 0 of the encoder, where two conversion layers replace a relatively larger conversion layer to capture more details, changing the size to 64\times 256\times 256 . By adopting the inverted residual block (IRB) paradigm proposed by MobileNetV2 and combining it with the lightweight submodule concept of ghost basic block from GhostNet, regular convolutions are replaced with alternating DW-Conv and PW-Conv combinations. This redefines a more lightweight ghost basic block (1) , introduced in Stage 1, changing the image size to 128\times 128\times 128 . Subsequently, the original convolution is replaced by a 3\times 3 DW-Conv with a stride of 2 for conversion convolution operations, and a 1\times 1 PW-Conv layer is used for channel transformation. Meanwhile, the 3\times 3 DW-Conv is separated into 1\times 3 DW-Conv and 3\times 1 DW-Conv layers for parallel computation, designing ghost basic block (2). The 1\times 1 PW-Conv performs submodule downsampling and channel stacking, with parallel designed low-rank branches to save computational costs. Stages 2 and 3 complete the downsampling and feature extraction of road features in remote sensing images and GPS trajectory/LiDAR images, respectively, reducing the size of both image types and increasing their dimensions to 256\times 64\times 64 and 512\times 32\times 32 . After completing the model pruning operation, we define the lightweight process of D-LinkNet called TDNet.

Fig. 3. - Lightweight structure in encoder (TDNet). The downsampling layer’s entire process, from input to output, which is divided into stages 0 to 4. In stage 0, a reduced max pooling technique is employed. Stage 1 utilizes a ghost basic block 
$(1)$
 that alternates between DW-conv and PW-conv. A lighter ghost basic block 
$(2)$
 is introduced, designed with reduced convolution kernels and separated convolutions in stage 2. Stage 3 alternates between these two ghost basic blocks to select the most effective sampling structure.
Fig. 3.

Lightweight structure in encoder (TDNet). The downsampling layer’s entire process, from input to output, which is divided into stages 0 to 4. In stage 0, a reduced max pooling technique is employed. Stage 1 utilizes a ghost basic block (1) that alternates between DW-conv and PW-conv. A lighter ghost basic block (2) is introduced, designed with reduced convolution kernels and separated convolutions in stage 2. Stage 3 alternates between these two ghost basic blocks to select the most effective sampling structure.

E. Feature Refinement and Enhancement Module

Solely relying on CNN for downsampling HRSIs and GPS trajectory/LiDAR data and multimodal fusion modules makes it difficult to fully retain the spatial details of the images. The loss of spatial details significantly affects the quality of fused images. Expanding the receptive field of roads improves the accuracy of road edge feature information, thereby obtaining effective multiscale information. Therefore, we designed the FREM to refine the spatial details of feature mappings.

Most CNNs are typically executed using square kernels that allow the network to learn feature mappings within a square window. However, linearly distributed strip features cannot effectively capture their linear characteristics, inevitably including unrelated information from adjacent pixels, increasing unnecessary computational costs. To address these issues, we constructed a BLCM which uses horizontal strip convolutions and vertical strip convolutions to convolve along the horizontal and vertical directions of the feature map, respectively. These directional convolutions capture features of different shapes and directions, enhancing the richness and accuracy of feature representation. To capture more comprehensive feature information and better adapt to edges and textures in different directions, we added depthwise separable convolutions and introduced channel attention (CA) after the horizontal and vertical convolutions. The horizontal branch consists of 1\times K horizontal strip convolutions, K \times 1 depth convolutions, and 1\times 1 PW-Convs, while the vertical branch consists of 1\times K vertical strip convolutions, 1\times K depth convolutions, and 1\times 1 PW-Convs. By overlapping horizontal and vertical convolutions, BLCM improves its ability to extract features in multiple directional dimensions.

As shown in Fig. 4, first, FREM connects the output characteristics of the MFFM, then uses five dilated convolutions with different dilation rates and five BLCMs to obtain multiscale road features. We adopt shared weights for different branches of the BLCM to reduce the number of network parameters. The outputs of the five branches are processed through different transposed convolution layers to obtain feature maps of the same dimensions. These feature maps are then concatenated to merge a large amount of contextual information into a larger receptive field, thereby restoring lost details. This process can be represented as follows:\begin{align*} {F}'_{1}& ={f}_{\mathrm { deconv1}}({f}_{\mathrm { BLCM}}({\text {Dilate}}_{1}(F))) \tag {14}\\ {F}'_{2}& ={f}_{\mathrm { deconv2}}({f}_{\mathrm { BLCM}}({\text {Dilate}}_{2}(F))) \tag {15}\\ {F}'_{3}& ={f}_{\mathrm { deconv3}}({f}_{\mathrm { BLCM}}({\text {Dilate}}_{3}(F))) \tag {16}\\ {F}'_{4}& ={f}_{\mathrm { deconv4}}({f}_{\mathrm { BLCM}}({\text {Dilate}}_{4}(F))) \tag {17}\\ {F}'_{5}& ={f}_{\mathrm { deconv5}}({f}_{\mathrm { BLCM}}({\text {Dilate}}_{5}(F))) \tag {18}\\ \text {Feature Map}& ={\text {Conv}}_{1\times 1}\big (\text {Switch}\big (\text {BN}\big (\text {Concat} \\ & \qquad \qquad \quad \times \big ({F}'_{1},{\, F}'_{2},\, {F}'_{3},{F}'_{4},{F}'_{5}\big)\big)\big)\big) \tag {19}\end{align*} View SourceRight-click on figure for MathML and additional features.where {F}_{i} represents the output features of each branch after dilated convolution, and {\text {Dilate}}_{i}(\cdot) represents the mapping function of the dilated convolution layers in each branch. {f}_{\mathrm { BLCM}}(\cdot) represents the mapping function of the BLCM block shared by all five branches. {f}_{\mathrm { deconv}}(\cdot) represents the deconvolution operation function corresponding to each branch, aimed at restoring the sizes of features at different scales. Feature Map represents the concatenated features of the outputs along the channel dimension for each branch. Then, the concatenated features are then processed using the lightweight activation function switch to eliminate the risk of gradient explosion during forward propagation. A BN layer is used for batch normalization, and finally, a 1\times 1 convolution is used to change the number of channels as the output of FREM. Additionally, to reduce complexity during training, we enhanced the plug-and-play capability of FREM by introducing a residual strategy. The entire process can be described as follows:\begin{align*} {F}'_{5}& ={f}_{\mathrm { deconv5}}({f}_{\mathrm { BLCM}}({\text {Dilate}}_{5}(F))) \tag {20}\\ \text {Out}& ={\text {Conv}}_{2}({F}" \oplus F). \tag {21}\end{align*} View SourceRight-click on figure for MathML and additional features.

Fig. 4. - Structure of FREM. In order to increase the convolution kernel’s receptive field and obtain edge feature information, the fused picture with increased features was processed using dilated convolutions at different dilation rates. The image is processed by the BLCM, and 
$1\times 1$
 convolutions are used to change the channels. The BLCM uses two parallel channels for lightweight convolution to extract items throughout this procedure.
Fig. 4.

Structure of FREM. In order to increase the convolution kernel’s receptive field and obtain edge feature information, the fused picture with increased features was processed using dilated convolutions at different dilation rates. The image is processed by the BLCM, and 1\times 1 convolutions are used to change the channels. The BLCM uses two parallel channels for lightweight convolution to extract items throughout this procedure.

Here, \text {Conv1}(\cdot) represents a single-layer convolution operation, \text {Conv2}(\cdot) refers to a 3\times 3 convolution operation.

SECTION IV.

Experiments and Results

A. Experimental Road Datasets

We selected three publicly available multimodal road datasets (BJRoad [45], Porto [46], and TLCGIS [47]) as experimental data for feature extraction and segmentation of roads in complex scenes using the semantic segmentation network. As shown in Fig. 5, these three datasets are publicly available and widely used for semantic segmentation in multimodal tasks.

Fig. 5. - Three road dataset image examples; the first three images in the first row are BJR, the last three images are PRD, the second row corresponds to GPS trajectory maps, and the third and fourth rows are two modal maps of TRD.
Fig. 5.

Three road dataset image examples; the first three images in the first row are BJR, the last three images are PRD, the second row corresponds to GPS trajectory maps, and the third and fourth rows are two modal maps of TRD.

The BJRoad datasets (BJR) contain 1350 HRSIs covering an area of approximately 100 km2. They include about 50 million GPS trajectory records from 28000 vehicles with features such as latitude and longitude, speed, direction, sampling interval, and vehicle status, collected from various devices with different sampling intervals and measurement resolutions. The image resolution is 1024\times 1024 pixels, with each pixel representing an area of 0.5\times 0.5 m in the real world. Therefore, we divided the dataset into training, validation, and test sets in a 7:2:1 ratio for the experiments.

The Porto road datasets (PRDs) are a multimodal dataset consisting of remote sensing images and GPS trajectories of Porto, the largest port city in Portugal, covering an area of approximately 209 km2. It includes GPS trajectory information collected from 442 taxes between 2013 and 2014. Due to the lack of specific details about the training and test sets, we split the complete images of the area into 6048 nonoverlapping subimages, each with a resolution of 512\times 512 pixels. We then divided the dataset into training and test sets in a 4:1 ratio.

The TLCGIS road datasets (TRDs) contain 5860 pairs of remote sensing and LiDAR images. The resolution of these images is 500\times 500 pixels, with each pixel representing a geographic length of 0.5 feet. We divided the dataset into training, test, and validation sets in a 5:4:1 ratio, resulting in 2640 training images, 2400 test images, and 240 validation images.

B. Configurations and Implementation Details

The experimental system environment is Linux CentOS 7, using the PyTorch deep learning framework. Two 24 GB NVIDIA GeForce RTX 4090 GPUs are used for model training and testing, respectively. The complexity of the network is measured by the number of parameters and floating-point operations (FLOPs). Before training, the training images are further augmented by horizontal or vertical flipping, grid distortion, and 90° rotation, expanding the number of training images to three times the original training set.

We selected the Adam optimizer [48] to minimize the loss function and find the optimal control parameters. The batch size for training was set to 4, the initial learning rate was set to 0.0001, and the total number of training and validation epochs was set to 200. We introduced a custom update strategy for learning rate decay, which involves comparing the new round’s loss with the previous round’s loss. If the former converges, the training loss is updated, and the model weights are saved. If the loss does not update for more than three epochs, the learning rate is divided by 5 and replaced with the next round’s learning rate. Training will terminate if the updated learning rate is less than 5 \times {10}^{-8} or if the training loss does not update for more than six epochs, to prevent overfitting.

For the BJRoad datasets, the original image resolution is 1024\times 1024 , matching the input size of the network, and only grayscale preprocessing is required. For the Porto datasets, the original image size is 512\times 512 , and we used bilinear interpolation to adjust the image to 1024\times 1024 pixels. For the TLCGIS datasets, the original image size is 500\times 500 . During preprocessing, we used Bicubic Interpolation to adjust the image size to generate 1024\times 1024 images. This processing ensures that all three datasets have a unified resolution before being input into the model, making them compatible with the designed network input size.

C. Evaluation Metrics

We use four evaluation metrics to measure the model’s extraction performance: Precision, Recall, IoU, and F1 -score. Specifically, Precision represents the ratio of correctly predicted road pixels to all pixels predicted as roads; Recall represents the ratio of correctly predicted road pixels to all road pixels in the ground-truth image. The F1 -score is the harmonic mean of Precision and Recall. IoU represents the overlap ratio between predicted road pixels and ground-truth pixels. Higher values of these metrics indicate better performance of the road extraction model. The specific definitions of the metrics are as follows:\begin{align*} \text {Precision} & = \frac {\mathrm {TP}}{\mathrm {TP + FP}} \tag {22}\\ \text {Recall} & = \frac {\mathrm {TP}}{\mathrm {TP + FN}} \tag {23}\\ {F_{1}} & = \frac {\mathrm {2 \times Precision \times Recall}}{\mathrm {Precision + Recall}} = \frac {\mathrm {2TP}}{\mathrm {2TP + FN + FP}} \tag {24}\\ \text {IoU} & = \frac {\mathrm {TP}}{\mathrm {TP + TN + FP}} \tag {25}\end{align*} View SourceRight-click on figure for MathML and additional features.where TP and FP represent true positives and false positives in the road prediction results, i.e., the number of correct road pixels and the total number of road background pixels. FN represents the number of road background pixels, while FP and FN represent the number of true negative and false negative elements for correctly and incorrectly predicted negative classes, respectively. TN represents correctly classified background pixels.

D. Hybrid Loss Function

This article combines region-based BCE loss [49] with distribution-based Dice loss [50] to form a composite loss, aimed at optimizing the segmentation performance of the model. BCE loss retains the texture details of both modalities of the images, while Dice loss enhances the model’s ability to capture high-frequency details and edge information. The specific formula for the hybrid loss function is as follows:\begin{align*} {{L}_{\mathrm { BCE-Road}}}& =1-\frac {2\sum {_{i}^{H\times W}\varphi _{\mathrm { true}}^{i}\varphi _{\mathrm { pred}}^{i}+1}}{{{\sum {_{i}^{H\times W}\left ({{ \varphi _{\mathrm { true}}^{i} }}\right)}}^{2}}+\sum {_{i}^{H\times W}{{\left ({{ \varphi _{\mathrm { pred}}^{i} }}\right)}^{2}}+1}} \tag {26}\\ {{L}_{\mathrm { Dice-Edge}}}& = -\frac {1}{N} \sum _{i=1}^{N} \Bigg (\gamma \varphi _{\mathrm { true}}^{i} \log \big (\varphi _{\mathrm { pred}}^{i}\big) \\ & \qquad \qquad \quad ~ + \big (1-\varphi _{\mathrm { true}}^{i}\big) \log \big (1-\varphi _{\mathrm { pred}}^{i}\big)\Bigg) \tag {27}\\ {{L}_{\mathrm { Total}}}& ={L}_{\mathrm { Road}}+{L}_{\mathrm { Edge}} \\ & ={{L}_{\mathrm { Road}}}\left ({{ {{\delta }_{\mathrm { true}}},{{\delta }_{\mathrm { pred}}} }}\right)+{{L}_{\mathrm { Edge}}}\left ({{ {{\delta }_{\mathrm { true}}},{{\delta }_{\mathrm { pred}}} }}\right) \tag {28}\end{align*} View SourceRight-click on figure for MathML and additional features.where N is the number of pixels in the sample. \varphi _{\mathrm { true}}^{i} represents the binary class label of pixel i, serving as the true label of the sample, with values of 0 or 1.~\varphi _{\mathrm { pred}}^{i} represents the predicted probability of the sample classification, with values in the range (0,1). {\delta }_{\mathrm { true}} represents the sample label, and {\delta }_{\mathrm { pred}} represents the prediction result. {L}_{\mathrm { Road}} and {L}_{\mathrm { Edge}} are the loss functions for the overall texture and edge details of the road, respectively, improving the segmentation model’s accuracy and edge information extraction to different degrees.

E. Evaluation of Model Performance

We selected ten networks for experimental comparison: U-Net, LinkNet [15], D-LinkNet, DeepLabv3+ [16], Res-UNet [51], TransUNet [17], DeepDualMapper [52], SA-Gate [53], CMMPNet [18], and CMIPNet [19]. The first six networks are single modality, while the latter four are multimodality. Performance metrics such as Precision, Recall, IoU, and F1 -score were evaluated on three public road datasets (BJRoad, Porto, and TLCGIS) to demonstrate the performance advantages of our designed LCIRE-Net over other proposed networks for road feature extraction.

Table I provides quantitative results for various classical methods, concluding that multimodal models generally outperform single-modal models. Our proposed LCIRE-Net achieved the best IoU metrics on the three datasets (0.6623, 0.6757, and 0.6655). Additionally, it achieved the best F1 -scores on the Porto and TLCGIS datasets (0.7818 and 0.8087). Below is a detailed comparative analysis of the models on specific datasets.

  1. BJRoad Datasets: Among single-modal models, DeepLabv3 performed the worst with an IoU of 51.07%, and U-Net had the worst Recall at 68.91%. On the other hand, D-LinkNet achieved the best F1 -score at 77.83%, and LinkNet achieved the best Precision at 77.33%. These methods were originally designed for single-modal feature extraction tasks, but we applied them directly to the multimodal task of HRSIs and GPS trajectories, limiting their feature capture and multimodal interaction capabilities, resulting in IoU below 60%. Among multimodal models, DeepDualMapper, using a gated fusion module for image and trajectory features, achieved a competitive IoU of 61.41%. CMMPNet is based on D-LinkNet, and achieved a significant advantage with an IoU of 64.22%. CMIPNet, using an improved BN layer constraint factor, achieved a highly competitive IoU of 65.21%. Our LCIRE-Net achieved the best IoU of 66.23%, and it was also the best model in terms of Recall. This demonstrates that our designed CMIDI mechanism effectively addresses noise interference in many cross-modal models, ultimately improving performance to a new height.

  2. Porto Datasets: Compared to D-LinkNet, LCIRE-Net’s Recall increased by 7.6%, indicating that our model based on image and GPS trajectory data effectively extracts the complete road network structure. In terms of IoU, although LCIRE-Net’s performance is slightly lower than the latest CMIPNet, we lead by a narrow margin in other metrics. Compared to single-modal road networks (D-LinkNet and Res-UNet), our model achieved the best metrics, showing that LCIRE-Net, supplemented by another modality, extracts road labels that better match ground truth. Although CMMPNet and CMIPNet performed best in Precision and Recall respectively, our network surpassed them in other metrics. Compared to other multimodal road segmentation methods (DeepDualMapper and SA-Gate), LCIRE-Net performed worse on some individual metrics across the three datasets but outperformed them in overall performance, demonstrating the superiority of our proposed strategy in identifying pixel categories. LCIRE-Net also has fewer parameters and FLOPs compared to other multimodal road segmentation networks. Additionally, we found that the precision metric of transformer-based single-modal road extraction outperformed all multimodal networks, which we attribute to GPS trajectory data possibly misinterpreting scenes like parking lots or open spaces as roads, causing interference and leading to worse segmentation results compared to single-modal transformer-based road extraction.

  3. TLCGIS Datasets: For single-modal networks, due to the dataset being largely occluded by trees or buildings, single remote sensing images have a significant impact on obtaining road occlusion information. As a result, single-modal networks generally performed poorly on this dataset. Among multimodal networks, DeepDualMapper and SA-Gate had considerable competitive IoU and Recall, with IoU of 63.67% and 64.41%, respectively. Our LCIRE-Net achieved the state-of-the-art IoU of 66.55% on this dataset, with an improvement of nearly 2%, significantly outperforming other multimodal road segmentation networks. Overall, the comparison of these performance metrics indicates that our LCIRE-Net is effective in extracting traffic roads from remote sensing images and LiDAR data.

TABLE I Comparison of Road Feature Extraction Performance Between Different Segmentation Models on Three Publicly Available Multimodal Road Datasets: BJRoad, Porto, and TLCGIS
Table I- Comparison of Road Feature Extraction Performance Between Different Segmentation Models on Three Publicly Available Multimodal Road Datasets: BJRoad, Porto, and TLCGIS

Additionally, LCIRE-Net did not achieve the best precision scores (0.7029, 0.7144, and 0.6933) on the three datasets. Our analysis suggests that the incomplete and blank background areas in the remote sensing images caused invalid segmentation of some false region features. However, the complementary information from the other modality enabled LCIRE-Net to outperform other segmentation models overall. This confirms that our cross-modal information interaction strategy has good generalization and superiority in road feature extraction tasks.

F. Ablation Study

To demonstrate the effectiveness of each module in our proposed LCIRE-Net, we conducted ablation experiments using TDNet as the baseline model. The metrics significantly improved after incorporating the CMIDI mechanism for direct fusion, and the model achieved optimal performance after adding the MFFM fusion strategy and enhancing edge features with FREM. This showcases the innovation and effectiveness of our designed modules. Additionally, we visualized the ablation experiments using feature heatmaps to compare each module.

1) Effectiveness Analysis of LCIRE-Net:

As shown in Table II, the network with the CMIDI mechanism for cross-modal information interaction in the decoder outperformed the baseline on BJRoad and TLCGIS in all evaluation metrics, demonstrating the complementary differences of the other modality in enhancing feature extraction. Removing FREM resulted in Recall dropping from 76.01% and 81.45% to 75.36% and 80.44%, indicating that multiscale feature fusion and dilated convolutions, which increase the receptive field of convolution kernels, significantly improve segmentation accuracy for global roads and edge details. Without MFFM, Precision on BJRoad decreased by 1.77%, and IoU dropped by 0.12% and 1.58%, confirming that our design effectively aggregates road features.

TABLE II Ablation Results of LCIRE-Net on BJRoad, Porto, and TLCGIS Datasets
Table II- Ablation Results of LCIRE-Net on BJRoad, Porto, and TLCGIS Datasets

2) Visual Feature Heatmap Comparison:

Fig. 6 shows the feature heatmaps visualizing the last layer after introducing each module in the ablation experiments. Fig. 6(a) is the input image of the remote sensing image, (b) is based on the GPS trajectory map, (c) is the segmentation heatmap of the baseline, and (d) is the feature heatmap after introducing CMIDI mechanism. It can be seen that multimodal information extraction effectively aids road extraction in complex background environments with occlusions. (e) and (f) are the heatmaps after introducing MFFM and FREM modules, respectively, indicating that the network learns richer background features, contains more detailed information, and achieves more comprehensive semantic representation with better accuracy and continuity, especially for small roads. (g) is our final network model LCIRE-Net, which extracts more effective edge road feature information, proving that our model has considerable capability in complementing another modality and edge feature extraction.

Fig. 6. - Visualization of the feature maps of the last decoder. (a) Remote sensing image. (b) GPS trajectory. (c) Baseline. (d) Baseline + CMIDI. (e) Baseline + CMIDI + MFFM. (f) Baseline + CMIDI + MFFM + FREM. (g) LCIRE-Net.
Fig. 6.

Visualization of the feature maps of the last decoder. (a) Remote sensing image. (b) GPS trajectory. (c) Baseline. (d) Baseline + CMIDI. (e) Baseline + CMIDI + MFFM. (f) Baseline + CMIDI + MFFM + FREM. (g) LCIRE-Net.

3) Comparative Analysis of Different Fusion Strategies:

To test whether the single-modal information provided by multimodal data is reliably used for road extraction tasks, we conducted ablation experiments for verification. As shown in Table III, in single-modal scenarios, when only GPS trajectories were input into the model, poor performance was obtained on BJRoad datasets with IoU (52.38%) and F1 -score (64.89%). When only remote sensing images were input, we achieved IoU (62.02%) and F1 -score (68.64%), indicating that image information is more important than trajectory and LiDAR information. In multimodal scenarios, when both aerial images and GPS trajectories/LiDAR are used, two strategies emerge: direct early fusion of cross-modal information before inputting to the model and late fusion after individual modal feature extraction. The early fusion model achieved slightly better IoU (62.34%, 66.47%, and 66.02%) than the late fusion model (58.18%, 64.71%, and 65.12%) and F1 -score. We hypothesize that early fusion integrates modal information before feature extraction, whereas late fusion extracts single-modal features first and then fuses them. Compared to these two fusion strategies, our LCIRE-Net performs downsampling feature extraction of both modalities simultaneously, fully leveraging the complementary differences between remote sensing images and GPS trajectories/LiDAR, leading to consistently better IoU performance than single-modal models. This indicates that multimodal data is more effective for road extraction.

TABLE III Comparison of IoU and F1 -Score Results of Different Fusion Strategies for Single-Modal and Multimodal Scenarios on BJRoad, TLCGIS, and PRDs
Table III- Comparison of IoU and 
$F1$
-Score Results of Different Fusion Strategies for Single-Modal and Multimodal Scenarios on BJRoad, TLCGIS, and PRDs

4) Analysis of the Effectiveness of Lightweight Design:

Table IV demonstrates a comparative analysis of the parameter counts, FLOPs, and IoU for the light ghost basic blocks (1) and (2) implemented with spatial and channel operators across three datasets. This demonstrates that our lightweight structural module offers the best cost-performance ratio. Specifically, TDNet with the IRB, using only 67% of the parameters, outperforms EfficientNetV2 [54] in IoU by 0.45%, 0.55%, and 0.15%, respectively. When utilizing the ShuffleNetV2 Block, TDNet significantly exceeds ShuffleNetV2 [55] by an 84% parameter reduction across all three datasets. Compared to the TDNet equipped with the lightest ShuffleNetV1 Block, our model shows a notable advantage with improvements of 0.92%, 0.45%, and 0.25%, respectively. Additionally, while our network exhibits a lower IoU compared to the Residual DSC Block with weights in the ResNet-34 network, it is important to note that our network represents only the baseline model. As highlighted in Table II, incorporating CMIDI, MFFM, and FREM is expected to significantly enhance the model’s performance metrics.

TABLE IV IoU Scores of Various Lightweight Modules Integrated With Different CNN Models and TDNet Were Tested on the Dataset, Along With the Reporting of the Number of Parameters and FLOPs
Table IV- IoU Scores of Various Lightweight Modules Integrated With Different CNN Models and TDNet Were Tested on the Dataset, Along With the Reporting of the Number of Parameters and FLOPs

5) Comparative Analysis of Intermediate Connection Modules in Encoder-Decoder:

To validate the effectiveness of the FREM design, Table V details a comparison of different classic modules (FPN [56] and ASPP [57]) and the BLCM replacement within our FREM, focusing on parameter size and performance (IoU/F1 ). The modules SE [58], CBAM [59], and GAM [60] represent attention mechanisms. The results indicate that our approach outperforms both FPN and ASPP across all three datasets. Specifically, FREM with BLCM demonstrates optimal performance metrics compared to versions without BLCM, highlighting the improvement in road extraction results due to BLCM. Among the attention modules, BLCM achieves the best performance on the Porto dataset, while GAM performs better than BLCM on BJRoad and TLCGIS. This discrepancy is attributed to GAM’s feature extraction capabilities, which, despite significantly higher parameter counts and FLOPs, do not translate into improved performance in all cases. Our lightweight approach maintains competitive performance with reduced computational cost, achieving optimal results across various metrics in different datasets. This confirms that FREM effectively aggregates road features, and BLCM serves as a plug-and-play module, demonstrating the generalization capability of our design.

TABLE V Different Strategies for Various Modules in the Intermediate Connections in Encoder-Decoder and Compared IoU/ F1 on Three Datasets
Table V- Different Strategies for Various Modules in the Intermediate Connections in Encoder-Decoder and Compared IoU/
$F1$
 on Three Datasets

6) Selection of Loss Function:

Table VI summarizes the performance metrics (IoU and F1 ) of LCIRE-Net using different loss functions across three datasets, demonstrating that the hybrid loss function we selected is optimal. BCE and Dice individually yield generally favorable results, leading us to combine them into a composite loss function for final selection. Focal loss [61] results are below average, resulting in poorer accuracy compared to similar approaches. BCE-Dice shows a slight advantage over Tversky [62] and Combo [63] in model performance, potentially due to the improved extraction of both global and detail features by the hybrid loss. Although WCE, as an improved version of BCE loss, enhances performance when used alone compared to BCE, its combination with Dice in the composite loss results in decreased effectiveness. Additionally, we performed an ablation study to analyze and explain the effects of different proportions in the composite loss allocation.

TABLE VI Comparison of IoU/ F1 as Road Segmentation Results on BJRoad, Porto, and TLCGIS Datasets Using LCIRE-Net With Different Weights of Loss Functions
Table VI- Comparison of IoU/
$F1$
 as Road Segmentation Results on BJRoad, Porto, and TLCGIS Datasets Using LCIRE-Net With Different Weights of Loss Functions

7) Analysis of the Proportional Impact of Hybrid Loss:

Table VII presents a comparison of performance metrics for BCE-Dice at different weight ratios across three datasets. We find that a 1:2 weight ratio achieves optimal results, which we have selected as our final ratio for the loss function. Specifically, a 1:1 ratio maintains overall model performance well, although it does not reach the optimal level, it is closest to the 1:2 ratio. Analysis across the datasets reveals that a higher proportion of BCE results in mediocre model performance, while a higher proportion of Dice generally yields better results. The best performance is observed on BJRoad, where three metrics reach their optimal values, with two metrics achieving optimal results on the other two datasets. Even when metrics do not reach optimal levels, our results are close to the best. Additionally, we observe that larger proportions tend to degrade model performance. This is likely due to the loss functions based on region and distribution; as one increases, the effect of the other diminishes or disappears, leading to suboptimal segmentation results and additional computational costs, which ultimately affect the model’s computational load and processing time.

TABLE VII Comparison of Road Segmentation Results on BJRoad, Porto, and TLCGIS Datasets Using LCIRE-Net With Different Weightings of Loss Functions
Table VII- Comparison of Road Segmentation Results on BJRoad, Porto, and TLCGIS Datasets Using LCIRE-Net With Different Weightings of Loss Functions

G. Model Parameter and Inference Time

Table VIII presents a comparative analysis of ten segmentation networks, including traditional semantic segmentation networks (U-Net, LinkNet, and Res-UNet), single-modal networks for road feature extraction from remote sensing images (D-LinkNet and DeepLabv3+), Transformer-based road extraction networks (TransUNet), multimodal networks (DeepDualMapper, SA-Gate, and CMMPNet), and a lightweight multimodal remote sensing network (CMIPNet).

TABLE VIII Comparison of the Time Required for Various Road Segmentation Models to Process a Single Remote Sensing Image Using 4090 GPU in Three Test Sets, Along With the Parameters and FLOPs of Various Network Models
Table VIII- Comparison of the Time Required for Various Road Segmentation Models to Process a Single Remote Sensing Image Using 4090 GPU in Three Test Sets, Along With the Parameters and FLOPs of Various Network Models

It is evident that multimodal networks generally have more parameters and FLOPs compared to single-modal networks. This is due to the substantial computational demands and noise introduced during multimodal fusion, necessitating additional parameters to address noise reduction. Given the exceptional performance of transformers, TransUNet exhibits the highest overall performance among single-modal networks. Although our network (LCIRE-Net) has fewer model parameters compared to TransUNet and DeepDualMapper, with reductions of 49.15 and 12.79 M parameters, respectively, and FLOPs reduced by 229.46 and 23.22 G, its overall performance significantly surpasses that of single-modal networks. In comparison with multimodal networks, we found that the SA-Gate model, which is smaller than our network, performs less effectively in feature extraction due to its lightweight design and lack of feature enhancement. Compared to the latest CMMPNet and CMIPNet, our LCIRE-Net, which employs a parameter-shared D-LinkNet network, has fewer parameters than CMMPNet. While CMIPNet improves L1 sparse constraints through the CMIP mechanism, we utilize threshold-based differentiation in the CMIDI mechanism. Although the integration of the fusion module and edge feature enhancement module adds 8.13 M parameters to our model, it still has fewer FLOPs compared to CMIPNet.

We evaluated the computation time of various network models for processing individual images in the test set. The comparison revealed that traditional road segmentation networks generally outperform Transformer-based networks in processing time. Our LCIRE-Net has more parameters compared to U-Net or LinkNet because these networks are simply U-shaped architectures without feature enhancement modules, resulting in fewer parameters. In comparison, TransUNet is 77%, 85%, and 76% faster in processing time across the three datasets. We attribute this to the complex computations introduced by the Transformer in TransUNet, while our design employs a lightweight CNN for feature extraction in the encoder. This approach, as demonstrated in the Performance Comparison (Section IV-E), significantly improves performance over single-modal and multimodal networks. Additionally, our network achieves competitive inference time and speed compared to the latest multimodal networks (CMMPNet and CMIPNet).

H. Analysis of Experimental Visualization Results

As shown in Fig. 7, the visualization results of six networks on the BJRoad test set indicate that multimodal scenarios outperform single-modal ones. U-Net struggles with occlusions caused by trees or buildings, while D-LinkNet uses standard dilated convolution to introduce context, extracting some occluded roads but still performing poorly overall. TransUNet, due to the strong attention mechanism of the transformer, segments relatively complete road information. CMMPNet and CMIPNet, benefiting from complementary multimodal information, produce predictions close to the labels. Our LCIRE-Net achieves even closer predictions to the true labels with more complete segmentation details. This is because our CMIDI mechanism fully leverages interactive information, significantly enhancing segmentation accuracy, while the FREM refines edge feature information. The red boxes highlight the distinct differences between our network and various road networks, and the blue boxes demonstrate our method’s superior edge feature extraction capability. LCIRE-Net accurately identifies and segments detailed features and edge information. The visualization results confirm that LCIRE-Net effectively addresses road occlusions and discontinuities, providing more complete road extraction results for image details and boundaries, demonstrating its ability to segment fine features.

Fig. 7. - Visualization results of BJRoad datasets using different methods. (a) Remote sensing image. (b) GPS trajectory. (c) U-Net. (d) D-LinkNet. (e) TransUNet. (f) CMMPNet. (g) CMIPNet. (h) LCIRE-Net.
Fig. 7.

Visualization results of BJRoad datasets using different methods. (a) Remote sensing image. (b) GPS trajectory. (c) U-Net. (d) D-LinkNet. (e) TransUNet. (f) CMMPNet. (g) CMIPNet. (h) LCIRE-Net.

Fig. 8 compares the performance of six networks on the Porto dataset in scenarios where pseudo-road features such as rivers or urban rail transit in remote sensing images interfere with network extraction. The red boxes mainly highlight our network’s superiority over other segmentation networks. In the first and second rows, compared to other multimodal road networks, our advantage is mainly reflected in edge feature extraction. This demonstrates that LCIRE-Net accurately distinguishes small-scale targets from the background in complex real-world scenarios. However, compared to the labels, there is still potential for improvement in segmenting road structural features in densely interconnected urban areas. Some models in the third row mistakenly extract river information as road information and directly merge it with the GPS trajectory modality, failing to correct the erroneous features. Our network dynamically complements, potentially eliminating erroneous features at certain layers, minimizing misclassification of objects with road-like features, and achieving more accurate predictions.

Fig. 8. - Visualization results of PRDs using different methods. (a) Remote sensing image. (b) GPS trajectory. (c) U-Net. (d) D-LinkNet. (e) TransUNet. (f) CMMPNet. (g) CMIPNet. (h) LCIRE-Net.
Fig. 8.

Visualization results of PRDs using different methods. (a) Remote sensing image. (b) GPS trajectory. (c) U-Net. (d) D-LinkNet. (e) TransUNet. (f) CMMPNet. (g) CMIPNet. (h) LCIRE-Net.

Fig. 9 presents the predicted results of six networks on the TLCGIS dataset for segmenting fine features in occluded road scenes. The red and blue boxes have the same significance as in the BJRoad dataset. We found that in the first row, the single-modal remote sensing images barely reveal road information, leading the model to conclude there are no road features, thus resulting in no roads being detected. In the second and third rows, although the multimodal approach supplements information from another modality, the information is not dynamically complemented during each downsampling extraction. LiDAR data can help detect some occluded or inconspicuous roads in aerial images. LCIRE-Net enables remote sensing images to dynamically complement LiDAR information through multiple layers of downsampling, accurately distinguishing small-scale targets from the background. Moreover, our method simultaneously extracts occluded large and small object information without being disturbed, thus avoiding redundant false roads and achieving clearer detailed contours than existing road extraction methods.

Fig. 9. - Visualization results of TLCGIS using different methods. (a) Remote sensing image. (b) GPS trajectory. (c) U-Net. (d) D-LinkNet. (e) TransUNet. (f) CMMPNet. (g) CMIPNet. (h) LCIRE-Net.
Fig. 9.

Visualization results of TLCGIS using different methods. (a) Remote sensing image. (b) GPS trajectory. (c) U-Net. (d) D-LinkNet. (e) TransUNet. (f) CMMPNet. (g) CMIPNet. (h) LCIRE-Net.

Additionally, although our experiments in the Ablation Study (Section IV-F) demonstrate that the IoU of remote sensing images is significantly better than that of trajectories, we believe that GPS trajectories are equally crucial for the robustness of road extraction. In cities like Chongqing and Chengdu in China, and London in the U.K., which are often heavily covered by fog and mist, the data collected by remote sensing images can be greatly disturbed. Since there are no foggy images in the BJRoad dataset, we used the fog effect renderer in PhotoShop (PS) to generate some foggy images to explore road feature extraction from foggy images and GPS trajectories. As shown in Fig. 10, in terms of segmentation results, LCIRE-Net can still generate road network maps almost identical to the labels by fully utilizing the complementary information from the GPS trajectory. This demonstrates that our model is also suitable for extreme weather conditions like heavy fog or mist, further enhancing its generalization performance.

Fig. 10. - First line shows some foggy remote sensing images on the BJRoad dataset test set with fog. Although the traffic roads in these images are severely obscured by heavy fog, in the second row, LCIRE-Net can still generate a road network map that is close to the label by making full use of the complementary information of the GPS trajectory of another modality.
Fig. 10.

First line shows some foggy remote sensing images on the BJRoad dataset test set with fog. Although the traffic roads in these images are severely obscured by heavy fog, in the second row, LCIRE-Net can still generate a road network map that is close to the label by making full use of the complementary information of the GPS trajectory of another modality.

SECTION V.

Conclusion

In this study, we propose a novel LCIRE-Net with a parallel dual branch based on multimodal networks at the decoder end. The CMIDI mechanism dynamically complements the differential features of the two modalities during simultaneous downsampling, effectively addressing the issue of road occlusion that a single modality cannot handle well. The MFFM newly defines the fusion of information between different modalities, and when applied to the decoder output features, it balances global and local road feature information. This approach focuses on effectively capturing long-range dependencies and extracting occlusion details in satellite remote sensing image road extraction methods. Since single-modal targets often have uneven distributions, we introduce the FREM, which provides valuable road structure information and, based on the expansion characteristics of dilated convolutions, enables the network to fully extract edge feature information.

Additionally, in terms of model lightweighting, we designed a novel ghost basic block, alternating DW-Conv and PW-Conv to replace the original residual blocks, based on the SOTA method from GhostNet. Extensive experiments on three public datasets (BJRoad, TLCGIS, and Porto) demonstrate the superiority and generalization of our proposed LCIRE-Net. The Ablation Study validates the necessity and effectiveness of CMIDI, MFFM, and FREM. Although the introduction of these innovative modules significantly improves road feature segmentation and extraction, there is still potential for optimization in terms of computational cost, model size, and inference speed. In the future work, we will further focus on making the model more lightweight and conduct related research.

References

References is not available for this document.