Introduction
Semantic segmentation is typically regarded as a problem of land-cover classification and aims at providing a category label for each pixel in an urban scene image. It plays a vital role in land-cover mapping [1], [2] urban planning, change detection [3], road extraction, and environmental protection. With the advancement of sensor technology, plenty of high-resolution (HR) urban scene images have been captured. Urban scene images with rich potential semantic content and abundant spatial details can provide data support for segmentation. However, large-scale variation, unbalanced distribution of the ground objects and their categories, interclass similarities, intraclass variations, and difficulty in extracting comprehensive feature information have challenged land-cover classification approaches to accurately identify and segment objects in HR urban scene images.
Earlier methods had made some progress employing the traditional feature design and convolution neural networks (CNNs), making the data-driven feasible for multiscale feature learning for urban scene images. Conventional methods often use handcrafted features (such as spectral, spatial, and textural) and traditional machine learning methods (such as support vector machine and random forest) to segment the urban scene images. However, the traditional methods depend on handcrafted features, consistently achieving unsatisfactory performance. Thus, designing good feature extractors for multiscale stimuli for solving urban scene image semantic segmentation problems is essential. It requires feature extractors to use larger receptive fields to identify context at multiscale.
Unsurprisingly, CNNs learn multiscale features through a stack of convolutional operators. This ability of CNNs leads to effective representations for solving several vision tasks (i.e., classification, object detection, and semantic segmentation). Compared with traditional methods, CNN-based approaches have shown tremendous success in semantic segmentation. In modern computer vision systems, CNNs are the most common choice of visual encoders. The most popular encoders include AlexNet [4], VGGNet [5], ResNet [6], InceptionNet-v4 [7], and InceptionResNet-v2 [7]. Recently, CNNs have been challenged by vision transformers (ViTs) [8], which have demonstrated good performance in several vision tasks. ViTs are designed either locally or globally and can gather information from a larger region using a larger window size. In contrast, these methods are usually computationally heavy and require significant memory to capture the global context. Contrastingly, attention mechanisms are widely adopted for semantic segmentation due to their advantages in acquiring long-range context information. DANet [9] and CBAM [10] are state-of-the-art approaches that incorporate spatial and channel attention mechanisms to enhance the feature representation abilities of models.
With the widespread adoption of CNNs for vision tasks, significant advancements have been made for natural image semantic segmentation, producing spectacular results. Fully convolutional network (FCN) [11] is pioneering work in semantic segmentation tasks, employing FCNs without fully connected layers for end-to-end dense pixel prediction. U-Net [12] introduces skip connections in encoder–decoder architecture and U-Net++ [13] dense skip connections to bridge the semantic gap between encoder–decoder feature maps. PSP-Net [14] leverages the pyramid pooling module (PPM) and DeepLabV3+ [15] atrous spatial pyramid pooling (ASPP) to steadily segment targets at multiscale to capture object and image context.
The above-mentioned CNN-based semantic segmentation methods have three limitations. First, these methods construct a multiscale model using single-sized fixed convolutional kernels, dilated convolution with different dilation rates, or pooling grids without enabling correlation among different feature maps. Second, they underutilize feature interdependencies of the context contained in each layer of the encoder–decoder architecture, causing scale-related inaccuracies. Finally, these methods usually use deconvolution or bilinear-based upsampling techniques to recover image resolution, causing checkerboard artifacts and blurry edges at increased computational cost. Thus, it is essential to design good feature extractors to simultaneously identify local and global contexts. Redesigned skip connections to prevent the loss of fine- and coarse-grained details at shallower and deeper layers, and upsampling techniques to fully fuse feature information at reduced computational cost for HR urban scene image's semantic segmentation tasks.
To mitigate the above-mentioned limitations, this work proposes an alternative simple yet effective multiscale context-aware feature fusion network (MCN) for HR urban scene images. Concretely, MCN includes three fundamental components: multiscale feature enhancement (MFE) module, multilayer feature fusion (MLF) module, and pixel-shuffle decoder (PSD) module. MFE is exploited for the backbone network to identify the local and global context of ground objects while suppressing the background noise and capturing complementary features by enabling correlation among different levels of feature maps. MLF is introduced as skip connections where features from different MFE layers are merged before the supervision to produce a single high-level representation of the input data by leveraging the strength of each layer in capturing low-, mid-, and high-level features. The network can better segment the image into various semantic classes by merging these learned representations from all layers. PSD is used to enhance the resolution of the feature maps in the decoder and fully fuse feature information from various scales. Unlike other upsampling methods, PSD can better fix blurry edges and checkerboard artifacts with fewer parameters while improving network speed and accuracy.
The contributions to this work are given as follows.
A novel MCN with three fundamental modules is proposed to solve the interclass similarities, intraclass variations, scale-related inaccuracies, and high computational complexity issues.
We adopted the well-extracted multiscale information to identify local and global contexts simultaneously. Redesigned skip connections to leverage the strengths of each layer in capturing different levels of features and upsampling technique to fully fuse feature information at various receptive fields while reducing the number of parameters.
Extensive experiments are conducted on the ISPRS 2-D semantic labeling datasets and DeepGlobe to demonstrate the effectiveness of MCN, which yields notable performance gains compared with the existing architectures but with much fewer parameters.
Related Works
This section discusses similar techniques commonly employed in the semantic segmentation of HR urban scene images, including multiscale feature learning, skip connections, upsampling methods, and visual attention mechanisms.
A. Encoder–Decoder Architecture
One of the first semantic segmentation efforts using CNNs is an FCN [11]. However, during downsampling, FCN reduces spatial information by a larger factor. Thus, during upsampling, it becomes difficult to reproduce fine details even after using transpose convolution, which results in coarse output. To tackle this issue, Ronneberger et al. [12] introduce skip connections in the encoder–decoder module. However, due to the fixed receptive field of convolutional kernels, U-Net [12] suffers from extracting multiscale features. Zhao et al. [14] introduce an effective PPM to capture multiscale features by applying pooling operations at different grids. However, the pooling-based approach (i.e., PPM) may lose pixel-level fine detail information because distinct pixels may use the same contextual information. Chen et al. [15] introduced DeepLabv3+, which utilizes a more effective ASPP module that deploys multiple parallel filters with different dilation rates to capture multiscale features without adding extra computational costs. However, it can only manage scale variation to some degree. In addition, sparse sampling leads to spatial information loss, and a larger dilation rate produces gridding artifacts.
1) Multiscale Feature Learning
The accurate semantic segmentation of urban scene imagery requires the multiscale feature information of the region of interest. Extracting ground object features at various scales can help address interclass similarities and intraclass variances for diverse situations. The AlexNet [4] stacks filters sequentially and achieves significant performance gain over traditional methods. VGGNet [5] stacks filters of smaller kernels to increase the network depth and receptive field. Although VGGNet provides a more robust multiscale feature representation than AlexNet, due to stacking filters directly, AlexNet and VGGNet had a relatively fixed receptive field for each layer. Szegedy et al. [7] introduced InceptionNet-v4 with different filter sizes in parallel to increase receptive fields and InceptionResNetv2 with inception and residual connections to enhance the efficiency of multiscale feature learning. SegFormer [16] presents a hierarchical transformer encoder to extract multiscale features and employs a multilayer perception (MLP) decoder to aggregate information from different layers. In urban scene semantic segmentation, Xu et al. [17] proposed a network to solve the problem of existing backbones in extracting multiscale features due to a large downsampling factor. In [18], an ensemble learning paradigm is employed to adaptively fuse the features from different scales and a pointwise convolution method to reduce the parameters while improving the model's accuracy. Extracting essential contextual information about ground objects requires CNN models to process features at various scales for effective semantic segmentation. In summary, to adequately utilize the rich spatial information in HR urban scene images and improve feature extraction's robustness among diverse and complex ground scenes, we introduce kernels of different sizes (i.e., 3 × 3, 5 × 5, and 7 × 7) in our work.
2) Skip Connections
Skip connections were introduced to solve different problems in different architectures, such as ResNets [6], for degradation and U-Net [12] for encoder–decoder architecture to prevent the loss of fine-grained details (i.e., object boundaries). U-Net++ [13] introduces dense skip connections to replace the plain skip connections in U-Net to bridge the semantic gap between encoder–decoder feature maps. However, due to the skip connections scheme and fixed receptive field of each layer, it is challenging for both U-Net and U-Net++ to model the global multiscale context for HR urban scene images. In urban scene semantic segmentation, 2DSegFormer [19] designed dilated residual connections as skip connections to further increase the receptive field of deep feature maps. MAResU-Net [20] redesigned skip connections in U-Net based on linear attention mechanism and ResNet. MSCA-Net [21] presents skip connections with atrous convolution to deal with the segmentation problems of multiscale urban scene images. MACU-Net [22] introduces skip connections with a channel attention mechanism to combine the multiscale features. We found that skip connections are helpful for several reasons in HR urban scene image segmentation. First, residual connections [6] allow the network to learn more complex feature mappings and facilitate faster convergence without being affected by vanishing gradient problems. Second, skip connections [12] are essential for encoder–decoder-based architectures as they allow the decoder to directly access the feature maps from corresponding levels of the encoder, thus preserving fine-grained details that might, otherwise, be lost as the spatial resolution decreases. It makes U-Net [12] particularly effective for tasks, such as image segmentation, where precise localization of object boundaries is essential. In summary, we used a residual connection in MFE and redesigned kip connections as MLF.
3) Upsampling Methods
CNNs are popular and highly performant choices for dense-level prediction. One commonly required component in CNNs is to increase the low-resolution feature maps for network visualization. Interpolation and deconvolution are the most common upsampling methods for recovering spatial information from convolution or max-poling layers. Interpolation upsampling methods include the nearest neighbor, bicubic, and bilinear interpolation. These methods lack the “learnable” aspect, blur the images, and aliasing distortions. Deconvolution upsamples low-resolution images using learnable kernels while improving upsampling during training. However, “uneven overlap” can easily occur during deconvolution, meaning that the convolutional kernel operates more in some places than others, causing checkerboard artifacts. Deconvolution was first proposed in FCN [11] and has been used in later segmentation models, i.e., U-Net [12]. In contrast to interpolation and deconvolution, Shi et al. [23] introduced the parameter and checkerboard artifacts free upsampling method, i.e., pixel shuffle (PS), also known as subpixel convolution for single-image super-resolution (SISR), which was later used in semantic segmentation tasks. PS provides a larger receptive field to capture more contextual information with minimal loss of information while maintaining the quality of generated segmentation. In urban scene semantic segmentation, many methods based on PS have been proposed. Chen et al. [24] proposed an end-to-end semantic segmentation network by inserting a shuffling layer in DeepLab architecture. They designed a field-of-view method to enhance the prediction while using an ensemble method to improve the model performance. Zhang et al. [25] proposed a network that simultaneously solves the super-resolution semantic segmentation and super-resolution image reconstruction by using low-resolution images to generate an HR segmentation image. We found that methods having upsampling layers, i.e., deconvolution or bilinear interpolation cause images to be distorted by checkerboard artifacts. In summary, the proposed work uses a PS operation in the decoder to improve the resolution of output feature maps and leverage the advantage of the MCN in capturing multiscale context. Using PS further improved the network speed and accuracy while alleviating the edge blur and artifacts caused by information loss.
B. Visual Attention Mechanism
The attention mechanism can improve the saliency representation of important features while suppressing interference from redundant features. SPOL [26] analyzed the importance of shallow features and used global average pooling to suppress background noise. In urban scene semantic segmentation, SCAttNet [27] combines channel and spatial attention mechanisms to refine the feature map. Zhang et al. [28] proposed a network to adaptively recalibrate feature responses and simultaneously aggregate global information along the channel and spatial dimensions to improve feature representation. Li et al. [29] propose a dual-channel scale-aware segmentation network with position and channel attention. PGNet [30] uses transformer-based architecture to fully leverage the long-range dependencies and global contextual information to segment objects of varying sizes. Ding et al. [31] proposed a network that utilizes CNN encoder and global–local attention-based transformer decoder to model global and local information. Gao et al. [32] proposed a network that used a dual-branch encoder based on CNN transformer to model local and global semantic information and designed a multilayer dense connectivity network as a decoder to aggregate the dual-branch semantic information. MANet [33] uses a multiscale strategy and self-attention mechanisms to aggregate relevant contextual features. ABCNet [34] uses a spatial path and a contextual path to extract contextual and fine-grained information to increase the segmentation accuracy of the network. BIBED-Seg [1] proposed a block-in-block edge detection network using an attention mechanism. In summary, an attention mechanism can improve the object region features while suppressing interference from redundant features. As a result, we designed MFE and MLF for feature enhancement and correlation modeling using an attention mechanism.
One of the fundamental concepts underlying these strategies is the multilevel context to enhance segmentation prediction. Although these methods can prevent global contextual information loss, they are computationally expensive and redundant while collecting rich and multiscale contextual information. The following shows that the MCN provides comparable results or excels on benchmark methods with fewer parameters.
Methodology
Fig. 1 shows MCNs three-module architecture. The first module is MFE (see Fig. 2), which takes input from ResNet50, the backbone network. Initially, MFE utilizes 3 × 3, 5 × 5, and 7 × 7 convolution kernels to extract feature information at low, middle, and high levels, respectively. Second, MFE refines the extracted multiscale features using an attention mechanism (see Fig. 3), which helps to decrease the influence of redundant background feature information. Finally, it uses a similarity function to capture the complementary features by establishing the correlation among different levels of feature maps. The second module is MLF (see Fig. 5), which merges the shallow and deep layers in the encoder–decoder network by using a larger receptive field and attention mechanism to deal with various layers simultaneously and improve the multiscale representation capability at a finer grained level. The third module is PSD (see Fig. 7), which alleviates the blurry edges and checkerboard artifacts while upsampling with reduced parameters. The proposed MCN is an end-to-end segmentation network that uses hierarchical processing to refine feature information and improve segmentation performance to obtain accurate semantic segmentation results.
Overview of the proposed segmentation method MCN, including MFE module for backbone network, MLF module as skip connections, and PSD module as a decoder.
Given an intermediate feature map, the operation process of the CAM is divided into two parts, including average-pooled features and max-pooled features, to capture both the overall importance and the most discriminative features of each channel. After pooling, the two 1-D vectors are sent to the multilayer perceptron network and added to generate 1-D channel attention. Then, the channel attention is multiplied by the input elements to obtain the refined feature map.
A. MFE Module
In recent years, many efforts have been made to improve the performance of CNNs from convolutional operations and bottleneck layers to more efficient architectures. The most common way in CNNs to enlarge the receptive field is to stack smaller kernels than a single larger kernel. According to the theory of effective receptive field (ERF) [35], ERF is proportion to O(
Feature maps are selected from the first layer of a trained MCN by convolutional filters of different sizes. It is indicated that smaller convolutional filters are more effective at extracting fine structures, such as the sharp corners of buildings and intricate patterns of vegetation. In comparison, coarse structures respond to larger filters. (From Top to Bottom) First and second rows demonstrate the Potsdam and Vaihingen train set, respectively.
To extract multiscale features from urban scene images for the input feature map, we use convolution kernels of different sizes, specifically 3 × 3, 5 × 5, and 7 × 7. In each convolution branch, we also introduce a 1 × 1 convolution to reduce the number of channels and control the calculation parameters. This allows us to efficiently extract multiscale features while minimizing computational cost. Overall, the process for extracting multiscale features can be described as follows:
\begin{equation*}
\left\{ {\begin{array}{c} {{X}_1 = {\vartheta }_{k = 3}\ \ ({\vartheta }_{k\ = \ 1}\left(X \right))}\\ {{X}_2 = {\vartheta }_{k = 5}\ \ ({\vartheta }_{k\ = \ 1}\left(X \right))}\\ {{X}_3 = {\vartheta }_{k = 7}\ \ ({\vartheta }_{k\ = \ 1}\left(X \right))} \end{array}} \right\} \tag{1}
\end{equation*}
At each convolutional layer, a group of filters expresses neighborhood spatial connectivity patterns along input channels. These filters are extremely useful in learning edges and a particular texture in the images, making CNNs produce image representations that capture hierarchical patterns. These representations can be strengthened by explicitly modeling the interdependencies of their convolutional feature channels. Channel attention provides a weight for each channel to enhance those particular channels, which are essential for feature learning. To improve the extracted multiscale features and mitigate the influence of overlapping background feature information, we utilize a channel attention module (CAM) to learn how to prioritize various features for calibration. This process of feature calibration is described as follows:
\begin{equation*}
\text{CA}\left({X_i^{\rm{^{\prime}}}} \right)\ = {\rm{\ \delta }}\left({{\rm{\Theta }}\left({\text{AP}\left({{X}_i} \right)} \right) + {\rm{\Theta }}\left({\text{MP}\left({{X}_i} \right)} \right)} \right) \tag{2}
\end{equation*}
It is essential to correctly utilize information at various scales to achieve precise semantic segmentation of urban scene images. Due to the inherent distinctions between the three branch feature maps, simply adding or concatenating them could cause more unnecessary information or data repetition in the final output. It is, therefore, essential to consider other techniques that can blend the feature maps in an efficient way while minimizing these issues. To capture the complementary features of three branches, we introduce a cosine similarity function that captures the similarity of objects among different feature maps. The MFE employs cosine similarity to quantify the relevance of the three branch feature maps. The obtained similarity scores are then normalized and used as weights to combine the maps into a single multiscale representation. This combined representation is refined and integrated into the final feature vector through optimization. For three branches of channel attention maps (i.e.,
\begin{equation*}
Z\ = \frac{{X_1^{\rm{^{\prime}}} \cdot X_2^{\rm{^{\prime}}}}}{{||X_1^{\rm{^{\prime}}}\left\| {\rm{*}} \right\|X_2^{\rm{^{\prime}}}||}} \otimes \frac{{X_1^{\rm{^{\prime}}} \cdot X_3^{\rm{^{\prime}}}}}{{||X_1^{\rm{^{\prime}}}\left\| {\rm{*}} \right\|X_3^{\rm{^{\prime}}}||}} \tag{3}
\end{equation*}
When a neural network uses different branches to extract features from an input, the resulting feature maps may have differences due to variations. This can make it challenging to completely understand the input by correlating these feature maps. To overcome this limitation, it is essential to consider these distinctions between feature maps and find ways to integrate them more effectively. Otherwise, the accuracy and effectiveness of the network's predictions or results may suffer. To address this issue, a gate mechanism is implemented in this study. This mechanism aims to optimize similar features and merge them with feature maps at various scales. The gate unit used in this research is different from previous work [39], which usually uses the sigmoid function to restrict the value between 0 and 1. Instead, we employ a ReLU function as it accelerates the training process of MCN and avoids potential issues with gradient dispersion
\begin{equation*}
\left\{ {\begin{array}{c} {X_1^{{\rm{^{\prime\prime}}}} = \ \sigma \left({\left({Z \otimes {X}_1} \right) + {X}_1} \right)}\\ {X_2^{{\rm{^{\prime\prime}}}} = \ \sigma \left({\left({Z \otimes {X}_2} \right) + {X}_2} \right)}\\ {X_3^{{\rm{^{\prime\prime}}}} = \ \sigma \left({\left({Z \otimes {X}_3} \right) + {X}_3} \right)} \end{array}} \right\} \tag{4}
\end{equation*}
For the feature map obtained by different branches, the fusion feature
\begin{equation*}
\ {X}_c = {\rm{concat\ }}(X_1^{{\rm{^{\prime\prime}}}}\,X_2^{{\rm{^{\prime\prime}}}},X_3^{{\rm{^{\prime\prime}}}}). \tag{5}
\end{equation*}
Convolutional kernels with trainable weights are repeatedly applied to feature maps to extract new features. During feature extraction, the input of a layer depends on the weights of the previous layer. Small changes in image batches or shallow feature maps accumulate and amplify along the network depth by making training layers fit these distribution changes rather than the valuable and actual content. As a result, a neural network suffers from covariant shift, decreasing performance and training speed. Batch normalization (BN) can avoid covariant shifts by normalizing the feature map along the channel direction. BN keeps the representation capacity of the neural network by retranslating and rescaling the normalized feature map. Therefore, we use BN and LReLU to increase the numerical stability and activate the output nonlinearly. Moreover, the 1 × 1 convolution is introduced to get
\begin{equation*}
\text{MFE}\left(X \right)\ = {\rm{\ \sigma }}\left({\gamma ({\vartheta }_{k\ = \ 1}\left({{X}_c} \right)} \right) + \beta \tag{6}
\end{equation*}
A residual network connection is added to achieve constant training. It takes activation from one layer and feeds it to a layer far deeper in the network, facilitating the training and learning of more complex features. Residual connections allow gradients to flow backward during backpropagation. The performance will not degrade even if some data are lost during feature extraction, as it will flow through residual connections during forward propagation
\begin{equation*}
\ X_i^E = {\rm{\ MFE}}\left(X \right) + X \tag{7}
\end{equation*}
B. MLF Module
Features from deeper layers are high in semantic details, while features from shallow layers are less semantic but contain more local information that helps in defining object boundaries more accurately. In U-Net [12] and similar architectures that use plain skip connections, these deep and shallow features are supervised directly, pushing the network to learn better representations. Unfortunately, direct form of supervision does not prove to be very beneficial for urban scene semantic segmentation due to the following limitations: First, limited receptive field of shallow features can lead to less semantic information and introduce more noise, whereas a larger receptive field can help to get more accurate segmentation results; second, merging shallow and deep features through direct concatenation can result in a large number of parameters and computations where indirect supervision in which features from different layers are merged prior to supervision can be beneficial for urban scene semantic segmentation at a lower computational cost. Despite significant advancements made by prior skip connections oriented approaches; two major challenges remain.
1) Deep Coarse Features
Due to the successive downscaling operations, the feature maps over the past few layers become severely coarse (i.e., 8 × 8 in ResNet50), resulting in a loss of spatial resolution. Despite their limited impact on classification accuracy, these coarse feature maps can significantly affect object localization. However, when the feature maps are too coarse, it becomes difficult to precisely localize objects since their exact position within the image is unclear. Even if the network correctly classifies an image containing an object, it will struggle to localize it if the feature maps are too coarse. To tackle this problem, techniques, such as upsampling or deconvolution, can be utilized to recover some of the lost spatial information, allowing for more precise object localization even with coarse feature maps. However, this comes at the cost of increased computational cost. Therefore, we utilized an alternative upsampling method (PSD) to recover some of the lost spatial information better but with reduced computational cost.
2) Shallow and Deep Features
Without additional boundary information from input data, it is hard to refine objects' complete and sharp boundaries. Shallow layers capture low-level features, such as edges, corners, and textures, making object boundaries sharper. On the other hand, deeper layers capture higher level features that are most abstract and semantic, such as object categories. Thus, effectively combining both shallow and deep features is critical for achieving precise semantic segmentation of HR urban scene images. Using an appropriate strategy can lead to a more effective generation of feature maps, benefiting both shallow and deep features, just as MLF. Compared with plain skip connections, we designed MLF (see Fig. 5) as skip connections, where features of different layers (i.e., MFE-1, MFE-2, and MFE-3) are merged before the supervision to produce a single, high-level representation of the input data. Our approach leverages the strengths of each layer's features, such as shallower layers capturing low-level features (i.e., edges and textures), middle layers capturing mid-level features (i.e., shapes and patterns), and deeper layers capturing high-level features (i.e., object categories). Combining these learned representations from all three MFE layers allows the network to better segment the image into various semantic classes, as shown in Fig. 6. Our MLF is helpful in specific scenarios, such as improved segmentation accuracy, robustness to noise, better generalization, more flexible design, and reduced computational complexity. MLF improves the overall network performance by emphasizing low-level features embedded in shallow layers and high-level features embedded in deep layers.
Shows the structure of MLF architecture. A multiplicative mechanism correlates the different branches (MFE-1, MFE-1, and MFE-3) that separately learn representations of the input data. The gradients in one branch can be affected by the performance of other branches, as the error signals propagate through the network during training. Errors in one branch can be amplified when multiplied by the activations from another branch. This can lead to large gradients, and the final prediction will be wrong. For instance, different branches, such as MFE-2 and MFE-3, are dependent on MFE-1, which conveys that the different branches have some level of interaction and interdependence. When MFE-1 learns better representations, it can improve the other branches' performance, ultimately leading to more accurate predictions when all the branches are fused together.
(From Top to Bottom) First and second rows demonstrate the Potsdam and Vaihingen train set, respectively. The figure displays visualizations of input and output features from three MLF modules, respectively. The input image in each MLF module represents low-, mid-, and high-level features from the MFE modules, while the output image represents the processed features.
For the input feature maps from MFE layers,
\begin{equation*}
\left\{ {\begin{array}{l} {\ {X}_1 = \ \sigma \left({\gamma \left({{\vartheta }_{k\ = \ 3}\ \left({{X}^E} \right)} \right) + \beta } \right)\ }\\ {\ {X}_2 = \sigma (\gamma ({\vartheta }_{k = 3}\ \ \left({{X}_1 + {X}^E} \right)) + \beta)}\\ {{X}_3 = \sigma (\gamma ({\vartheta }_{k = 3}\ \ \left({{X}_2 + {X}^E} \right)) + \beta)} \end{array}} \right\}. \tag{8}
\end{equation*}
The presence of noise in feature data can severely hinder the accuracy of predictive models. Our approach differs from [10], which only focuses on one layer at a time. Instead, we deal with multiple layers simultaneously in a unique way. Before fusing features from different layers, we deploy CAM to filter out irrelevant features across various layers simultaneously and improve the quality of the processed data. By leveraging the interdependence of features, channel attention allows for efficient information processing and enhanced learning capabilities within the network. Our findings suggest that this method is highly effective in improving the accuracy of predictive modeling, mainly when dealing with complex data, such as urban scene images.
For all three feature maps, CAM is exploited to achieve
\begin{equation*}
\text{CA}\left({X_i^{\prime}} \right)\ = {\rm{\ \delta }}\left({{\rm{\Theta }}\left({\text{AP}\left({{X}_i} \right)} \right) + {\rm{\Theta }}\left({\text{MP}\left({{X}_i} \right)} \right)} \right) \tag{9}
\end{equation*}
After CAM, we employ three parallel 1 × 1 convolutions to achieve
\begin{equation*}
Z\ = {\vartheta }_{k = 1}{\rm{\ \ }}\left({X_1^{\rm{^{\prime}}}} \right) \otimes {\vartheta }_{k\ = \ 1}\ \left({X_2^{\rm{^{\prime}}}} \right) \otimes {\vartheta }_{k\ = \ 1}\ \left({X_3^{\rm{^{\prime}}}} \right) \tag{10}
\end{equation*}
Finally, the latent representation is multiplied by
\begin{equation*}
\left\{ {\begin{array}{c} {\ X_1^F = {X}_1 \otimes ({\vartheta }_{k = 1}\ \ \left(Z \right))\ }\\ {\ X_2^F = {X}_2 \otimes ({\vartheta }_{k = 1}\ \ \left(Z \right))}\\ {\ X_3^F = {X}_3 \otimes ({\vartheta }_{k = 1}\ \ \left(Z \right))} \end{array}} \right\} \tag{11}
\end{equation*}
C. PSD Module
PS [23] was initially introduced for SISR, where we aim to train a CNN that generates super-resolved images at the original resolution. Without adding extra parameters and computation costs, PS provides another way to fit semantic segmentation for large-scale urban scene images under memory limits. On this basis (see Fig. 7), we design the PSD module to leverage the advantage of the MCN in capturing multiscale information. PSD upsamples by rearranging pixels of the feature map and reducing the number of channels by four times, which significantly reduces parameters of the subsequent convolution, followed by a composite function comprising three different operations (3 × 3, BN, and PReLU) and concatenation with corresponding MLF layers from the skip connections path. To achieve dense-level prediction, high-level features generated at the last encoder stage
\begin{equation*}
\ \bar{X}_{\text{De}\left(s \right)}^{\ 2h \times 2w \times c/4} = {\rm{\ PS}}\left(X_{\text{De}\left({s - 1} \right)}^{h \times w \times c} \oplus X_{\text{MLF}\left(s \right)}^{h \times w \times c}\right) \tag{12}
\end{equation*}
Sample images from land-cover classification datasets, Potsdam, Vaihingen, and DeepGlobe, respectively.
Upsampled feature maps are further convolved with one standard convolution 3 × 3 to lessen the aliasing distortion, and nonlinearity is added to generate the final feature map at the decoder stage s
\begin{equation*}
\ X_{\text{De}\left(s \right)}^{2h \times 2w \times c/4} = \ \sigma (\gamma (\bar{X}_{\text{De}\left(s \right)}^{\ 2h \times 2w \times c/4}*\ \ {\vartheta }_{k\ = \ 1}) + \beta) \tag{13}
\end{equation*}
Experimental Results and Analysis
A. Semantic Segmentation Datasets
The performance of MCN is assessed using three publicly available datasets, Potsdam, Vaihingen, and DeepGlobe.
1) Potsdam Dataset
The Potsdam dataset consists of 38 HR aerial images, each with an average size of 6000 × 6000 pixels and a ground sampling distance (GSD) of 5 cm. Potsdam RGB images are annotated with six distinct landscape classes: impervious surface (road), building, low vegetation, trees, car, and clutter, where clutter class is not considered in the assessment. For Potsdam, we split images into training, validation, and testing sets with 23, 1, and 14 images, respectively.
2) Vaihingen Dataset
The Vaihingen dataset consists of 33 HR aerial images, each with an average size of 2494 × 2064 pixels and a GSD of 9 cm. Vaihingen IRRG images are annotated with six distinct landscape classes: impervious surface (road), building, low vegetation, trees, car, and clutter, where clutter class is not considered in the assessment. For Vaihingen, we split images into training, validation, and testing sets with 15, 1, and 17 images, respectively.
3) DeepGlobe Dataset
The DeepGlobe dataset consists of 803 HR satellite images, each with an average size of 2448 × 2448 pixels and a GSD of 50 cm pixels. These images are annotated with seven distinct landscape classes: forest land, urban land, barren land, agriculture land, rangeland, water, and unknown, where unknown class is not considered in the assessment. Following [36], we split images into training, validation, and testing sets with 454, 207, and 142 images, respectively.
B. Evaluation Metrics
We use the following assessment metrics to evaluate the performance MCN, overall accuracy (OA), mean intersection over union (mIoU), and mean F1-score (mF1), which can be defined as follows:
\begin{align*}
\text{OA} &= \frac{{\text{TP}}}{{\text{TP} + \text{FP} + \text{TN} + \text{FN}}}\ \tag{14}\\
\text{IoU} &= \frac{{\text{TP}}}{{\text{TP} + \text{FN} + \text{FP}}} \tag{15}\\
F1 &= 2 \times \frac{{\text{precision} \times \text{recall}}}{{\text{precision} + \text{recall}}} \tag{16}\\
\text{precision} &= \frac{{\text{TP}}}{{\text{TP} + \text{FP}}},\quad \text{recall} = \frac{{\text{TP}}}{{\text{TP} + \text{FN}}}\ \tag{17}
\end{align*}
C. Implementation Details
The experiments were conducted on a single NVIDIA RTX 3090 GPU, utilizing the PyTorch framework. We chose the U-Net [12] as the baseline model and employed ResNet50 [6] as the backbone network for the MCN. In our work, we utilized only the first three bottleneck layers of the pretrained ResNet50 to reduce the number of trainable parameters. For optimization, we employed the Adam optimizer with AMSGrad [37], using a weight decay of 2 × 10−5. In addition, we applied polynomial decay (L) to 1− cur_iter/max_iter)0.9, where the maximum number of iterations was set to 108. We also set 2×L for all bias parameters. The initial learning rate was set to 8.5 × 10−5/√2 for the ISPRS dataset and 8.5 × 10−4/√2 for the DeepGlobe dataset. We implemented a stepwise schedule method to decrease the learning rate and improve the training process. For the ISPRS dataset, we reduced the learning rate by a factor of 0.85 after every 15 epochs. Similarly, for the DeepGlobe dataset, we employed a reduction in the learning rate by a factor of 0.85 after every 4 epochs. During training and validation, we used randomly sampled 5000 patches of size 256 × 256 from the ISPRS and DeepGlobe datasets. These patches were augmented by mirroring and flipping, each with a 50% probability. To improve the predictions, we employed test time augmentation (TTA) by averaging the predictions of overlapping TTA regions. To handle the problem of imbalanced data in the ISPRS dataset, we utilized a cross-entropy loss function that incorporated median frequency balancing weights, as described by the equation. Conversely, the DeepGlobe dataset employed a cross-entropy loss function
\begin{align*}
&L = - \frac{1}{N}\mathop \sum \limits_{n = 1}^N \mathop \sum \limits_{c = 1}^C l_c^{\left(n \right)}\log \left({p_c^{\left(n \right)}} \right){W}_c \tag{18}\\
&{W}_c = \frac{\text{median}(\{ f_{c}|c\in C \})}{{{f}_c}}\ \tag{19}
\end{align*}
D. Ablation Study
To verify the effectiveness of MCN, we conducted extensive ablation experiments on ISPRS datasets using different settings. Tables I and II present the ablation experiments of different modules of MCN, while Tables III and IV present the ablation experiments of MCN using different upsampling methods.
1) Effectiveness of MFE Module
MFE is designed for extracting rich spatial information among various and complex urban objects while suppressing background noise and enabling correlation among different levels of feature maps. Table I presents that compared with the baseline, MFE module increases the mF1 by 3.28/4.02%, OA by 3.22/2.20%, and mIoU by 5.20/5.79% on ISPRS datasets, respectively. Fig. 9 shows the visualization results of MFE module on ISPRS datasets, which produced the overall better segmentation results, but there are instances where some pixels are misclassified, such as low veg category.
Qualitative comparisons of MFE, MLF, and PSD modules. (From Top to Bottom) First and second rows demonstrate the Potsdam and Vaihingen test dataset, respectively.
2) Effectiveness of MLF Module
MLF is designed as skip connections to prevent the loss of fine- and coarse-grained details at shallower and deeper layers. MLF makes the utmost low-level features embedded in shallow layers and high-level features embedded in deep layers by improving the overall network's performance. Table I presents that compared with the baseline, MLF module increases the mF1 by 2.35/4.24%, OA by 2.48/2.12%, and mIoU by 4.00/6.52% on ISPRS datasets, respectively. Furthermore, when comparing MFE+MLF with the MLF module, there was a further improvement of 0.24/1.12% in mF1, 0.24/0.80% in OA, and 1.12/1.81% in mIoU on ISPRS datasets, respectively. Fig. 9 shows the visualization results of MLF and MFE+MLF modules on ISPRS datasets. Compared with the baseline method, MLF demonstrates noticeable enhancements in segmentation results. Similarly, MFE+MLF significantly improves over MLF by effectively reducing the number of misclassified pixels while generating sharp boundaries.
3) Effectiveness of PSD Module
PSD is designed for the decoder to improve the resolution of the feature maps and fully fuse feature information of different receptive fields while better fixing blurry edges and checkerboard artifacts with the reduced number of parameters. Table I presents that compared with the baseline, PSD module increases the mF1 by 1.42/3.05%, OA by 1.59/1.57%, and mIoU by 1.77/4.57% on ISPRS datasets, respectively. Compared with MFE+MLF and PSD modules, MFE+MLF+PSD further improves the semantic segmentation accuracy, with mF1, OA, and mIoU reaching 92.85/89.25%, 93.51/90.18%, and 86.81/80.86% on ISPRS datasets, respectively. Fig. 9 shows the visualization results of PSD and MFE+MLF+PSD modules on ISPRS datasets. Compared with the baseline method, PSD demonstrates noticeable enhancements in segmentation results. Similarly, MFE+MLF+PSD exhibits significant improvements over MFE+MLF and PSD by classifying the correct number of classes while reducing the checkerboard artifacts and blurry edges.
4) MCN With Different Upsampling Methods
Deconvolution and bilinear are the most common upsampling methods for recovering spatial information from convolution or max-poling layers. In contrast to both upsampling techniques, we designed PSD based on subpixel convolution, which can better fix blurry edges and checkerboard artifacts with fewer parameters while further improving network speed and accuracy. Table III presents the quantitative results of ISPRS datasets, respectively. Compared with deconvolution, MCN with PSD improved the segmentation accuracy in terms of mF1 by 1.36/0.87%, OA by 2.02/0.68%, and mIoU by 2.39/1.43%. While compared with bilinear, MCN with PSD improved the segmentation accuracy in terms of mF1 by 0.87/0.29%, OA by 1.48/0.19%, and mIoU by 1.51/0.47%. Fig. 10 shows that MCN with PSD can better alleviate the edge blur and artifacts caused by information loss than other upsampling methods.
Qualitative comparisons of MCN with deconvolution, bilinear, and PSD-based upsampling methods. (From Left to Right) row demonstrates the Potsdam and Vaihingen test dataset, respectively.
5) Model Complexity
Considering that the complexity of the model is significant to assess the metric of a framework, we report the training time for each epoch, inference time, parameters, flops, and model size of different modules and upsampling methods in Tables II and IV, which demonstrates that the design of MCN is computationally efficient. Table II presents the model complexity of MCN using different modules. Tables I and II present that compared with the baseline, MCN outperforms ISPRS datasets in terms of mF1 by 4.39/5.43%, OA by 4.94/3.19%, and mIoU by 7.14/8.07% while reducing the number of parameters by 66% and model size by 65%. Table IV presents the model complexity of MCN using different upsampling methods. These results indicate that compared with deconvolution, MCN with PSD reduces the number of parameters and model size by 18%.
Compared with deconvolution, MCN with PSD has the same flops but is 7 s faster and requires less training time of 11 s. Compared with bilinear, MCN with PSD is 11 s faster and reduces the number of parameters by 13%, flops by 20%, model size by 16%, and training time of 4.
E. Comparison Methods
To conduct a quantitative comparison, we have carefully selected a comprehensive set of benchmark methods that are specifically designed for semantic segmentation of urban scene imagery. Note that all experimental findings are provided by the source code or the author in Tables V–VIII
1) CNN-Based Context Aggregation Networks
Collaborative network with PS layer (ColNet) [25], class perception network (C-PNet) [38], ensemble full CNN-based network (EFCNet-UNet) [18], one-shot neural architecture search for a backbone network (RSBNet) [39], feature-selection network with hypersphere embedding (FSHRNet) [17], and deep feature enhancement method for land cover (EG-UNet) [40].
2) CNN-Based Attentional Networks
Lightweight attention network (LiANet) [41], segmentation network with spatial and channel attention (SCAttNet) [27], dual-channel scale-aware network with position and channel attention (DSPCANet) [29], attentive bilateral contextual network (ABCNet) [34], multiattention network (MANet) [33], and squeeze and excitation residual network (SERNet) [28].
3) Transformer and With or Without CNN-Based Context Enhancement Networks
Fusing swin transformer and CNN-based network (STransFuse) [32], wide-context transformer network (WiCoNet) [31], positioning guidance network (PGNet) [30], enhancing multiscale representations with transformer network (EMRT) [42], distilling segmenters from CNNs and transformers (DSCT) [43], a billion-scale foundation model (UperNet) [44], and foreground saliency enhancement network (RSSFormer) [45].
4) Segmentation Models Based on Redesigned Skip Connections
Multistage attention network (MAResU-Net) [20], semantic segmentation network using multiscale skip connection (MSCA-Net) [21], segmentation network for fine-resolution remotely sensed images (MACU-Net) [22] and 2-D transformer model (2DsegFormer) [19].
5) Segmentation Models Designed for Ultrahigh-Resolution Images (UHR)
Collaborative global–local network (GLNet) [36], progressive semantic segmentation network (MagNet) [45], integrating shallow and deep features network (ISDNet) [46], patch proposal network (PPN) [47], image segmentation via locality-aware contextual correlation network (LCC) [48], and one model is enough for image semantic segmentation (OME) [49].
F. Comparative Study
1) Results on Potsdam Dataset
Table V presents the experimental results of different methods on test sets of Potsdam. Our model on Potsdam outperforms the existing land-cover classification methods with an OA of 93.51%, mF1 of 92.85%, and mIoU of 86.81%. Regarding details, our model ranks first in the F1-score for (building and low veg) and second for (imperious surface and trees) subclasses. Fig. 11 shows the visualization results of MANet [33], MAResUNet [20], ABCNet [34], EG-UNet [40], U-Net [12], and proposed method MCN on test set of Potsdam. Compared with popular semantic segmentation methods, we can observe that MCN can better handle situations with a shadow or complex texture and generate complete shapes of objects, such as buildings, trees, cars, and imperious surfaces with clear boundary separating objects.
Visualization results of different land-cover classification methods on ISPRS datasets. (From Left to Right) First four and second four columns demonstrate the Potsdam and Vaihingen test dataset, respectively.
2) Results on Vaihingen Dataset
Table VI presents the experimental results of different methods on test sets of Vaihingen. Our model on Vaihingen outperforms the existing land-cover classification methods with an OA of 90.18%, mF1 of 89.25%, and MIoU of 80.86%. Regarding details, our model ranks first in the F1-score for (low veg and car), and second for (building and imperious surface) subclasses. MCN has better capability in handling highly imbalanced classes, such as a car with an increased receptive field in Table VI. Notably, on Vaihingen, the F1-score for the car class is 87.40%. The visualization results, as shown in Fig. 11, compare the performance of MCN with five other semantic segmentation methods (MANet [33], MAResNet [20], ABCNet [34], EG-UNet [40], and U-Net [12]), on the test set of the Vaihingen. The findings indicate that MCN outperforms these methods in handling challenging scenarios involving shadows or intricate textures, generating precise shapes for objects, such as buildings, trees, low veg, and roads, and accurately distinguishing between objects with clear boundaries (i.e., cars).
3) Results on DeepGlobe Dataset
Tables VII and VIII present the experimental results of different methods on test sets of DeepGlobe. Our model on DeepGlobe outperforms the existing land-cover classification methods with an mIoU of 73.73%, OA of 90.56%, and mF1 of 89.60%. Regarding details, our model ranks first in the IoU score for all subclasses except the barren class. Fig. 12 shows the visualization results of MANet [33], MAResUNet [20], ABCNet [34], EG-UNet [40], U-Net [12], and proposed method MCN on the test set of DeepGlobe. DeepGlobe consists of classes with similar visual features and irregular shapes, making their classification challenging. However, our MCN accurately distinguishes between these land-cover categories, resulting in precise segmentation results. These visualized results further convincingly validate the effectiveness of our MCN on satellite images.
Visualization results of different land-cover classification methods on the DeepGlobe test dataset.
4) Discussion
Our model consistently outperforms CNN and transformer-based networks in terms of OA, mF1, and mIoU on well-established benchmark datasets, such as ISPRS and DeepGlobe, demonstrating competitive performance despite utilizing fewer parameters. Using MFE, MCN can effectively capture local and global contextual information. MFE enables the extraction and encoding of features from multiple levels of abstraction, allowing the model to perceive intricate details and holistic scene understanding. Using MLF, MCN can effectively combine features extracted from different levels of abstraction to improve the accuracy and performance of the model. Fusing these features leads to the generation of sharper boundaries, distinguishing between different objects or classes in an image. In addition, our model benefits from PSD. Incorporating PSD enhances the model's ability to generate precise and artifact-free segmentations while minimizing computational requirements.
5) ISPRS Datasets Performance Comparison
Table IX presents a comparison of various methods that employ the ResNet50 backbone. The comparison is based on two key aspects: overall accuracy and the number of parameters. Compared with these methods, MCN achieved an OA of 93.51% on Potsdam and 90.18% on Vaihingen datasets with 19.56 million parameters, maintaining both the number of parameters and high accuracy simultaneously.
Conclusion
This article proposes a novel MCN with three fundamental modules to solve the interclass similarities, intraclass variations, scale-related inaccuracies, and high computational complexity issues. MFE is introduced as a feature enhancement module for the backbone network to identify the local and global context of ground objects while suppressing the background noise and capturing complementary features by enabling correlation among different levels of feature maps. MLF is introduced as skip connections where features from different MFE layers are merged before the supervision to produce a single high-level representation of the input data by leveraging the strength of each layer in capturing different levels of features. PSD is introduced as a decoder, which can better fix blurry edges and checkerboard artifacts with fewer parameters while improving network speed and accuracy. Ablation studies and comparative experiments conducted on the ISPRS datasets demonstrate the effectiveness of the proposed method. On all three Potsdam, Vaihingen, and DeepGlobe, MCN achieves the best performance compared with the existing land-cover classification models with fewer parameters.