Loading web-font TeX/Main/Regular
Research Advances in Deep Learning for Image Semantic Segmentation Techniques | IEEE Journals & Magazine | IEEE Xplore

Research Advances in Deep Learning for Image Semantic Segmentation Techniques


Classification graphical representation of deeplearning based image semantic segmentation algorithms

Abstract:

Image semantic segmentation represents a significant area of research within the field of computer vision. With the advent of deep learning, image semantic segmentation t...Show More

Abstract:

Image semantic segmentation represents a significant area of research within the field of computer vision. With the advent of deep learning, image semantic segmentation techniques that integrate deep learning have demonstrated superior accuracy compared to traditional image semantic segmentation methods. Recently, the Mamba architecture has demonstrated superior semantic segmentation performance compared to the Transformer architecture, and has consequently become a research focus in this field. Nevertheless, the specifics of the Mamba architecture have remained underexplored in the extant literature. This review provides a comprehensive overview of the latest research progress in deep learning techniques for semantic segmentation. It offers a systematic review of traditional convolutional neural network (CNN)-based architectures and focuses on a series of emerging architectures, including the Transformer architecture, the Mamba architecture, and cutting-edge approaches such as self-supervised learning strategies. For each category, a detailed account is provided of the principal algorithms and techniques employed, together with a report on the performance achieved using datasets commonly used in the field.
Classification graphical representation of deeplearning based image semantic segmentation algorithms
Published in: IEEE Access ( Volume: 12)
Page(s): 175715 - 175741
Date of Publication: 12 November 2024
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Image Semantic segmentation is a key technology in the field of computer vision. It serves as an indispensable cornerstone for pattern recognition and image understanding and can be understood as the problem of assigning semantic labels to pixel categories [1]. This technology enables the recognition and understanding of different objects and scenes within an image by categorizing each pixel point, which provides a streamlined and reliable foundation for subsequent tasks. Consequently, it enables more accurate, efficient, and intelligent image analysis and applications. As a vital component of intelligent processing, semantic segmentation holds broad application prospects in various fields, including self-driving cars [2], [3], medical image diagnosis [4], assistive technology for the blind [5], video surveillance [6], and augmented reality [7], [8].

Over the past decades, image semantic segmentation techniques have been attracting a lot of attention as a core topic in the field of computer vision. With the emergence of research problems and challenges, many researchers have been actively involved in the field and proposed many landmark algorithms through continuous innovation and efforts [9], [10], [11], [12]. These algorithms have not only significantly improved the accuracy and quality of image semantic segmentation, but also offered a strong impetus to the development and innovation of the field. In this paper, we systematically examine and explore the deep learning-based image semantic segmentation algorithms proposed before 2024, elaborate on the working principles of these algorithms, and comprehensively summarize their performance. At the same time, we also discuss the current challenges and potential future research directions in the field, aiming to provide valuable references for the application practice in related fields.

Recently, the field of image semantic segmentation has ushered in a major breakthrough. With its excellent performance, the Mamba architecture has successfully surpassed the Transformer architecture and has become the research focus in this field. However, the existing review has not yet introduced and discussed its internal mechanism. To this end, this article comprehensively sorts out the latest research progress of deep learning technology in the field of image semantic segmentation. It not only reviews the traditional architecture based on convolutional neural network (CNN), including encoder-decoder structure, skip connection technology, multi-scale feature fusion strategy, and attention mechanism, but also focuses on a series of emerging architectures, including Transformer architecture, Mamba architecture, and self-supervised learning strategy. Each technology category has its own unique theoretical basis and implementation method. They complement each other and jointly promote the development and improvement of image semantic segmentation technology. To more effectively illustrate the application and evolution of these deep learning models in the field of image semantic segmentation, we have meticulously organized the models of each category and presented them in Fig. 1, providing readers with a comprehensive understanding of the evolution of each model type.

FIGURE 1. - Classification graphical representation of deep learning based image semantic segmentation algorithms.
FIGURE 1.

Classification graphical representation of deep learning based image semantic segmentation algorithms.

Specifically, the Encoder-Decoder architecture provides robust support for pixel-level understanding of images through its precise spatial mapping capability. The Dilated Convolution technique effectively captures contextual information in an image by adjusting the receptive field of the convolution kernel. The multi-scale feature fusion structure enhances the model’s ability to recognize multi-scale objects by integrating features of different resolutions. The introduction of the attention mechanism enables the model to focus on key regions of the image, thereby further improving the accuracy of the segmentation. The Transformer architecture utilizes the self-attention mechanism to achieve the modeling of long-distance dependencies. The Mamba architecture provides an efficient long sequence modeling capability, which enables linear time complexity to be maintained when processing large-scale data; while the self-supervised learning model reduces the dependence on large amounts of labeled data by utilizing the intrinsic structural information of the image.

The rest of the article is organized as follows: in Section II, we provide an overview of the applications and advances in deep learning for semantic image segmentation. In Section III, we meticulously explored high-impact image semantic segmentation models by classification up to 2024. In Section IV, We review and elaborate on the most popular current image segmentation datasets. In Section V, we enumerate the performance metrics commonly used when evaluating image semantic segmentation models and present quantitative results and experimental performance of the models. Section VI, we summarize and outline the main findings and conclusions of this study.

SECTION II.

Image Semantic Segmentation Using Deep Learning

This section aims to provide a comprehensive analysis of the applications of image semantic segmentation, with a particular focus on the recent advances in deep learning for image semantic segmentation. Initially, we will examine the fundamental principles and architectural components of deep learning, and the ways in which it can enhance the performance of image semantic segmentation. Subsequently, we will investigate the diverse application areas of image semantic segmentation.

A. Deep Learning and Semantic Segmentation of Images

With the rapid changes in image acquisition technology, the detailed information of images captured by modern devices becomes more and more complex [13]. However, traditional image semantic segmentation techniques(threshold segmentation [14], edge detection [15], region segmentation [16], clustering methods [17]) can often only cope with simple image models and are very sensitive to the interference of complex backgrounds and noise, making it difficult to effectively deal with multi-scale and multi-directional target objects. In this context, the application of deep learning techniques to the field of image segmentation, especially the wide application of convolutional neural networks (CNN) [18], has revolutionized image semantic segmentation.

Since 2012, CNN has demonstrated excellent performance in processing image data by virtue of its unique local connectivity, weight sharing, etc., and has become a key tool for constructing efficient classifiers.CNN mainly consists of four basic structures: 1) convolutional layer, which filters the input data and extracts the local features through a sliding-window filter (convolutional kernel); 2) activation function, which performs the convolutional layer output with a nonlinear transformation of the output of the convolutional layer, so that the model can capture complex features; 3) pooling layer, through the feature degradation and filtering, to enhance the translation invariance of the model, thus simplifying the model and effectively avoiding overfitting; 4) full connectivity layer, the integration of the local features extracted by the convolutional layer and the pooling layer, and then according to the number of classified output data. During the network training process, the convolutional layer and the pooling layer work together to build multiple convolution groups, which are jointly responsible for extracting key features in the image. Finally, the extracted features are comprehensively analyzed through a series of fully connected layers to achieve accurate image classification.

Deep learning technology [19], [20], [21] has become a research hotspot in the field of image semantic segmentation with its unique advantages, which are mainly reflected in the automatic feature learning ability, multi-scale information integration, high-precision segmentation ability, strong generalization ability, and end-to-end training ability. This technique not only simplifies the tedious feature design process in traditional methods, but also realizes the fine segmentation of image content through complex network structure and optimization algorithms, and at the same time shows good generalization performance on different datasets and tasks, which lays a solid foundation for the wide application of image semantic segmentation [22]. Currently, deep learning libraries widely provide pre-trained weights for the following CNN architectures: LeNet [18], AlexNet [23], VGGNet [24], MSRANet [25], GoogLeNet [26], Inception [27], ResNet [28], DenseNet [29], MobileNet [30], EfficientNet [31], ShuffleNet [32], SENet [33], SKNet [34]. As research deepens, more CNN-based network models have emerged. In Section III, we will explore the deep learning-based image semantic segmentation architecture to gain a deeper understanding of their key roles in image segmentation and computer vision.

B. Applications of Semantic Segmentation of Images

In this section, we discuss various application areas for semantic segmentation, including remote sensing, medical imaging, and video processing. For each domain, we emphasize the unique challenges and opportunities that arise.

1) Semantic Segmentation of Remote Sensing Images

Remote sensing [35] is directly translated as “remote perception”, and refers to the use of modern optical electronic detection instruments, not in contact with the target object, the remote object of the electromagnetic wave signals radiated (or reflected) to receive and record, and then processed into the human eye can be directly recognized by the image, so as to reveal the nature of the detected object and its The pattern of change. Remote sensing technology can be categorized into aerospace, aviation, and ground according to the height of the remote sensing platform, and the images collected by these remote sensing platforms are the data cornerstone of semantic segmentation technology [36], [37].

Combining artificial intelligence technology, remote sensing image data, and remote sensing practical applications to establish a satellite remote sensing application system based on artificial intelligence has become an inevitable trend in the development of satellite remote sensing technology [38], [39]], [40]. In recent years, the semantic segmentation method based on deep learning has set off a revolutionary wave in the field of remote sensing image processing [41], [42], [43], [44], [45], which utilizes large-scale data samples to train the deep convolutional neural network model and realize the automatic extraction of surface geomorphology information in high-resolution remote sensing images. At the same time, deep learning technology is utilized to explore the rich information behind the data samples, extract diverse and fine-grained semantic feature representations from remote sensing images of different resolutions, and make more accurate predictions of unknown remote sensing data.

With the development of aerospace and sensor technology, remote sensing satellites have shown a move from traditional low-resolution images to high-resolution images, and have also achieved a transition from a single data source to diversified, multimodal data fusion [46]. In addition, the combination of satellite remote sensing platforms and drone aerial photography technology has also opened up new paths for fine scene analysis of low-altitude remote sensing [47], [48], [49]. With its flexible aerial photography capabilities, drones have greatly enriched the ability to capture tiny details on the earth’s surface, providing more accurate data sets for artificial intelligence processing. These remote sensing images collected by multiple platforms such as remote sensing satellites, aircraft and drones, pixel-level classification covers common land cover categories: forests, crops, buildings, water resources, grasslands, roads, etc., and play an irreplaceable and key role in surface resource and environmental monitoring [50], crop yield estimation [51], [52], disaster monitoring [53], global change [54], smart cities [55], [56], military reconnaissance [57] and other fields, providing scientific decision-making information for governments and industry departments at all levels. In the future, it is necessary to integrate knowledge and technology from different fields, focus on integration and innovation between disciplines, and promote overall progress in the field of remote sensing.

2) Medical Image Analysis

Medical image analysis mainly focuses on four aspects: lesion detection [58], image segmentation [59], image alignment [60], and image fusion [61], of which image segmentation techniques are crucial as they are able to recognize and segment medical images to assist in further diagnosis and intervention. Many current deep learning-based medical image segmentation can process various forms of medical images, mainly including computed tomography (CT) [62], magnetic resonance imaging (MRI) [63], and ultrasound imaging (UI) [64].

Medical image segmentation technology [65] divides an image into several regions based on the similarities or differences between regions in a two-dimensional slice image. It discovers lesions by identifying each highlighted region of interest (ROI). It plays a vital role in computer-aided diagnosis and intelligent medical care, greatly improving the efficiency and accuracy of diagnosis, such as detecting brain tumor boundaries from MRI images, pneumonia infection in X-rays, and detecting cancer from biopsy sample images. Due to the complexity of the human anatomical structure and the integration of functions, a single image segmentation technology often difficult to achieve satisfactory segmentation effects on general images. Therefore, the current research trend is more inclined to explore the organic integration of multiple segmentation algorithms, aiming to improve the overall performance through complementary advantages. At the same time, given the limitations of computer-automated processing of image segmentation tasks, human-computer interactive segmentation methods are gradually becoming the focus of scientific research.

Many medical image segmentation problems are binary classification problems, which assign labels to special semantic information in the image (such as tumors, organs, blood vessels, etc.) [66], [67], [68], and the number of categories is generally not as many as that of natural image datasets (landscapes, people, animals, and cars), such as Pascal VOC [69] has 21 categories, Cityscapes [70] has 30 categories etc. There are many image modes in the medical field, and the accurate annotation of each type of image requires deep expertise in a specific medical field. In particular, the annotation of MRI and microscope images is particularly complex [71]. Generally speaking, compared with datasets obtained through widely used scanning technologies such as ultrasound, X-rays, and lesions, although these medical image sets may have simpler structures and clear boundaries, which can simplify the annotation process to a certain extent, their total volume is still limited. This situation is partly due to patient privacy protection and strict medical regulations, which together hinder the large-scale collection and sharing of medical images.

In order to alleviate the problem of the insufficient scale of specific datasets, challenges are regularly held in the field of medical image segmentation(APIS [72], MICCAI [73], MSD [74], ACDC [75]). These competitions not only promote academic exchanges, but also provide carefully annotated, publicly accessible medical image datasets as valuable resources. These challenge datasets not only promote the rapid development of semantic segmentation model research, but are also widely adopted as benchmark datasets for evaluating the performance of segmentation algorithms, playing a vital role in improving the accuracy and efficiency of medical image processing.

3) Video Scene Understanding

Applications such as autonomous driving, virtual reality (VR), augmented reality (AR), human-computer interaction, image search and editing all rely on an in-depth understanding of the complete scene, in which semantic segmentation techniques are particularly important in video understanding. In the face of high-resolution videos, traditional methods tend to reduce them to static image sequences for processing, a practice that ignores the temporal continuity inherent in videos, thus limiting the accuracy and practicality of segmentation [76].

The core challenge of video semantic segmentation focuses on how to effectively integrate contextual information in the temporal dimension to accurately capture the subtle and critical dynamic changes between video frames [77]. Although this task is crucial to improving segmentation quality, it is often accompanied by a sharp increase in computational complexity. In order to balance computational efficiency and segmentation quality, the research community is actively seeking innovative ways [78]. Among these, feature reuse and Image Warp have become key strategies for reducing computing costs [79]. These techniques significantly reduce unnecessary repeated calculations by intelligently reusing feature information from the previous frame or frames and appropriately adjusting this information to adapt to changes in the current frame, which not only improves processing speed but also maintains high segmentation accuracy. In addition, with the launch of large-scale video segmentation datasets such as Cityscapes and CamVid, researchers have obtained rich experimental resources for testing, verifying, and optimizing various video segmentation algorithms. These datasets not only contain rich video content, but also provide precise annotation information, providing a solid foundation for algorithm improvement.

The introduction of video optical flow technology has brought breakthrough progress to video semantic segmentation [80]. Optical flow technology can accurately track the motion trajectory of pixels or feature points in video frames, thereby effectively capturing and expressing temporal context information. By integrating optical flow information into the semantic segmentation model, the model can better understand the dynamic changes in the video and make more accurate and robust segmentation decisions. This fusion strategy not only improves the accuracy of segmentation, but also enhances the model’s adaptability to complex video scenes [81].

SECTION III.

Deep Learning Based Image Segmentation Techniques

Prior to the application of convolutional neural networks (CNNs), researchers used random forests and conditional random fields (CRFs) to build classifiers for semantic learning [82], [83]. Later, deep learning methods produced a new generation of segmentation models with significantly improved sex, which became the mainstream solution for semantic segmentation.In 2014, Long et al. [84] pioneered the full convolutional network (FCN), which uses a convolutional layer to replace the fully connected layer at the end of the traditional CNN, and is capable of processing images of any size and generating segmentation maps of the same size, which truly realizes the end-to-end way of It is a landmark contribution of deep learning in the field of image semantic segmentation. The FCN architecture is shown in Fig. 2.

FIGURE 2. - The fully connected layer of the feature backbone in the FCN model is replaced by a convolutional layer to achieve end-to-end semantic segmentation of images.From [84].
FIGURE 2.

The fully connected layer of the feature backbone in the FCN model is replaced by a convolutional layer to achieve end-to-end semantic segmentation of images.From [84].

The FCN model is favored for its popularity and efficiency in image segmentation tasks, but it also faces some challenges and limitations: the segmentation results obtained are not fine enough and are insensitive to the details in the image; the relationships between pixels are not considered and the spatial regularization step in the conventional methods is not fully utilized and lacks spatial consistency; the network structure is complex and the inference speed is slow, and so on. Many subsequent advances in deep learning-based semantic segmentation of images have innovated and expanded on the seminal work of the FCN model. In this section, we explore representative models and their core technologies from the perspective of multi-dimensional working mechanisms and core concepts, focusing on seven aspects: (1) encoder-decoder architecture; (2) dilation convolution technique; (3) multi-scale feature fusion strategy; (4) attention mechanism; (5) Transformer architecture; (6) Mamba architecture; (7) self-supervised learning strategy.

A. Encoder-Decoder Based Model

When using CNN for image semantic segmentation, its pooling layer can expand the sensory field, but at the same time, it can lead to the problems of reduced resolution of the feature map and loss of spatial location information. For this reason, the problem is solved by introducing an encoder-decoder structure, the core idea of which is to convert the input sequence into an output sequence, which is keyed to two components - encoder and decoder. The encoder usually consists of an efficient backbone of pre-trained classification networks, such as VGG [24], Inception [27] and ResNet [28]. Its purpose is to reduce the spatial scale of the input image and extract feature maps with contextual semantic information. The decoder, usually implemented by bilinear interpolation, deconvolution, or up-sampling operations, transforms and accurately maps the feature maps back to the same resolution as the original image to construct an accurate pixel-level segmentation map. In the decoding process, in order to recover high-level semantic information from low-level features, there is usually a direct information connection between the decoder and the encoder, which is called a skip connection. Through skip connection, the decoder is able to utilize the features at the corresponding level in the encoder to better reconstruct the image and preserve the semantic information, thus generating more accurate segmentation results.

Ronneberger et al. [85] used a symmetric encoder-decoder structure, and the U-Net architecture was obtained by following the principle of FCN with corresponding improvements, which solved problems such as slow inference speed. As shown in Fig. 3, the core features of U-Net lie in the unique U-shaped network structure and Skip Connection mechanism, which is especially suitable for semantic segmentation tasks in medical images. In the encoder, high-level semantic features are progressively extracted through successive convolution and pooling operations, which also reduce the spatial resolution. In the right half, which is the decoder, the spatial resolution is incrementally restored via up-sampling operations. Concurrently, multi-scale feature fusion is achieved through skip-connections, by merging feature information from corresponding layers in the encoder.

FIGURE 3. - The U-Net model is particularly suitable for medical image segmentation tasks.From [85].
FIGURE 3.

The U-Net model is particularly suitable for medical image segmentation tasks.From [85].

U-Net’s distinctive “U”-shaped design enables it to combine features from both the lower and upper layers of an image. The lower layers, rich in spatial and edge information, facilitate the accurate segmentation of local structures within the image. Meanwhile, the higher layers, rich in contextual semantic information, aid in understanding the relationships between the target and its surrounding environment. This multi-scale fusion mechanism enables UNet to obtain good performance in image segmentation tasks, and it has been derived into many variants as an efficient and practical network structure, and has demonstrated strong potential for applications in many fields other than the medical field. However, the symmetric convolutional layers of U-Net for up-sampling have large differences in semantic information, which increases the difficulty of network learning due to the more direct splicing method chosen. Based on U-Net Qin et al. [86] proposed U2-Net by introducing a new module, RSU (ReSidual U-blocks), which can capture more contextual information from different scales and increase the depth of the network without significantly increasing the computational cost.

The downsampling strategy of FCN does not fully consider the pixel-to-pixel contextual relationship and the overall consistency of the object, resulting in insufficiently fine segmentation prediction results. To address the shortcomings such as the lack of a reasonable mapping mechanism from the feature map to the spatial resolution of the input image, Badri-Narayanan et al. [87] proposed the SegNet network model based on FCN.As shown in Fig. 4, the topology of the encoder network uses the first 13 convolutional layers of the VGG16 composition and discards the subsequent fully connected layers to categorize and analyze the low-level local pixel values of the image to obtain the higher-order semantic information. After collecting this semantic information, the decoder part uses de-pooling to up-sample the feature map, and then the up-sampled image is convolved to make up for the loss of boundary details in the encoder due to the pooling layer, thus refining the geometry of the object.

FIGURE 4. - The SegNet model uses a symmetric encoder-decoder structure without a fully connected layer.From [87].
FIGURE 4.

The SegNet model uses a symmetric encoder-decoder structure without a fully connected layer.From [87].

The uniqueness of this network solution structure lies in its encoder design structure, where the up-sampling operation is designed only in the decoder network and forms an exact one-to-one correspondence with the pooling layer in the encoder. The advantage lies in the fact that the up-sampling process directly utilizes the indexing information generated during the pooling process, which reduces the learning cost, reduces the total number of parameters of the model, and achieves a lightweight mechanism. However, SegNet is not without its flaws. It struggles with processing smaller objects, resulting in poor segmentation outcomes. Consequently, enhancing the accuracy of small object segmentation remains a challenging issue for future research.

Subsequently, Noh et al. [88] proposed a fully symmetric deep deconvolutional network (DeconvNet). This network consists of a convolutional part and an anti-convolutional part. In this case, the convolutional layer consists of VGG16 with fully connected layers removed and downsampling layer, which gradually reduces the size of the feature map and converts it to a lower dimensional representation, extracting as many rich low-level features and high-level features as possible. The deconvolution part is the mirror image of the convolution layer, including the convolution layer, the upsampling layer and the fusion layer, which utilizes the upsampling operation to recover the spatial dimensions of the feature map and fuses the extracted features in the convolution layer to minimize the loss of information as much as possible.DeconvNet is able to achieve the same size of the input and output in this way, and at the same time retains the fine structure and multi-scale information in the image.

Global Convolutional Network (GCN) [89] uses pre-trained ResNet as the feature extraction network, FCN as the segmentation framework, and a GCN structure is designed for generating a multi-scale semantic classification map for each category. In order to pursue the balance of semantic and spatial information, the features are entered into the GCN module for processing before fusion, which discards the use of fully-connect layer and pool layer and processes them with full convolution to avoid the loss of localization information. A larger convolution kernel is also used to make a close connection between the feature map and the classification layer, thus enhancing the translation invariance. A boundary refinement block based on residual structure is also proposed to learn boundary information and thus optimize the boundary prediction of segmented objects, unlike CRF post-processing, which can be integrated into the network for end-to-end training.

In order to solve the problem of FCN ignoring global context information in segmentation task, Liu et al. [90] proposed ParseNet model, which fuses global and local features by both early fusion and late fusion, and its network architecture is shown in Fig. 5. early fusion classifies the features after directly merging them, while late fusion classifies the features separately and then integrates the results, and the two fusion effects are similar without additional processing.ParseNet normalizes the features and learns the scale parameters during fusion to ensure the fusion stability, and finally achieves the best performance at that time on the PASCAL-Context dataset.

FIGURE 5. - ParseNet has two main issues to consider in the process: the timing of fusion and the normalization of scales.From [90].
FIGURE 5.

ParseNet has two main issues to consider in the process: the timing of fusion and the normalization of scales.From [90].

To address the computational complexity problem of FCN, Wu et al. [91] proposed FastFCN with the network architecture shown in Fig. 6, which introduces Joint Pyramid Upsampling (JPU), which replaces the time-consuming and memory-consuming massive convolutional operation by transforming the extraction task of high-resolution feature maps into a joint upsampling problem. The FastFCN method using JPU reduces the computational complexity by more than three times without loss of performance.

FIGURE 6. - FastFCN proposes a novel joint upsampling module, joint pyramid upsampling (JPU), by formulating the task of high-resolution feature map extraction as a joint upsampling problem.From [91].
FIGURE 6.

FastFCN proposes a novel joint upsampling module, joint pyramid upsampling (JPU), by formulating the task of high-resolution feature map extraction as a joint upsampling problem.From [91].

Subsequently, the widespread application of FCN architectures in the field of image segmentation has continued to deepen and ushered in significant technological breakthroughs and developments. For example, in the field of panorama segmentation, Li et al. [92] proposed the Panoptic FCN framework, which realizes representing and predicting foreground objects (THINGS) and background regions (STUFF) in a unified fully convolutional pipeline. In the field of medical image segmentation, Christ et al. [93] solved the CT image segmentation problem by cascading FCNs. In the field of point cloud segmentation, Wen et al. [94] proposed a Direction Constrained Fully Convolutional Neural Network (D-FCN), which is specifically designed to deal with airborne LiDAR point cloud classification tasks.

Several other studies in this category of models have also used transposed convolution or encoder-decoder for image segmentation, such as High-Resolution Network (HRNet) [95], Stacked Deconvolution Network (SDN) [96] for RGB-D segmentation, Linknet [97], W-Net [98], etc. Its design advantage lies in its high adaptability, which allows researchers to select appropriate encoder and decoder algorithms for the specific needs of the application domain, making it widely used in natural language processing, image processing, and other fields.

B. Models Based on Dilated Convolution Technique

In the process of semantic segmentation of images using CNNs, the receptive field of the feature map is increased by stacking convolutional and pooling layers, but this results in a reduction in the size of the feature map and loss of internal data structures and spatial locations of pixels. In the context of such problems, the semantic segmentation problem has been in a bottleneck, which can be avoided by dilated convolution [99] that can increase the receptive field while keeping the feature map size constant.

Atrous convolution is used to increase the receptive field by adding a hyperparameter, dilation rate, which indicates the spacing of the convolution kernels, to the standard convolution. The optimized convolution structure is used instead of the traditional operations such as convolution and pooling to increase the receptive field while maintaining the feature map size and the original image size. The expansion of the receptive field helps to detect segmented large targets, and the maintenance of the feature map resolution helps to accurately localize the targets. As shown in Fig. 7, different dilation rates acquire different sizes of the receptive field, thus acquiring multi-scale information. Currently, Atrous convolution can be applied to most of the deep learning frameworks and has a very wide range of applications in semantic segmentation, target detection and other fields.

FIGURE 7. - Three different convolutions with kernel size=3, stride=1, and dilation rate of 1, 2, and 4 in that order.From [99].
FIGURE 7.

Three different convolutions with kernel size=3, stride=1, and dilation rate of 1, 2, and 4 in that order.From [99].

Four versions of the DeepLab series were released from 2015 to 2018.DeepLab V1 [100] uses VGG16 as the backbone network for extracting features, and then performs some fine-grained classification of pixels. The network has two core components: dilated convolution and CRF. First, the ordinary convolution of VGG is replaced by dilated convolution to expand the sensory field of the feature maps, and then the boundary optimization is performed on the resulting segmentation maps by the added fully connected conditional random field (CRF). The basic framework of DeepLab is shown in Fig. 8. DeepLab V1 proposed three variants in the paper, DeepLab-7\times 7 , DeepLab-Msc and DeepLab-Msc-LargeFOV, and DeepLab-Msc-LargeFOV achieved the best results among several different network structures.

FIGURE 8. - DeepLab expands the sensory field by cavity convolution and introduces a fully connected conditional random field (CRF) to optimize the segmentation output.From [100].
FIGURE 8.

DeepLab expands the sensory field by cavity convolution and introduces a fully connected conditional random field (CRF) to optimize the segmentation output.From [100].

In 2015, Wang et al. [101] introduced a module for dense prediction that employs dilation convolution to gather multi-scale contextual information systematically without sacrificing resolution. This demonstrated that a simplified adaptive network could enhance the accuracy of image segmentation. Subsequently, the authors further proposed the dilated residual network (DRN) [102], which significantly outperforms the corresponding ResNet model for segmentation without increasing the depth and complexity of the model.

However, dilation convolution is prone to spatial gaps during operation, and problems such as information loss and information irrelevance occur. To improve the segmentation effect, Wang et al. [103] proposed two methods to manipulate the convolutional correlation operation: dense upsampling convolution (DUC) was proposed to capture the detailed information lost during bilinear upsampling; hybrid dilated convolution (HDC) is proposed to expand the sensing field to aggregate global information by forming a block with a serial set of dilated convolutions, while reducing the “gridding” problem of dilated convolutions. EncNet proposed by Zhang et al. [104] utilizes a context encoding module to obtain the global context information of an image and selectively emphasizes the category information associated with the scene. Based on this, the semantic segmentation framework is proposed by combining a pre-trained ResNet using a dilated convolutional strategy and a multi-scale strategy.

C. Models Based on Multi-Scale Feature Fusion

In image segmentation task, multi-scale feature fusion operation improves the accuracy. Its improves the understanding of object and video content by capturing feature information at different scales and levels. Two common approaches for multiscale feature fusion include: parallel multi-branch networks and hopping connections. Parallel multi-branch networks fuse features at different scales to obtain a more comprehensive feature representation. Skip connection fuses shallow features with deeper features by introducing short connections in the network. Both methods help the network to better integrate contextual information, improve the model’s ability to adapt at different scales, and achieve more accurate semantic segmentation.

1) Skip Connection

In neural networks, there are two basic ways to skip connection: addition and concatenation. In Deep Residual Networks (ResNet) proposed by He et al. [28], information from the initial layer is passed to deeper network layers by matrix addition and no parameters are attached. The depth of the network layers is crucial for the performance of the model, however, when the number of network layers increases to a certain depth, a degradation problem occurs, i.e., the model accuracy reaches a saturation phenomenon or even a decreasing trend. In addition, deep networks face the problem of gradient vanishing or gradient explosion, which makes deep learning models difficult to train. For the above problems, ResNet introduces a residual learning mechanism, which effectively mitigates the degradation phenomenon, and also uses a BN layer to accelerate the training, which mitigates the effect of gradient vanishing or gradient explosion.

Two years later, ResNet and Inception were combined to create a simpler ResNeXt model [105]. This model borrowed the “split-transform-aggregate” concept of Inception, introduced group convolution in the residual module, and added a new variable “cardinality” as another key dimension in addition to depth and width. Compared with the traditional ResNet, ResNeXt shows superior segmentation performance while maintaining similar number of parameters. In Huang et al.’s study [29], they introduced a dense convolutional network architecture called DenseNet, which is innovative in the sense that each layer in the network establishes direct connections directly to all subsequent layers. This design strategy aims to maximize the reuse and efficient flow of feature information. Dense connectivity is not only regarded as one of the key means to improve the performance and efficiency of convolutional neural networks, but it also effectively mitigates the gradient vanishing problem, significantly enhances the propagation of features in the network, and encourages the reuse of features. More importantly, this architecture reduces the number of parameters of the model to a great extent, thus improving the efficiency and generalization ability of the model. To further optimize previous networks containing ShortCut structures (e.g., DenseNet and ResNet), Wang et al. [106] designed CSPNet to enable this architecture to achieve richer gradient combinations while reducing the computational effort.

2) Pyramid Structure

In order to solve the problems such as convolutional layers requiring image fixed input size, He et al. [107] proposed Spatial Pyramid Pooling Structure (SPP) and verified its effectiveness in semantic segmentation.

Ghiasi and Fowlkes [108] proposed La aplacian pyramid reconstruction and refinement model (LRR) by reconstructing features extracted from different convolutional layers through a Laplacian pyramid-based multiresolution reconstruction architecture [109]. LRR represents a feature map as a set of basis functions, and after introducing boundary information using cross-layer method [84], it fuses the high-level semantic information with the low-level detail information to generate more accurate and detailed segmentation results.

Lin et al. [110] proposed RefineNet, a generalized multipath refinement network, to find the lost resolution information. The architecture of this network is shown in Fig. 9, which includes three major modules in total: Residual Convolution Unit (RCU, Residual Convolution Unit), Multi-Resolution Fusion (Multi-Resolution Fusion) and Chained Residual Pooling (Chained Residual Pooling). It utilizes all the available information accumulated during the downsampling process to achieve high-resolution prediction through long-range residual linking. This approach can utilize the fine-grained features captured by the early convolutional layers to precisely adjust the high-level semantic features captured by the deeper layers to achieve more accurate segmentation results.

FIGURE 9. - RefineNet is a general-purpose multipath optimization network that explicitly uses all available information from the downsampling process to achieve high-resolution predictions using long-range residual connections.From [110].
FIGURE 9.

RefineNet is a general-purpose multipath optimization network that explicitly uses all available information from the downsampling process to achieve high-resolution predictions using long-range residual connections.From [110].

The PSPNet proposed by Zhao et al. [111] integrates contextual feature information from different regions through spatial pyramid pooling (SPP) on top of the traditional FCN with dilated convolution, and then up-samples and connects these features to form a feature map containing both local and global contextual information. This layered global prior information can improve the model’s segmentation performance in complex scenes. It is worth noting that the number of pyramid layers and the size of each layer depend on the size of the feature map input to the SPP module.The PSPNet network structure is shown in Fig. 10.

FIGURE 10. - PSPNet aggregates contexts from different regions, giving the model the ability to understand global contextual information. From [111].
FIGURE 10.

PSPNet aggregates contexts from different regions, giving the model the ability to understand global contextual information. From [111].

Feature Pyramid Network (FPN) [112] contains four main components: bottom-up network, top-down network, horizontal links and convolutional fusion. Among them, the bottom-up network uses ResNet structure to extract semantic information by default. Since shallow feature maps are suitable for small target detection problems and deep feature maps are suitable for large target detection problems, through this processing method of FPN, the high-level semantic information of the deep layer can be fused into the shallow network, knowing that the shallow network carries out the recognition, which in turn improves the multi-scale detection problem.

Xiao et al. [113] proposed a new network with a hierarchical structure by using the feature pyramid network (FPN) as the backbone network, and added the pyramid pooling module (PPM) in the last layer to learn the differentiated data from different image datasets.

However, the computational burden brought by the traditional FPN reduces the semantic information, making the final feature combination not optimistic. Thus, Quyen et al. [114] proposed an Enhanced Feature Pyramid Network (EFPN), which instead of fusing the four feature scales, processes the three lower scales separately to provide contextual information, and uses the largest feature as a coarse information branch. Each contextual feature is connected to the coarse branch to generate individual predictions. By deploying this architecture, individual predictions can be efficiently segmented for specific target sizes. Finally, the score graphs are fused together in order to collect salient weights from the different predictions.

Compared to DeepLab V1, DeepLab V2 [115] replaces VGG with ResNet and adds the Atrous Spatial Pyramid Pooling (ASPP) module.ASPP captures detail information and contextual information at different resolutions in the image by employing dilated convolutions with different null rates in parallel, resulting in more accurate segmentation results. On the other hand, segmentation boundary results are improved by combining DCNN and fully connected CRF. Subsequently, Chen et al. [116] proposed DeepLab V3, which improved the ASPP module and discarded the restoration of the CRF module. Dilated convolution with both cascade and parallel modules was applied in ASPP, and 1\times 1 convolutional layers and batch normalization were added, which deepened the network, but the network did not retain enough shallow feature information. To solve this problem, a decoder module was introduced in DeepLab V3+ [117] to fuse the low-level features with the high-level features, thus improving the segmentation boundary accuracy. More levels of depth-separable convolution are also used to optimize the network processing speed. However, the ASPP module in the DeepLab family does not sample the dilated convolution in a dense way, leading to the loss of a large amount of information and thus limiting the range of the receptive field. For this reason, the combination of ASPP and DenseNet constitutes DenseASPP [118], which combines the outputs of each dilation convolution in a densely connected manner to obtain a larger feature acceptance field.

He et al. [119] proposed the Adaptive Pyramid Context Network (APCNet), which constructs multiscale context vectors by multiple Adaptive Context Modules (ACMs), processed in parallel, with each ACM utilizing the global image representation as a guide to estimate the affinity coefficients of each local subregion.

D. Models Based on Attention Mechanisms

Semantic segmentation models based on attention mechanisms have received widespread attention in recent years with the rise of attention mechanisms. Different from the multi-scale fusion based methods, the use of self-attention mechanism can adaptively select and focus on important feature information to improve the performance and efficiency of the network.In 2014, the attention mechanism was first proposed in the field of computer vision. Later, Chen et al. [120] incorporated the attention mechanism into semantic segmentation networks in Attention to Scale, as shown in Fig. 11, where the attention model weights multi-scale features according to the target scale in the image. This attention model outperforms average pooling and maximum pooling, allowing the model to visualize the importance of features at different locations and scales.

FIGURE 11. - The attention model learns to gently weight multi-scale features at each pixel location. From [120].
FIGURE 11.

The attention model learns to gently weight multi-scale features at each pixel location. From [120].

The physical structure of the convolutional kernel is designed in such a way that it causes the information flow to be constrained in localized regions, limiting the understanding of complex scenes. For this reason, Zhao et al. [121] proposed the point-by-point spatial attention network (PSANet) to relax the local neighborhood constraints, and connect each position on the feature map to other positions through adaptively learned attention masks to achieve bidirectional information propagation.

In order to capture richer contextual information, Fu et al. [122] proposed Dual Attention Network (DANet) removing the last two downsampling modules from ResNet network and adding dilated convolution for feature extraction, and introducing two parallel attention modules: location attention module and channel attention module. Combining the two attention mechanisms improves feature extraction and context-awareness, and strengthens the relationship between important features and pixel points between different channels. The Object Context Network (OCNet) [123] proposed at the same time as DANet also incorporated the idea of attention mechanism for segmentation and proposed Object Context Pooling (OCP). It adopted a multi-scale feature fusion strategy, utilized the relationship and contextual information between objects in the image, and gradually aggregated information in a top-down manner, thereby improving the segmentation ability of complex scenes. Later, Huang et al. [124] argued that long-distance dependencies are also conducive to capturing useful contextual information, then proposed the cascaded model Cascaded Context Network (CCNet), which decomposes a complex segmentation task into multiple simple subtasks, and gradually through the criss-cross attention (CCA) module acquires the context information of its surrounding pixels on the cross paths, and improves the segmentation ability of objects at different scales. Compared with CCNet, EMANet [125], in the proposed expectation-maximizing attention mechanism (EMA), discards the process of computing the attention map on the full graph, which greatly reduces the complexity.

In 2020, Zhong et al. [126] proposed a squeeze-and-attention network (SANet) based on SENet [127], in which the SA module imposes pixel-group attention on conventional convolution by introducing an “attention” convolution channel, thereby effectively considering the interdependence between spatial channels and strengthening feature representation. However, Yang et al. [128] considered the computational complexity of the self-attention (SA) quadratic to be inefficient, and proposed FocalNet using a focal modulation module instead of SA. Focal modulation consists of three components: focal contextualization implemented by a stack of deep convolutional layers, gated aggregation, and element-level affine transformation. affine transformations. This lightweight design focuses more on multi-scale context than the self-attention mechanism, thus demonstrating superior performance to the best self-attention model at the time. Li et al. [129] introduced a new Outlook Attention mechanism in VOLO. Unlike self-attention, this mechanism focuses on encoding features and context at a finer granularity, which is crucial for recognition performance but often overlooked in traditional self-attention. VOLO uses a two-stage architectural design to implement fine-grained encoding of token representations and global information aggregation, thereby enhancing model performance.

Li et al. [130] chose to incorporate the attention mechanism into a spatial pyramidal architecture and constructed a Pyramidal Attention Network (PAN), which is capable of accurately capturing and extracting multi-level, dense feature information to optimize the pixel-level labeling task. Erisen [131] proposed SERNet-Former, whose encoder structure uses a unique efficient residual network Efficient-ResNet, which improves the efficiency of feature fusion by means of Attention Enhancement Gates (AbGs) and Attention Enhancement Modules (AbMs); an additional Attention Fusion Network (AfN) inspired by AbMs is also introduced in the decoder structure, allowing to improve the efficiency of the conversion of the semantic information during the up-sampling process.

In addition to the networks above-mentioned, many other semantic segmentation networks based on the Attention mechanism have been proposed, such as APCNet [119], CBAM [132], SKNet [133], Local Relation Net [134], and for the optimization of the mathematical form of Attention, A2Net [135], and CGNL [136] have also been derived.

E. Models Based on the Transformer Architecture

In the past few years, CNN has been the dominant model in the field of computer vision (CV), in contrast to Transformer [137], which is based on the self-attention mechanism, and has made a big splash in the field of natural language processing (NLP). Although the Transformer model has become a popular topic in the NLP field, its application in the CV field is relatively limited. In 2020, Google proposed the Vision Transformer (ViT) model [138], which directly applies Transformer to images, and became one of the pioneering models for applying the Transformer architecture to the computer vision field as one of the pioneering models. As shown in Fig. 12, ViT consists of three main modules: Linear Projection (Patch + Position Embedding), Transformer Encoder, and MLP Head. Although ViT performs well on image classification tasks, its performance on small-scale datasets still needs to be improved, and direct training of ViT may produce unstable and sub-optimal results, especially when the model becomes wider and deeper. To alleviate the above problems, Gong et al. [139] introduced new loss functions, including patch-wise cosine loss, patch-wise contrastive loss, and patch-wise mixing loss, while keeping the original ViT architecture unchanged,to enhance the model’s ability to distinguish between features during feature extraction.

FIGURE 12. - ViT divides the input image into multiple patches and then projects each patch into a fixed-length vector to be fed into the Transformer. the subsequent Encoder operations are exactly the same as in the original Transformer.From [138].
FIGURE 12.

ViT divides the input image into multiple patches and then projects each patch into a fixed-length vector to be fed into the Transformer. the subsequent Encoder operations are exactly the same as in the original Transformer.From [138].

Extending the Transformer from the NLP domain to the CV domain also faces the challenge of high computational complexity, which is especially inefficient in high-resolution image processing. Thus, Liu et al. [140] innovatively proposed the Swin Transformer, which adopts hierarchical feature representation and shift-window self-attention mechanism to effectively solve the computational efficiency problem of traditional Transformer in visual tasks. Subsequently, a Swin Transformer V2 [141] with 3 billion parameters was successfully trained after further optimization, which adopts residual post-normalization and scaled cosine attention mechanism, introduces continuous position bias in logarithmic space, and reduces the dependence on labeled data through self-supervised pre-training to become one of the largest dense vision models currently. Recently, academics have proposed several novel and fruitful improvement strategies based on the powerful model of Swin Transformer. Pascal I [142] proposed an improved model of the Swin Transformer for brain tumor diagnosis, introducing the hybrid offset window multi-head self-attention module HSW-MSA and rescaling model and ResMLP, aiming to improve classification accuracy, reduce memory, and simplify training complexity.

Since 2020, ViTs such as Swin Transformers have gradually become mainstream for vision tasks, known for their high accuracy, efficiency, and strong scalability. In 2022, Zhang et al. [143] introduced ConvNeXt, which is based on the ResNet50 model and modeled after Swin Transformers to improve the structure to obtain a purely convolutional model, which is comparable to the top Vision Transformers in vision benchmarking and maintains the simplicity and efficiency of ConvNet, demonstrating strong performance in various scenarios. Subsequently, Woo et al. [144] added fully convolutional masked autoencoder (FCMAE) and global response normalization (GRN) to ConvNeXt, which are used to strengthen inter-channel feature competition. The combination of the above self-supervised learning techniques and architectural optimizations led to the introduction of ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks.

As research progressed, more Transformer-based semantic segmentation models emerged, such as the SEgmentation TRansformer (SETR) model [145], shown in Fig. 13, which is based on an encoder-decoder architecture that employs an encoder that contains only the transformer instead of a stacked convolution for feature extraction, which serializes the input image is serialized into a series of image blocks and global context modeling is performed using the transformer encoder. Compared to traditional FCN-based approaches, SETR avoids the need for layer-by-layer resolution reduction by maintaining full-resolution inputs and outputs.

FIGURE 13. - SEgmentation TRansformer (SETR). (a) SETR consists of a standard Transformer; (b) progressive upsampling; (c) multilevel feature aggregation.From [145].
FIGURE 13.

SEgmentation TRansformer (SETR). (a) SETR consists of a standard Transformer; (b) progressive upsampling; (c) multilevel feature aggregation.From [145].

In the field of computer vision, constructing a unified model is extremely challenging, but enables resource conservation and promotes the overall field development. In this context, Cheng et al. [146] proposed MaskFormer, which extends the robust ensemble prediction mechanism of DETR by creating a corresponding binary mask for each detected object and using a unified loss function to deal with the mask classification problem, which can be applied to both semantic and instance segmentation tasks without any modification. Inheriting the core idea of MaskFormer, Mask2Former [147] is further proposed, which uses Masked Attention in the transformer decoder to extract local features by restricting the cross-attention to the predicted mask region. Compared to proprietary architectures for different segmentation tasks (panorama, instance, semantic), Mask2Former saves three times the research effort, effectively conserves computational resources, and validates the future of unified architectures. Observing the low utilization problem caused by inconsistent mask prediction among decoder layers in Mask2Former, Zhang et al. [148] proposed a Mask-Piloted training method in MP-Former for improving the defect of mask attention in Mask2Former, stabilizing the optimization objective and improving the training efficiency.

With continuous research, more improvements and innovative models have been proposed by researchers to overcome some of the problems encountered by ViT in semantic segmentation tasks. Zhang et al. [149] explored the application of plain ViT in semantic segmentation, and proposed a new semantic segmentation paradigm, SegViT, which introduces a lightweight Attention-to-Mask (ATM) module, which transforms the similarity mapping between learnable category tokens and spatial feature mappings into semantic masks. SegViT was later fully upgraded to obtain SegViTv2 [150], which adopts the Shrunk++ architecture and combines Edge-Aware Query Downsampling (EQD) and Query-Based Up-sampling (QU) to halve the encoder computational cost while maintaining performance. For continuous learning, catastrophic forgetting is avoided by freezing the old task parameters). The decoder for SegViT v2 is far less expensive in terms of computational cost than UPerNet [113], consuming only about 5% of the computation while demonstrating excellent performance on multiple ViT backbones.

To solve the problem of insufficient mixing of ViT information due to residual connectivity, Dai [151] proposed Pixel-focused Attention (PFA), which is a token mixer based on a biological vision design, to enhance the model’s ability to process image details by focusing finely on pixel-level information. Meanwhile, Convolutional GLU, a channel mixer with gated channel attention, is introduced to combine the advantages of GLU and SE to enhance the local modeling capability and model robustness. The combination of the two constructs the novel visual model TransNeXt, which effectively improves the performance of the visual transformer model.

Transformer performs well on visual tasks, but simply expanding the receptive field is prone to high cost and influence by irrelevant regions, and sparse attention data agnosticism may limit the long-term relationship modeling capability. Therefore, Xia et al. [152] proposed a Deformable Attention Transformer (DAT++), which solves the problems of high computational cost and sensitivity to irrelevant regions faced by traditional Transformers in visual tasks by introducing the Deformable Multi-Headed Attention (DMHA) module.DAT++, through a data-driven approach to dynamically adjust the focus of attention to effectively capture key information in images. In addition, by incorporating advanced convolutional techniques, DAT++ achieves breakthrough performance in several visual recognition tasks, demonstrating its potential as a basic model for visual recognition.

The core dot product self-attention mechanism of ViT makes it difficult to handle high-resolution feature maps efficiently due to quadratic computational complexity. Therefore, Rao et al. [153] proposed HorNet, which utilizes Recursive Gated Convolution (gnConv) to achieve efficient higher-order spatial interactions. gnConv combines gated convolution with a recursive design, which significantly reduces computational complexity without sacrificing the variability of translations, etc. HorNet not only achieves excellent performance on several visual recognition tasks, HorNet not only achieves superior performance over Swin Transformer and ConvNeXt but also successfully fuses the adaptivity of visual Transformer with the spatial hierarchy of CNN.

Traditional ViT performs poorly on dense prediction tasks due to its lack of visual a priori assumptions, and the need to introduce vision-specific inductive biases to improve performance despite its simple structure that facilitates multimodal pretraining. Therefore, Chen et al. [154] proposed a simple yet powerful adapter for Vision Transformer for dense prediction tasks, ViT-Adapter, which enhances the local feature learning capability of ViT by introducing spatial prior and multi-scale feature extraction. The adapter consists of three key components: a spatial prior module to capture the local semantics of the input image, a spatial feature injector to inject this prior information into the ViT, and a multiscale feature extractor to reconstruct the multiscale features required for the dense prediction task.

The local attention mechanism of traditional Transformer models (e.g., Neighborhood Attention, NA) reduces the computational complexity but limits the model’s global perception ability and long-range dependency modeling capability. Thus, Hassani et al. [155] introduced Dilated Neighborhood Attention (DiNA), which effectively extends the scope of local attention of traditional NA to capture more global context and exponentially expand the sensory field without adding extra cost. Combining the local attention of NA and the sparse global attention of DiNA, the Dilated Neighborhood Attention Transformer (DiNAT) is introduced, which achieves performance beyond the state-of-the-art models available at the time while maintaining computational efficiency.

Existing Transformer visual models fail to adequately adapt to the differences between images and languages, especially when dealing with extremely long pixel sequences, which affects the effective learning between pixels and objects. Thus, Yu et al. [156] reconsidered the relationship between pixel features and object queries from a clustering perspective and proposed a novel end-to-end segmentation framework, k-Means Mask Transformer (kMaX-DeepLab), in which the trans-attentional module is simplified through a k-means clustering-inspired mechanism.

Following the design concept of “minimalism”, Hong et al. [157] proposed PlainSeg, a high-performance semantic segmentation model based on a simple Vision Transformer (ViT), which achieves efficient semantic segmentation by minimizing architectural biases and simplifying the decoder. PlainSeg consists of three main components: a pre-trained Plain ViT as a feature extractor, a lightweight Transformer decoder for classification masks, and a refiner to recover feature resolution. The model utilizes the power of the Masked Image Modeling (MIM) pre-trained Plain ViT to improve performance through a simple up-sampling technique and three convolutional layers, during which it is found that high-resolution features are critical for high performance despite the simple up-sampling technique. In addition, the model demonstrates the importance of efficient optimization of randomly initialized parameters using a small learning rate. To optimize the model performance, a PlainSeg-Hier variant is further proposed, which further enhances the model performance by utilizing multi-scale features and employing a single-layer deformable Transformer encoder to fuse the multi-scale features.

F. Models Based on Mamba Architecture

Combining the advantages of the two major visual basic models (CNN and Transformer) in the field of computer vision (CV), a new selective structured state space model Mamba [158] was formed, which brought unprecedented performance improvement to long sequence modeling tasks. Mamba simplifies the SSM architecture and integrates the linear attention block and the multi-layer perceptron (MLP) block in the traditional SSM into a unique Mamba block. As shown in Figure 14, the Mamba block uses an activation function instead of a multiplication gate, while integrating SSM transformations into the main path of the MLP. The overall structure of Mamba consists of repeated stacking of multiple Mamba blocks, which are interspersed with standard normalization layers and residual connectivity to ensure efficient and stable models. Mamba breaks through the limitations of convolutional neural networks through global receptive fields and dynamic weight allocation, and provides powerful modeling capabilities similar to Transformers without generating quadratic complexity. Unlike Transformer’s comprehensive context dependency, Mamba introduces a selection mechanism that focuses on key information, a feature that makes it show great potential as a vision base model and is rapidly being applied to a variety of computer vision tasks.

FIGURE 14. - Mamba’s integration of simplified blocks for H3 and MLP. From [158].
FIGURE 14.

Mamba’s integration of simplified blocks for H3 and MLP. From [158].

FIGURE 15. - Self-supervised learning is categorized mainly into generative and contrastive learning.
FIGURE 15.

Self-supervised learning is categorized mainly into generative and contrastive learning.

The Mamba architecture model uses different strategies to improve the performance and efficiency of image processing tasks. Chen et al. [159] used a nested structure Mamba-in-Mamba in MIM-ISTD, where global information is processed through Outer Mamba blocks, while local features within each visual block are further explored using Inner Mamba blocks. Specifically, the input image is uniformly divided into multiple localized regions (called “visual sentences”), and then each localized region is further divided into smaller sub-regions (called “visual words”). In this way, MiM-ISTD is able to efficiently capture both global and local information simultaneously, enabling efficient and accurate prediction of small infrared targets.FractalMamba [160] uses fractal scanning curves to serialize image blocks, improving the ability to capture image structures by maintaining high spatial proximity and adapting to different image resolutions. EfficientVMamba [161] reduces the complexity of image processing by integrating efficient 2D scanning methods, dual-path feature fusion, and inverted EfficientNet block insertion strategies. MSVMamba [162] captures local and global information in images through multi-scale features, and combines selective mechanisms to enable the model to process images more flexibly. PlainMamba [163] improves on the non-hierarchical Mamba model and has efficient computing characteristics. One of the core features of LocalVMamba [164] is the windowed selective scanning mechanism, which divides the image into multiple windows and processes image blocks independently in each window, enhancing the capture of local features.

U-Mamba [165], a pioneering work in medical image segmentation, incorporates the essence of the Mamba architecture for the first time, which not only solves the limitations of traditional networks in dealing with long-distance dependencies by introducing a hybrid CNN-SSM module, but also improves the ease of use and adaptability of the algorithm by implementing the self-configuration function. Following closely, Mamba-UNet [166] combines the advantages of U-Net and Mamba, and its use of pure visual Mamba (VMamba) [167] with visual state space (VSS) blocks as the core to construct the encoder-decoder structure and effectively retains spatial information at different scales through skip connection further enhances the model’s performance on fine-grained segmentation tasks. In addition, many methods [166], [168], [169], [170] directly replace the CNN blocks in the U-Net architecture with Mamba-like blocks, which not only improves the segmentation accuracy, but also extends the application scope of Mamba in medical image analysis. Meanwhile, VMUNet [168], VM-UNet-V2 [171], Swin-UMamba [172], and Mamba-HUNet [170] adopt Visual State Space (VSS) blocks as the core building blocks, which exhibit significant advantages in capturing extensive context information, improving computational efficiency, and adapting to different network structures, further validating Mamba’s advantages, further validating the great potential of Mamba architecture in medical image segmentation.

In the field of remote sensing imagery, Mamba methods stand out for their ability to efficiently process high-resolution imagery and utilize image-rich spatial-spectral features for fine-grained scanning and analysis.SpectralMamba [173] introduces a gated spatial-spectral merging technique combined with a segmented sequential scanning strategy, aiming to capture subtle variations and complex patterns in the imagery with greater precision. Subsequently, Mamba-in-Mamba [174] cleverly combines a centralized Mamba cross-scanning (MCS) mechanism with tokenized Mamba to enhance the generation and concentration of representative features.ChangeMamba [175] takes an alternative approach by incorporating a VSS architecture into its encoder and spatio-temporal relationship modeling mechanism to achieve the spatio-temporal analysis of multi-temporal features. RS-Mamba [176] introduces an omnidirectional selective scanning module, which even pushes the flexibility of Mamba technology to a new level.

The Mamba model has great potential in terms of data efficiency, high-resolution data, multimodal data, and contextual learning, and future research will further explore Mamba for computer vision tasks in various domains.

G. Models Based on Self-Supervised Learning

Although traditional supervised learning methods have achieved remarkable results, the acquisition of labels is often faced with high human and time costs, in the face of this challenging problem, in recent years, self-supervised learning methods have been gradually studied in depth. Self-supervised learning is a kind of learning method without manual labeling, by using the information of the image itself for training, it can realize the semantic segmentation task in the absence of labeled data. The current self-supervised learning is mainly categorized into generative learning and contrastive learning, as shown in Fig. 13(a)(b), respectively. The core of generative learning is to learn the data generation process through the model so that it can generate new data samples, which is widely used in image and video generation, text generation, and other fields. The core of contrastive learning is to learn feature representation by comparing the similarity between samples, which is very useful in feature extraction and similarity learning.

In view of the characteristics of semantic segmentation tasks, Wang et al. [177] proposed a self-supervised equivariant attention mechanism (SEAM), which uses the equivariance and consistency regularization of image transformations to narrow the gap between weakly supervised and fully supervised semantic segmentation. They introduced a pixel correlation module (PCM) to further refine the CMA, and combined the twin network and equivariant cross regularization (ECR) loss to integrate PCM and self-supervisory signals, thereby improving the prediction consistency and segmentation performance of the model. Araslanov et al. [119] proposed a novel unsupervised domain adaptation (UDA) framework called self-supervised enhanced consistency (SAC). It abandons complex adversarial training or style transfer and adopts simple image enhancement technology to ensure the consistency of model predictions. Through self-supervised learning, SAC uses co-evolving pseudo-labels for training in the target domain without the need for additional training rounds, achieving efficient domain adaptation.

In 2021 Facebook AI proposed DINO [178], a deep learning model for self-supervised visual learning, which was one of the first to explore self-supervised learning representations based on the Transformer architecture to learn visual feature representations through self-supervised training on unlabeled images. After two and a half years, the Meta AI team re-proposed DINOv2 [179], an advanced version of DINO, with technical innovations to fully exploit the potential of self-supervised learning. Specifically, an automated data pipeline was introduced to optimize the quality of the dataset, and a ViT model with 1 billion tunable parameters was trained and distilled by unsupervised self-distillation into a series of smaller models for different computational needs, which outperformed the best available general-purpose functionality on most benchmarks at the image and pixel level. Problems with Transformer-based models: the best-performing detection and segmentation models remain non-uniform, which hinders task and data cooperation between the detection and segmentation tasks. Li et al. [180] extended DINO by adding a mask prediction branch to Mask DINO, which works in parallel with DINO’s original box prediction branch, utilizing DINO’s content query embedding and high-resolution pixel embedding maps to perform mask classification. The simplicity, efficiency, and scalability of Mask DINO allow segmentation to benefit from detection pretraining on large-scale detection datasets.

EVA, proposed by Fang et al. [181], is a vision-centered base model with self-supervised pre-training by predicting masked visual features of image-text alignment, which allows the model to scale up to 1 billion parameter size without relying on large amounts of labeled data. In addition, EVA can be used not only as a purely visual encoder but also as a multimodal hub connecting image and text to improve the training stability and efficiency of multimodal base models.

SECTION IV.

Image Segmentation Dataset

In this section, we present a number of 2D datasets in the field of image segmentation with a wide range of applications, along with a detailed description of the characteristics of each dataset. The annotations of these datasets are based on pixel-level labels, which can be used to fairly compare the performance and accuracy of various semantic segmentation models to evaluate a model.

  1. PASCAL Visual Object Classes (PASCAL VOC) [69]. PASCAL VOC is an international computer vision challenge, the organization provides the most recognizable image test datasets in the field, the most commonly used of which is PASCAL VOC 2012, which has four main classes, namely Person, Animal, Vehicle, and Indoor. The dataset supports five vision tasks for training, including Image Classification, Object Detection, Image Segmentation, Behavioral Classification, and Human Body Layout. In the Image Segmentation task, there are a total of 21 categories of object labels (including background). The dataset is divided into training and validation sets, consisting of 1464 and 1449 images respectively.

  2. PASCAL Context [182].PASCAL Context is an extension to PASCAL VOC 2010, totaling 459 labeled categories and containing 10103 images, of which 4998 are used for the training set and 5105 for the validation set. Many of the object categories in this dataset are too sparse, so the 59 categories with the highest frequency of occurrence are usually selected as semantic labels, and the rest of the categories are labeled as background i.e. background.

  3. Microsoft Common Objects in Context (MS COCO) [183]. The MS COCO dataset is an everyday scene image dataset for segmentation and detection, where the objects in the figure have precise positional labeling. It contains 81 categories (including backgrounds), totaling 328,000 images, containing up to 2.5 million labeled instances. Among them, there are more than 82,000 images in the training set, 40,500 images in the validation set, and more than 80,000 images in the test set.

  4. Cityscapes Dataset (Cityscapes) [70]. Cityscapes covers street scenes from 50 different cities, seasons, and times of day (daytime), focusing on semantic understanding of urban street scenes. Based on more than 30 category labels, it performed high-quality fine labeling for 5,000 images and coarse labeling for 20,000 images. In the case of fine labeling, the training set is 2975 images, the validation set is 500 images, and the test set is 1525 images.

  5. Cambridge-driving Labeled Video Database (CamVid) [184]. This dataset is the first video collection with semantic labeling of object categories and complete metadata. It is filmed from the perspective of a driving car, which increases the number and heterogeneity of observed targets, totaling more than 700 images. It also provides 32 semantic category labels for each pixel, and each pixel is precisely labeled.

  6. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [185].ILSVRC is one of the famous international computer vision challenges that provides an ImageNet dataset. The dataset has more than 14 million images covering more than 20,000 categories, of which, more than 1 million images have explicit category labeling and object location labeling.

  7. Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) [186].KITTI is the largest dataset in the field of automated driving, which is captured by sensors such as optical cameras and LIDAR on board the car, and contains real image data captured in urban areas, rural areas, and highways It contains real image data collected from urban, rural, and highway scenes, and is mainly used to evaluate road segmentation, target detection, target tracking, and other technologies in the in-vehicle environment.

  8. Sift Flow [187]. Sift Flow focuses on eight different outdoor-type scenes, including streets, mountains, cities, beaches, buildings, etc., and contains a total of 2,688 images with 256\times 256 pixels, each with pixel-level annotations, covering 33 semantic categories.

  9. Stanford background dataset (SBD for short) [188].SBD was built by Stanford University and extracts outdoor scene images from public datasets (LabelMe, MSRC, PASCAL VOC, and Geometric Context), respectively, and contains a total of 715 images, and each image contains at least 1 foreground image and has a horizontal position within the image (does not need to be visible).

  10. NYU Depth Dataset V2 (abbreviated as NYUDv2) [189].NYUDv2 consists of various indoor scene video sequences containing 1449 densely labeled RGB-D images, 464 indoor scenes originating from 3 cities, and 407,024 unlabeled frames, and each object is given a class and instance number.

  11. SUN-RGBD [190].SUN-RGBD is acquired by four different RGB-D image sensors and contains a total of 10335 indoor images of different scenes. It covers more than 700 types of objects, with a total of 146,617 2D polygon labels and 58,657 3D bounding boxes.

  12. ADE20k [191]. ADE20k has more than 25,000 images of complex everyday scenes, covering a wide range of annotations for scenes, objects, and object parts. Other datasets can also be used for image segmentation, such as Berkeley Segmentation Dataset (BSDS) [192], PASCAL Part [193], Youtube-Objects [194], SYNTHIA [195], and so on.

SECTION V.

Model Performance Data Analysis

In image segmentation tasks, it is crucial to properly evaluate the performance of the model. In this section, we first parse the commonly used performance evaluation metrics in image segmentation models and then provide the performance evaluation results of the models on different datasets.

A. Performance Evaluation Metrics for Image Semantic Segmentation Algorithms

Pixel Accuracy (PA) is used to calculate the ratio of the number of correctly predicted pixels to the total number of pixels in the image. For K +1 classes (containing K foreground categories and 1 background category), let P_{ii} denote the number of correctly categorized pixels; P_{ij} is the number of pixels in class i that are predicted to be in class j. The formula for pixel accuracy is:\begin{equation*} PA= \frac {\textstyle \sum _{i=0}^{k}{P}_{ii}}{\textstyle \sum _{i=0}^{k}\textstyle \sum _{j=0}^{k}{P}_{ij}} \tag {1}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

PA is commonly used to estimate the overall segmentation effect, but it contains limited information that tends to mask certain categories with poor results and does not fully reflect the segmentation accuracy of individual categories. For this reason, Class Pixel Accuracy (CPA) is used to assess the segmentation accuracy of each class. For the segmentation result of the i_{th} class say, CPA is calculated as:\begin{equation*} CPA= \frac {\textstyle {P}_{ii}}{\textstyle \sum _{j=0}^{k}{P}_{ij}} \tag {2}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Mean Pixel Accuracy (MPA) is the average of all CPAs, sometimes referred to as mean accuracy, which is commonly used in the estimation of the overall segmentation effect, and the results of MPA are more reliable compared to PA. The formula for MPA is:\begin{equation*} MPA= \frac {1}{k+1}\displaystyle \sum _{i=0}^{k} \frac {\textstyle {P}_{ii}}{\textstyle \sum _{j=0}^{k}{P}_{ij}} \tag {3}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

IoU is used to measure the degree of overlap between the segmentation results and the true labels for each category, and it should be noted that the process needs to exclude the pixel blocks that are labeled as background. Let A and B denote the ground truth and the predicted results, respectively, and the IoU is calculated as:\begin{equation*} IoU=J(A,B)=\frac {|A \cap B|}{|A \cup B|} \tag {4}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

However, as an indicator to evaluate the performance of the entire data set, the IoU indicator will give priority to certain label categories that cover a larger area, which will ignore label categories that have a relatively smaller area but are equally important. Accordingly, to more accurately assess the overall performance of the entire dataset, we utilize the Mean Intersection over Union (MIoU) metric, which represents the mean IoU of all categories. This approach ensures that each category contributes equally to the evaluation process. MIoU is a widely utilized standard evaluation metric in semantic segmentation tasks and is employed as the primary metric for most model evaluation comparisons, MIoU is calculated as:\begin{equation*} MIoU=\frac {1}{k+1}\displaystyle \sum _{i=0}^{k}\frac {{p}_{ii}}{\textstyle \sum _{j=0}^{k}{p}_{ij}+\textstyle \sum _{j=0}^{k}{p}_{ji}-{p}_{ii}} \tag {5}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

The Frequency Weighted Intersection over Union (FWIoU) represents an advancement over the MIoU. It assigns distinct weights to each category based on the frequency of occurrence of each category. The formula for FWIoU is as follows:\begin{equation*} FWIoU=\frac {1}{\displaystyle \sum _{I=0}^{K}\displaystyle \sum _{J=0}^{K}{P}_{ij}}\displaystyle \sum _{j=0}^{k}\frac {\displaystyle \sum _{j=0}^{k}{P}_{ij}\displaystyle {P}_{ii}}{\displaystyle \sum _{j=0}^{k}{P}_{ij}+\displaystyle \sum _{j=0}^{k}{P}_{ji}-\displaystyle {P}_{ii}} \tag {6}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

In the context of semantic segmentation tasks, precision, recall, and F1 score are frequently employed as metrics for evaluating the accuracy of a model. Among these, precision is a measure of the proportion of true cases among the results predicted as positive samples, including both true and false positive cases. This reflects the reliability of the result when the model’s prediction is a positive sample. The precision value is calculated using the following formula:\begin{equation*} Precision=\frac {TP}{TP+FP} \tag {7}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

The recall metric represents the ratio between the correctly predicted positive samples and the total number of labeled positive samples, thereby measuring the model’s ability to predict positive samples. It can also be used to reflect the sensitivity of the model to some extent, which is also known as the true positive rate (TPR). The recall is calculated using the following formula:\begin{equation*} Recall=Sensitivity=TPR=\frac {TP}{TP+FN} \tag {8}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

In this context, TP represents the True Positive Score, FP denotes the False Positive Score, and FN signifies the False Negative Score. The F1 Score is the reconciled average of the check accuracy and check completeness, which is employed to comprehensively assess the performance of the model. The formula for the F1 Score is as follows:\begin{equation*} F1 \ Score=\frac {2 \ast Precision \ast Recall}{Precision+Recall} \tag {9}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

The dice similarity coefficient (DSC), also referred to as Dice, is an ensemble similarity measure function defined as the area of overlap between the predicted region and the labeled region, divided by the total number of pixels in the two images. Dice is analogous to the intersection over union (IoU), which is computed as:\begin{equation*} Dice = \frac {2|A \cap B|}{|A|+|B|} \tag {10}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

When applied to Boolean data, with the foreground as a positive class, the Dice coefficient is essentially the same as the F1 score and is calculated as:\begin{equation*} Dice = \frac {2TP}{2TP+FP+FN} \tag {11}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

B. Model Performance Evaluation Results

In the evaluation of deep learning-based image semantic segmentation algorithms, this paper employs the metric MIoU, which is a representative evaluation metric. MIoU, as a measure of segmentation accuracy, has been widely utilized in numerous well-known literature, and it can effectively evaluate the average segmentation performance of the model across various categories. Indeed, numerous leading research papers and conferences (e.g., CVPR, ICCV, ECCV, etc.) have utilized MIoU as the primary evaluation metric when reporting image semantic segmentation results. This metric has become a primary criterion for evaluating the efficacy of segmentation algorithms, largely due to its comprehensive and objective nature. When employing MIoU as an evaluation metric, it is possible to explicitly highlight the reasons for its widespread adoption, thereby enhancing the persuasiveness and credibility of the argument. The following table provides a summary of the performance of various algorithms on commonly used public datasets.

Semantic segmentation models based on encoding and decoding structures efficiently capture and reconstruct detailed information of an image through an encoder-decoder architecture. Models with this architecture are particularly suited to handle tasks requiring fine pixel-level classification, as they are able to capture global contextual information in the encoder and progressively recover the spatial resolution of the image through the decoder to achieve accurate semantic segmentation.

As shown in Table 1, although FCN achieves good results on most datasets, it still has some limitations when dealing with small objects and complex scenes.ParseNet improves the model’s understanding of complex scenes and segmentation accuracy by introducing global context information.FastFCN improves the model’s efficiency and accuracy by optimizing the network structure and the training process, especially on the ADE20K and Pascal Context datasets. Panoptic FCN achieves a more comprehensive understanding of images by combining panoramic segmentation and semantic segmentation, especially on the COCO dataset, which achieves an MIoU of 86.6%, showing its advantages in dealing with complex scenes and multi-object recognition.UNet performs better on the Cityscapes dataset, while SegNet achieved more stable results on both the Cityscapes and PASCAL VOC datasets.DeconvNet and GCN were proposed in 2015 and 2017, respectively, with the former achieving pixel-by-pixel classification of images through the introduction of a deconvolution layer, and the latter through graph convolutional operations for deeper image structure understanding.GCN performs particularly well on the Cityscapes and PASCAL VOC datasets, showing its advantages in dealing with complex scenes and multi-object recognition.SDN (Adaptive Decomposition Network), which was proposed in 2017, achieves a more accurate segmentation of images through adaptive decomposition operations in 2021, and in particular achieves an 83.5% MIoU, which is one of the best-performing codec architecture-based models for classification in data.

TABLE 1 Encoder-Decoder Based Model Performance
Table 1- Encoder-Decoder Based Model Performance

In the field of semantic segmentation, the dilation convolution technique is an important innovation, which increases the sensory field by introducing dilation in the convolution kernel to capture a wider range of contextual information, which significantly improves the performance of the model when dealing with complex scenes and high-resolution images, and the performance of the models based on the dilation convolution technique is demonstrated in Table 2.

TABLE 2 Model Performance Based on Dilated Convolution Technique
Table 2- Model Performance Based on Dilated Convolution Technique

DeepLab V1 is based on the VGG-16 backbone network, and captures more contextual information by introducing atrous convolution to improve segmentation accuracy. On the Cityscapes and PASCAL VOC datasets, DeepLab V1 achieves 63.1% and 71.6% MIoU (Mean Intersection and Merger Ratio), respectively, which shows its good performance in complex urban streetscape and everyday object recognition.DRN, on the other hand, increases the sensory field by introducing dilated convolution to improve the resolution of the model without increasing parameters. On the Cityscapes dataset, DRN utilizes ResNet-101 as the backbone network and achieves 70.9% MIoU, showing its effectiveness in dealing with complex urban streetscapes.DUC-HDC further extends the U-Net model by introducing dilation convolution and high-resolution components to enhance the segmentation capability of the model. On the Cityscapes dataset, DUC-HDC utilizes ResNet-101 as the backbone network and achieves an MIoU of 77.6%, which demonstrates its significant advantage in processing high-resolution images and complex scenes.EncNet optimizes the network structure through an evolutionary algorithm and automatically searches for the optimal network architecture. On the Cityscapes dataset, EncNet utilizes R-101-D8 as the backbone network and achieves 78.55% MIoU, while on the COCO dataset, it achieves 85.9% MIoU.EncNet’s performance highlights the potential of automated network design to improve model performance.

Multi-scale feature fusion is an effective strategy to improve the performance of semantic segmentation models, which allows the model to simultaneously consider the features of an image at different resolutions, leading to a better understanding of the global and local context of the scene, and the performance of the model based on multi-scale feature fusion is demonstrated in Table 3.

TABLE 3 Model Performance Based on Multi-Scale Feature Fusion
Table 3- Model Performance Based on Multi-Scale Feature Fusion

The LRR model utilizes the stochastic reshaping technique combined with ResNet as the backbone network to achieve 77.5% and 69.7% MIoU on the PASCAL VOC and Cityscapes datasets, respectively.The LRR model shows that proposing the stochastic reshaping technique enhances the generalization ability of the model by varying the dimensions of the input images. By introducing multi-scale feature fusion and channel attention mechanism, RefineNet achieved 84.2%, 40.7%, and 33.6% MIoU on Cityscapes, ADE20k, and COCO datasets respectively. The backbone network of RefineNet is ResNet-152, which enhances the feature representation by refining the network structure, especially in terms of maintaining edge information. PSPNet uses R-101-D8 as the backbone network and captures contextual information at different scales through the pyramid pooling module, and achieves 78.34%, 78.47%, 43.57%, and 41.95% MIoU on Cityscapes, PASCAL VOC, ADE20k, and COCO datasets, respectively, which indicates that PSPNet has good adaptability in dealing with scenes of different resolutions and complexities.FPN captures multi-scale information by constructing a feature pyramid and achieves 75.8% and 39.35% MIoU on Cityscapes and ADE20k datasets, respectively. The backbone network of FPN is R-101, which enhances the multi-scale expression of the features through top-down paths and lateral connections.EFPN is an improved version of FPN, which differs from FPN by optimizing the feature fusion method instead of using a simple addition operation for feature fusion, and enhancing the network structure to improve the feature extraction and representation capabilities.EFPN uses VGG-16 as the backbone network, and is optimized for feature fusion on the Cityscapes, PASCAL VOC, and Pascal Context datasets achieving 82.3%, 86.4%, and 53.9% MIoU, respectively.

Through continuous iteration, the DeepLab series of models has introduced technologies such as dilated convolution and attention mechanism. DeepLab V2 and V3 have achieved good results on the Cityscapes, PASCAL VOC, ADE20k, Pascal Context, and COCO datasets respectively. As an important version of the series, DeepLab V3+ has achieved MIoU of 80.65%, 78.62%, 45.47%, and 47.3% on the Cityscapes, PASCAL VOC, ADE20k, and Pascal Context datasets respectively by introducing deep convolution features and improved decoder structure. DenseASPP (Dense Asymmetric Pyramid Pooling) uses DenseNet-161 as the backbone network and achieves 80.6% MIoU on the Cityscapes dataset. DenseASPP enhances the model’s context capture capability through dense connections and asymmetric pyramid pooling modules. APCNet (Adaptive Pyramid Context Network) uses R-101-D8 as the backbone network and achieves 79.64% and 45.54% MIoU on the Cityscapes and ADE20k datasets, respectively. APCNet uses an adaptive pyramid context module to enhance the model’s ability to integrate features of different scales.

The attention mechanism based model improves segmentation accuracy and robustness in semantic segmentation tasks by giving the network the ability to selectively focus on specific regions in the image, in Table 4 the performance of the attention mechanism based model is demonstrated.

TABLE 4 Model Performance Based on the Attention Mechanism
Table 4- Model Performance Based on the Attention Mechanism

PSANet uses pixel-level semantic similarity to enhance segmentation performance. It strengthens local features and improves segmentation accuracy by adaptively aggregating the relationships between pixels, achieving 79.69%, 77.91%, and 43.8% MIoU on Cityscapes, PASCAL VOC, and ADE20k datasets, respectively. DANet uses a dual attention mechanism, including channel attention and spatial attention, to enhance the expressiveness of features. This design makes DANet more effective in dealing with complex background and foreground objects, achieving 80.52%, 76.51%, 44.17%, and 37.9% MIoU on Cityscapes, PASCAL VOC, ADE20k, and COCO datasets, respectively. OCNet captures multi-scale features through frequency-divided convolution blocks, which can indirectly enhance the model’s ability to focus on specific areas, achieving 82.5%, 45.5%, and 56.2% MIoU on Cityscapes, ADE20K, and PASCAL-Context datasets, respectively. CCNet uses the coordinate attention mechanism to directly weigh the coordinate position of each pixel in the feature map, thereby improving segmentation accuracy and achieving 81.4% MIoU on the Cityscapes dataset. EMANet uses a multi-scale attention mechanism to capture features of different scales and improve segmentation accuracy, achieving 87.7% MIoU on the PASCAL VOC dataset. SANet uses an aggregated attention mechanism based on global information to enhance feature expression by explicitly modeling the relationship between feature channels, achieving 86.1% MIoU on the PASCAL VOC dataset. FocalNet uses the Focal Self-Attention mechanism to cleverly combine fine-grained local attention and coarse-grained global attention, and captures long-distance global information in a generalized way, achieving 57.3% and 67.3% MIoU on the ADE20K and COCO datasets, respectively. VOLO (Vision Outlooker) is a newer architecture that introduces a new attention mechanism, Outlook Attention, which more effectively encodes fine-level features and context into token representations. This is different from the traditional self-attention mechanism (Self-Attention), which often focuses more on modeling global dependencies and ignores the encoding of fine features. MIoU of 84.3% and 54.3% were achieved on the Cityscapes and ADE20K datasets, respectively, which also proves that the model has good migration ability. SERNA-Former uses Efficient-ResNet as the backbone network, and improves the segmentation performance by combining Efficient-ResNet with Attention-Boosting Gates (AbGs) and Attention-Fusion Networks (AfNs), and achieves MIoU of 84.62% and 87.35% on the CamVid and Cityscapes datasets, which shows the effectiveness of the proposed module in handling semantic segmentation tasks.

Models based on the Transformer architecture show great potential in semantic segmentation tasks. By fusing multi-scale features and attention mechanisms, this model can not only significantly improve the performance and segmentation accuracy of the model, but also enhance the robustness of the model. This is mainly due to the unique design of the Transformer architecture, which is able to effectively capture global dependencies in images, thus achieving better results in semantic segmentation tasks. Specifically, the Transformer-based model is able to better capture the feature information of objects at different scales in the image by combining the multi-scale feature map with the attention mechanism. The multi-scale feature map can cover both fine-grained and coarse-grained features in the image, which enables the model to understand the semantic information of the image more comprehensively. At the same time, the attention mechanism can automatically focus on important regions in the image, which further improves the model’s ability to capture semantic information, and the performance of the model based on the Transformer architecture is shown in Table 5.

TABLE 5 Model Performance Based on the Transformer Architecture
Table 5- Model Performance Based on the Transformer Architecture

In Table 5, it can be observed that Swin V1-L and Swin V2-G demonstrate the application of Swin Transformer in semantic segmentation tasks. Swin V1-L is based on the original architecture of Swin Transformer, and V1-L strikes a good balance between maintaining computational efficiency and model performance. It employs techniques such as shifted window attention mechanism, and relative position coding, and achieves a MIoU of 53.5% on the ADE20K dataset, while Swin V2-G builds on V1 with several important improvements and optimizations, introducing the res-post-norm (residual post-norm) technique, which places the LayerNorm layer is placed at the end of the residual branch to improve training stability and reduce the accumulation of activation values. The scaled cosine attention mechanism is also used to better handle long sequences of information and to reduce the concentration problem of the attention map. The MIoU is increased to 59.9% by the above optimization. However, due to the huge number of parameters in V2-G, its training process requires a lot of computational resources and time.

ConvNeXt and ConvNeXt V2-H demonstrate another architecture based on convolutional networks. While ConvNeXt uses the technique of Spatial Pyramid Pooling (SPP) to capture features at different scales and improves the accuracy of the CNN by combining features at different scales, with an MIoU of 54.0% on the ADE20K dataset. ConvNeXt V2-H is a further iterative version of the ConvNeXt model that incorporates the self-supervised learning technique and architectural improvements, in particular the inclusion of a full convolutional mask self-encoder framework and a global response normalization (GRN) layer, which results in an increased MIoU of 57.0%.

MaskFormer has an MIoU of 44.5% on the ADE20K dataset, while Mask2Former introduces the Mask Attention module, which extracts localized features by constraining cross-attention within the masked region. This design makes the model more effective in focusing on the local region of the target and alleviates the problem of slow convergence of the Transformer model, thus increasing the MIoU to 57.3%. MaskFormer, although it also uses the Transformer decoder, is not specifically optimized for attention within the mask region, and thus the performance is relatively poorer than the Mask2Former is much worse.

SegViT and SegViT-v2 demonstrate the application of BEiT (Bootstrapped Efficient Image Transformer) based architecture in semantic segmentation.SegViT achieves an MIoU of 58.14% on the ADE20K dataset, while SegViT-v2 improves this number to 58.2%.Both SegViT and SegViT-v2 are effective in improving the representation of features by combining self-supervised learning and the Transformer architecture.

SETR takes advantage of the power of the Transformer architecture and captures global context information through the self-attention mechanism, achieving MIoU of 47.71%, 81.64%, and 53.74% on the ADE20K, Cityscapes, and Pascal Context datasets, respectively. SegFormer uses the Vision Transformer (ViT) as its backbone network and captures global features through the self-attention mechanism, achieving MIoU of 49.62% and 82.25% on the ADE20K and Cityscapes datasets, respectively. DiNAT combines the attention mechanism with the Transformer architecture and achieves MIoU of 58.1%, 84.5%, and 68.3% on the ADE20K, Cityscapes, and COCO datasets, respectively. HorNet introduces a horizontal attention mechanism to enhance the model’s attention to the horizontal structure in the image, achieving an MIoU of 57.9% on the ADE20K dataset. kMaX-DeepLab effectively improves segmentation accuracy by combining deep convolutional networks and attention mechanisms, achieving MIoU of 55.2% and 83.5% on the ADE20K and Cityscapes datasets, respectively. TransNeXt effectively improves feature representation capabilities by combining the Transformer architecture and convolutional networks, achieving an MIoU of 54.7% on the ADE20K dataset.

ViT-Adapter-L achieves 61.5% and 85.2% MIoU on ADE20K and Cityscapes datasets, respectively, by combining the Vision Transformer and Adapter module. FocalNet-L achieves 57.3% and 67.3% MIoU on ADE20K and COCO datasets, respectively, by combining focal attention mechanism and Transformer architecture. SwinMTL achieves 76.41% MIoU on the Cityscapes dataset by combining Swin Transformer and multi-task learning. PlainSeg achieves 58.14%, 67.25%, and 53.02% MIoU on ADE20K, Pascal Context, and COCO datasets, respectively, by combining BEiT architecture and simple segmentation head. Depth Anything achieves 84.8% and 59.4% MIoU on Cityscapes and ADE20K datasets, respectively, by combining depth estimation and semantic segmentation. PatchDiverse achieves 54.5% and 83.6% MIoU on ADE20K and Cityscapes datasets respectively by combining diverse patch sampling and Transformer architecture. SPNet achieves 45.6% and 82.0% MIoU on ADE20K and Cityscapes datasets respectively by combining spatial pyramid pooling and convolutional networks.

The performance of the Mamba model in the field of image semantic segmentation is impressive, especially in improving segmentation accuracy and reducing model complexity. The Mamba model efficiently captures long-distance dependencies while maintaining linear computational complexity by leveraging selective structured state space models, and overcomes the challenges faced by CNNs and Transformers due to high computational requirements by providing linear scalability with sequence length. Scalability issues make them ideal for real-time and large-scale applications.

Table 6 shows the performance of the Mamba-based models. We can see that the Mamba model and its variants perform differently on the ADE20K dataset. These models all use convolution and selective state space models (Conv+SSM) as the backbone network. In terms of mean intersection over union (MIoU), a key metric for measuring semantic segmentation performance, the scores of these models are between 47% and 51%. Specifically, the VMamba model tops the list with an MIoU score of 51.6%, indicating that it can accurately distinguish different objects and boundaries when processing images in the ADE20K dataset. It is followed by the LocalVMamba model with a score of 51.0%, which is also a very good score, indicating that the model performs well in capturing local features. The performance of the remaining mamba-based models is weak, but as a new lightweight architecture, the mamba-based model still has promising development prospects.

TABLE 6 Model Performance Based on Mamba
Table 6- Model Performance Based on Mamba

Self-supervised learning designs pre-training tasks to extract information from large amounts of unlabeled data. In semantic segmentation, the model learns image global consistency, local texture details, and cross-scale contextual relationships to construct a deep understanding of the scene. It helps the model to quickly and accurately capture key segmentation regions in subsequent supervised fine-tuning to improve segmentation accuracy.

Table 7 shows the performance of the models based on self-supervised learning. The SEAM model uses ResNet38 as the backbone network and achieves 64.5% MIoU on the PASCAL VOC dataset, indicating that SEAM can effectively handle the semantic segmentation task of daily objects. The SAC model uses resnet-101 as the backbone network and achieves 53.8% MIoU on the Cityscapes dataset. Although this result is average in the complex urban street scene segmentation task, it reveals that the model may need deeper feature extraction and context information when dealing with more complex scenes. The DINOv2 model is an architecture based on Vision Transformer (ViT-L/14). It achieves 53.1%, 80.9%, and 86.0% MIoU on the ADE20K, CityScapes, and Pascal VOC datasets, respectively. The results show that DINOv2 can effectively use the Transformer architecture to capture global context information, especially the high performance on the CityScapes and Pascal VOC datasets, highlighting its advantages in dealing with complex scenes and daily object recognition. Mask DINO integrates the object detection and image segmentation tasks into a Transformer-based model, uses ResNet-50 as the backbone network in the segmentation task, introduces the mask prediction branch, and proposes a series of key components to improve performance, and achieves 48.7% and 80.0% MIoU on the ADE20K and Cityscapes datasets, respectively. The EVA model is based on CMask R-CNN as the backbone network, and achieves 62.3% and 53.4% MIoU on the ADE20K and COCO datasets, respectively. The results show that EVA performs well in processing datasets with rich categories and complex backgrounds, especially on the ADE20K dataset.

TABLE 7 Model Performance Based on Self-Supervised Learning
Table 7- Model Performance Based on Self-Supervised Learning

SECTION VI.

Conclusion

This paper provides an in-depth and comprehensive review of the recent research progress of deep learning methods in the field of semantic image segmentation. Firstly, it combs through in detail the close connection between the basic concepts of image segmentation and deep learning techniques, and traces the evolutionary lineage of early segmentation methods. Subsequently, it provides an exhaustive discussion of the various types of deep learning methods that have emerged in recent years, and systematically classifies deep learning-based image segmentation algorithms into seven categories: encoder-decoder architecture, dilation convolution technique, multi-scale feature fusion strategy, attention mechanism, Transformer architecture, Mamba architecture, and self-supervised learning strategy. Immediately following, this paper provides a detailed list of various strategies for improving the efficiency of deep learning networks by optimizing the network architecture, and provides an in-depth analysis of the considerations behind these design choices. In addition, current mainstream benchmark datasets, including PASCAL VOC, MS COCO, Cityscapes, and ADE20k, are presented, along with the key metrics used to evaluate model performance. Through experimental validation on these standard datasets, all of these algorithms demonstrate excellent performance.

With the rapid development of image acquisition technology, image types are becoming more and more diversified, bringing more challenges to image segmentation. In reviewing the development of the image segmentation field, we find that it is crucial to choose appropriate segmentation methods for different application scenarios. Sometimes, combining multiple segmentation methods can achieve better results. With the continuous progress of technology, the application scope of image segmentation in computer vision tasks is gradually expanding, and the accuracy and speed of segmentation continue to improve. However, there are still some challenges, such as a lack of datasets, insufficient segmentation accuracy for small targets, high computational complexity of algorithms, and inability to realize real-time interactive segmentation. These challenges will be the key direction of future research, which has important research value and practical significance.

References

References is not available for this document.