Introduction
Medical image segmentation is vital for extracting regions of interest from image data. This enables more precise interpretation of medical images for doctors. The main goal of segmenting this data is to identify areas of the anatomy required for diagnosing and treating various diseases which benefit a large number of patients. However, manual annotation of medical image pixels is a costly process due to the vast number of patients with huge details across different modalities of medical images, and the limited number of specialists available [1]. On the other hand, leveraging precise and effective processing of medical images can significantly reduce the time, cost, and potential errors associated with human-based processing [2].
Given the complexity and variability of medical imaging data, the utilisation of deep learning networks has emerged as the necessary approach. Deep learning networks are used as state-of-the-art methods in medical imaging segmentation and outperform non-deep state-of-the-art methods. These networks require a large amount of data to train and powerful computation resources to process train data and justify the huge number of network parameters. One of the first deep networks applied to image segmentation is a fully convolutional neural network (FCN) [3].
Nowadays the majority of the medical image segmentation algorithms follow encoder-decoder top-down structures. These algorithms initially encode input images into a latent space using different convolution layers. Subsequently, decoders utilise convolution layers to learn the locations of regions of interest within images. The horizontal propagation of dense feature maps from the encoder to the corresponding decoder layer that applies the spatial information to the deeper layer. This architectural design has significantly produced a more accurate output segmentation map (skip connection), known as UNet [4]. U-Net and its variants have arguably been the most influential leap forward in segmentation algorithms in the recent past. Attention U-Net [5] and MultiResUNet [6] are two examples of variants derived from the original U-Net architecture. Attention UNet use the attention mechanism between skip connections and output from the previous decoder layer to aggregate low-level features from the encoder to high-level features from the decoder, whereas MultiResUNet use MultiRes Block in encoding and decoding layers with a modified path for skip connection from the encoder to decoder layers consist of a chain of convolutional layers with residual connections to alleviate the disparity between the encoder-decoder features.
In image segmentation, the aggregation of multi-scale features holds particular importance for capturing context and spatial information at different scales, which can lead to more accurate and robust segmentation results. The U-Net architecture, commonly used for biomedical image segmentation, employs skip connections to combine features from the encoder and decoder at different resolutions. This enables the model to maintain fine-grained details while also incorporating high-level semantic information. The PolyRes-Network adopts the U-Net architecture, employing two branches: an encoder branch and a decoder branch. At each level within these branches, it consists of an MLR-block. The MLR-block architecture enables the model to identify more useful local and global spatial information in the encoding phase. Meanwhile, the decoding phase utilizes an attention mechanism between the skip connection and the corresponding output from the MLR-block of the decoding layer to generate more accurate segmentation feature maps at different scales in each level of the decoder branch. Finally, the multi-scale feature aggregation MSFA block aggregates the output from different attention gates along with the output of the final MLR-block in the network to generate the final segmentation feature maps of the network. However, while many studies have tackled the segmentation problem, they often lack a seamless transfer of gradient features from the encoder to the decoder. Specifically, there is a deficiency in incorporating skip-connections with attention blocks, as well as transferring features from different decoder layers to a multi-scale feature block for generating final segmentation feature maps. Experimentally this architecture gives our model the ability to output more accurate segmentation maps compared with state-of-the-art methods.
The main contributions of this work are:
We propose a novel architecture, PolyRes-Net, for semantic image segmentation. The proposed architecture introduces the MLR block in each layer of the encoder-decoder branches of the network, incorporating an attention mechanism in the skip connection, and utilizing a multi-scale features aggregation layer to output the final segmentation feature maps.
PolyRes-Net relies on preserving low-level features along with high-level features throughout the network’s final output. This is achieved by establishing connections at two levels: from the encoder to the decoder through skip connections and from various decoding layers to the final segmentation feature maps using a multi-scale feature aggregation layer.
To demonstrate the efficacy of the suggested method performs better than alternative algorithms through conducting experiments on a variety of datasets. Utilizing Dice Coefficient (DSC) and Jaccard Index as assessment metrics, we have tested with four distinct medical imaging datasets: Kvasir-SEG, CVC ClinicDB, 2018 Data Science Bowl, and ISIC-2018 skin lesion segmentation challenge dataset.
An extensive evaluation of PolyRes-Net across four datasets shows a significant improvement over most of the state-of-the-art models that exist nowadays. Therefore, PolyRes-Net can be a new baseline for medical image segmentation tasks.
The remainder of the paper is organised as follows: Section II discusses the previous studies that applied deep learning in the medical domain. Section III discusses and proposes the PolyRes-Network system. Section IV evaluates and demonstrates the efficiency of PolyRes-Network against other state-of-art methods. Finally, section V wraps up the paper, discussing its conclusion, limitations and future development.
Related Work
Systems for computer-aided diagnosis (CAD) have drawn interest as an essential tool for making quick clinical decisions about specific diseases [7]. Among the most challenging tasks for these systems is medical image segmentation, which involves extracting features from medical images to aid in the diagnosis of patient cases. FCN [3] and U-Net [4] have gained significant popularity among semantic segmentation approaches for medical images.
The U-shaped architecture is commonly used in the segmentation task, consisting of an encoder-decoder path. in the setup, the encoder utilizes the channels (filters) and reduces the spatial dimensions in each layer. On the other hand, the decoder reduces the channels while increasing the spatial dimensions. In the end, the spatial dimensions are restored to predict each pixel in the input image [8].
Training Deep Neural Networks is an a challenge process. Accuracy can be increased by training a deep neural network with a deeper network. On the other hand, it could interfere with training and lead to a deterioration issue. UNet++ [9] established a densely linked encoder-decoder network with deep supervision by adding a number of layered, dense skip paths. To help with training and tackle deterioration, a deep residual learning framework [10] is suggested. An architecture called Deep Residual U-Net [11] makes use of U-Net [4] and deep residual learning [10]. An enhanced version of the ResUnet architecture called ResUnet++ [12] makes use of squeezing and excitation blocks, residual blocks, attention blocks, and Atrous Spatial Pyramidal Pooling (ASPP). In order to emphasize pertinent aspects and suppress irrelevant ones, the Squeeze and Excitation (S&E) block was able to model relationships between the various channels and produce a global information map. These S&E blocks were included by FED-Net [13] in their altered U-Net architecture.
In the U-shaped design, a high-level semantic feature maps from the decoder and matching low-level feature maps from the encoder are combined via skip connections. This might lead to a semantic divide between high-level and low-level characteristics, with various strategies being employed to bridge this divide. A suggested architecture called Attention U-Net [5] modifies feature maps transmitted over skip-connections by means of an attention block. In order to exclude irrelevant spatial characteristics that are flowing from the skip connections and retain just the pertinent data, a gating mechanism is formed using the output of the previous decoder block. Furthermore, MultiResUnet [6] suggested designing each U-shaped block to have three
Atrous convolutions, or dilated convolutions, are introduced by the DeepLab architecture [14] to extract more dense features where information is better kept given objects of different scale. Then, compared to their earlier DeepLab versions, DeepLabV3 [19] exhibits impressive progress. In comparison to the synthesis paths of FCN and UNet, the DeepLabV3 architecture employs a synthesis path with fewer convolutional layers. A skip connection, akin to UNet design, is used by DeepLabV3 between the analysis and synthesis paths. The polyp segmentation architecture introduced by Pranet [18] uses a parallel partial decoder (PPD) to aggregate features in high-level layers and creates a global map that serves as the first guiding area for subsequent components. Features maps can be related to one another using the reverse Attention (RA) module.
HRENet [20] developed an adaptive feature aggregation (AFA) module to dynamically aggregate the multi-level features and send them to the decoder block, as well as an informative context enhancement (ICE) module to intensify the lower-level encoder features under the direction of hard-region attention. Multiple-scale information is included into modules used in other efforts, such as MSRFNet [21] and PolypSeg [22]. In PolypSeg [22], an adaptive scale context module (ASCM) and a semantic global context module (SGCM) were employed. While SGCM strengthens feature fusion between high-level and low-level features and eliminates noise from the low-level features to increase segmentation accuracy, ASCM addresses size fluctuations within polyps and boosts feature representation capabilities. In a comparable manner, MSRFNet [21] combined an additional shape stream network to refine polyp boundaries with cross-scale fusion modules to convey both high-resolution and low-resolution characteristics.
TGANet [23] suggested a text-guided attention architecture that made use of several feature augmentation modules coupled with various encoder blocks to address polyps’ fluctuating size and quantity for robust polyp segmentation. To supplement the size-based and number-based feature representations, an auxiliary work is learnt alongside the primary task and utilized as label attention in the decoder blocks. An encoder-decoder architecture is proposed by DCSAU-Net [24]. The primary feature conservation (PFC) approach in the encoder section by integrating the long-range geographical information of the network in the low-level semantic layer, tracking lowers the number of parameters and computation required. The rich primary feature derived from this layer is then provided to the compact split-attention (CSA) block. The CSA module uses a multi-path attention structure to improve the feature representation of various channels. In order to efficiently extract features from the combined features, the decoder concatenates encoded features from each downsampling layer with matching upsampled features using skip connections. Then, the same CSA block is applied.
EANet [25] introduces three modules; the dynamic scale-aware context module (DSC), the edge attention preservation module (EAP), and the multi-level pairwise regression module (MPR) to apply the complementary relationship between edge detection and object segmentation. The dynamic scale-aware context module (DSC) learns the relevant receptive fields dynamically and adaptively based on the scale of the segmented target object, therefore capturing multi-scale contextual information. The segmented object’s edge information is extracted using the Edge Attention Preservation module (EAP), which suppresses low-level background noise while keeping edge-related information intact. By modeling edge and area information, the multi-level pairwise regression module (MPR) is presented to enhance features retrieved from various depth-wise layers, including deeper and shallower layers.
The resulting segmentation task may be affected by encoding the input image as a low-resolution representation by joining high-to-low resolution convolutions in series, and then recovering the high-resolution representation from the encoded low-resolution representation without taking the high-resolution representation into consideration. To ensure that segmentation architectures produce reliable segmentation maps (such as skip connections), it is essential to maintain tracking of a high-resolution representation of the picture. According to [26], [27], [28], and [29], multi-scale fusion may also assist in exchanging high- and low-resolution features during the segmentation job rather than obtaining segmentation maps from low-level representations.
Based on previous work, this paper introduces a novel U-Net segmentation architecture named the Multilevel Residual Network (PolyRes-Net), consisting of three main modules: the Multi-Level Residual module (MLR), the Attention module, and the Multi-Scale Feature Aggregation module (MSFA). The Multi-Level Residual module (MLR) is utilized to extract global and local feature maps using two main branches and multiple levels of residual layers providing the model with more accurate and missing features. An attention mechanism is employed between the up-sampled features in the decoder and the corresponding encoder features through skip connections. The multi-scale Feature Aggregation module (MSFA) plays a pivotal role in generating the output segmentation feature maps. It aggregates features from different attention gates with the output of the last (MLR) block features, ultimately producing the final segmentation feature maps.
Proposed Method
Our proposed method as shown in Fig. 1 composed of two paths encoder path consist of five MLR-blocks and decoder path consist of four MLR-blocks. In the decoding stage attention gates are used between skip connection and output from the corresponding decoder layer to combine high-resolution representation and low-resolution representation from encoding to decoding path and get attention to the most important features before going up to the next decoder layer with more rich features. A detailed description of our proposed model PolyRes-Network is presented in this section.
PolyRes network architecture which consists of three main components MLR-block, Attention-block and multi-scale feature aggregation block.
Our model depends on multi-scale feature aggregation at different levels. Firstly, from the encoding stage to the decoder stage through skip connection and attention blocks. Secondly from different decoding stages through MSFA block to output the final segmentation feature map. This architecture enables the model to transfer and keep more important features from the encoder to decoder layers through skip connection and attention gates, also from different decoder layers at different scales to result in the final segmentation feature maps. Features aggregation that’s done through skip connections, attention gates and MSFA block enables the model to calibrate some of the misaligned predictions and improve the resulting segmentation features maps.
PolyRes-Network Model
A. Encoder
Encoder branch consists of five MLR-blocks as mentioned previously with a gradually increasing number of filters F, that controls the number of feature maps generated from each block.\begin{equation*} F = U \times \alpha \tag {1}\end{equation*}
MLR-block is the main unit of our network as shown in Fig. 2 consisting of two main branches left and right branch. Each one mainly consists of two consecutive \begin{align*} FLFilters & = F \times 0.333 \tag {2}\\ SLFilters& = F \times 0.5 \tag {3}\end{align*}
MLR-block architecture, illustrating two main branches: left and right branches. Each branch comprises two
The two main branches try to collect different shapes of data. LBranch consist of two consecutive
MLR-block depend on transfer knowledge on different levels: Firstly, At the level of each branch of the two main branches features computed from FLFilters and SLFilters are concatenated. Secondly, features computed from each branch are concatenated and fed forward in two consecutive
B. Decoder
Decoder branch consists of four MLR-blocks as mentioned previously with a gradually decreasing number of filters and attention layer applied before each decoding block. By identifying which aspects of the features need greater attention to improve the quality of characteristics that improve the results, the attention mechanism directs its attention to a subset of its input features.
Inspired by the success of attention mechanism with unet [5], our architecture used attention layer in the decoder part between skip connection and previous MLR-block output to be able to focus on the essential areas of the feature maps as shown in Fig. 3, so the input for each MLR-block in the decoder branch computed as following\begin{equation*} input = atten(skip connection, previous block output) \tag {4}\end{equation*}
The attention block outputs the most important features by capturing the attention between the skip connection and the MLR-block output.
The Attention block takes two inputs: the skip connection and the previous MLR-block output. Each input is fed into a
C. Multi-Scale Feature Aggregation
Our model utilizes the multi-scale feature aggregation (MSFA) layer to make segmentation decisions, incorporating information not only from the output of the last decoding layer but also considering outputs from each attention gate. Aggregating the outputs from attention gates, rather than from decoding layers alone, provides the network with more accurate features. Attention gates leverage skip connections and the previous decoder layer output to merge high and low-level representations, focusing on the most important features at each scale. The MSFA layer then aggregates these crucial features from each scale, in addition to the final decoder layer output, enhancing the resulting segmentation maps of the network.
The MSFA, as illustrated in Fig. 4, takes five inputs. Four of them are features from different attention gates at each scale, denoted as
Multiscale feature aggregation involves the aggregation of features from different scales obtained from various attention blocks. This aggregation is performed in conjunction with the final MLR-block output, resulting in the generation of final segmentation feature maps.
Results
We used four publicly available biomedical imaging datasets to evaluate our model; KvasirSEG [31], CVC-ClinicDB [30], 2018 Data Science Bowl [32], and ISIC-2018 Challenge [33], [34]. Each of these datasets has a different number of images with varying sizes, consisting of the images and their corresponding ground truth masks. Firstly, we resize the images and their masks to
Further preprocessing was applied to optimize the data for use with the PolyRes-Net model. The pixel values were normalized to a range of [0, 1], which facilitated faster convergence during training. To enhance the model’s ability to generalize and reduce the risk of overfitting, various data augmentation techniques were used, including random rotations, flips, cropping, zooming, and the addition of Gaussian noise. These techniques helped to simulate different conditions and perspectives, making the model more robust. In datasets with class imbalances, a weighted loss function was utilized to ensure that the model focused adequately on less represented classes. Additionally, image enhancement methods like histogram equalization and CLAHE were applied as needed, particularly in cases where medical images had low contrast, to improve the clarity of key features. These preprocessing actions were essential in preparing the data, ultimately contributing to the strong performance of the model across different datasets.
PolyRes-Net implemented using pytorch running on tesla t4 GPU, trained using binary cross entropy (BCE) loss function and evaluated using some of standard computer vision metrics for medical image segmentation such as Mean Dice Coefficient (mDSC), Mean Intersection over Union (mIOU), recall and precision.\begin{align*} BCE(y, \hat {y})& = -[y \cdot \log (\hat {y}) + (1 - y) \cdot \log (1 - \hat {y})] \tag {5}\\ mDSC & = \dfrac {1}{N} \sum _{1}^{N} \dfrac {2 \times TP}{2 \times TP + FP + FN} \tag {6}\\ mIoU & = \dfrac {1}{N} \sum _{1}^{N} \dfrac {TP}{TP + FP + FN} \tag {7}\\ Precision & = \dfrac {TP}{TP + FP} \tag {8}\\ Recall & = \dfrac {TP}{TP + FN} \tag {9}\end{align*}
Model trained for 150 epoch using adam optimizer [36] and cosine annealing warm restarts scheduler [37] with initial learning rate is 0.001, etamin is 0.0008, T0 is 8 and Tmult is 1. The cosine annealing warm restarts scheduler consists of two parts: cosine annealing and warm restarts. The term cosine annealing refers to the use of the cosine function as the learning rate annealing function, while warm restarts involve periodically resetting the learning rate to the same or higher value to escape local minima and encourage exploration during training. In practical applications, the cosine function has been demonstrated to outperform alternatives such as simple linear annealing. Warm restarts is the interesting part: it means that every so often, the learning rate is restated, e.g. re-raised back up and you can see it’s impact on our model From Fig. 6. Our model scheduler configured with an initial Learning Rate (LR) of 0.001 decreasing it until 0.0008 and warm restart again after eight epochs to start again from 0.001.
Comparison graph based on loss for the experimented datasets CVC-ClinicDB, Bowl-2018, KvasirSeg and ISIC-2018.
Learning curve of PolyRes-Net loss according epochs, A) with CVC ClinicDB dataset. B) with 2018 Data Science Bowl dataset.
Results on CVC-ClinicDB: For the CVC-ClinicDB dataset as shown in Tables 3, 5 and Fig. 5, MLR-Net outperforms all SOTA methods reporting the highest mIoU and mDSC of 0.8560 and 0.9102, respectively. Our method outperformed the most competitive DCSAU-Net [24] with a mIoU of 0.30% and mDSC of 0.40%. This improved performance is due to MLR-Net’s unique architecture, which makes use of the MLR-block’s dual-branch structure to efficiently extract a variety of relevant characteristics that facilitate more precise segmentation.
Results on KvasirSEG: As shown in Tables 3, 5 and Fig. 5, PraNet [18] outperforms MLR-Net reporting the highest mIoU and mDSC of 0.8330 and 0.8920, respectively. This result demonstrates that PraNet might have a more effective attention mechanism compared to MLR-Net. However, MLR-Net still receives competitive performance on this dataset which highlights its robustness across different datasets.
Results on 2018 Data Science Bowl: As shown in Tables 3, 5 and Fig. 5, MLR-Net outperforms all SOTA methods reporting the highest mIoU and mDSC of 0.8532 and 0.9180, respectively. Our method outperformed the most competitive DCSAU-Net [24] with a mIoU of 0.20% and mDSC of 0.50%. This enhancement may be attributed to MLR-Net’s multi-scale feature aggregation and efficient feature extraction skills, which allow it to extract both fine-grained features and global context from the images.
Results on ISIC-2018 Challenge: As shown in Tables 4, 5 and Fig. 5, MLR-Net outperforms all SOTA methods reporting the highest mIoU and mDSC of 0.8214 and 0.8925, respectively. Our method outperformed the most competitive MultiReUnet [6] with a mIoU of 1.00% and mDSC of 0.75%. Because MLR-Net integrates skip connections, attention mechanisms, and multi-scale feature aggregation effectively, it can collect both local and global information in the images, producing significantly better segmentation results.
Conclusion and Future Work
In this paper, we introduce PolyRes-Network, which adopts a U-shape architecture featuring encoder-decoder branches. The MLR-block, is the core component and it is employed within these branches to extract features. It leverages a dual-branch structure, employing distinct architectures of residual layers for each branch. This unique design facilitates the extraction of different shapes of important features by allowing each branch to focus on capturing specific characteristics. The model employs skip connections to concatenate feature maps from various encoder and decoder layers. This integration allows the combination of both low-level and high-level features, facilitating the network’s ability to capture detailed information and global context. An attention mechanism is implemented in the decoder branch, where each MLR-block receives its input from the output of attention between the skip connection and the previous MLR-block. The model constructs its final feature segmentation output using the multi-scale feature aggregation (MSFA) layer. The MSFA layer aggregates the outputs of attention layers at different scales, enhancing the accuracy of the segmentation feature maps.
The experimental results demonstrate the promising performance of our proposed method, surpassing several commonly used methods. However, challenges arise in achieving optimal results with the KvasirSEG dataset. The results highlight the superior performance of the proposed PolyRes-Net compared to state-of-the-art segmentation methods. Specifically, PolyRes-Net achieves the highest DSC scores of 91.02%, 91.80%, and 89.25% on CVC ClinicDB, 2018 Data Science Bowl, and ISIC-2018 skin lesion segmentation challenge dataset, respectively. Additionally, the highest JC scores are 85.60%, 85.32%, and 82.14% for the same datasets, further underscoring the efficacy of the proposed model.
We identify opportunities for improvement, particularly in the (MSFA) layer with different aggregation algorithms. We believe changes to the skip connection construction, incorporating CNN layers to bridge the gap between encoder and decoder branch features. Additionally, considering the use of a transformer in the attention layer may yield more accurate results than CNN. Furthermore, we believe that exploring the integration of a backbone model may prove beneficial, providing an avenue to further improve the model’s segmentation performance, especially when dealing with diverse modalities of medical images.