Introduction
A. Background
Medical image segmentation involves partitioning an image into multiple regions based on the spatial and pixel-level semantic information of the image. The shape, size, and location of the salient features in medical images often vary depending on the dataset, making it challenging to train a model to perform the semantic segmentation task. Deep learning (DL)-based medical image segmentation has addressed some of these challenges and is useful in many clinical applications, such as identifying or isolating an anatomical structure, diagnosis and treatment planning in radiology, image-guided interventions, and surgeries, quantitative analysis, computer-aided diagnosis, and many more. With that goal, Li et al. proposed a deep convolution neural network (CNN)-based framework for liver tumor segmentation [1], which was expanded upon with further computed tomography (CT) studies by Vivanti et al. [2]. In 2014, Menze et al. developed the multi-modal brain tumor segmentation benchmark [3], and in 2017 targeting the same challenge, Cherukuri et al. demonstrated segmentation to postoperative hydrocephalic scans [4]. Long et al. [5] developed fully convolutional networks for semantic segmentation.
Recent advancements in medical image segmentation are formulated on encoder-decoder-based DL architectures [6], [7]. Such DL approaches have been very effective in learning the spatial and semantic information from the image. However, DL models with denser layers often lose low-level spatial information and tend to learn more complex high-level semantic information. This challenge is addressed by encoder-decoder-based models, which we explore further in this paper and is further improved using our proposed architecture.
B. Literature Review
1) U-Net and its Variants
The U-Net architecture was proposed by Ronneberger et al. in 2015 [8]. A detailed review of U-Net and its variants was published by Siddique et al. in 2021 that outlines the theory and current applications of these methods [9]. A brief summary, as relevant to our proposed EMED-UNet architecture, is presented here.
The U-Net architecture contains an encoder (contracting path) and a decoder (expanding path). U-Net has been very successful in part due to the incorporation of skip connections [10], [11], [12], [13], which can retrieve full-scale spatial resolution from the encoder side and merge them with the semantic and coarser features at the decoder side.
Since 2015, numerous variants of U-Net have emerged. For example, Res-UNet [14], Dense-UNet [15], UNet++ [16], and UNet3+ [17] were developed to improve segmentation performance compared to the base U-Net model. The U-Net++ network architecture proposed by Zhao et al. [12] solved the problem of finding the optimal depth of U-Net for the best result and also provided a redesign of the skip connections for the flexible fusion of features, unlike the restrictive fusion of the same-scale feature maps of U-Net. UNet++ was further improved by Lei et al. [18] for congestive heart failure diagnosis. Later, Zou et al. employed a Multi-scale Residual UNet++ [19] for the same purpose. Implementation of the U-Net architecture for 3D image segmentation was developed as 3D U-Net [12], where all operations and functions, including convolution, max-pooling, and up-sampling, were performed in 3D. Adding attention-based learning, Oktay et al. proposed the Attention U-Net [20], which gave the model the ability to focus only on certain key features. Further, attention-gated networks [21] were introduced to leverage salient regions of medical images.
Recently, COVID TV-UNet was proposed by Saeedizadeh et al. [22] in 2021 for segmenting COVID-19 chest CT images. It was further extended by Saeedizadeh et al. for retinal pigment epithelium layer detection [23] in order to develop a shape-preserving retinal optical CT image alignment method.
Introducing transformers and HarDNet Structures, Shen and Xu proposed a dual-encoder image segmentation network [24], solving the limitation of losing the interaction of inter-channel information during the upsampling of the decoder. Thus, they brought more strength to the local connection between the adjacent chunks. In [25], Chen et al. proposed an unsupervised tilewise autoencoder (T-AE) pretraining architecture for transferable knowledge. They also proposed a sub-framework based on a reconstruction network regularized segmentation model for fine-tuning.
Substantial work has been done with COVID-19 CT imaging in segmenting lesions and detecting lung infection from the CT scan. Das, in 2021, proposed an Adaptive Activation Function-based U-Net model [26] and ensemble learning with the CNN deep features with a novel approach called EL-CNN-DF. In 2022, the Cascaded 3D U-Net [27] concept was presented by Aswathy et al. working on the COVID-19 CT segmentation task. They proposed a cascade of two 3D U-Nets, both fulfilling different purposes. The first 3D U-Net extracts the lung parenchyma from the CT volume and the second works on extracting the infected 3D modules. Malik et al. in [28] proposed CDC-Net, a multi-classification CNN model for the detection of chest-related infections of COVID-19, pneumothorax, pneumonia, lung cancer, and tuberculosis. The proposed architecture involved residual connections and dilated convolutions. Yin et al. proposed another segmentation framework: SD-UNet [29] targeting CT imaging modality to detect lung infections. They modified the U-Net with Squeeze-and-Attention and dense ASPP (atrous spatial pyramid pooling) modules in order to fuse the global and multi-scale features.
Later came the concept of deeply learned vectors’ formation [30], proposed by Naeem et al., mainly based on implementing a CNN with auto-correlation, gradient computation, scaling, filter, and localization coupled with state-of-the-art content-based image retrieval methods.
2) Works on Efficiency
To decrease computational costs, many research groups have focused on developing lightweight networks. These networks aim to make the models deployable on resource-constrained hardware and for real-time applications. For example, InceptionNet proposed by Szegedy et al. [31] has a bottleneck layer (i.e.,
C. Our Previous Work
We presented the first iteration of EMED-UNet as part of the IEEE Region-10 Symposium (TENSYMP) 2022 Conference [39]. To strengthen that preliminary work, we have further improved our model architecture, incorporated a deep supervision technique in our model for performance improvement, and validated the EMED-UNet model with additional datasets. In addition, instead of using large
D. Motivation and Contributions
Despite its wide adoption in the medical imaging community, the U-Net architecture [8] has significant limitations, particularly when it comes to the need to deploy models in real-time systems such as mobile devices. Though U-Net and its variants have shown high accuracy, the primary challenge is that the U-Net models generally have complex neural architectures that employ a large number of filters and require a high amount of trainable parameters and many Floating Point Operations Per Second (FLOPS). This results in a high computational cost and requires a significant amount of onboard memory.
Another challenge faced by the U-Net architecture is that it limits the type and size of the kernel used for the convolutions in the architecture. Specifically, there is only one type of kernel used throughout the U-Net architecture in all the convolutions: the standard
To address these limitations, we present the EMED-UNet network architecture. The proposed architecture consists of an Efficient Feature Extraction (EFE) module, which is employed at each encoder and decoder in the network. The EFE module reduces the number of parameters at the bottleneck layers, thereby reducing overall computational complexity and memory usage. The module consists of dilated convolution layers, which increase the receptive field without increasing the kernel size. The design of EMED-UNet consists of multiple encoders and decoders (i.e. multiple U-Nets), all capturing information at different receptive fields. A specific encoder-decoder pair (i.e. one U-Net) focuses on capturing the feature maps at a specific receptive field during convolution in the EFE module. Therefore, using multiple encoder-decoder pairs, EMED-UNet can capture features at multiple receptive fields. This allows multiple gradients to back-propagate simultaneously and multiple outputs to collect at the end. Furthermore, we implemented a deep supervision technique in the EMED-UNet to enable a collaborative learning experience between different U-Nets. Inspired by the UNet++ [16] proposed by Zhou et al., we redesigned the skip connections to achieve a flexible fusion of the saliency maps extracted by different U-Nets at varying resolution levels in the EMED-UNet.
We evaluated the EMED-UNet architecture on four medical imaging datasets: Montgomery County (MC), Shenzhen CXR dataset, COVID-19 CT lesion segmentation dataset, and the Brain Tumor Segmentation (BraTS) dataset. Our results demonstrate that the proposed model is able to match, or in some cases, outperform, the accuracy of U-Net while significantly reducing the number of parameters, FLOPS, and memory usage. With our EMED-UNet network architecture, we make the following four key contributions:
We propose the EMED-UNet architecture, which contains multiple U-Nets, each capturing information at different receptive fields. Thus, the model can learn features of multiple receptive fields and multiple segmented outputs (one from each U-Net) and can be collected at the end. A deep supervision technique is also incorporated to enable collaborative learning among multiple U-Nets.
To reduce computational complexity and memory usage, we introduce an Efficient Feature Extractor (EFE) module. The EFE contains dilated convolutions that increase the receptive field, per the network’s requirement, without a significant increase in the number of parameters. It also contains multiple convolutional bottleneck layers, which contribute to parameter reduction.
We evaluated the EMED-UNet network architecture on four different medical image segmentation datasets and proved its proficiency over the standard U-Net and its variants in terms of reduced computational complexity and memory size.
We present two studies to independently showcase the strength of the EMED-UNet network and the EFE module. At first, we removed the EFE module and added the ResNet backbone to study the strength of the EFE module. In the second study, we studied the effects of the ResNet backbone on the proposed network and the U-Net network, comparing the learning abilities of both networks. Hence, studying the strength of the network.
The upcoming sections of this paper are organized as follows: In Section II, the EMED-UNet Network Architecture in multiple subsections, covering the motivation, the proposed module, technical specifications of the proposed EMED-UNet architecture, network connectivity, and the deep supervision enhanced collaborative learning. Datasets, Experiments, Results, and the two ablation studies are presented in Section III. Section IV provides the inferences from the work and possible future works. Conclusions are discussed in Section V.
Proposed Architecture
A. Motivation Behind the New Architecture
The base U-Net model contains a large number of filters in each of the convolution layers. Due to this, the total number of parameters in the model increases by a considerable amount (i.e. the number of parameters in the U-Net model goes over 31 million). A large number of filters demands a substantially high number of matrix multiplications, resulting in high FLOPS. Specifically, there are 386 G FLOPS in the base U-Net model, and the overall model size of the architecture is 373 MB.
Any CNN model, particularly those developed to analyze medical images, should be created such that its receptive field can cover the entire salient region of the input image. In medical imaging datasets that contain images from various modalities, the salient regions of the input images are of varying sizes. Therefore, the receptive field should be such that all the salient regions are covered. This requires multiple sets of kernels in the convolution layers of the models, all with different receptive fields. However, the base U-Net model carries a single type of kernel in its entire encoder-decoder structure, which is not feasible if the images in the dataset have salient parts of varying size and shape and are located in different spatial locations. Therefore, a deep learning model that can extract feature maps at multiple receptive fields is desirable.
We carried out a study to analyze the effect of different types of kernels in the convolution box of U-Net architecture. Two types of kernels were applied in the convolution box of the base U-Net architecture:
The challenges described above motivated our development of an efficient architecture that can learn all types of features based on multiple receptive fields during the convolution operation while reducing computational complexity and memory usage.
B. EFE Module
We propose an EFE module to improve memory efficiency and computational complexity in the EMED-UNet architecture. As shown in Fig. 2, the EFE module consists of
Top: EMED-UNet architecture with comparison on how features extracted by EMED-UNet are equivalent to the features extracted by different U-Nets, using different spatial extents. Bottom: The EMED-UNet network with different configurations of {n,
The EFE module comprises three hyper-parameters to achieve optimal outputs in terms of accuracy and computational complexity. The hyper-parameters are as follows:
: Denotes the receptive field to be developed in the module using the$K$ convolutions.$3\times 3$ : The reduction factor to be applied to limit the number of$\lambda _{1}$ filters(depth dimension of the$1\times 1$ filters).$1\times 1$ : The reduction factor to be applied to limit the number of$\lambda _{2}$ dilated filters (depth dimension of the$3\times 3$ filters).$3\times 3$
There are three parallel paths in the EFE module. One path goes with the
The EFE module is used in every encoder (before the final down-sampling layer) and decoder (before the final up-sampling layer) node. The following analysis describes the step-by-step operations performed by the EFE module and drives the final output. First, consider \begin{align*} F_{1} &= \frac {1}{\lambda _{1}} \times F \tag{1}\\ F_{2} &= \frac {1}{\lambda _{2}} \times F \tag{2}\end{align*}
\begin{align*} Z_{1} &= ZW_{1}, \hspace {0.5cm} \therefore Z_{1} \in R^{H\times W\times F_{1}} \tag{3}\\ Z_{2} &= (ZW_{2})W_{3}, \hspace {0.5cm} \therefore Z_{2} \in R^{H\times W\times F_{2}} \tag{4}\\ Z_{3} &= (ZM)W_{4}, \hspace {0.5cm} \therefore Z_{3} \in R^{H\times W\times F_{1}} \tag{5}\\ Z_{out} &= ZW_{1} \oplus (ZW_{2})W_{3} \oplus (ZM)W_{4} \tag{6}\end{align*}
\begin{align*} F_{out} &= 2 \times F_{1} + F_{2} \tag{7}\\ F_{out} &= 2 \times \frac {1}{\lambda _{1}} \times F + \frac {1}{\lambda _{2}} \times F \tag{8}\end{align*}
In place of the EFE module, the base U-Net model had two
The above analysis describes the reduction in the number of channels in the model after applying the EFE module. As the EFE module is present in every encoder and decoder node, this will eventually result in an overall parametric reduction in the architecture.
Fig. 1 demonstrates that the proposed EMED-UNet contains multiple embedded U-Nets. The feature extraction in each U-Net is done using unique kernels. Therefore, using different spatial extents, the features at different receptive fields are extracted by different U-Nets. In this way, the Multiple Encoder-Decoder structures of EMED-UNet can extract information at multiple receptive fields. Fig. 1 describes EMED-UNets with different configurations based on three parameters:
: Number of U-Nets or a number of encoder-decoder pairs in the architecture.$n$ : Minimum receptive field formed by any convolution layer in the entire architecture.$K_{min}$ : Maximum receptive field formed by any convolution layer in the entire architecture.$K_{max}$
C. Technical Specifications of the EMED-UNet Architecture
For example, in Fig. 1, the EMED-UNet with specifications:
D. Connectivity of the Network
The EMED-UNet network architecture has different connectivity for varying configurations based on three variables:
E. Collaborative Learning Using Deep Supervision
The EMED-UNet architecture contains multiple U-Nets. Therefore, there are several segmentation outputs generated at the end of every decoder. This motivates us to apply deep supervision in the model by back-propagating through all the decoder ends and to make all the U-Nets learn collaboratively. This feature ensembles the U-Net sub-components into one model to leverage enhanced collaborative learning between them at multiple receptive fields. The learning will happen in such a way that it learns features within the context of features learned by other encoder-decoder sub-components.
At the time of output collection, we are collecting the best output of those generated at multiple ends. All the upcoming results of EMED-UNet are generated after applying deep supervision. Fig. 3 shows the deeply supervised EMED-UNet network, getting multiple output masks while back-propagating the loss simultaneously in all the U-Nets. Fig. 3 describes the overall bird’s eye view and unfolding of the EMED-UNet architecture.
Overview of the deeply supervised EMED-UNet architecture, showing the unfolding of Encoder-Decoders of the EMED-UNet and the EFE module.The images shown in the figure are of the BraTS Dataset just to create a case example.
Experiments
A. Datasets
1) Montgomery County - Chest X-Ray Database
Montgomery County chest X-ray dataset [40], [41] was created for the diagnosis of Tuberculosis (TB) disease. It was a collaborative work of the National Library of Medicine (NLM) with the Department of Health and Human Services, Montgomery County, Maryland, USA. The dataset contains 138 chest X-rays, out of which 58 are from infected individuals and the rest are from non-infected individuals. This relatively small data set created a challenge to make the model learn all the critical features using a limited number of samples. The images were resized to
2) Tuberculosis Chest X-Ray DATA(Shenzen)
The Shenzhen Tuberculosis dataset [40], [41], [42] is a chest X-ray dataset created to detect Tuberculosis disease. The dataset resulted from the collaborative work of NLM and Shenzhen No.3 People’s Hospital, Guangdong Medical College, Shenzhen, China. It contains 662 images, 336 of which are from infected cases and 326 from normal cases. The 566 lung masks for this dataset were proposed by the Computer Engineering Department, Faculty of Informatics and Computer Engineering, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv, Ukraine. Lung masks are available here. The images are resized to
3) COVID-19 CT Lesion Segmentation Dataset
This is a large lung CT scan dataset [43], [44], [45] which was developed by combining data from three public datasets on COVID-19. It contains 2,729 images with their corresponding ground truth lesion masks. Each CT scan file of a patient consists of multiple slices, and thus every image is a slice from the CT scans of the patients. These scans are converted to NumPy arrays and then to png files for further processing. The resolution of every image in the dataset is
4) Brain Tumor Segmentation (BraTS) Dataset
This dataset focuses on the evaluation of state-of-the-art methods for the segmentation of brain tumors in multimodal magnetic resonance imaging (MRI) scans. It contains volumes with multimodal scans: T1, post-contrast T1-weighted T1Gd, T2-weighted (T2), and T2 Fluid Attenuated Inversion Recovery (T2-FLAIR). The dataset contains over 66,348 slices. We converted the images to the resolution of
B. Implementation
1) EMED-UNet
As described in Section – II, the EMED-UNet model uses the hyperparameters: n, \begin{align*} \{n, K_{min}, K_{max}, \lambda _{1},\lambda _{2}\} &= \{2,3,7,4,1\} \tag{9}\\ \{n, K_{min}, K_{max}, \lambda _{1},\lambda _{2}\} &= \{2,3,5,4,1\} \tag{10}\\ \{n, K_{min}, K_{max}, \lambda _{1},\lambda _{2}\} &= \{2,3,7,4,2\} \tag{11}\\ \{n, K_{min}, K_{max}, \lambda _{1},\lambda _{2}\} &= \{2,3,5,2,1\} \tag{12}\end{align*}
The effect of change in parameters with different configurations mentioned above is shown in Table 2.
2) Comparison of Models
We compared the results of the EMED-UNet architecture with two U-Net models, each resulting in a different receptive field. One U-Net contains the
We took the Train:Val:Test split ratio of 70:10:20 in the case of MC, SHZ, and C-19 CT datasets. The EMED-UNet was trained at a learning rate of
Table 3 shows the hyperparametric settings while training the model. The model was trained on a Paramshavak Supercomputer, the specifications of which are outlined in Table 4.
Every model mentioned in the results section is trained once on each dataset. Training and testing on one data set are done independently from all other data sets.
C. Performance Metrics and Loss Functions
The Performance Metrics and Loss Functions used in evaluating the model are as follows:
Dice-coefficient:
The Dice Coefficient [46] is a similarity metric used to measure the similarity between two images. The mathematical formulation of the Dice-coefficient metric function is:
where A is the prediction, and B is the ground truth.\begin{equation*} DC(A,B) = \frac {2 \times \mid A \cap B \mid }{\mid A\mid + \mid B\mid } \tag{13}\end{equation*} View Source\begin{equation*} DC(A,B) = \frac {2 \times \mid A \cap B \mid }{\mid A\mid + \mid B\mid } \tag{13}\end{equation*}
Intersection over Union:
The Intersection over Union (IoU) [47], also known as Jaccard Index, is a standard metric to evaluate segmentation performance. This metric shows the overlap ratio between the segmented output and the ground truth.
Precision:
Precision shows, out of all the predicted masks, how many of those objects had matched or overlapped with the ground truth annotation.
Dice Loss:
The Dice Loss [46] is also one of the most commonly used loss functions in segmentation tasks. It is based on the dice coefficient metric. The good characteristic of this function is that it can compute loss at global as well as local scale.
Above is the mathematical formulation of the dice loss where y is the ground truth and\begin{equation*} DL(y,\hat {p}) = 1 - \dfrac {2y\hat {p} + 1}{y + \hat {p} + 1} \tag{14}\end{equation*} View Source\begin{equation*} DL(y,\hat {p}) = 1 - \dfrac {2y\hat {p} + 1}{y + \hat {p} + 1} \tag{14}\end{equation*}
is the predicted output.$\hat {p}$ Metrics Comparing Efficiency:
The metrics that compare the efficiency of the models are the number of parameters and FLOPS (Floating Point Operations Per Second). The overall inference time of a system depends on the computing device (CPU, GPU cores, level of parallel processing, and more). Therefore, we use the number of FLOPS for evaluating the efficiency of the models. FLOPS are proportional to the inference time of any system, which is directly proportionate to the system’s efficiency.
D. Results
Table 5 compares the performance of the U-Net base model, its other variants, other state-of-the-art models, and the EMED-UNet model on the Montgomery County dataset. This is a relatively small dataset where the total number of images is 138. As shown in the table, the EMED-UNet model with parameters {2,3,5,2,1} performed better than the U-Net base model, its other variants, PSPNet, other compared models, and its other EMED-UNet counterparts with an IoU of 0.9195 and dice coefficient of 0.9578, with a significantly decreasing in the number of trainable parameters from 31.043 (in UNet(
Table 6 also shows the comparison between the U-Net base model and other state-of-the-art models on the Shenzen dataset. The precision, IoU, and dice coefficient increased from 0.9655, 0.8825, and 0.9370 in the case of U-Net(
Table 7 demonstrates the comparison between the U-Net base model, its other variants, and the EMED-UNet network on the COVID-19 CT Lesion Segmentation dataset. The EMED-UNet{2,3,7,4,2} outperforms the U-Net(
Table 8 shows that the performance of the EMED-UNet models outperforms all the other models in terms of both accuracy (DC) and efficiency (FLOPs and Params(M)). The EMED-UNet{2,3,7,4,1} outperforms the other models with a DC of 0.74 with only 1.84 M parameters and 0.38G FLOPs. It turns out that the EMED-UNET{2,3,5,2,1} is the lightest model, which also gives a decent accuracy of 0.272. Thus, it could be useful for specific purposes.
The above results prove that the embedded U-Net structure of the EMED-UNet Network is efficient in learning parameters as compared to the dense architecture of U-Net++, as it takes fewer parameters for learning the salient features of the medical images.
Fig. 4 shows the Dice Coefficient vs. FLOPS on four medical imaging datasets. It can be observed that the EMED-UNet model consistently outperformed the U-Net and its other variants in terms of efficiency. EMED-UNet model achieved a gain in accuracy on the MC, Shenzhen, and BraTS datasets, while it lost some accuracy on the COVID-19 CT dataset.
Dice Coefficient vs. FLOPS on Montgomery County, Shenzhen CXR, Covid-19 CT Lesion segmentation, and the BraTS datasets.
Fig. 5 demonstrates a visual output comparison between U-Net and EMED-UNet segmentation results. In terms of accuracy, both the U-Net and EMED-UNet are performing well, but in some edge cases, EMED-UNet performs better. For example, as shown in Fig. 5, the U-Net predicted the ‘1’ (in binary image) in some places, while EMED-UNet has a false prediction. The accuracy of the results can be verified from the numerical results in Table 4.
Qualitative results comparison between output samples generated with U-Net and the EMED-UNet models on all the datasets. The EMED-UNet configuration used is {2,3,7,4,2} for the Shenzhen and C-19 Datasets, and {2,3,5,2,1}, on the Montgomery County dataset.
E. Ablation Studies
We conducted two ablation studies: First, to compare the EFE module with another backbone; Second, to compare the proposed Network architecture (Multi-Encoder-Decoder network) with the base U-Net architecture. We chose the strong residual backbone [48] and evaluated its effects on the proposed Multi-Encoder-Decoder network. We replace the EFE module with a ResNet module and compare the efficiency and accuracy in both cases. For the purposes of this section, let us refer to EMED-UNet without the EFE module as MED-UNet. The MED-UNet with the residual module is denoted by MED-UNet(res).
The residual module used in place of the EFE module consists of two convolution layers, one
1) Ablation Study - 1
In the MED-UNet, the configuration will be {
Results of Ablation study. The Model ‘Res-XX’ denotes the residual backbone on the multi-encoder-decoder network, and ‘EFE-XX’ denotes the multi-encoder-decoder network with the EFE module, which is basically the EMED-UNet architecture.
Fig. 6 and Table 8 describe the results of the ablation study. The MED-UNet(res) used 81% more parameters as compared to the EMED-UNet{2,3,7,4,2}. It achieves a negligible accuracy gain on the Shenzhen dataset and a small gain on the COVID-19 CT dataset. This demonstrates that the EFE modules work well in finding the balance between accuracy and efficiency. The detailed results comparison is shown in Table 8.
2) Ablation Study - 2
In this study, we tested the effects of different backbones on both the MED-UNet and the U-Net architectures. We considered the ResNet backbones. The MED-UNet network consists of multiple embedded U-Nets, with each U-Net extracting information at different receptive fields. We demonstrate that the MED-UNet network outperforms U-Net in terms of accuracy by using a common backbone in both networks.
The residual module used in MED-UNet is the same as the one used in the ablation study. Here we are considering the same model of MED-UNet(res) with a residual module as the one we considered in the ablation study - 1. In the case of U-Net, we use regular
Again, we conducted the experiment on the COVID-19 CT LS dataset. The results in Table 8 show the strength of the Multi-Encoder-Decoder model architecture. It is observed that the MED-UNet{2,3,7,4,2}(res) outperforms the U-Net(
Discussion
The ablation study demonstrates the impact of the EFE module on the overall model efficiency. Using the reduction factors as controls, we can find the best trade-off between accuracy and efficiency. In the ablation study, we observed that the EMED-UNet{2,3,7,4,2} consisting of 6.35M parameters nearly underperforms compared to MED-UNet{2,3,7}(res), consisting of 36M parameters, in terms of accuracy. However, if we consider a different configuration of EMED-UNet like {2,3,7,4,1} consisting of 10.39M parameters, it slightly crosses MED-UNet{2,3,7} with the residual module with a dice coefficient of 0.8018, on the C-19 dataset. Thus, by compromising on efficiency, one can get better accuracy.
While increasing the value of the reduction factors in the EFE module, the model is losing accuracy as we reduce the number of channels in the feature maps. However, we can recover the accuracy using the proposed structure of the EMED-UNet network. The network consisting of multiple encoders and decoders can capture the salient features from the images using multiple receptive fields. The strength of the EMED-UNet network is shown in the study described in section E. Thus, the EMED-UNet network with the EFE module, i.e. the EMED-UNet network, is the best fit.
Based on the detailed study of the proposed network, we can see that a major improvement in results is arising due to the network structure. This work might be further uplifted by observing the effect of a transformer backbone on the same network structure while testing multiple network configurations.
Conclusion
The proposed EMED-UNet network architecture was demonstrated as very effective at segmenting biomedical images with improved accuracy and significantly reduced parameters, FLOPS, and model size as compared to the standard U-Net base model. By employing the EFE module, our network decreases the computational complexity and memory required for the semantic segmentation task. The EMED-UNet can learn all the salient features in the image by introducing multiple encoder-decoders to extract features at multiple receptive fields, incorporating collaborative learning using the deep supervision technique. We evaluated our multi-encoder-decoder network on four medical image segmentation datasets and observed that it matches or exceeds the accuracy of the standard U-Net base, measured using multiple performance metrics, on a very low parametric cost, significantly fewer FLOPS, and low memory consumption. Iterating through different parameters in the model, one can find the best configuration for the segmentation task on any medical image modality that maintains a good trade-off between accuracy and efficiency.