Journals & Magazines >IEEE Access >Volume: 11

EMED-UNet: An Efficient Multi-Encoder-Decoder Based UNet for Medical Image Segmentation

The figure illustrates the schematic of the proposed deeply supervised EMED-UNet framework, highlighting the sequential unfolding of its Encoder-Decoders and the proposed...

Abstract:

Many current and state-of-the-art deep learning models for accurate image segmentation are based on the U-Net architecture, a convolutional neural network designed for bi...Show More

Metadata

Abstract:

Many current and state-of-the-art deep learning models for accurate image segmentation are based on the U-Net architecture, a convolutional neural network designed for biomedical applications. Despite its widespread adoption in the medical imaging community, U-Net has two major limitations. First, due to its deep structure and the large number of filters used, the number of parameters and Floating-Point Operations Per Second (FLOPS) are high. This results in high computational complexity and demands a large memory size, making real-time implementation and deployment of U-Net models challenging. Second, the base U-Net model only uses a single kernel type (i.e.,

$3\times 3$ ) throughout the network for all convolution operations. Feature extraction using a single spatial extent throughout the network is suitable if the size, location, and shape of the salient regions remain static in all the dataset images, which is not necessarily the case in medical imaging. To address these two limitations, we propose an Efficient Multi-Encoder-Decoder based UNet (EMED-UNet), a novel architecture for efficient medical image segmentation. We evaluated our network on four medical imaging datasets: Montgomery County, Shenzhen CXR, COVID-19 CT LS, and the BraTS (Brain Tumor Segmentation) dataset. EMED-UNet outperforms U-Net and its variants in terms of accuracy, with around 77% reduction in parameters, a 60% reduction in FLOPS, and a 79.2% reduction in memory usage (all as compared to U-Net). The results demonstrate that EMED-UNet is a lightweight and accurate model for image segmentation that substantially improves upon the U-Net base model and is more feasible to deploy, given its decreased computational cost.

The figure illustrates the schematic of the proposed deeply supervised EMED-UNet framework, highlighting the sequential unfolding of its Encoder-Decoders and the proposed...

Published in: IEEE Access ( Volume: 11)

Page(s): 95253 - 95266

Date of Publication: 28 August 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3309158

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

A. Background

Medical image segmentation involves partitioning an image into multiple regions based on the spatial and pixel-level semantic information of the image. The shape, size, and location of the salient features in medical images often vary depending on the dataset, making it challenging to train a model to perform the semantic segmentation task. Deep learning (DL)-based medical image segmentation has addressed some of these challenges and is useful in many clinical applications, such as identifying or isolating an anatomical structure, diagnosis and treatment planning in radiology, image-guided interventions, and surgeries, quantitative analysis, computer-aided diagnosis, and many more. With that goal, Li et al. proposed a deep convolution neural network (CNN)-based framework for liver tumor segmentation [1], which was expanded upon with further computed tomography (CT) studies by Vivanti et al. [2]. In 2014, Menze et al. developed the multi-modal brain tumor segmentation benchmark [3], and in 2017 targeting the same challenge, Cherukuri et al. demonstrated segmentation to postoperative hydrocephalic scans [4]. Long et al. [5] developed fully convolutional networks for semantic segmentation.

Recent advancements in medical image segmentation are formulated on encoder-decoder-based DL architectures [6], [7]. Such DL approaches have been very effective in learning the spatial and semantic information from the image. However, DL models with denser layers often lose low-level spatial information and tend to learn more complex high-level semantic information. This challenge is addressed by encoder-decoder-based models, which we explore further in this paper and is further improved using our proposed architecture.

B. Literature Review

1) U-Net and its Variants

The U-Net architecture was proposed by Ronneberger et al. in 2015 [8]. A detailed review of U-Net and its variants was published by Siddique et al. in 2021 that outlines the theory and current applications of these methods [9]. A brief summary, as relevant to our proposed EMED-UNet architecture, is presented here.

The U-Net architecture contains an encoder (contracting path) and a decoder (expanding path). U-Net has been very successful in part due to the incorporation of skip connections [10], [11], [12], [13], which can retrieve full-scale spatial resolution from the encoder side and merge them with the semantic and coarser features at the decoder side.

Since 2015, numerous variants of U-Net have emerged. For example, Res-UNet [14], Dense-UNet [15], UNet++ [16], and UNet3+ [17] were developed to improve segmentation performance compared to the base U-Net model. The U-Net++ network architecture proposed by Zhao et al. [12] solved the problem of finding the optimal depth of U-Net for the best result and also provided a redesign of the skip connections for the flexible fusion of features, unlike the restrictive fusion of the same-scale feature maps of U-Net. UNet++ was further improved by Lei et al. [18] for congestive heart failure diagnosis. Later, Zou et al. employed a Multi-scale Residual UNet++ [19] for the same purpose. Implementation of the U-Net architecture for 3D image segmentation was developed as 3D U-Net [12], where all operations and functions, including convolution, max-pooling, and up-sampling, were performed in 3D. Adding attention-based learning, Oktay et al. proposed the Attention U-Net [20], which gave the model the ability to focus only on certain key features. Further, attention-gated networks [21] were introduced to leverage salient regions of medical images.

Recently, COVID TV-UNet was proposed by Saeedizadeh et al. [22] in 2021 for segmenting COVID-19 chest CT images. It was further extended by Saeedizadeh et al. for retinal pigment epithelium layer detection [23] in order to develop a shape-preserving retinal optical CT image alignment method.

Introducing transformers and HarDNet Structures, Shen and Xu proposed a dual-encoder image segmentation network [24], solving the limitation of losing the interaction of inter-channel information during the upsampling of the decoder. Thus, they brought more strength to the local connection between the adjacent chunks. In [25], Chen et al. proposed an unsupervised tilewise autoencoder (T-AE) pretraining architecture for transferable knowledge. They also proposed a sub-framework based on a reconstruction network regularized segmentation model for fine-tuning.

Substantial work has been done with COVID-19 CT imaging in segmenting lesions and detecting lung infection from the CT scan. Das, in 2021, proposed an Adaptive Activation Function-based U-Net model [26] and ensemble learning with the CNN deep features with a novel approach called EL-CNN-DF. In 2022, the Cascaded 3D U-Net [27] concept was presented by Aswathy et al. working on the COVID-19 CT segmentation task. They proposed a cascade of two 3D U-Nets, both fulfilling different purposes. The first 3D U-Net extracts the lung parenchyma from the CT volume and the second works on extracting the infected 3D modules. Malik et al. in [28] proposed CDC-Net, a multi-classification CNN model for the detection of chest-related infections of COVID-19, pneumothorax, pneumonia, lung cancer, and tuberculosis. The proposed architecture involved residual connections and dilated convolutions. Yin et al. proposed another segmentation framework: SD-UNet [29] targeting CT imaging modality to detect lung infections. They modified the U-Net with Squeeze-and-Attention and dense ASPP (atrous spatial pyramid pooling) modules in order to fuse the global and multi-scale features.

Later came the concept of deeply learned vectors’ formation [30], proposed by Naeem et al., mainly based on implementing a CNN with auto-correlation, gradient computation, scaling, filter, and localization coupled with state-of-the-art content-based image retrieval methods.

2) Works on Efficiency

To decrease computational costs, many research groups have focused on developing lightweight networks. These networks aim to make the models deployable on resource-constrained hardware and for real-time applications. For example, InceptionNet proposed by Szegedy et al. [31] has a bottleneck layer (i.e., $1 \times 1$ convolutions) which is used for dimensional reduction to decrease the number of parameters significantly. In this network, a kernel of the same size could not be used in every convolution throughout the network if the salient regions in every image vary in sizes. Therefore, they showcased the importance of using different spatial extents (i.e. kernel sizes) in the convolution layers. Based on this network, Singh et al. developed Inception U-Net [32] with the aim of automating the selection of the numerous layers in the U-Net architecture. It contains the inception layers in the nodes of the standard U-Net. Cahall et al. [33] expanded on this to use the dilated convolution in the Inception U-Net to capture both the local structural as well as global contextual information and proposed the Dilated Inception U-Net. Following that, Zhang et al. [34] proposed the Dense Inception U-Net integrating the Inception-Res module and the densely connected convolution module. In addition, Beheshti and Johnsson proposed a lightweight U-Net, the “Squeeze U-Net” [35] in 2020, with reduced memory requirement and improved efficiency. As another approach towards creating lightweight models for medical image segmentation came the knowledge distillation technique [36]. They used a technique of distilling knowledge from another well-trained segmentation network. Finally, in 2023, He et al. in [37] proposed an efficient hierarchical hybrid vision Transformer (H2Former), getting the aggregating potential of CNNs, multi-scale channel attention, and Transformers. More transformer-based Medical Image Analysis techniques are elaborated in [38]. This states that the transformers can play a good role in enhancing the current state-of-the-art techniques developed for medical image analysis.

C. Our Previous Work

We presented the first iteration of EMED-UNet as part of the IEEE Region-10 Symposium (TENSYMP) 2022 Conference [39]. To strengthen that preliminary work, we have further improved our model architecture, incorporated a deep supervision technique in our model for performance improvement, and validated the EMED-UNet model with additional datasets. In addition, instead of using large $5\times 5$ or $7\times 7$ convolutions, we used dilated $3\times 3$ convolutions to reduce the number of trainable parameters and FLOPS in the model. In our initial work, we also observed that certain mid-level features need to be extracted before passing the images into the EMED-UNet network. Therefore, two convolution layers were added as pre-processing steps before the input layers of the EMED-UNet network. We also improvised the model accuracy by applying Deep Supervision to the EMED-UNet model.

D. Motivation and Contributions

Despite its wide adoption in the medical imaging community, the U-Net architecture [8] has significant limitations, particularly when it comes to the need to deploy models in real-time systems such as mobile devices. Though U-Net and its variants have shown high accuracy, the primary challenge is that the U-Net models generally have complex neural architectures that employ a large number of filters and require a high amount of trainable parameters and many Floating Point Operations Per Second (FLOPS). This results in a high computational cost and requires a significant amount of onboard memory.

Another challenge faced by the U-Net architecture is that it limits the type and size of the kernel used for the convolutions in the architecture. Specifically, there is only one type of kernel used throughout the U-Net architecture in all the convolutions: the standard $3\times 3$ kernel. Using only one kernel type throughout the model is suitable if the size and shape of the salient regions in all the images in the dataset remain static, which is generally not in the case of medical imaging. In addition, as there is only one encoder-decoder pair in the U-Net model with only one outlet to collect the output, a single effective receptive field is formed over the input image. Due to this, the feature extraction happens at a single effective receptive field generated over the input at the end of the network.

To address these limitations, we present the EMED-UNet network architecture. The proposed architecture consists of an Efficient Feature Extraction (EFE) module, which is employed at each encoder and decoder in the network. The EFE module reduces the number of parameters at the bottleneck layers, thereby reducing overall computational complexity and memory usage. The module consists of dilated convolution layers, which increase the receptive field without increasing the kernel size. The design of EMED-UNet consists of multiple encoders and decoders (i.e. multiple U-Nets), all capturing information at different receptive fields. A specific encoder-decoder pair (i.e. one U-Net) focuses on capturing the feature maps at a specific receptive field during convolution in the EFE module. Therefore, using multiple encoder-decoder pairs, EMED-UNet can capture features at multiple receptive fields. This allows multiple gradients to back-propagate simultaneously and multiple outputs to collect at the end. Furthermore, we implemented a deep supervision technique in the EMED-UNet to enable a collaborative learning experience between different U-Nets. Inspired by the UNet++ [16] proposed by Zhou et al., we redesigned the skip connections to achieve a flexible fusion of the saliency maps extracted by different U-Nets at varying resolution levels in the EMED-UNet.

We evaluated the EMED-UNet architecture on four medical imaging datasets: Montgomery County (MC), Shenzhen CXR dataset, COVID-19 CT lesion segmentation dataset, and the Brain Tumor Segmentation (BraTS) dataset. Our results demonstrate that the proposed model is able to match, or in some cases, outperform, the accuracy of U-Net while significantly reducing the number of parameters, FLOPS, and memory usage. With our EMED-UNet network architecture, we make the following four key contributions:

We propose the EMED-UNet architecture, which contains multiple U-Nets, each capturing information at different receptive fields. Thus, the model can learn features of multiple receptive fields and multiple segmented outputs (one from each U-Net) and can be collected at the end. A deep supervision technique is also incorporated to enable collaborative learning among multiple U-Nets.
To reduce computational complexity and memory usage, we introduce an Efficient Feature Extractor (EFE) module. The EFE contains dilated convolutions that increase the receptive field, per the network’s requirement, without a significant increase in the number of parameters. It also contains multiple convolutional bottleneck layers, which contribute to parameter reduction.
We evaluated the EMED-UNet network architecture on four different medical image segmentation datasets and proved its proficiency over the standard U-Net and its variants in terms of reduced computational complexity and memory size.
We present two studies to independently showcase the strength of the EMED-UNet network and the EFE module. At first, we removed the EFE module and added the ResNet backbone to study the strength of the EFE module. In the second study, we studied the effects of the ResNet backbone on the proposed network and the U-Net network, comparing the learning abilities of both networks. Hence, studying the strength of the network.

The upcoming sections of this paper are organized as follows: In Section II, the EMED-UNet Network Architecture in multiple subsections, covering the motivation, the proposed module, technical specifications of the proposed EMED-UNet architecture, network connectivity, and the deep supervision enhanced collaborative learning. Datasets, Experiments, Results, and the two ablation studies are presented in Section III. Section IV provides the inferences from the work and possible future works. Conclusions are discussed in Section V.

SECTION II.

Proposed Architecture

A. Motivation Behind the New Architecture

The base U-Net model contains a large number of filters in each of the convolution layers. Due to this, the total number of parameters in the model increases by a considerable amount (i.e. the number of parameters in the U-Net model goes over 31 million). A large number of filters demands a substantially high number of matrix multiplications, resulting in high FLOPS. Specifically, there are 386 G FLOPS in the base U-Net model, and the overall model size of the architecture is 373 MB.

Any CNN model, particularly those developed to analyze medical images, should be created such that its receptive field can cover the entire salient region of the input image. In medical imaging datasets that contain images from various modalities, the salient regions of the input images are of varying sizes. Therefore, the receptive field should be such that all the salient regions are covered. This requires multiple sets of kernels in the convolution layers of the models, all with different receptive fields. However, the base U-Net model carries a single type of kernel in its entire encoder-decoder structure, which is not feasible if the images in the dataset have salient parts of varying size and shape and are located in different spatial locations. Therefore, a deep learning model that can extract feature maps at multiple receptive fields is desirable.

We carried out a study to analyze the effect of different types of kernels in the convolution box of U-Net architecture. Two types of kernels were applied in the convolution box of the base U-Net architecture: $3 \times 3$ and $5 \times 5$ . Table 1 shows that the U-Net ($5\times5$ ) performed better when the experiment was conducted using the Shenzhen dataset, but U-Net ($3\times3$ ) outperformed it when the experiment was conducted on the Montgomery County dataset. From Table 1, we conclude that in some cases, the U-Net with the $3 \times 3$ kernels work better, and in some, the $5 \times 5$ one. Thus, exactly what kernel to use depends on the resolution of the images, along with the type of latent space to be learned. This demonstrates that a model created for the medical imaging segmentation task should be able to focus on multiple receptive fields. A fixed kernel convolution throughout the network cannot learn all types of salient regions if the size and location of salient features in the image are not common in all the images.

TABLE 1 Effect of $3 \times3$ and $5 \times5$ Kernels on the U-Net Architecture Performance

$Table 1- Effect of $3 \times3$ and $5 \times5$ Kernels on the U-Net Architecture Performance$

The challenges described above motivated our development of an efficient architecture that can learn all types of features based on multiple receptive fields during the convolution operation while reducing computational complexity and memory usage.

B. EFE Module

We propose an EFE module to improve memory efficiency and computational complexity in the EMED-UNet architecture. As shown in Fig. 2, the EFE module consists of $1\times 1$ convolutions, $3\times 3$ dilated convolutions (with varying dilation rate), and a $2\times 2$ MaxPooling2D layer. The $1\times 1$ convolution reduces the depth dimension of the feature maps and filters only the channels suitable for the next $3\times 3$ convolution to apply on. The purpose of using dilated $3\times 3$ convolutions is to increase the receptive field during the convolution operation. For example, $3\times 3$ convolutions with a dilation rate equal to 2 will have a receptive field of $5\times 5$ . Those with a dilation rate equal to three will have a receptive field of $7\times 7$ .

$FIGURE 1. - Top: EMED-UNet architecture with comparison on how features extracted by EMED-UNet are equivalent to the features extracted by different U-Nets, using different spatial extents. Bottom: The EMED-UNet network with different configurations of {n, $K_{min}$ , $K_{max}$ } is shown.$

FIGURE 1.

Top: EMED-UNet architecture with comparison on how features extracted by EMED-UNet are equivalent to the features extracted by different U-Nets, using different spatial extents. Bottom: The EMED-UNet network with different configurations of {n,$K_{min}$ ,$K_{max}$ } is shown.

Show All

FIGURE 2.

The EFE module is used in both the encoders and decoders.

Show All

The EFE module comprises three hyper-parameters to achieve optimal outputs in terms of accuracy and computational complexity. The hyper-parameters are as follows:

$K$ : Denotes the receptive field to be developed in the module using the $3\times 3$ convolutions.
$\lambda _{1}$ : The reduction factor to be applied to limit the number of $1\times 1$ filters(depth dimension of the $1\times 1$ filters).
$\lambda _{2}$ : The reduction factor to be applied to limit the number of $3\times 3$ dilated filters (depth dimension of the $3\times 3$ filters).

There are three parallel paths in the EFE module. One path goes with the $2\times 2$ MaxPooling2D layer followed by a $1\times 1$ convolution. The second path contains a $1\times 1$ convolution bottleneck followed by the $3\times 3$ dilated convolution, and the third path contains only the $1\times 1$ convolution bottleneck. All three parallel paths concatenate at the end as the output of the EFE module. K, $\lambda _{1}$ , and $\lambda _{2}$ are the three controls to the EFE module and directly impact the model’s receptive field and the total number of trainable parameters.

The EFE module is used in every encoder (before the final down-sampling layer) and decoder (before the final up-sampling layer) node. The following analysis describes the step-by-step operations performed by the EFE module and drives the final output. First, consider $F_{1}$ and $F_{2}$ to be the reduced number of filters after the reduction caused by the factors $\lambda _{1}$ and $\lambda _{2}$ , respectively. ‘F’ is the input number of filters to the EFE module based on the Network structure as described in Fig. 1. Therefore, \begin{align*} F_{1} &= \frac {1}{\lambda _{1}} \times F \tag{1}\\ F_{2} &= \frac {1}{\lambda _{2}} \times F \tag{2}\end{align*} View Source Now, as described in Fig. 2, $Z_{1}$ , $Z_{2}$ , and $Z_{3}$ are the respective outputs of the three parallel paths. M is the weight matrix of the $2\times 2$ MaxPooling2D operation. $W_{1}$ , $W_{2}$ , $W_{3}$ , and $W_{4}$ are the weight matrices of the convolution layers, also described in Fig. 2. Let’s assume the input image matrix is Z. Then \begin{align*} Z_{1} &= ZW_{1}, \hspace {0.5cm} \therefore Z_{1} \in R^{H\times W\times F_{1}} \tag{3}\\ Z_{2} &= (ZW_{2})W_{3}, \hspace {0.5cm} \therefore Z_{2} \in R^{H\times W\times F_{2}} \tag{4}\\ Z_{3} &= (ZM)W_{4}, \hspace {0.5cm} \therefore Z_{3} \in R^{H\times W\times F_{1}} \tag{5}\\ Z_{out} &= ZW_{1} \oplus (ZW_{2})W_{3} \oplus (ZM)W_{4} \tag{6}\end{align*} View Source Finally, $Z \in R^{H\times W\times (F_{out})}$ , where $F_{out}$ denotes the number of channels in the output of the module. The expression for $F_{out}$ is as follows:\begin{align*} F_{out} &= 2 \times F_{1} + F_{2} \tag{7}\\ F_{out} &= 2 \times \frac {1}{\lambda _{1}} \times F + \frac {1}{\lambda _{2}} \times F \tag{8}\end{align*} View Source where $\lambda _{1}$ and $\lambda _{2}$ can be any number from the sequence {2,4,6,8,..}.

In place of the EFE module, the base U-Net model had two $3\times 3$ convolution layers with an “F” number of filters. The final output will be as $Z_{out} \in R^{H \times W \times (2F)}$ . The final number of channels in the U-Net module is significantly more than that of the EFE module.

The above analysis describes the reduction in the number of channels in the model after applying the EFE module. As the EFE module is present in every encoder and decoder node, this will eventually result in an overall parametric reduction in the architecture.

Fig. 1 demonstrates that the proposed EMED-UNet contains multiple embedded U-Nets. The feature extraction in each U-Net is done using unique kernels. Therefore, using different spatial extents, the features at different receptive fields are extracted by different U-Nets. In this way, the Multiple Encoder-Decoder structures of EMED-UNet can extract information at multiple receptive fields. Fig. 1 describes EMED-UNets with different configurations based on three parameters:

$n$ : Number of U-Nets or a number of encoder-decoder pairs in the architecture.
$K_{min}$ : Minimum receptive field formed by any convolution layer in the entire architecture.
$K_{max}$ : Maximum receptive field formed by any convolution layer in the entire architecture.

C. Technical Specifications of the EMED-UNet Architecture

For example, in Fig. 1, the EMED-UNet with specifications: ${n, K_{min}, K_{max}} = {2, 3, 7}$ contains two U-Net models (two encoder-decoder pairs), one with the receptive field of $3\times 3$ ($K$ =3 and therefore dilation rate = 1) in its EFE module, and the other with the receptive field of $5\times 5$ ($K=5$ and therefore dilation rate $= 2$ ) in its EFE module. The single-node 7,0,0 contains only the EFE module with a receptive field of $7\times 7$ ($K=7$ and, therefore, dilation rate $= 3$ ). This node is not a part of any U-Net structure (no up-sampling or down-sampling operations involved), but it is connected with the other U-Nets using skip connections. Thus, the variables $K_{min}=3$ and $K_{max}=7$ means the convolutions with receptive field set of {3,5,7} are used in the architecture. In this way, the feature learning in the model will be more effective as feature maps are extracted using convolutions focusing on different receptive fields.

D. Connectivity of the Network

The EMED-UNet network architecture has different connectivity for varying configurations based on three variables: $n$ , $K_{min}$ , and $K_{max}$ . Again, consider the EMED-UNet with parameters ${n, K_{min}, K_{max}} = {2, 3, 7}$ described in Fig. 1, showing the connectivity of the network. Let us denote any node in the network as $X_{K, i, j}$ (denoted in the figure as $K, i, j$ in the circle), where K is the parameter of the EFE module present in that node, $i$ is the level of down-sampling of the node, and j denotes the number of skip connections coming as inputs to the node. Note that in the case of up-sampling, $i = i - 1$ , and in the case of down-sampling, $i = i + 1$ . For example, if we consider any node $X_{3, 1, 3}$ , the EFE module will have a parameter $K = 3$ . The node is part of the outer U-Net sub-network (denoted by the light brown shade), reached after down-sampling the input four times and up-sampling thrice. It can also be verified that the number of skip connection inputs given to the node is three.

E. Collaborative Learning Using Deep Supervision

The EMED-UNet architecture contains multiple U-Nets. Therefore, there are several segmentation outputs generated at the end of every decoder. This motivates us to apply deep supervision in the model by back-propagating through all the decoder ends and to make all the U-Nets learn collaboratively. This feature ensembles the U-Net sub-components into one model to leverage enhanced collaborative learning between them at multiple receptive fields. The learning will happen in such a way that it learns features within the context of features learned by other encoder-decoder sub-components.

At the time of output collection, we are collecting the best output of those generated at multiple ends. All the upcoming results of EMED-UNet are generated after applying deep supervision. Fig. 3 shows the deeply supervised EMED-UNet network, getting multiple output masks while back-propagating the loss simultaneously in all the U-Nets. Fig. 3 describes the overall bird’s eye view and unfolding of the EMED-UNet architecture.

FIGURE 3.

Overview of the deeply supervised EMED-UNet architecture, showing the unfolding of Encoder-Decoders of the EMED-UNet and the EFE module.The images shown in the figure are of the BraTS Dataset just to create a case example.

Show All

SECTION III.

Experiments

A. Datasets

1) Montgomery County - Chest X-Ray Database

Montgomery County chest X-ray dataset [40], [41] was created for the diagnosis of Tuberculosis (TB) disease. It was a collaborative work of the National Library of Medicine (NLM) with the Department of Health and Human Services, Montgomery County, Maryland, USA. The dataset contains 138 chest X-rays, out of which 58 are from infected individuals and the rest are from non-infected individuals. This relatively small data set created a challenge to make the model learn all the critical features using a limited number of samples. The images were resized to $512 \times 512 \times 3$ before passing them into the model to train.

2) Tuberculosis Chest X-Ray DATA(Shenzen)

The Shenzhen Tuberculosis dataset [40], [41], [42] is a chest X-ray dataset created to detect Tuberculosis disease. The dataset resulted from the collaborative work of NLM and Shenzhen No.3 People’s Hospital, Guangdong Medical College, Shenzhen, China. It contains 662 images, 336 of which are from infected cases and 326 from normal cases. The 566 lung masks for this dataset were proposed by the Computer Engineering Department, Faculty of Informatics and Computer Engineering, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv, Ukraine. Lung masks are available here. The images are resized to $512 \times 512 \times 3$ during processing.

3) COVID-19 CT Lesion Segmentation Dataset

This is a large lung CT scan dataset [43], [44], [45] which was developed by combining data from three public datasets on COVID-19. It contains 2,729 images with their corresponding ground truth lesion masks. Each CT scan file of a patient consists of multiple slices, and thus every image is a slice from the CT scans of the patients. These scans are converted to NumPy arrays and then to png files for further processing. The resolution of every image in the dataset is $512\times 512$ . We converted the images to the resolution of $256\times 256$ before feeding the images to train in the model. The data set is publicly available at Kaggle.

4) Brain Tumor Segmentation (BraTS) Dataset

This dataset focuses on the evaluation of state-of-the-art methods for the segmentation of brain tumors in multimodal magnetic resonance imaging (MRI) scans. It contains volumes with multimodal scans: T1, post-contrast T1-weighted T1Gd, T2-weighted (T2), and T2 Fluid Attenuated Inversion Recovery (T2-FLAIR). The dataset contains over 66,348 slices. We converted the images to the resolution of $96\times 96$ before training.

B. Implementation

1) EMED-UNet

As described in Section – II, the EMED-UNet model uses the hyperparameters: n, $K_{min}, K_{max}, \lambda _{1}$ , and $\lambda _{2}$ . We iterated over different model configurations to find the optimal set of parameters. Increasing the values of $\lambda _{1}$ and $\lambda _{2}$ will reduce the number of channels in the model by a significant amount. Therefore, the model’s performance will start deprecating after a critical value of $\lambda _{1}$ and $\lambda _{2}$ . The critical values were found to be $\lambda _{1} = 16$ and $\lambda _{2} = 4$ . The following are the settings where good performance was achieved with the model:\begin{align*} \{n, K_{min}, K_{max}, \lambda _{1},\lambda _{2}\} &= \{2,3,7,4,1\} \tag{9}\\ \{n, K_{min}, K_{max}, \lambda _{1},\lambda _{2}\} &= \{2,3,5,4,1\} \tag{10}\\ \{n, K_{min}, K_{max}, \lambda _{1},\lambda _{2}\} &= \{2,3,7,4,2\} \tag{11}\\ \{n, K_{min}, K_{max}, \lambda _{1},\lambda _{2}\} &= \{2,3,5,2,1\} \tag{12}\end{align*} View Source

The effect of change in parameters with different configurations mentioned above is shown in Table 2.

TABLE 2 Effect of Choosing Different Configurations of EMED-UNet on Parameters, FLOPS, and, Model Size in EMED Network

2) Comparison of Models

We compared the results of the EMED-UNet architecture with two U-Net models, each resulting in a different receptive field. One U-Net contains the $3 \times 3$ filters in every convolution layer, and the other has $5 \times 5$ filters in each of its convolution layers. Table 2 describes the set of filters used throughout the architectures. Detailed information on the U-Net implementation is presented in the later sections.

We took the Train:Val:Test split ratio of 70:10:20 in the case of MC, SHZ, and C-19 CT datasets. The EMED-UNet was trained at a learning rate of $1e^{-5}$ with an automatic reduction in learning during validation, extending to the minimum until $1e^{-7}$ . The Adam optimization was used in the training to optimize learning. In the pre-processing stage, images were resized to $512\times 512$ .

Table 3 shows the hyperparametric settings while training the model. The model was trained on a Paramshavak Supercomputer, the specifications of which are outlined in Table 4.

TABLE 3 Training Parameters

TABLE 4 High-Performance Computing Specifications of the Paramshavak Supercomputer

Every model mentioned in the results section is trained once on each dataset. Training and testing on one data set are done independently from all other data sets.

C. Performance Metrics and Loss Functions

The Performance Metrics and Loss Functions used in evaluating the model are as follows:

Dice-coefficient:
The Dice Coefficient [46] is a similarity metric used to measure the similarity between two images. The mathematical formulation of the Dice-coefficient metric function is:\begin{equation*} DC(A,B) = \frac {2 \times \mid A \cap B \mid }{\mid A\mid + \mid B\mid } \tag{13}\end{equation*} View Source where A is the prediction, and B is the ground truth.
Intersection over Union:
The Intersection over Union (IoU) [47], also known as Jaccard Index, is a standard metric to evaluate segmentation performance. This metric shows the overlap ratio between the segmented output and the ground truth.
Precision:
Precision shows, out of all the predicted masks, how many of those objects had matched or overlapped with the ground truth annotation.
Dice Loss:
The Dice Loss [46] is also one of the most commonly used loss functions in segmentation tasks. It is based on the dice coefficient metric. The good characteristic of this function is that it can compute loss at global as well as local scale.\begin{equation*} DL(y,\hat {p}) = 1 - \dfrac {2y\hat {p} + 1}{y + \hat {p} + 1} \tag{14}\end{equation*} View Source Above is the mathematical formulation of the dice loss where y is the ground truth and $\hat {p}$ is the predicted output.
Metrics Comparing Efficiency:
The metrics that compare the efficiency of the models are the number of parameters and FLOPS (Floating Point Operations Per Second). The overall inference time of a system depends on the computing device (CPU, GPU cores, level of parallel processing, and more). Therefore, we use the number of FLOPS for evaluating the efficiency of the models. FLOPS are proportional to the inference time of any system, which is directly proportionate to the system’s efficiency.

D. Results

Table 5 compares the performance of the U-Net base model, its other variants, other state-of-the-art models, and the EMED-UNet model on the Montgomery County dataset. This is a relatively small dataset where the total number of images is 138. As shown in the table, the EMED-UNet model with parameters {2,3,5,2,1} performed better than the U-Net base model, its other variants, PSPNet, other compared models, and its other EMED-UNet counterparts with an IoU of 0.9195 and dice coefficient of 0.9578, with a significantly decreasing in the number of trainable parameters from 31.043 (in UNet($3\times 3$ )) million to 4.65M and FLOPS from 386 G to 65.8 G. It also significantly outperforms UNet++ in terms of efficiency and stays inline in terms of accuracy.

TABLE 5 Results of Segmentation on Montgomery County Dataset

Table 6 also shows the comparison between the U-Net base model and other state-of-the-art models on the Shenzen dataset. The precision, IoU, and dice coefficient increased from 0.9655, 0.8825, and 0.9370 in the case of U-Net($5\times 5$ ) to 0.9654, 0.9021, and 0.9478 in EMED-UNet with configuration {2,3,7,4,2}. Thus, on a larger dataset with more training images, the EMED-UNet performs better than U-Net. The UNet++ achieves slightly better precision, but it is outperformed by EMED-UNet{2,3,7,4,2} in terms of Dice Coefficient, IoU, and Loss, with much better efficiency.

TABLE 6 Results of Segmentation on Shenzhen CXR Dataset

Table 7 demonstrates the comparison between the U-Net base model, its other variants, and the EMED-UNet network on the COVID-19 CT Lesion Segmentation dataset. The EMED-UNet{2,3,7,4,2} outperforms the U-Net($3\times 3$ ) in terms of both accuracy and efficiency. But, the U-Net($5\times 5$ ) performs better in terms of accuracy. Still, an EMED-UNet configuration like EMED-UNet{2,3,7,4,1}, where the model loses some efficiency, can outperform the U-Net($5\times 5$ ), with around 4% improvement in the dice coefficient, 4.3% increment in the IoU, and also 1% increment in the precision. Similarly, a significant increment is observed in the accuracy and efficiency when compared to UNet++.

TABLE 7 Results of Segmentation on C-19 CT LS Dataset

Table 8 shows that the performance of the EMED-UNet models outperforms all the other models in terms of both accuracy (DC) and efficiency (FLOPs and Params(M)). The EMED-UNet{2,3,7,4,1} outperforms the other models with a DC of 0.74 with only 1.84 M parameters and 0.38G FLOPs. It turns out that the EMED-UNET{2,3,5,2,1} is the lightest model, which also gives a decent accuracy of 0.272. Thus, it could be useful for specific purposes.

TABLE 8 Results of Segmentation on BraTS Dataset. Note: The Input Image Size Used is $96\times96.$

$Table 8- Results of Segmentation on BraTS Dataset. Note: The Input Image Size Used is $96\times96.$$

TABLE 9 Results of Studies From Section-IV Subsections-E and F

The above results prove that the embedded U-Net structure of the EMED-UNet Network is efficient in learning parameters as compared to the dense architecture of U-Net++, as it takes fewer parameters for learning the salient features of the medical images.

Fig. 4 shows the Dice Coefficient vs. FLOPS on four medical imaging datasets. It can be observed that the EMED-UNet model consistently outperformed the U-Net and its other variants in terms of efficiency. EMED-UNet model achieved a gain in accuracy on the MC, Shenzhen, and BraTS datasets, while it lost some accuracy on the COVID-19 CT dataset.

FIGURE 4.

Dice Coefficient vs. FLOPS on Montgomery County, Shenzhen CXR, Covid-19 CT Lesion segmentation, and the BraTS datasets.

Show All

Fig. 5 demonstrates a visual output comparison between U-Net and EMED-UNet segmentation results. In terms of accuracy, both the U-Net and EMED-UNet are performing well, but in some edge cases, EMED-UNet performs better. For example, as shown in Fig. 5, the U-Net predicted the ‘1’ (in binary image) in some places, while EMED-UNet has a false prediction. The accuracy of the results can be verified from the numerical results in Table 4.

FIGURE 5.

Qualitative results comparison between output samples generated with U-Net and the EMED-UNet models on all the datasets. The EMED-UNet configuration used is {2,3,7,4,2} for the Shenzhen and C-19 Datasets, and {2,3,5,2,1}, on the Montgomery County dataset.

Show All

E. Ablation Studies

We conducted two ablation studies: First, to compare the EFE module with another backbone; Second, to compare the proposed Network architecture (Multi-Encoder-Decoder network) with the base U-Net architecture. We chose the strong residual backbone [48] and evaluated its effects on the proposed Multi-Encoder-Decoder network. We replace the EFE module with a ResNet module and compare the efficiency and accuracy in both cases. For the purposes of this section, let us refer to EMED-UNet without the EFE module as MED-UNet. The MED-UNet with the residual module is denoted by MED-UNet(res).

The residual module used in place of the EFE module consists of two convolution layers, one $1\times 1$ convolution (to match the third dimension) and the addition operation. The residual module takes three inputs: input feature maps, the number of filters (F), the parameter ‘K’, and the type of filter used in convolution (i.e., the $3\times 3$ filter or $5\times 5$ ). For example, the node $X_{5,1,2}$ has the number of filters $F = 128$ and $K=5$ , and the input to that residual module is the output of the node $X_{5,2,0}$ . Further, the two convolution layers used have the kernel size of $3\times 3$ with a dilation rate of 2 to maintain the receptive field of 5 (as $K=5$ ).

1) Ablation Study - 1

In the MED-UNet, the configuration will be {$n$ ,$K_{min}$ ,$K_{max}$ }. There are no reduction factors involved as there is no EFE module present. We used the configurations: MED-UNet{2,3,7}(res) (with residual module), EMED-UNet{2,3,7,4,2} (with EFE), and EMED-UNet{2,3,7,4,1} (with EFE). In Fig. 6, the Res-SHZ refers to the MED-UNet with a residual backbone (i.e. MED-UNet(res)), and the EFE-SHZ refers to the original EMED-UNet, both on the Shenzhen Dataset and the Covid-19 CT Lesion Segmentation dataset.

FIGURE 6.

Results of Ablation study. The Model ‘Res-XX’ denotes the residual backbone on the multi-encoder-decoder network, and ‘EFE-XX’ denotes the multi-encoder-decoder network with the EFE module, which is basically the EMED-UNet architecture.

Show All

Fig. 6 and Table 8 describe the results of the ablation study. The MED-UNet(res) used 81% more parameters as compared to the EMED-UNet{2,3,7,4,2}. It achieves a negligible accuracy gain on the Shenzhen dataset and a small gain on the COVID-19 CT dataset. This demonstrates that the EFE modules work well in finding the balance between accuracy and efficiency. The detailed results comparison is shown in Table 8.

2) Ablation Study - 2

In this study, we tested the effects of different backbones on both the MED-UNet and the U-Net architectures. We considered the ResNet backbones. The MED-UNet network consists of multiple embedded U-Nets, with each U-Net extracting information at different receptive fields. We demonstrate that the MED-UNet network outperforms U-Net in terms of accuracy by using a common backbone in both networks.

The residual module used in MED-UNet is the same as the one used in the ablation study. Here we are considering the same model of MED-UNet(res) with a residual module as the one we considered in the ablation study - 1. In the case of U-Net, we use regular $3\times 3$ kernels in the residual module rather than using dilated convolutions. The U-Net model with residual backbone is denoted by U-Net($3\times 3$ )(res).

Again, we conducted the experiment on the COVID-19 CT LS dataset. The results in Table 8 show the strength of the Multi-Encoder-Decoder model architecture. It is observed that the MED-UNet{2,3,7,4,2}(res) outperforms the U-Net($3\times 3$ )(res) with the dice coefficient of 0.7992, with U-Net having a dice coefficient of 0.7353. This shows the strength of the EMED-UNet Network architecture.

SECTION IV.

Discussion

The ablation study demonstrates the impact of the EFE module on the overall model efficiency. Using the reduction factors as controls, we can find the best trade-off between accuracy and efficiency. In the ablation study, we observed that the EMED-UNet{2,3,7,4,2} consisting of 6.35M parameters nearly underperforms compared to MED-UNet{2,3,7}(res), consisting of 36M parameters, in terms of accuracy. However, if we consider a different configuration of EMED-UNet like {2,3,7,4,1} consisting of 10.39M parameters, it slightly crosses MED-UNet{2,3,7} with the residual module with a dice coefficient of 0.8018, on the C-19 dataset. Thus, by compromising on efficiency, one can get better accuracy.

While increasing the value of the reduction factors in the EFE module, the model is losing accuracy as we reduce the number of channels in the feature maps. However, we can recover the accuracy using the proposed structure of the EMED-UNet network. The network consisting of multiple encoders and decoders can capture the salient features from the images using multiple receptive fields. The strength of the EMED-UNet network is shown in the study described in section E. Thus, the EMED-UNet network with the EFE module, i.e. the EMED-UNet network, is the best fit.

Based on the detailed study of the proposed network, we can see that a major improvement in results is arising due to the network structure. This work might be further uplifted by observing the effect of a transformer backbone on the same network structure while testing multiple network configurations.

SECTION V.

Conclusion

The proposed EMED-UNet network architecture was demonstrated as very effective at segmenting biomedical images with improved accuracy and significantly reduced parameters, FLOPS, and model size as compared to the standard U-Net base model. By employing the EFE module, our network decreases the computational complexity and memory required for the semantic segmentation task. The EMED-UNet can learn all the salient features in the image by introducing multiple encoder-decoders to extract features at multiple receptive fields, incorporating collaborative learning using the deep supervision technique. We evaluated our multi-encoder-decoder network on four medical image segmentation datasets and observed that it matches or exceeds the accuracy of the standard U-Net base, measured using multiple performance metrics, on a very low parametric cost, significantly fewer FLOPS, and low memory consumption. Iterating through different parameters in the model, one can find the best configuration for the segmentation task on any medical image modality that maintains a good trade-off between accuracy and efficiency.

References is not available for this document.

EMED-UNet: An Efficient Multi-Encoder-Decoder Based UNet for Medical Image Segmentation

Abstract:

Metadata

Abstract: