Introduction
Image segmentation is a computer vision task that specializes in categorizing an input image or a video frame into a pre-defined number of classes by generating non-intersecting and easily-interpretable sections of the input beneficial for further processing. The image segmentation task is considerably complex compared to other computer vision tasks, such as image classification because image classification categorizes an input by processing the entire image [1], whereas image segmentation generates an output for every single image pixel.
Image segmentation has numerous real-life applications, including video surveillance [2], augmented reality [3], and driverless cars [4]. The most beneficial and noteworthy image segmentation application is in the field of medicine, where it provides a detailed illustration of the human body for the anatomy analysis, detects illnesses, and identifies the severity level of a disease, to name a few [5]. Medical image segmentation is directly associated with a person’s health and life; hence, it must be very accurate to prevent a disease or cure an illness [6], [7], [9], [10].
Based on the input specifications, the image segmentation task can broadly be divided into two distinct groups: binary and multiclass image segmentation. The binary image has two available categories, namely background and foreground. Some of the applications of the medical image segmentation belongs to this group [9], [10]. On the other hand, the multiclass segmentation may have more than two countable classes, including semantic segmentation in autonomous driving applications [11].
Considering the notable performance of deep learning (DL) methods’, artificial intelligence (AI) systems have been shown to outperform humans in image classification tasks [13], [14]. While a person can compete with an AI system in the image classification task, it is impossible in image segmentation due to the significantly complex nature of the task. Because pixel-by-pixel classification is prohibitively tedious and not feasible given the enormous quantity of data in modern medical images. Therefore, generating precisely segmented medical images using AI techniques is becoming a research hotspot [16].
Due to the criticality of the DL methods for the medical image segmentation, extensive research has so far been made in this domain. The most popular DL model architecture in this field is U-Net [17]. After its introduction in 2015, researchers have proposed DL-based networks that achieve a state-of-the-art performance [9], [16], [18]–[22]. However, some of these models [16], [18]–[20], [22] perform complex computations, which make them unusable in machines with limited computation resources. In addition, these computationally expensive models require an extremely long training time for DL-based medical image segmentation models. Some efficient models [9], [21] cannot attain a state-of-the-art performance and cannot generate an accurate medical image segmentation. The aforementioned problems should be addressed to ensure further progress in medical image segmentation. Considering the existing shortcomings, we propose herein an accurate and efficient deep convolutional neural network (DCNN) model, called AEDCN-Net, to alleviate the current issues by reducing the number of trainable parameters and training/inference time as well as improving the medical image segmentation accuracy. The contributions of this study are fourfold:
The AEDCN-Net benefits from bottleneck, atrous, and asymmetric convolution-based skip connections in the encoding path and nearest-neighbor interpolation method in the decoding path, which significantly reduces the number of trainable model parameters.
Due to the carefully designed architecture of AEDCN-Net, on average, it is 40% faster than the existing computationally expensive methods that achieve a state-of-the-art performance in medical image segmentation.
Although AEDCN-Net demands fewer trainable parameters and less training time, it has a superior performance in terms of accuracy and generates more precise segmented medical images compared with its counterparts.
To the best of our knowledge, no proposed model has yet outperformed the existing methods in both computational efficiency and segmentation accuracy so far. Therefore, the proposed model can be used as a benchmark for further studies in the medical image segmentation domain.
The rest of this paper is structured as follows: Section II reveals detailed information on the existing methods in medical image segmentation; Section III provides a meticulous explanation of the proposed methodology; Section IV presents the experimental details; Section V discusses results of the experiments and qualitative comparison of the considered models; and finally, Section VI concludes this study and presents future study directions.
Related Work
This section summarizes the currently available methods used in medical image segmentation. Based on the techniques characteristics, they can broadly be categorized into computationally expensive and powerful, as well as lightweight and efficient models.
A. Computationally Expensive and Powerful Models for Semantic Segmentation
After the introduction of the convolutional neural network (CNN) models in computer vision tasks, considerable progress has been observed in the medical image segmentation accuracy. The most notable DL-based network is a fully CNN encoder-decoder architechture-based model for biomedical image segmentation, called U-Net [17]. The existing DL-based methods attaining a state-of-the-art performance in medical image segmentation have a similar model architecture to the U-Net [23]. They are precisely enhanced U-Net variants. For example, Zhou et al. proposed a novel encoder-decoder architecture that uses blocks of nested, dense skip connections [20]. These pathways reduce the semantic gap between the feature maps of the encoder and decoder sub-networks that assisted to significantly outperform the existing methods. Isensee et al. developed a robust and self-adapting framework on the basis of the original U-Net architecture [19]. The network benefits from the leaky rectified linear unit activation function and instance normalization to achieve a performance better than that of the original U-Net. Li et al. improved the U-Net architecture with residual connections by increasing the network depth and adding strong dropouts to extract finer features that allow state-of-the-art performance in fundus image segmentation [18]. Similarly, Jha et al. developed a ResUNet++ model architecture using a conditional random field and a test-time augmentation that achieved a superior performance compared with the existing DL-based networks on various polyp segmentation datasets [22]. Although these models exhibit a superior performance in terms of accuracy and precision in medical segmentation, they require an enormous number of trainable parameters; therefore, they are computationally expensive.
B. Efficient and Lightweight Models for Medical Image Segmentation
To devise efficient DL-based models, Mehta et al. introduced a lightweight network that employs group point-wise and depth-wise dilated separable convolutions to achieve a state-of-the-art performance in semantic segmentation [21]. Similarly, [24] and [25] used compressing techniques, such as vector quantization to increase the speed of semantic segmentation models. Punn et al. also presented an inception U-Net architecture [26] inspired by [27]. This network illustrates the model perception of target segmentation images using activation maximization and filter map visualization techniques and attained a superior performance in terms of accuracy. Gadosey et al. developed a modified version of U-Net for devices with a low computational power based on bottleneck layers [28]. They used depth-wise separable convolutions in the entire network. In addition, the model benefited from a weight standardization algorithm with the group normalization method. The modifications allowed the model to be computationally efficient and lightweight. Similarly, Olimov et al. presented a fast U-Net (FU-Net) model relying on the bottleneck convolution layers in the encoding and decoding paths of the model, which allowed medical image segmentation on the devices with limited computational power and memory [9]. Although these models address the problem of efficient computation, they do not provide highly-accurate segmented images.
Proposed Methodology
This section presents AEDCN-Net in detail. Figure 1 shows an overview of the proposed methodology. AEDCN-Net has three distinct stages: data preprocessing, data learning, and inference.
A. Data Preprocessing
In data preprocessing, raw medical images are prepared for training using the DCNN model. First, the images are resized to match the network input size. The images are resized to be \begin{equation*} X_{std} = \frac {X - \displaystyle \frac {1}{M}\displaystyle \sum _{i=1}^{M} x_{i}} {\sqrt {\displaystyle \frac {1}{M}\displaystyle \sum _{i=1}^{M} {\left ({x_{i} - \displaystyle \frac {1}{M}\displaystyle \sum _{i=1}^{M} x_{i} }\right)}^{2} }} \tag{1}\end{equation*}
In (1),
Most medical image databases suffer from data scarcity problems [9]. To alleviate this issue, we applied data augmentation based on the characteristics of the medical image data after completing the data standardization process. The data augmentation techniques should be chosen carefully based on the dataset image characteristics; otherwise, they can result in a low performance of the DCNN model in the data learning stage. The data augmentation is a part of pre-processing stage and pre-computed before starting the data learning stage. The data augmentation is conducted only once before training stage and every epoch in the learning phase used the same augmented images. We used the following data augmentation techniques:
Horizontally flipping the images;
Randomly shifting the image dimensions in the range of integer value
;{x} Zooming the images in the range of random integer value
;{x} Randomly changing the angle of images by an integer value of
.{y}
In the proposed method, we used
B. Data Learning
After obtaining the preprocessed medical images from the first stage of the proposed methodology, we trained them using a DCNN model. Figure 2 shows the AEDCN-Net model architecture, which was similar to the original U-Net. However, several modifications ensured the enhancement of the performance of the proposed model architecture. Specifically, it comprised atrous-asymmetric convolution (ATAS) blocks, max-pooling, concatenation, and upsampling operations. The ATAS blocks are responsible for learning useful features from the preprocessed medical images. Table 1 present details of the ATAS blocks.
DCNN model architecture of the proposed method containing atrous-asymmetric convolution (ATAS) blocks for encoding and decoding paths.
Table 1 shows two branches in the ATAS blocks, namely the main and secondary branches. First, a raw medical image was input into the main branch by passing through bottleneck, atrous, and asymmetric convolution operations.
1) Bottleneck Convolution
The bottleneck convolutional layer is based on exploiting fewer convolution filters than the input image, each of which measures \begin{align*} W_{conv}=&c^{l-1} \times x\times y \times c^{l} \\ W_{bnck}=&c^{l-1} \times \frac {c^{l}}{b} + \frac {c^{l-1}}{b} \times x\times y \times c^{l} \\ FLOPs_{conv}=&H \times W \times c^{l-1} \times x\times y \times c^{l} \\ FLOPs_{bnck}=&H \times W \times \\&\times \left ({c^{l-1} \times \frac {c^{l}}{b} \; + \frac {c^{l-1}}{b} \times x\times y \times c^{l}}\right) \tag{2}\end{align*}
In (2),
2) Atrous Convolution
The atrous convolution uses an atrous factor of \begin{equation*} (A \ast _{a} F)(p) = \displaystyle \sum _{c+a \; b=p} A(c)f(b) \tag{3}\end{equation*}
In (3),
3) Asymmetric Convolution
Equation (4) represents the regular convolution operation between an image \begin{align*} \Psi _{conv}=&I \ast F_{t} \\=&\displaystyle \sum _{x=1}^{X}\sum _{y=1}^{Y}\sum _{c=1}^{C} I\;(h-x, w-y, c) \;F_{t}\;(x, y, c) \tag{4}\end{align*}
In (4), \begin{align*} \Psi _{th}=&I \ast F_{th} \\=&\displaystyle \sum _{x=1}^{X}\sum _{c=1}^{C} I\;(h-x, y, c) \;F_{th}\;(x, 1, c) \\ \Psi _{tw}=&\Psi _{th} \ast F_{tw} \\=&\displaystyle \sum _{y=1}^{Y}\sum _{tw=1}^{TW} \Psi _{tw}\;(x, w-y) \;F_{tw}\;(1, y, tw) \\ \hat {\Psi }_{tac}=&((I \ast F_{th}) \ast F_{tw}) \tag{5}\end{align*}
In (5),
4) Model Architecture
We progressively increased the number of filters in the encoding path. The first convolution layer contained 64 filters that have a size of
Moreover, the max-pooling operation decreased the spatial dimension of the images by a factor of two, ensuring a computational complexity reduction. The upsampling operation also recovered the image original size as the training progressed by increasing the output of the ATAS block in the decoding path by a factor of two. In the proposed model architecture, we used the nearest-neighbor interpolation method to recover the original image size, as in [9]. We chose this operation because it does not have trainable parameters and ensures a reduction in the number of parameters to train, which is consistent with our objective of developing an accurate and efficient DCNN model. Finally, the concatenation operation connected the output of the ATAS blocks in the encoding path to the corresponding output of the upsampling operation in the decoding path. The concatenation helped alleviate the problem of feature loss resulting from the max-pooling and upsampling operations.
In the end, the output of the ATAS blocks passed through a
5) Loss Function
We used the sum of two loss functions, namely cross entropy loss and dice loss, as a value for minimization. The loss function is formulated as follows:\begin{align*}&\hspace {-.5pc} L_{f} =\left ({\frac {1}{M}\sum _{i=1}^{M}-y_{i}log(\hat {y}_{i})}\right) \\&+ \left ({-\frac {2}{N}\sum _{n=1}^{N} \frac {\sum _{p=1}^{P}y_{p}^{k}\hat {y}_{p}^{k}}{\sum _{p=1}^{P}y_{p}^{k} + \sum _{p=1}^{P}\hat {y}_{p}^{k}}}\right) \tag{6}\end{align*}
In (6),
C. Inference
After completing the data learning stage and obtaining a trained DCNN model, we can now employ this model to generate segmented medical images in an inference stage. In this step, the raw data should pass through the same preprocessing operations, as in the training stage, except for data augmentation. A test set of a dataset or real-life medical images was precisely resized, transformed into grayscale, and standardized using (1). For standardization,
Experiments and Results
This section describes the conducted experiments and their results and presents a comparison of the performances of the proposed method and the existing state-of-the-art models.
A. Experiment Datasets
For the experiments, we employed four publicly available and widely used medical image datasets, namely the 2018 Liver Tumor Segmentation challenge dataset containing abdominal computed tomography (CT) scans [31], 2018 Data Science Bowl (DSB) challenge dataset containing a large number of segmented nuclei images [32], Kvasir-SEG dataset containing polyp images [33], and International Skin Imaging Collaboration (ISIC) 2018: Skin Lesion Analysis Toward Melanoma Detection challenge dataset containing dermoscopic images [34]. Real-life medical image datasets often experience a problem of limited data for training and validation [35], [36]; therefore, we used various datasets that have limited (2018 LiTS: 331) and ample (ISIC 2018: 2594) training images to test the performance of the proposed method from different angles. Table 2 presents the details of these datasets.
B. Baseline Models
We selected five recent medical image segmentation DCNN models that attain state-of-the-art performance to compare the results of the proposed method: FU-Net [9], nnU-Net [19], UNet++ [20], ESPNetv2 [21], and ResUNet++ [22]. We have provided a detailed summary of these models in the Section II; hence, we do not mention their specifications here.
C. Training Setup
We formulated the baseline and proposed methods using Python version 3.6.9 and TensorFlow Library version 2.4.0, respectively. We initialized the weight parameters based on a standard normal distribution with a mean and a standard deviation of 0 and 1, respectively, to follow the standards of the WIB-ReLU activation function [29]. We did not use bias parameters because they are canceled out while the batch normalization method is used. We used combined cross entropy and dice loss functions as the function for minimization (refer to Section III-B5) and an Adam optimizer [37] with learning rate
D. Evaluation Metrics
We assessed the performance of the baseline and proposed methods using several evaluation metrics, including pixel accuracy (PA), dice coefficient (DC), and mean intersection over union (mIoU). The formulas of these evaluation metrics are as follows:\begin{align*} PA=&\frac {1}{M}\displaystyle \sum _{i=1}^{M}\frac {\sum _{p}^{P} \hat {y}_{p} == {y_{p}}}{\sum _{p}^{P} y_{p}} \\ DC=&\frac {2 \times TP}{2 \times TP + FP + FN} \\ mIoU=&\frac {TP}{TP + FP + FN} \\ \tag{7}\end{align*}
Equation (7) shows the computation methods of the considered evaluation metrics, where
Discussion
This section discusses the results of the conducted experiments in terms of computational and memory efficiency and shares the results of ablation studies. Moreover, it exhibits qualitative comparison of the baseline and proposed methods and enumerates limitations of the proposed method.
A. Experiment Results
Table 3 summarizes the experimental results of the considered models on the test sets of the aforementioned datasets. From the table, the proposed model enjoyed high speed for training and inference and significantly outperformed the existing computationally expensive models, such as ResUNet++ and nnU-Net by achieving nearly
In the case of the accuracy-related metrics, the proposed model considerably outperformed the baseline networks in the datasets with a limited number of medical images, like 2018 LiTS and 2018 DSB primarily because the computationally expensive models with a large number of trainable parameters experienced overfitting and could not generalize well to the unseen test data. However, in the experiments on datasets with 1000 and more images, such as Kvasir-SEG and ISIC 2018, the nn-UNet and ResUNet++ models attained better performances than the lightweight models due to a great number of computations and parameters. AEDCN-Net still could largely outperform the lightweight models and achieve at least a second best result in terms of the PA, DC, and mIoU metrics on the considered datasets.
B. Computational and Memory Efficiency
We also compared the considered models in terms of trainable parameters, model size, and FLOPs. Table 4 presents the evaluation results.
In Table 4, AEDCN-Net required nearly seven and 15 times fewer trainable parameters in comparison with the lightweight and computationally expensive models, respectively. Moreover, the size of the proposed model was considerably smaller than the baseline networks. Finally, AEDCN-Net was efficient in terms of computation by requiring the lowest number of FLOPs to produce the medical image segmentation.
C. Ablation Studies
Table 5 analyzes the effect of different components in the proposed method on the accuracy-related evaluation metrics and number of trainable parameters. We selected the datasets with the fewest and the largest number of images to conduct ablation studies to reduce the computational cost for the experiments.
As shown in Table 5, the asymmetric convolution operation with the kernel sizes of
D. Qualitative Comparison of the Considered Models
After finishing the training and evaluating the model performance on the considered datasets, we show herein the generated segmented images using the baseline and proposed methods. Figure 3 depicts the input medical images, ground truth masks, and generated segmentation masks by the considered methods. The most efficient baseline model, FU-Net, failed to generalize well on the test images. Particularly, the model’s inferior performance was noticeable in the segmented images from the 2018 DSB dataset. In addition, nnU-Net produced lower-quality segmentation masks in the Kvasir-SEG and ISIC 2018 datasets. Notably, the proposed method could produce more detailed and precise segmented medical images than baseline methods in all considered datasets.
Comparison of the segmentation results: (a) input images; (b) ground truth masks; and the corresponding segmented masks using (c) FU-Net, (d) nnU-Net, (e) UNet++, (f) ESPNetv2, (g) ResUNet++, and (h) AEDCN-Net. The test set of the 2018 DSB dataset had no ground truth mask; therefore, there is no image in the second row and the second column of the figure.
E. Limitations of the Proposed Method
The results of the conducted experiments using four medical image datasets and comparison of the performance with the existing state-of-the-art models showed that the proposed AEDCN-Net outperformed the baseline models in terms of speed, memory, efficiency, and accuracy. However, the proposed method have several limitations. First, some datasets used in the experiments have limited number of training set, which cannot fully demonstrate a performance difference between the proposed method and the more powerful and computationally expensive networks. Second, the considered datasets in the experiments exhibit only binary (foreground and background) output. Although, the proposed method can easily be employed for multiple output segmentation by slightly altering its activation function in the final output layer, this operation can lead to increase in computational complexity.
Conclusion and Future Work
This study investigated the medical image segmentation using DL-based techniques. Based on the extensive literature review, we found that the currently available state-of-the-art methods in this field are computationally inefficient and slow. Moreover, the lightweight and efficient models cannot generate precise segmented images. Therefore, we proposed the AEDCN-Net model that benefits from the carefully designed preprocessing and the computationally efficient DCNN model using skip connection-based bottleneck, atrous and asymmetric convolution operations in the encoding path and nearest-neighbor interpolation upsampling technique in the decoding path. In the conducted experiments using four open-source medical image datasets, the proposed method showed a superior performance in terms of computational efficiency, memory, and accuracy compared with the counterpart models. Moreover, the AEDCN-Net significantly outperformed the efficient models by achieving greater results when assessed using several evaluation metrics.
For the future directions of AEDCN-Net enhancement, we will work on increasing the accuracy of the proposed model and attempt to interpret the predicted segmented medical images based on the severity level of illness.