Introduction
Convolutional Neural Networks (CNNs) have significantly advanced the field of image segmentation in recent years [1], [2], [3], [4]. The U-Net architecture, in particular, has proven highly effective for medical semantic segmentation by connecting a series of CNN layers [5]. The contracting path of the U-Net extracts spatial information from the input, while the expanding path reconstructs the segmented output using this spatial information via skip connections.
Several variations of the U-Net have been proposed to further improve its performance. The UNET++ features a nested encoder-decoder design with dense connections between layers, allowing it to capture detailed spatial features that may be lost during downsampling [6]. The KiU-NET combines the U-Net and Kite-Net architectures in parallel, enabling it to detect small, indistinct anatomical structures more effectively [7].
CNN-based architectures are highly effective at learning representations but struggle with learning global feature due to their localized receptive fields [8]. To extract global feature using a local range receptive field, it is necessary to employ multiple CNN operations. However, this approach causes computational inefficiency. Furthermore, the increased number of parameters imposes significant difficulties on optimization. These limitations can lead to inadequate semantic segmentation, particularly with multiple objects that have diverse boundaries in an image.
To overcome these challenges, dilated convolution and deformable convolution have been proposed to expand the receptive fields and capture a wider range of information [9], [10]. The dilated convolution enables the convolutional network to expand its receptive field by introducing different dilation. The deformable convolution adopts flexibility by utilizing 2D offset to the regular grid kernel. Since these 2D offsets are learned directly from the data, deformable convolution networks perform effectively in vision tasks requiring fine localization.
However, even with these improvement, the receptive fields in various convolutional layers remain constrained by the kernel window, which still involves significant computational complexity [11]. While developed CNNs can indeed capture global features, they often face limitations such as high computational costs and accuracy issues. There is a growing demand to capture relationships between distant parts of an input image without sacrificing accuracy. Therefore, the task should request for new models.
To address these issues, researchers have explored combining the CNN based U-Net architecture with transformers, which have shown great promise in Natural Language Processing (NLP) tasks. Vision Transformers (ViT) split the input image into patches and compute the connections between them, enabling the extraction of global spatial features through the use of Multi-Head Self Attention (MHSA) layers [6], [13], [16]. The UNet Transformer (UNETR) incorporates a ViT backbone as its encoder, allowing it to learn global spatial features and compress spatial information effectively.
While numerous deep learning models have been proposed for organ segmentation, comparatively fewer have been developed specifically for tumor segmentation. This is due in part to the challenges posed by the small size and unpredictable locations of tumors in medical images. Many existing models have therefore focused on organ segmentation, which inherently includes tumor information.
To address these issues, we propose the Organ UNETR (OrgUNETR), a modified version of the UNETR architecture designed specifically for kidney and prostate tumor segmentation. Our model leverages both organ and tumor information to improve tumor segmentation performance compared to a baseline model that uses only tumor information. We achieve this by predicting organs and tumors through distinct channels, each with its own loss function. Through backpropagation, the model learns to locate both organs and tumors more accurately.
We further optimize the OrgUNETR architecture by replacing the MHSA layers with Squeeze and Excitation (SE) layers [17], [18], [19], [20], [21], [22]. The SE layers efficiently compute attention among feature channels, reducing the model’s computational cost while maintaining comparable performance. They also enhance the model’s ability to prioritize important features, making them particularly effective for medical segmentation tasks.
To validate the effectiveness of our approach, we compare the performance of OrgUNETR to a baseline model trained only on tumor information. The proposed model is evaluated with multiple tumor segmentation datasets with CT images, KiTS19 and Prostate158, for the generalization of the performance of the model.
The main contributions of this paper are as follows:
We demonstrate that including organ information enables more accurate tumor prediction compared to a baseline model, as organ and tumor information are inherently related.
By substituting MHSA layers with SE layers, we reduce the computational cost of OrgUNETR while maintaining tumor segmentation accuracy, making it more practical for real-world applications.
Related Works
A. Tumor Segmentation Task
The tumor segmentation task is an essential process in the analysis of medical diagnosis, facilitating the precise identification of anatomical structures [23]. The tumor segmentation involves classification of every pixel, distinguishing tumor region from tissues. The objective of tumor segmentation is to accurately define the boundaries of tumors across the medical images such as CT scans, Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET) scans. This task is critical not only for diagnosing diseases but also for the strategic treatment plan, enabling personalized medicine approaches.
The CT scan and MRI scan are favored for segmentation tasks due to their ability to provide detailed images that facilitate the precise identification of abnormalities like tumors, making them efficient and crucial for accurate disease diagnosis [24]. Traditionally, this tumor segmentation from these scan images has been performed manually by physicians. This process that is not only time consuming but also demands significant human resources [25]. To conserve these resources and enhance the efficiency of the diagnostic process, the deep learning model is introduced [26]. The deep learning model is trained on medical image datasets that include detailed annotation of tumor locations, enabling the model to learn and subsequently perform tumor segmentation task autonomously. The integration of deep learning into tumor segmentation significantly promotes efficiency [27]. By providing the information of tumor location, the model assists physician in making decision regarding diagnosis. Therefore, the deep learning model facilitates the diagnostic process and improves the accuracy of tumor detection and segmentation, thereby enhancing the efficiency and effectiveness of medical treatments.
There have been numerous attempts to develop a deep learning model that performs tumor segmentation task. For example, Swin U-Net Transformer (SwinUNETR) detects tumors by using input features at different resolutions that are extracted from SwinUNETR encoder utilizing shifted windows to compute self-attention [28]. Also, nnFormer successfully segments tumors by utilizing the cooperation of convolution and self-attention mechanism. By employing local and global self-attention mechanisms, more precise prediction is feasible [29].
B. U-Net Architecture
The U-Net architecture, initially proposed for biomedical image segmentation, has become a widely adopted CNN-based model known for its effectiveness [5]. The U-Net consists of two primary paths: the contracting path and the expansive path. The contracting path is responsible for capturing the contextual information of the input image by gradually reducing its spatial dimensions while simultaneously increasing the depth of the feature channels. On the other hand, the expansive path, which is symmetrical to the contracting path, focuses on reconstructing the segmented image. It utilizes a series of upsampling operations to expand the feature maps back to the original input dimensions. During this reconstruction process, the expansive path receives feature maps from the corresponding levels of the contracting path via skip connections. These skip connections play a crucial role in transferring detailed spatial information, enabling more precise localization of features in the segmented output. By employing this dual-path architecture, the U-Net effectively captures both the global context and local details, making it highly suitable for biomedical image segmentation tasks.
Several variants of the U-Net have been introduced to further enhance its performance. One notable example is the UNET++, which features a nested encoder-decoder design with dense connections between the layers. These dense connections allow the UNET++ to capture fine-grained spatial features that might otherwise be lost during the downsampling process. By preserving these detailed features, the UNET++ is able to generate more accurate segmentation results, particularly in scenarios where small or intricate structures are present.
Another variant, the KiU-NET, combines the strengths of the U-Net and Kite-Net architectures by arranging them in parallel. This unique configuration enables the KiU-NET to effectively detect small and indistinct anatomical structures. The Kite-Net component of the KiU-NET expands the feature maps before downsampling them, which is in contrast to the traditional U-Net approach. By employing these two complementary networks, the KiU-NET is able to capture both the global context and the fine details of the input image, resulting in improved segmentation performance for tiny and blurred objects.
Despite the remarkable success of CNN-based architectures like the U-Net and its variants, their ability to learn global features is inherently limited by the localized nature of their receptive fields. This limitation can lead to suboptimal semantic segmentation results, particularly when dealing with images containing multiple objects with diverse boundaries. To address this issue, researchers have explored various techniques, such as dilated convolutions and deformable convolutions, which aim to enlarge the receptive fields of the CNN layers. However, these approaches still face constraints in terms of the kernel window size and computational complexity, limiting their effectiveness in capturing long-range dependencies.
C. Vision Transformer
Transformers, which were originally introduced in the field of machine translation, have revolutionized the way sequence-to-sequence tasks are approached. These models replace the traditional recurrent and convolutional operations with self-attention mechanisms, enabling them to effectively capture long-range dependencies and achieve state-of-the-art performance [30], [31]. The success of transformers quickly spread beyond machine translation, finding applications in a wide range of NLP tasks. As a result, transformers have become the go-to architecture for many NLP applications, such as text classification, question answering, and language generation.
Another notable work in this direction is the Vision Transformer (ViT) [14], [32], [34], which represents a significant departure from traditional CNN-based architectures. Unlike CNNs, which primarily focus on local features, the ViT model excels at capturing global context by comparing patches of the input image. The ViT divides the image into fixed-size patches and linearly projects them into a sequence of embeddings. These embeddings are then processed by a stack of transformer layers, which utilize self-attention mechanisms to model the relationships between the patches. By attending to the entire sequence of patches, the ViT is able to capture long-range dependencies and global context, making it particularly effective for tasks beyond image classification, such as object detection and semantic segmentation.
D. UNETR for 3D Medical Segmentation
Building upon the success of transformers in computer vision, researchers have begun exploring their potential for medical image segmentation tasks. One notable example is the UNETR [35], [36], [37], which is specifically designed for 3D medical image segmentation. The UNETR architecture draws inspiration from the ViT and incorporates its self-attention mechanisms into the U-Net framework.
The UNETR consists of two main components: an encoder and a decoder. The encoder is responsible for extracting rich feature representations from the input 3D medical image. It achieves this by first dividing the image into uniform 3D patches and projecting them into a sequence of embeddings using a linear projection layer. These embeddings are then processed by a series of transformer layers, which utilize MHSA and multi-layer perceptron (MLP) blocks to capture the relationships between the patches. By attending to the entire sequence of patches, the encoder is able to extract global contextual information and compress the spatial dimensions of the input image.
The decoder of the UNETR is designed to reconstruct the segmented output from the compressed feature representations generated by the encoder. It consists of a series of 3D deconvolution and convolution layers that progressively upsample the feature maps to the original input dimensions. To ensure that the decoder has access to the rich feature representations learned by the encoder, skip connections are employed between corresponding levels of the encoder and decoder. These skip connections allow the decoder to leverage both the global context captured by the encoder and the local details preserved in the higher-resolution feature maps.
One of the key advantages of the UNETR architecture is its ability to capture long-range dependencies and global context, which is particularly important for medical image segmentation tasks. By incorporating self-attention mechanisms, the UNETR is able to effectively model the relationships between different regions of the input image, enabling it to generate more accurate and coherent segmentation results. Moreover, the UNETR is designed to handle 3D medical images directly, without the need for slice-by-slice processing, which is common in CNN-based approaches. This allows the UNETR to exploit the inherent 3D structure of medical images and capture valuable volumetric information.
E. Squeeze and Excitation Network
The SE network is an architectural unit that aims to improve the representational power of CNNs by explicitly modeling the interdependencies between the channels of its convolutional features. The SE block is designed to adaptively recalibrate the feature maps generated by a CNN, allowing the network to emphasize informative features and suppress less useful ones.
The SE block consists of two main operations: squeeze and excitation. The squeeze operation aims to aggregate the spatial information of each feature map into a single numeric value, effectively capturing the global context of the feature map. This is typically achieved through global average pooling, which computes the average value of each feature map across its spatial dimensions. The resulting vector, often referred to as the channel descriptor, provides a compact representation of the global distribution of the feature map.
The excitation operation, on the other hand, aims to capture the interdependencies between the channels of the feature maps. It takes the channel descriptor as input and generates a set of channel-wise weights through a small neural network. This neural network typically consists of a dimensionality reduction layer, followed by a non-linearity (e.g., ReLU) and a dimensionality increasing layer. The output of the excitation operation is a set of channel-wise weights that can be used to scale the original feature maps.
The scaled feature maps are obtained by element-wise multiplication of the original feature maps with the channel-wise weights generated by the excitation operation. This allows the SE block to adaptively recalibrate the feature maps, emphasizing the channels that are most informative for the task at hand and suppressing the less relevant ones. By doing so, the SE block enhances the representational power of the CNN and enables it to capture more discriminative features.
Materials and Methods
A. Orgunetr: Incorporating Organ Context for Enhanced Tumor Segmentation
The proposed OrgUNETR architecture is designed to incorporate organ context information for improved tumor segmentation. Figure 1 presents an overview of the OrgUNETR model. The input to the model is a 3D CT scan, which is first processed by a patch embedding layer. This layer divides the input image (
The overall architecture of OrgUNETR. It illustrates the comprehensive architecture of the OrgUNETR model, detailing its various layers and connections. A 3D CT scan is divided into a uniform 3D patches and projected into a token sequence by linear projection. The sequence is used as an input of SE Blocks. The encoded feature maps of different SE Blocks are extracted and integrated by the decoder. The final output dimension is
The encoder of the OrgUNETR model consists of a series of SE blocks, which are connected successively. The architecture employs 2, 4, and 6 SE blocks in the encoder, with each block downsampling the spatial dimensions by a factor of two. This allows the encoder to compress the spatial information while extracting relevant features at different scales.
At various stages of the network, the extracted features are upsampled using deconvolution layers and further enhanced by convolution layers, followed by batch normalization and ReLU activation functions. The upsampled features are then concatenated with the corresponding features from earlier blocks via skip connections. This process is repeated until the feature maps reach the same spatial dimensions as the input image.
One of the key challenges in tumor segmentation is the small size and unpredictable location of tumors within the CT scans. To address this issue, the OrgUNETR model incorporates both organ and tumor information for precise tumor localization. The output of the model consists of two channels, one for organ segmentation and the other for tumor segmentation. By sharing weights between the tumor prediction channel and the organ prediction channel, the model leverages organ information during the tumor prediction process. This dual-channel approach effectively incorporates organ context, enabling more accurate segmentation of tumor locations.
To achieve a balance between computational efficiency and performance, the OrgUNETR model replaces the MHSA layers, commonly used in transformer-based architectures, with SE layers. Unlike self-attention mechanisms, which require the computation of attention maps across all patch sequences, SE layers focus on modulating the feature channels based on global information obtained through a global averag pooling layer. By replacing self-attention layers with SE layers, the OrgUNETR model reduces computational complexity while maintaining segmentation performance.
B. Preprocessing and Patch Embedding
Preprocessing the input CT scans is a crucial step in the OrgUNETR pipeline. Directly processing every pixel in a 3D CT scan would result in high computational complexity, making it impractical for real-world applications. To alleviate this issue, the OrgUNETR model adopts a patch embedding layer, inspired by the original UNETR architecture.
The patch embedding layer divides the input image (
To address this issue, a learnable positional encoding vector (\begin{equation*}z_{tokens}=\left [{{ p_{1}E;p_{2}E;\cdots ;p_{N}E }}\right ]+E_{pos}, \tag {1}\end{equation*}
where
C. Squeeze and Excitation Blocks
The SE blocks form a crucial component of the OrgUNETR architecture. These blocks are designed to adaptively recalibrate the feature maps by explicitly modeling the interdependencies between channels [17], [18], [19], [20], [21], [22]. By doing so, the SE blocks enable the model to prioritize informative features and suppress less relevant ones.
In the OrgUNETR model, each SE block consists of an SE layer followed by an MLP layer. A normalization layer is appended after each layer to stabilize the weight values during training. The SE layer assesses the significance of each feature channel by generating a channel-wise attention vector.
The operation of the SE block can be represented by the following equation:\begin{equation*} S_{l}=LN(MLP{(LN(v}_{l-1}\times S_{l-1})), \tag {2}\end{equation*}
where
The channel-wise attention vector
Next, the channel descriptor undergoes a dimensionality reduction operation, typically implemented using a fully connected layer with a smaller number of neurons compared to the number of channels. This step helps to reduce the computational complexity and prevent overfitting. The reduced channel descriptor is then passed through a non-linear activation function, such as ReLU, to introduce non-linearity into the attention mechanism.
Finally, the activated channel descriptor is expanded back to the original number of channels using another fully connected layer. This expansion operation generates the channel-wise attention vector
The recalibrated feature maps are further processed by the MLP layer, which learns to combine the attended features effectively. The output of the MLP layer is then normalized using layer normalization to stabilize the training process.
By employing SE blocks, the OrgUNETR model can adaptively recalibrate the feature maps, emphasizing informative channels and suppressing less relevant ones. This mechanism enhances the model’s ability to capture discriminative features and improves its segmentation performance. Moreover, by using a single attention vector for each SE block, the computational complexity is significantly reduced compared to the MHSA layers used in transformer-based architectures.
D. Metrics
To train and evaluate the performance of the OrgUNETR model, we employ a combination of the Dice coefficient and Cross-Entropy Loss, which are widely used metrics in segmentation tasks [40], [41], [42].
The Dice coefficient measures the overlap between the predicted segmentation and the ground truth. It is calculated using the following equation:\begin{equation*} S_{dice}=\frac {2\times \left ({{ P_{true}\times P_{pred} }}\right)}{P_{true}+P_{pred}}, \tag {3}\end{equation*}
In addition to the Dice coefficient, we also employ the Cross-Entropy Loss, which quantifies the dissimilarity between the predicted probabilities and the ground truth labels. The Cross-Entropy Loss is calculated as follows:\begin{equation*}CE=\sum \sum {T\times log(p_{truth})}, \tag {4}\end{equation*}
To combine the Dice coefficient and Cross-Entropy Loss, we introduce the DiceCELoss, which is a weighted sum of the two metrics. The DiceCELoss is defined as:\begin{equation*} DiceCELoss=\alpha \times S_{dice}+\beta \times CE, \tag {5}\end{equation*}
Given the critical importance of accurately segmenting both the organ and the tumor, we further extend the DiceCELoss by introducing a weighted variant:\begin{equation*} { DL}_{total}=0.65\times {DL}_{organ}+0.35\times {DL}_{tumor}, \tag {6}\end{equation*}
E. Kidney Tumor Segmentation Dataset
To evaluate the performance of the OrgUNETR model on kidney tumor segmentation, we utilize the KiTS19 dataset. This dataset serves as a cornerstone for our study, providing CT scans accompanied by annotations for both the right and left kidneys, as well as kidney tumors.
The KiTS19 dataset comprises 544 CT scans, which were annotated by medical students under the supervision of expert radiologists. Each CT scan has a consistent resolution of
It is important to note that the resizing process can introduce a challenge: small tumor pixels may merge with adjacent pixels, potentially leading to the elimination of tumor pixels in some cases. To mitigate this issue, we carefully examine the resized CT scans and exclude 54 scans that lack tumor pixels after resizing. This step ensures that the training dataset contains sufficient tumor information for the model to learn from. The KiTS19 dataset is publicly accessible and can be downloaded from the official repository (https://github.com/neheller/kits19) with the consent of the organizers.
F. Prostate Tumor Segmenntation Dataset
In addition to the KiTS19 dataset, we also utilize the Prostate158 dataset to evaluate the performance of the OrgUNETR model on prostate tumor segmentation. The Prostate158 dataset is a comprehensive collection of high-quality 3 Tesla MRI scans specifically designed for prostate segmentation tasks.
The dataset includes scans of both anatomical zones and cancerous lesions within the prostate, making it a valuable resource for prostate MRI image analysis. The inclusion of both healthy and cancerous tissue annotations enables the development of models that can accurately segment the prostate gland and identify tumors simultaneously [44].
The Prostate158 dataset consists of 139 training samples and 19 validation samples. Each MRI scan has a native resolution of
To normalize the intensity values of the MRI scans, we apply min-max scaling to each scan, bringing the pixel values into a consistent range. This normalization step helps to mitigate the influence of variations in scanner settings and acquisition protocols, making the dataset more suitable for training deep learning models.
The Prostate158 dataset is accompanied by expert annotations, which serve as ground truth labels for training and validating the segmentation models. These annotations were carefully curated by experienced radiologists, ensuring their reliability and accuracy.
The Prostate158 dataset is publicly available and can be accessed from the official repository (https://zenodo.org/record/6481141) with the consent of the organizers. This dataset has been widely used in the research community for developing and evaluating prostate segmentation algorithms, contributing to the advancement of prostate cancer diagnosis and treatment planning.
Results
A. Segmentation Results With KiTS19 Dataset
To evaluate the performance of our proposed OrgUNETR model on kidney tumor segmentation, we conduct experiments using the KiTS19 dataset. The dataset consists of 490 CT volumes, each annotated with both kidney and kidney tumor labels. We partition the dataset into training and validation sets using a 70:30 ratio, ensuring a fair evaluation of the model’s generalization ability.
During training, we employ the AdamW optimizer [45] with a learning rate of 0.0001. To enhance the model’s robustness and prevent overfitting, we apply data augmentation techniques, specifically random rotation of the input images within a range of 0 to 10 degrees [46]. The loss function used for training is a combination of Dice Loss and Cross-Entropy Loss, referred to as DiceCELoss.
To assess the effectiveness of our proposal, which involves training models with organ information to enhance accuracy, across various state-of-the-art models, we estimate its applicability on KiTS19 using the conventional UNETR, SwinUNETR, nnFormer, and U-Net [5], [28], [29], [36], both in their original form and with our proposed modification. We evaluate the performance of each conventional model trained solely with tumor information against each proposed model trained with both tumor and organ information using the Dice score metric, which calculates the overlap between the predicted segmentation and the ground truth.
The results presented in Table 1 indicate that our proposed models that are trained with organ and tumor information yields improved Dice score compared to the conventional models that are trained only with tumor information. The overall our proposed models demonstrated superior performance compared to the conventional models. The proposed UNETR outperforms the conventional UNETR by 34.9%. Also, in case of nnFormer, the proposed model surpasses the conventional model by 14.9%. Particularly noteworthy is the SwinUNETR model, where our proposed modification achieved a Dice score of 0.4786, representing an increase of 103%. The U-Net model proposed in our study demonstrates a 47.0% increase in accuracy compared to the conventional U-Net. These results clearly show that training models with organ information that is explicitly related to the tumor enhances the tumor segmentation ability.
By examining the results from the various models, the approach of simultaneously training on both organ and tumor information is applicable to other models. This indicates that the approach is not only applicable to the models discussed in this paper but can also be extended to other deep learning models. Furthermore, this strategy can be expanded into a general methodology that to detect the target precisely, the related information that related to the target is required.
The results of our OrgUNETR experiments on the KiTS19 dataset are presented in Figure 2. Our OrgUNETR model achieves a Dice score that is 49.04% higher than the baseline model, demonstrating the significant impact of incorporating organ information in tumor localization. The dual-channel approach enables the model to leverage the contextual information provided by the organ labels, leading to more accurate tumor segmentation.
Dice score comparison for tumor segmentation on the KiTS19 dataset. (a) depicts the comparison of the OrgUNETR versus UNETR model using the validation dataset, and (b) presents the Dice scores of OrgUNETR versus UNETR on the training dataset. In both (a) and (b), the orange line shows the Dice score from OrgUNETR, while the blue lines show the Dice score from UNETR. The bold lines represent the application of a moving average to enhance clarity.
Figure 3 illustrates the training and validation loss curves for both OrgUNETR and the baseline model. Our model demonstrates a substantial reduction in DiceCELoss compared to the baseline model, with a decrease of 37.85%. This observation suggests that training the model with organ information enhances its learning capacity by providing additional supervision regarding organ location. The lower validation loss achieved by OrgUNETR indicates its superior performance and generalization ability compared to the baseline model.
Comparison of DiceCELoss for tumor segmentation on the KiTS19 dataset. (a) illustrates the comparison between the OrgUNETR and UNETR models using the validation dataset, while (b) displays the loss for OrgUNETR compared to UNETR on the training dataset. In both (a) and (b), the orange line represents the loss for OrgUNETR, whereas the blue lines indicate the loss for UNETR. Bold lines signify the use of a moving average for clarity.
One of the key contributions of our work is the replacement of MHSA layers with SE layers in the OrgUNETR architecture. By adopting SE layers that perform channel-wise attention, we achieve a notable reduction in computational complexity while maintaining segmentation accuracy. Specifically, our model exhibits a 13.9% reduction in computational cost compared to the original UNETR architecture, making it more efficient and suitable for practical applications.
To further illustrate the impact of incorporating organ information on tumor segmentation, we present a visual comparison of the segmentation results obtained by OrgUNETR and the baseline model in Figure 4. The first row shows the ground truth segmentation, while the second and third rows display the tumor predictions of OrgUNETR and the baseline model, respectively. The pink pixels represent the tumor regions, while the grayscale pixels correspond to the background.
Tumor prediction from OrgUNETR and the baseline model on KiTS19 dataset. The first row indicates the ground truth. The second row illustrates the tumor prediction of OrgUNETR. The third row shows the tumor prediction performed by the baseline model. The pink pixels throughout the image represents the tumor pixels, whereas the greyscale pixels indicate the background.
In the third column of Figure 4, we observe a significant difference between the predictions of OrgUNETR and the baseline model. For instance, in the second row, OrgUNETR accurately predicts the tumor in the right kidney, whereas the baseline model incorrectly predicts a non-existent tumor in the left kidney. This observation highlights the inferior performance of the baseline model in detecting tumors accurately from CT scans.
Similarly, in the fifth column, OrgUNETR correctly shows the absence of a tumor in the left kidney, while the baseline model incorrectly predicts the presence of tumors in the left kidney. These examples demonstrate the effectiveness of incorporating organ information in improving tumor segmentation accuracy.
Overall, the experimental results on the KiTS19 dataset strongly support the efficacy of our proposed OrgUNETR model in kidney tumor segmentation. By leveraging organ information through a dual-channel approach and employing SE layers for efficient attention mechanisms, OrgUNETR achieves superior performance compared to the baseline model, both in terms of Dice score and visual quality of the segmentation results.
B. Segmentation Results With Prostate158 Dataset
To further validate the effectiveness of our proposed OrgUNETR model, we conduct experiments on the Prostate158 dataset, which consists of high-quality 3 Tesla MRI scans specifically designed for prostate segmentation tasks. The dataset includes annotations for both anatomical zones and cancerous lesions within the prostate, making it a comprehensive resource for evaluating prostate tumor segmentation models.
The Prostate158 dataset comprises 139 training samples and 19 validation samples. Each MRI scan has a native resolution of
Table 2 presents a comparative analysis of the Dice scores achieved by various models on Prostate158 dataset. In case of U-Net, the proposed model surpasses the conventional model by 11.4%. For the UNETR and SwinUNETR, the proposed models show superior performance by 22.8% and 22.1% respectively. Contrary to other models, the nnFormer demonstrated better performance in the conventional model. However, the overall architectures of the proposed models show superior performance relative to the conventional models. Through experiments conducted with Prostate158 dataset, we confirm the applicability of our model across various models.
Especially for OrgUNETR, The training and validation sets are split in a ratio of 70 to 30. We train the OrgUNETR model using the AdamW optimizer with a learning rate of 0.0001. To enhance the model’s robustness, we employ data augmentation techniques, specifically random rotation of the input images within a range of 0 to 30 degrees.
The performance of OrgUNETR is evaluated using the Dice score metric, and we compare it against a baseline model that focuses solely on tumor localization using a single channel. Figure 5 presents the comparison of Dice scores between OrgUNETR and the baseline model on the Prostate158 dataset. The left plot shows the Dice scores on the validation set, while the right plot displays the Dice scores on the training set.
Dice score comparison for tumor segmentation on the Prostate 158 dataset. (a) depicts the comparison of the OrgUNETR versus UNETR model using the validation dataset, and (b) presents the Dice scores of OrgUNETR versus UNETR on the training dataset. In both (a) and (b), the orange line shows the Dice score from OrgUNETR, while the blue lines show the Dice score from UNETR. The bold lines represent the application of a moving average to enhance clarity.
Our OrgUNETR model achieves a Dice score that is 32.69% higher than the baseline model, demonstrating the significant impact of incorporating organ information in prostate tumor segmentation. Interestingly, we observe that the training Dice score of the baseline model surpasses that of OrgUNETR. However, when evaluated on the validation set, OrgUNETR consistently outperforms the baseline model. This observation suggests that OrgUNETR is more effective in generalizing to unseen data and is less prone to overfitting compared to the baseline model.
Figure 6 illustrates the training and validation loss curves for both OrgUNETR and the baseline model on the Prostate158 dataset. Our model demonstrates a substantial reduction in DiceCELoss compared to the baseline model, with a decrease of 43.39%. This observation underscores the benefit of incorporating organ information into the segmentation process, leading to more accurate tumor predictions. It is worth noting that the training DiceCELoss of both models shows only a 3.44% difference, indicating that both models are trained similarly. However, the superior performance of OrgUNETR on the validation and test sets highlights its ability to generalize well and make accurate predictions on unseen data.
Comparison of DiceCELoss for tumor segmentation on the Prostate 158 dataset. (a) illustrates the comparison between the OrgUNETR and UNETR models using the validation dataset, while (b) displays the loss for OrgUNETR compared to UNETR on the training dataset. In both (a) and (b), the orange line represents the loss for OrgUNETR, whereas the blue lines indicate the loss for UNETR. Bold lines signify the use of a moving average for clarity.
To provide a qualitative assessment of the segmentation results, we present representative examples in Figure 7. The first row shows the ground truth segmentation, while the second and third rows display the tumor predictions of OrgUNETR and the baseline model, respectively. The pink pixels represent the tumor regions, while the grayscale pixels correspond to the background.
Tumor prediction from OrgUNETR and the baseline model on Prostate 158 dataset. The first row indicates the ground truth. The second row illustrates the tumor prediction of OrgUNETR. The third row shows the tumor prediction performed by the baseline model. The pink pixels throughout the image represents the tumor pixels, whereas the greyscale pixels indicate the background.
In the third column of Figure 7, we observe significant differences between the predictions of OrgUNETR and the baseline model. OrgUNETR accurately predicts a large tumor in the middle of the prostate, closely resembling the ground truth. In contrast, the baseline model not only predicts the tumor in the middle of the prostate but also incorrectly identifies additional tumor regions. These examples demonstrate the superior performance of OrgUNETR in accurately segmenting prostate tumors by leveraging organ information.
The experimental results on the Prostate158 dataset further validate the effectiveness of our proposed OrgUNETR model in tumor segmentation tasks. By incorporating organ information through a dual-channel approach, OrgUNETR achieves significant improvements in Dice score and visual quality of the segmentation results compared to the baseline model. The model’s ability to generalize well to unseen data and its robustness to overfitting make it a promising tool for prostate tumor segmentation in clinical practice.
C. Additional Experiments
To thoroughly evaluate the robustness and performance of our OrgUNETR model, we conducted an extensive series of experiments across various hyperparameter configurations. Initially, the number of channels in the decoder layer was set to 16, and the learning rate was fixed at 0.0001. We then explored different embedding dimensions for the input patches, specifically 8, 16, and 32, to analyze their impact on model performance.
Our experimental setup employed the KiTS19 and Prostate158 datasets, which are well-regarded benchmarks for assessing the efficacy of medical image segmentation models. The performance of our model was quantified using the Dice score, a standard metric for evaluating the accuracy of segmentation models.
Table 3 presents the Dice scores achieved by OrgUNETR across the different embedding dimensions. Notably, even with varying hyperparameters, our model consistently demonstrated robust performance. For instance, with an embedding dimension of 16, OrgUNETR attained a Dice score of 0.2137 on the KiTS19 dataset, which is the highest performance observed for this dataset. On the Prostate158 dataset, the model achieved its peak performance with an embedding dimension of 32, recording a Dice score of 0.2195. It is important to highlight that while the highest Dice scores were observed at embedding dimensions of 16 and 32, the scores at dimension 8 also showed commendable performance, indicating the model’s stability and efficiency across different configurations.
The marginal convergence of Dice scores between dimensions 16 and 32 further underscores the model’s robustness. Despite the variation in embedding dimensions, the performance remained consistently high, demonstrating the effectiveness and reliability of OrgUNETR in handling complex medical image segmentation tasks. This consistent performance, irrespective of the embedding dimension, attests to the superior design and implementation of our model.
In conclusion, the experimental results confirm that OrgUNETR performs exceptionally well across different hyperparameter settings. The consistent Dice scores across varying embedding dimensions indicate that our model maintains high performance regardless of specific parameter adjustments. This robustness highlights the potential of OrgUNETR as a reliable tool for medical image segmentation, capable of delivering accurate and consistent results.
Conclusion
In this study, we introduced OrgUNETR, an enhanced version of the UNETR architecture specifically designed for tumor segmentation in medical images. The proposed model incorporates organ context information to improve the accuracy and robustness of tumor localization. By leveraging the fact that tumors typically exist within specific organs, OrgUNETR effectively addresses the challenges posed by the small size and unpredictable locations of tumors in CT and MRI scans.
One of the key contributions of OrgUNETR is its ability to simultaneously segment both the organ and the tumor using a dual-channel approach. This approach significantly improves tumor segmentation performance by allowing the model to learn the inherent relationships between organs and tumors. The experimental results on the KiTS19 and Prostate158 datasets demonstrate the effectiveness of incorporating organ information, with OrgUNETR achieving substantial improvements in Dice score compared to a baseline model that focuses solely on tumor segmentation.
On the KiTS19 dataset, which consists of CT scans of the kidney and kidney tumors, OrgUNETR achieved a remarkable 40.54% increase in Dice score for tumor segmentation when organ information was included. Similarly, on the Prostate158 dataset, which contains MRI scans of the prostate gland and prostate tumors, OrgUNETR outperformed the baseline model by 32.69% in terms of Dice score. These results provide strong evidence for the benefits of leveraging organ context in tumor segmentation tasks.
In addition to the performance gains, we also optimized the computational efficiency of OrgUNETR by replacing the MHSA layers with SE layers. The SE layers efficiently compute channel-wise attention, reducing the computational complexity of the model while maintaining its segmentation accuracy. By substituting MHSA layers with SE layers, we achieved a 13.9% reduction in computational cost, making OrgUNETR more practical for real-world applications with limited computational resources.
The superior performance of OrgUNETR can be attributed to its ability to capture both local and global contextual information. The encoder of OrgUNETR, which consists of a series of SE blocks, effectively compresses spatial information while extracting relevant features at different scales. The decoder, on the other hand, reconstructs the segmented output by integrating the extracted features through skip connections and upsampling operations. This architecture enables OrgUNETR to generate precise and coherent segmentation results, even for challenging cases with small and irregularly shaped tumors.
Furthermore, the inclusion of organ information in the segmentation process helps to mitigate the issue of false positives, where the model incorrectly identifies non-tumor regions as tumors. By learning the relationships between organs and tumors, OrgUNETR is able to distinguish between normal anatomical structures and abnormal growths more effectively. This is particularly important in clinical settings, where accurate tumor detection and delineation are crucial for treatment planning and patient management.
The experimental results also highlight the generalization ability of OrgUNETR. Despite the variations in tumor size, shape, and location across different patients and imaging modalities, OrgUNETR consistently outperforms the baseline model. This robustness is essential for the practical deployment of the model in real-world scenarios, where it may encounter a wide range of tumor characteristics.
ACKNOWLEDGMENT
(Sanghyuk Roy Choi and Jungro Lee contributed equally to this work.)