Introduction
Organ localization serves as an essential prerequisite for many computed tomography (CT) image analysis tasks such as organ segmentation [1]–[13], lesion detection [5], and image registration [1], [2], [4]. Accurate estimation of the target organs’ position and extent can help the subsequent algorithms to focus on regions-of-interests (RoI) and contribute to better performance. For instance of the most widely studied topic of organ segmentation, organ localization is commonly used as the first step to extract RoIs from the original CT images [9]–[13]. It not only improves the accuracy of the subsequent segmentation but also reduces the computation cost in both time and memory. Furthermore, organ localization is widely used in picture archiving and communication systems (PACS) for efficient image retrieval and visualization navigation [2], [4], [8], [14]. With the aid of organ localization, partial data in the organ RoIs can be preferentially retrieved and transmitted, largely reducing the bandwidth usage and server load of the hospital networks.
Accurate automatic localization of multiple organs in CT image, however, is a challenging problem because of the following reasons. Firstly, the organ boundaries in CT image could be faded by the neighboring low-contrast soft tissues and imaging noise. Secondly, the appearance (e.g. location, size, and shape) of the target organs could be highly variable across different patients. Thirdly, incomplete organ structures truncated by the scan range and abnormalities such as lesions, cysts, and artificial structures also hinder the automatic methods from achieving high accuracy.
To address this challenging problem, many automatic methods are proposed in the past decade. These methods can be generally divided into two categories: the classical machine learning methods and the convolutional neural networks (ConvNets) based methods (or the deep learning methods). In the early stage, many classical machine learning methods are proposed to solve the problem of organ localization. They design various of hand-craft features such as intensity, gradient [2], [4], and Haar-like features [3], [15] to estimate the centroid and extent of the target organs through different classifiers such as AdaBoost [3], random forests [2], [4] and probabilistic boosting-tree [15]. In recent years, due to the great success of deep learning in computer vision, many ConvNets based methods for organ localization are proposed [5]–[8], [16], [17]. The powerful representation ability of the hierarchical features, which are self-learned through end-to-end training mode, makes this methodology popular and present state-of-the-art performance compared with the previous classical machine learning methods.
According to the dimensionality of the convolution kernel, the ConvNets based methods can be further divided into 2D ConvNets based methods [5]–[8], [16] and 3D ConvNets based methods [17], [18]. 2D ConvNets based methods take CT slices extracted along orthogonal directions as input and predict the presence/absence status of the organs in each slice. The final bounding boxes of the target organs can be assembled by these binary statuses. This strategy is straightforward but less efficient because the model needs to be run for each CT slice while most adjacent CT slices present almost identical contents. Less consideration of the inter-slice context also hinders the 2D ConvNets based methods from achieving compelling performance. To address the limitation of the 2D ConvNets, 3D ConvNets (such as 3D Faster R-CNN [18] and [17]) are exploited to handle the problem of organ localization. These methods directly predict the presence probability and geometric parameters of the organ bounding boxes using 3D region proposal network (RPN). Compared with the 2D ConvNets, 3D ConvNets can take full advantage of the spatial context information in volumetric CT image within one forward propagation, thus present higher localization accuracy and faster processing speed. However, due to the customized network components and label assignment strategy in RPN, the implementations of these 3D ConvNets based methods (or 3D RPN based methods) are more challenging than that of the previous 2D ConvNets based methods.
In this work, we propose a novel 3D fully convolutional network (FCN) [19], called triple-branch FCN, to handle the problem of 3D organ localization in CT image. This triple-branch FCN is fully implemented in 3D manner, thus it can take full advantage of the spatial context information in CT image to achieve higher organ localization accuracy than the 2D ConvNets based methods. When compared with other 3D ConvNets based methods (or the 3D RPN based methods), the proposed method has simpler network structure, and there are only several basic ConvNet operations (such as convolution, pooling and softmax regression) involved in this triple-branch FCN. Thus it is more easy-implemented and flexible to be deployed in other applications than the previous 3D RPN based methods. In order to further improve the localization accuracy of the proposed network, we also propose to use a density enhancement filter to enhance the structures with specific densities in the CT image, and concatenate these enhanced images with the original CT image to compose a three-channel image to feed into the proposed 3D triple-branch FCN for training and testing.
In summary, the major contributions of this work are three folds:
We design a novel 3D FCN architecture, called triple-branch FCN, to perform accurate multi-organ localization in CT image. Benefiting from the 3D implementation manner, the proposed network can take full advantage of the spatial context in CT image to perform accurate organ localization. Only several basic ConvNets components, such as convolution, pooling, and softmax regression, are involved in this network, thus it is easy-implemented and flexible to serve as a preprocessing module in other applications.
We propose to use a density enhancement filter to emphasize some structure information in the CT image and concatenate the enhanced images with the original CT image to compose a three-channel image, which is taken as input by the proposed triple-branch FCN to further improve its performance. As our experimental results showed, taking this enhanced three-channel image as input can largely improve the final localization accuracy of the proposed network.
We conduct extensive experiments on a challenging public dataset with 201 clinical abdominal/torso CT images to evaluate the performance of the proposed method for the localization of 11 body organs (or anatomical structures). The experimental results show that the proposed method achieves higher localization accuracy in comparison to the current state-of-the-art methods.
The rest of this paper is structured as follows. Section II presents the dataset, preprocessing, and augmentation strategies. Section III gives a detailed description of the proposed method. In Section IV, we conduct extensive experiments to evaluate the performance of the proposed method. Some special issues are discussed in Section V. Finally, we conclude this work in Section VI.
Materials
In this work, we conduct experiments using a public CT image dataset and a public annotation set based on it.
A. Image Dataset
The image data used in this work comes from the public Liver Tumor Segmentation (LiTS) Challenge dataset.1 It consists of 201 clinical contrast-enhanced abdominal/torso CT images, which are collected from several clinical sites around the world. 131 images are used for training and the rest 70 are used for testing. In the training set, we randomly select 13 images (approximately 10%) used for validation. The in-plane spatial resolution of these CT images ranges from
B. Annotations
Based on the CT images from the LiTS dataset, a public annotation set for body organ localization [17] is established. Localization information of 11 organs (or anatomical structures) is included in this annotation set: left/right lungs, heart, liver, pancreas, spleen, left/right kidneys, left/right femoral heads, and bladder. Each instance of the target organs is annotated by six coordinates
C. Data Preprocessing and Augmentation
In the data preprocessing stage, we resample the CT images to a uniform size of
Before feeding the preprocessed CT images to the deep ConvNets, we conduct data augmentation to mitigate over-fitting in the training stage. Firstly, we shift the CT image along
Method
Figure 1 gives a schematic representation of the proposed method. Firstly, the input CT image is extended to a three-channel image that contains both the original image information and the density-enhanced organ structures. Secondly, this three-channel image is fed to a ConvNet (called backbone network) to extract its 3D feature map. Thirdly, the extracted feature map volume is synchronously fed to three subsequent sibling ConvNets (called branch networks) to predict the presence probability curves of the target organs along axial, coronal and sagittal directions, respectively. Finally, these presence probability curves are binarized by a fixed threshold of 0.5, and the final organ bounding boxes can be simply composed by the largest 1D non-zero component in these three binary curves. The backbone network and the branch networks consist of the proposed triple-branch FCN.
A. Three-Channel Image Generation Using Density Enhancement Filter
Original CT image has a large dynamic range of intensity and wide scanning extent. It would result in complex background and introduce a number of irrelevant anatomic structures, either of which causes difficulty in learning distinct features for accurate organ localization through deep ConvNets. To make the target organs more distinct than the background structures and decrease the optimization complexity in feature space, we additionally compute two hand-craft features, i.e. the enhanced density map and its gradient map, and concatenate these two feature maps with the original CT image to compose a three-channel image, which is taken as input by the subsequent triple-branch FCN. This procedure is illustrated in Figure 2 and the generated three-channel image is visualized in Figure 3.
Generation of the three-channel image using density enhancement filter and gradient operator.
Visualization of the three-channel image. (a): Original CT image, (b): Enhanced density map, (c): Gradient map.
According to the CT imaging principle, the Hounsfield Unit (HU) directly reflects the density of the materials in the CT image. Most target organs present uniform HU values concentrating in a short interval (approximate from −50 to 250 HU, see Figure 4). Therefore, we propose to use the following density filter to enhance the organs whose density is in this interval and suppress other non-relevant structures such as bone, muscles, and fats:\begin{equation*} I_{d}=\exp {\left[{-\left({\frac {I_{0}-w_{l}}{w_{w}}}\right)^{2}}\right]}\tag{1}\end{equation*}
Intensity distribution in the region of the organ bounding boxes in our dataset. The apex on the left (P1) corresponds to the regions of left/right lungs that are filled with air (HU ranges from −1000 to −700). The two close apexes in the middle (P2 and P3) correspond to the regions of the other 9 organs. Specifically, P3 is the apex corresponding to the target organs, while P2 is mainly caused by the surrounding soft-tissues and fats that usually have lower HU than the central organs. The band filled with orange color represents the window we choose to enhance (with the window level of 100 and window width of 150).
Organ boundaries are the main basis for localization. According to our previous experimental results [17], organs with sharper boundaries (e.g. lung and heart) generally achieve higher localization accuracy than organs with low-contrast boundaries and variable shapes (e.g. liver, spleen, and bladder). Therefore, we hypothesize that emphasizing the edge information of the target organs could further improve the localization accuracy. To this end, we additionally calculate the gradient map of the aforementioned enhanced density map using a Sobel operator to combine with the original CT image.
B. Feature Extraction Using 3D Backbone Network
The proposed triple-branch FCN is mainly composed of two parts: the backbone network for feature extraction and the branch network for organ presence probability curve prediction.
Feature extraction is essential to the ConvNets based methods for organ localization. For the volumetric CT image, there are commonly two ways for feature extraction: using 2D ConvNets to process each CT slice or using 3D ConvNets to process the CT volume directly. Compared with the 2D manner, using 3D ConvNets can not only take full advantages of the spatial context information but also decrease the processing time, resulting in more distinct features for accurate organ localization and faster processing speed. Therefore, in this work, we use a 3D version of AlexNet [20] (truncated before all fully connected layers) as the backbone network for feature extraction. Theoretically, one can also employ other standard networks, such as VGG-16 [21] and ResNet-34 [22], to serve as the backbone. Deeper and more complex backbone networks can bring higher localization accuracy but also larger memory footprints. We specially conduct an ablation experiment in Section IV-F to evaluate the impact of using different backbone networks in the proposed method. In practice, one can also design customized backbone networks for specific applications.
C. Organ Localization Using Triple-Branch Networks
Since the 3D bounding boxes of the target organs can be inferred by their existence status in all CT slices extracted along axial, coronal and sagittal directions, the problem of organ localization can be interpreted as estimating the presence probability curves of the target organs along these three directions separately. To achieve this goal, we design three sibling ConvNets (called branch networks) using asymmetric convolution kernels to mold the feature map volume output from the backbone network into three 1D feature vectors, which indicate the presence probability of the organs along axial, coronal and sagittal directions respectively. As shown in Table 1, the structural parameters of these branch networks are specially designed, thus the length of their output presence probability curves can be exactly equal to the width, height and length of the input CT image respectively. Specifically, given an input CT image in size of
D. Label Assignment and Objective Function
In the training stage, each CT image is labeled with three binary curves with fixed lengths of
E. Implementation Details
The proposed method is implemented in Caffe [23] deep learning framework with the aid of Insight Segmentation and Registration Toolkit (ITK). All experiments are conducted on a workstation equipped with one NVIDIA GTX1080 Ti graphic card (11 GBytes of memory) and one Intel® Core
Experiments
A. Metrics
To evaluate the localization accuracy of different methods, we adopt the intersection-over-union (IoU) between the predicted bounding box and the ground-truth bounding box as the major metric:\begin{equation*} IoU=\frac {V^{*}\cap V}{V^{*}\cup V}\tag{2}\end{equation*}
B. Multiple Body Organ Localization
We firstly report the results of the proposed method for the localization of 11 body organs (or anatomical structures). There are totally 618 instances of the 11 target organs in the ground-truth of the testing set. 5 false negatives and 3 false positives appear in the final results. The mean wall and centroid distance (with standard deviation) are 4.36(7.98)
Result visualization of the proposed method. The ground-truth and the predicted bounding boxes are drawn with solid line and dashed line respectively. Organs are distinguished by different colors annotated in the right-bottom legend. Best viewed in color.
C. Comparison With Other Methods
In this section, we conduct direct comparison between the proposed method and other four state-of-the-art methods (A: De Vos et al.(2017) [8], B: Humpire et al.(2018) [7], C: 3D Faster R-CNN [18] and D: Xu et al.(2019) [17]) for organ localization on the same dataset. To be fair and reasonable, the comparative methods are fully implemented and trained along the lines of their original literatures. We specially fine-tune the hyper-parameters of these comparative methods to optimize their performance on our dataset, and some outliers in their results are excluded.
The results of this quantitative comparison are shown in Table 2 and Figure 7. As we can see, the methods directly utilizing 3D context information of CT volumes (Method C, D and the proposed method) achieves generally higher IoU than the methods using 2D CT slices (Method A and B). It indicates that processing the CT image in 3D manner can take full use of the spatial context information thus improve the localization accuracy. Furthermore, the proposed method achieves the highest global IoU, outperforming other competitors on most target organs except the right lung (87.43% lower than 87.78% of Method B), the pancreas (58.26% lower than 58.56% of Method D) and the left/right femoral heads (74.86% / 75.35% lower than 79.77% / 77.26% of Method D). These results demonstrate the effectiveness of our designs for accurate organ localization. The results on wall and centroid distance of our method in Table 3 are also better than that of the other four methods.
Boxplot of the IoU of different methods for 11 body organ localization. Best viewed in color.
For the method efficiency, the average processing time of the proposed network is 0.5 seconds, which is comparable to that of other two 3D ConvNets (0.4 seconds of Method C and 0.3 seconds of Method D) and 3 to 7 times faster than that of the 2D ConvNets (1.5 seconds of A and 3.7 seconds of B), demonstrating the high efficiency of processing CT image using 3D networks. Due to the computation of the density enhanced map and the gradient map, there is an extra preprocessing time (approximately 3.2 seconds per CT image) needed for the proposed method. The preprocessing procedure, however, is fully conducted on CPU, thus it could be further accelerated by using parallel computing on GPU.
D. Effectiveness of Density Enhancement
To improve the localization accuracy, we propose to use the density enhancement filter to extend the original CT image to a three-channel image, which is taken as input by the subsequent triple-branch FCN. In order to verify the effectiveness of this density enhancement, we compare the localization accuracy of the proposed method when taking the original CT image with/without the enhanced density map and the gradient map as input. The experimental results are shown in Table 4. It can be seen that the density enhancement filter effectively improves the localization accuracy of the proposed method. This result verifies our hypothesis that emphasizing the regions and edges of the target organs can help the deep networks to learn more essential features for accurate organ localization.
E. Impact of Different Gradient Operators
In this section, we conduct an experiment to investigate the effects of using different gradient operators to generate the gradient map in the proposed method. Three gradient operators, i.e. Sobel, Laplacian of Gaussian (LoG) and Canny, are included in this experiment. As the experimental results shown in Table 5, the global IoU of the proposed method with these three gradient operators are 76.44%, 75.46% and 75.61%, respectively. It seems that using the second-order gradient operators (such as LoG and Canny) to obtain more detailed edges does not contribute to higher localization accuracy in comparison to the first-order gradient operator (Sobel). We attribute this result to the relatively large receptive field of the deep ConvNets. It leads to losing sight of detail information and makes the network insensitive to the detailed edges.
F. Impact of Different Backbone Networks
As aforementioned in Section III-B, the backbone network used for feature extraction is essential to the final performance of the proposed method. Different backbone network could lead to different results. To investigate the impact of different backbone networks, we successively evaluate the localization accuracy (measured by IoU) of the proposed method using three standard architectures as the backbone: AlexNet [20], VGG-16 [21] and ResNet-34 [22]. Due to the large memory footprints of the VGG-16 and ResNet-34, this experiment is conducted using down-sampled CT images in sizes of
G. Impact of the Triple-Branch Joint Training
The triple-branch structure of the proposed method is convenient to predict the presence probability curves in three directions synchronously. It is also helpful to improve the final localization accuracy through the joint training manner. In the training stage, each of the branch networks has an influence on the shared backbone network, thus force the backbone network to find more essential features to perform accurate localization. To demonstrate this impact, we train these three branch networks separately with non-shared backbone networks and fuse the output of these branch networks to generate the final results. As the results shown in this experiment, this non-joint training manner leads to a drop in IoU (from 76.44% to 76.08%). It indicates that the joint training manner has a positive impact on the final localization accuracy. Although the improvement in accuracy is not significant, it is more convenient and time-saving to train the three branch networks synchronously in one model rather than training three models individually.
Discussions
The wall and centroid distance error of the methods by Humpire et al. [7] in Table 3 is lower than the results reported in their original literatures. We attribute this performance discrepancy to the different characteristics of the datasets used in [7] and our work. Compared with the CT image data used in [7] which was collected from a single medical center, our image data (i.e. the LiTS challenge dataset) comes from several clinical sites around the world using different scanners and protocols, thus it suffers larger data variations and is more challenging for accurate organ localization. For example, the CT slice thickness of the datasets in [7] is distributed in a range of [1.00, 2.00]
As shown in Table 4, the global IoU of the proposed method without taking enhanced density map and gradient map as input is 73.36%, which is slightly higher than that of method [17] (73.01%, Xu et al. [17] in Table 2). Since both [17] and the proposed method are based on 3D ConvNets, this result indicates that the proposed 3D triple-branch FCN has almost the same performance as the 3D ConvNets used in [17]. However, the proposed 3D triple-branch FCN is fully composed of basic operations of ConvNets such as convolution, pooling, and softmax regression, and no customized components (e.g. the layer used for anchor generation and label assignment in 3D Faster R-CNN and [17]) are involved. Thus it is easy-implemented and flexible to serve as a preprocessing module in other applications.
Data augmentation plays an important role in the training of the proposed method. To demonstrate this point, we turn the data augmentation off in the training stage and just use the 118 original CT images to train the proposed network. As the results shown in this experiment, the global IoU drops from 76.44% to 66.16%, indicating the positive effects of data augmentation in the proposed method. Because the augmentation strategies are fully based on random numbers and performed on-line in the training stage, the training batch data fed into the model is always changing, even for the same CT images. To this end, there is no exact number of the total training data after data enhancement.
In the proposed method, we resample the original CT images by bilinear interpolation. When we replace the bilinear interpolation with bicubic interpolation, the global IoU of the proposed method slightly changes from 76.44% to 76.38%, but the computation time for the preprocessing largely increases from 3.2 seconds to 19.1 seconds. Considering the trade-off between accuracy and efficiency, we finally use bilinear interpolation in the proposed method to resample the original CT images.
When we train the proposed triple-branch FCN without the density enhanced map and the gradient map, the final global IoU decrease from 76.44% to 73.36%, which is slightly higher than that of the current state-of-the-art method (73.01%, Xu et al. [17] in Table 2). It demonstrates that the density and edge enhancement is helpful for improving the localization accuracy of the proposed network. When we look into Table 2, we can find that this improvement mainly comes from three organs, i.e. liver (from 77.83% to 86.99%), spleen (from 70.01% to 85.75%) and bladder (from 58.23% to 66.40%). Each one of these three organs has relatively low background contrast and variable shapes. However, the IoU of the pancreas (which has fuzzy boundaries in CT image) is almost unchanged (from 58.56% to 58.26%). This result indicates that the density and edge enhancement is helpful for improving the localization accuracy of the organs with low background contrast and variable shapes, but has less effect on the organ with fuzzy boundaries.
Due to the special imaging principle, inhomogeneous contrast in CT image usually reflects the non-uniform density distribution of the materials, which means that they are different tissues. Therefore, the proposed density enhancement filter is competent to handle the inhomogeneous contrast by enhancing the target organ with specific density and suppressing others. However, there are two special cases of the inhomogeneous contrast where the density enhancement filter may not be worked. Firstly, some lesion regions appearing in the target organs usually have different densities with the normal issues, thus it is hard to enhance these lesion regions at the same time with the normal tissues. Secondly, some artificial structures, such as the artificial femoral head, introduce metal artifacts in the CT image, which interfere with the intensity distribution. These two problems could not be solved by solely using the density enhanced map but can be relieved by taking more information into account. To this end, we combine the original CT image and the gradient map with the density enhanced map to compose the three-channel image as input.
In the proposed method, two kinds of handcraft features, i.e. the density enhanced map and the gradient map, are combined with the original CT image to compose the three-channel image fed to the triple-branch networks. This manner can be treated as a kind of early fusion strategy [27]–[29] in machine learning, which concatenates different features in feature-level. According to the experimental results, combining these two kinds of handcraft features with the original CT image improves the global IoU of the proposed model by a significant margin of 3.08% (from 73.36% to 76.44% in Table 4), demonstrating the effectiveness of the early fusion strategy.
Conclusion
In this work, we present an automatic method for multiple organ localization in CT image using a novel 3D triple-branch FCN. This method is fully implemented in 3D manner thus it can fully utilize the spatial context information in CT image to perform accurate organ localization. The core component, i.e. the triple-branch FCN, is completely built using basic components in ConvNets. Thus it is easy-implemented and flexible to serve as a preprocessing module in other applications. To further improve the performance of the proposed method, we also propose to use a density enhancement filter to enhance the structures with specific densities in the CT image and concatenate these enhanced images with the original CT image to compose a three-channel image to feed into the proposed 3D triple-branch FCN for training and testing. Experimental results on a public clinical abdominal/torso CT dataset show that the proposed method can provide higher localization accuracy in comparison to the current state-of-the-art methods.