Introduction
X-ray computed tomography (CT) plays an important role in medical imaging and has been widely used in modern clinical institutions recently. Due to the potential risk of genetic damages or cancers caused by patients’ exposure to radiations [1], [2], there is an increasing concern on the radiation dose to the patients, leading to the well-known guiding principle of as low as reasonably achievable (ALARA). However, decreasing the radiation dose leads to extra noise and artifacts in a reconstructed image, degrading the diagnostic information. Therefore, Many works that can remove noise and improve the quality of low-dose CT (LDCT) images have been proposed in the past decades, which can be generally divided into three categories: 1) sinogram domain filtering [3]–[6], 2) iterative reconstruction [7]–[10], and 3) image processing [11]–[15].
Sinogram domain filtering methods considered the 2-D sinogram signals as 2-D images and applied image processing methods to remove noise from them, such as penalized weighted least-squares (PWLS) algorithm [4], bilateral filtering [5], and structural adaptive filtering [16]. In [6], Liu et. al. attempted to remove noise through a 3-D representation-based feature decomposition of the projected attenuation component and the noise component using a well-designed composite dictionary containing atoms with discriminative features. The sinogram filtration methods performed well when the characteristics of noise in the data domain were well known. However, these methods would lose structure and spatial resolution in the reconstructed images when small edges were filtered out. Moreover, the raw data was not always accessible for commercial scanners.
A wide range of iterative reconstruction (IR) algorithms have been proposed for LDCT reconstruction by the researchers over the past decade and adopted by many major CT vendors [17]. The scanner geometry and physical properties of the imaging processing [7] were first simulated. Then statistical noise models [18] and prior information in the image domain, such as total variation (TV) and its variants [9], [19], as well as dictionary learning [10], [20], [21], were incorporated with the system model to optimize objective functions as the reconstructed image. These algorithms improved the image quality a lot but they still suffered from losing details and remaining artifacts. In addition, they required expensive computational cost due to the iterative nature.
Different from the sinogram filtration and the iterative reconstruction methods, image processing algorithms could be directly applied to LDCT images. For example, the well-known non-local means (NLM) method [22], which calculated the weighted average of similar neighbourhoods, was adapted by Ma et al. for CT image denoising [11]. The block-matching 3D (BM3D) algorithm [23] grouped similar 2D CT image patches into 3D arrays, applied 3D transform and filtered the corresponding coefficients so as to remove noise in various image denoising tasks efficiently [13], [24]. Stemming from sparse representation, Chen et al. [12] proposed to adapt a patch-based K-SVD algorithm [25] to suppress mottled noise and streak artifacts in abdomen CT images. Although these methods can improve the image quality effectively, over-smoothing and residual errors were still existed in the resulted image because the noise was non-uniformly distributed in CT images.
Deep learning (DL) methods have proved to provide superior performance in many image-based applications [26] and have attracted more and more attentions in medical image processing field. Chen et al. [27] proposed a noise reduction method for low-dose CT by training a deep convolutional neural network (CNN) patch-by-patch. And they improved this original work by integrating autoencoder, deconvolution network and residual block into the so-called residual encoder-decoder convolutional neural network (RED-CNN) [28]. Kang et al. [29] proposed a wavelet network by combining a deep CNN with a directional wavelet approach, leading to greater noise reduction and shorter reconstruction time. Gholizadeh-Ansari et al. [30] proposed a deep residual network with dilated convolution that can pass the signal to the higher layers through identity mappings and achieved good results with fewer layers and less computational costs. Chen et al. [31] unfolded an existing iterative framework for CT reconstruction into a deep-learning network where the key parameters were learned from training samples. In [32], Liu et al. adopted the stacked sparse denoising autoencoders to construct a low-dose CT restoration network, which was not only capable of suppressing noise but also recovering structural details. Yin et al. [33] proposed a domain progressive 3D residual convolution network (DP-ResNet) for LDCT imaging procedure containing sinogram domain network (SD-net), filtered back projection (FBP) and image domain network (ID-net), which could improve the performance of noise removal significantly.
Very recently, with that generative adversarial network (GAN) has become one of the most impressive variants of CNN for computer vision and pattern recognition [34], there were increasing attempts to use GAN for image noise removal [35]–[37]. Specifically for LDCT restoration, Wolterink et al. [38] proposed to use a conditional GAN (CGAN) [39] where a network consisting of seven consecutive convolutional layers with small convolutional kernels was used as generator while the discriminator was a network aiming at differentiating real routine-dose CT images from denoised low-dose CT images by optimizing the cross-entropy loss. Nie et al. [40] employed a 3D fully convolutional network (FCN) together with residual connection as the generator and utilized an additional image gradient difference term in the loss function. In [41], the authors revealed that using mean squared error (MSE) as loss function would overlook subtle image features critical for human perception. To solve this problem, Yang et al. [42] proposed to replace the JS divergence by Wasserstein distance to measure the differences between distributions of generated images and ground truths. They also introduced the distances between features extracted by a pre-trained VGG-19 network [43] as perceptual loss of the GAN. With the same Wasserstein adversarial loss and perceptual loss, Shan et al. [44] proposed a 2D Conveying Path-based Convolutional Encoder-decoder (CPCE) network and entended it to 3D model, resulting in better performance in noise suppression and structure preservation. Liao et al. [45] incorporated a feature pyramid network (FPN)-based discriminator and a differentially modulated focus map to the least squares GAN (LSGAN) [46], outperforming other methods in correcting cone-beam artifacts in the image. Yi et al. [47] introduced a sharpness loss in addition to adversarial loss and perceptual loss to ensure the final sharpness of the image and the faithful reconstruction of low-contrast regions. Du et al. [48] attempted to inject visual attention knowledge into the learning process of GAN to provide powerful prior of the noise distribution, so the network would not only pay special attention to noisy regions and surrounding structures but also explicitly assess the local consistency of the recovered regions.
Although these GAN-based denoising methods can provide convincing performance, they suffer from the following drawbacks: 1) noise was transferred into the decoding blocks along with the shortcut connection from the corresponding encoding blocks, resulting in a large amount of noise remaining in the generated image, even some false lesions generated from the noise; 2) only the result from final decoding blocks of generator was sent to the discriminator, ignoring the impact of results from pervious decoding blocks on the final result, and 3) Wasserstein distance was useful for stabilizing the GAN training but not good enough for improving the image quality [49].
To overcome the shortcomings of the existing LDCT image denoising methods, in this work we propose a generative adversarial network with novel architecture. As shown in Fig. 1, we modify the U-Net [50] as generator by adding inception-residual blocks on each of the shorcut connection routines and replacing the concatenating connection by the residual mapping. For the discriminator part, we propose a novel architecture by combining discriminating results from multiple CNNs, each of which independently distinguishes the output of every deconovolutional layer from the corresponding downsampling layers of the real normal-dose CT images, so that the adverserial training is sensitive to noise and artifacts in different scales. To further improve the performance of the proposed network, we defined a loss function consisting of the following parts: 1) the least square loss as adversarial loss, which can improve the stability of the training process, 2) the Euclidean distances between features extracted by VGG-19 network as perceptual loss to make the generated image more similar as the true image under the human visual perception, 3) the MSE between the generated image and the ground truth as content loss to guarantee the generated image closer to the true image in the pixel level, and 4) the MSE between the removed noise and the real noise as noise loss to make sure the noise could be removed more precisely. The experimental results illutrate that our proposed network performs better than the state-of-the-art methods on two different public CT image datasets with respect to various evaluation criteria.
The rest of this paper is organized as follows. We introduce some background knowledge in Section II. The architecture of our proposed network is described in Section III. The experimental results are then presented and discussed in Section IV. Finally, we conclude our work in Section V.
Background
A. Noise Reduction Model
Given a normal dose CT (NDCT) image \begin{equation*} I_{LD}=N(I_{ND}),\tag{1}\end{equation*}
\begin{equation*} I_{ND} = N^{-1}(I_{LD}),\tag{2}\end{equation*}
\begin{equation*} \widehat {I_{ND}}=\mathcal {F}(I_{LD};\Theta),\tag{3}\end{equation*}
\begin{equation*} \mathcal {F}(I_{LD};\Theta)=\arg \min _{\theta }\left \|{ f(I_{LD};\theta)-I_{ND} }\right \|^{2}_{2},\tag{4}\end{equation*}
B. Least Square Generative Adversarial Network (LSGAN)
In contrast to the original GAN, which employs a sigmoid cross-entropy loss and leads to vanishing gradients problem and unstable training, LSGAN uses the least square loss with \begin{align*} \min _{D}V_{LSGAN}(D)=&\frac {1}{2}\mathrm {E}_{x\sim p_{data}(x)}[(D(x)-b)^{2}] \\&+\,\frac {1}{2}\mathrm {E}_{z\sim p_{z}(z)}[(D(G(z))-a)^{2}] \\ \min _{G}V_{LSGAN}(G)=&\frac {1}{2}\mathrm {E}_{z\sim p_{z}(z)}[(D(G(z))-c)^{2}],\tag{5}\end{align*}
Inspired by the model of noise reduction with neural network and the advantages of LSGAN, we make use of LSGAN to remove noise in the LDCT image. The details of the proposed LSGAN architecture are dicussed in the following section.
Network Architecture
A. Objective
As discussed in the above section, we take advantage of the LSGAN as the basic structure of our proposed network. Fig. 1 shows the overall architecture of the proposed method, where the generator \begin{align*} \min _{D}\mathcal {L}_{A}(D;G)=&\sum _{k=1}^{N} \min _{D}\mathcal {L}_{A}(D_{k};G) \\=&\sum _{k=1}^{N} \mathrm {E}_{x_{n}\sim p(x_{n})}\left [{ \left \|{ D_{k}(x_{n})-1 }\right \|^{2} }\right] \\&+\, \mathrm {E}_{x_{l}\sim p(x_{l})}\left [{ \left \|{ D_{k}(G(x_{l})) }\right \|^{2} }\right],\tag{6}\\ \min _{G}\mathcal {L}_{A}(G;D)=&\sum _{k=1}^{N} \min _{G}\mathcal {L}_{A}(G;D_{k}) \\=&\sum _{k=1}^{N} \mathrm {E}_{x_{l}\sim p(x_{l})}\left [{ \left \|{ D_{k}(G(x_{l}))-1 }\right \|^{2} }\right],\tag{7}\end{align*}
The training image pairs of the proposed network. (a) is the input noisy image, (b) is the ground truth image without noise, and (c) is the “fake” de-noised image generated by the proposed network.
It is well considered that only using the adversarial loss is far from enough for noise removal. Therefore, we follow the theory proposed in [43] by applying VGG-19 to calculate the perceptual loss for better training performance. The using of perceptual loss can solve the situation that two images can look the same to human beings but be quite different mathematically [51]. The perceptual loss is calculated as follows:\begin{equation*} \mathcal {L}_{p}=\frac {1}{N_{i}} \left \|{ \phi ^{(i)}(x_{n})-\phi ^{(i)}(G(x_{l})) }\right \|_{1},\tag{8}\end{equation*}
Together with the above two loss functions, the traditional mean square error (MSE) between the generated image \begin{equation*} \mathcal {L}_{c}=\frac {1}{N_{x}}(x_{n}-G(x_{l}))^{2},\tag{9}\end{equation*}
Furthermore, we propose to measure the difference between the noise map of LDCT image and the noise removed by the generator, which is computed as:\begin{equation*} \mathcal {L}_{n}=\frac {1}{N_{x}}(\left |{x_{n}-x_{l} }\right |-\left |{G(x_{l})-x_{l} }\right |)^{2},\tag{10}\end{equation*}
Given the above losses, the objective of our proposed network is:\begin{align*}\mathcal {L}=&\arg \min _{G}\min _{D}(\lambda _{1}\mathcal {L}_{A}(D;G)+\lambda _{2}\mathcal {L}_{A}(G;D)+\lambda _{3}\mathcal {L}_{p}(G) \\&\qquad \qquad \qquad \qquad \qquad {+\,\lambda _{4}\mathcal {L}_{c}(G)+\lambda _{5}\mathcal {L}_{n}(G)),} \tag{11}\end{align*}
B. Generator
Fig. 3 shows the architecture of the proposed generator, where the U-Net [50] is used as the basic structure since it can recover the fine-grained details well in the generated image. However, noise in each convolution layer is transfered into the deconvolution layer without any “filtration” in shortcut connection, resulting in much noise remained in the generated image. In order to solve this problem, we apply the inception-residual block to each shortcut connection and change the connection mode from concatenation to residual mapping so as to filter noise as much as possible.
1) Inception-Residual Block
The inception-residual block was proposed by Szegedy et. al. in [52] and was well known for its ability to reflect the inception structure of an image. It combined the advantages of both the inception architecture [53] and the residual connections [54] so that it could reflect the multi-scale visual features and remove the noise features while retaining high computational efficiency. In this work, we propose to use 4 inception-residual blocks with different sizes to process 4 different convolution layers before connecting them to the corresponding deconvolution layers. Fig. 4(a) shows the Inception-ResNet-A block used on the first two convolution-deconvolution shortcut connections, where the parameters are modified so that it can fit the sizes of the first two convolution layers in the proposed U-Net structure. Fig. 4(b) shows the Inception-ResNet-B block used on the last two convolution-deconvolution shortcut connections, where we also somehow adjust the parameters so as to make the block consistency with the sizes of the last two convolution layers in the U-net structure. The reason of using different blocks for different shortcut connections is that the noise is less with the U-Net going deeper and there is no need to make use of a too complicated inception block for a relatively simple input to make the image over-smooth.
The structure of the proposed inception-residual blocks used in the generator. (a) is the Inception-ResNet-A module used on the shortcut connection routines for the first two convolution layers, (b) is the Inception-ResNet-B module used on the shortcut connection routines for the last two convolution layers.
2) Residual Mapping
In a traditional U-Net, each convolution layer is concatenated with the corresponding deconvolution layer, which usually leads to the problem of model degradation and parameter explosion, reducing the training efficiency and accuracy. Inspired by the outstanding denoising performance of ResNet50 and its invariants [54], we change the mode of shortcut connection so that the proposed generator could mimick an image with less noise. As shown in Fig. 5, the convolution layer processed by the inception-residual block is added directly to the corresponding deconvolution layer, which is mathematically expressed as follows:\begin{equation*} SC_{i}=Conv_{K-i}+Deconv_{i},\quad i=1,2,3,4\tag{12}\end{equation*}
The structure of the proposed residual mapping used for connecting the convolution layer and the corresponding deconvolution layer. (a) is the concatenation connection used in the traditional U-Net structure, (b) is the residual mapping connection used in the modified U-Net structure as generator.
Some key parameters of the layers in the proposed generator are illustrated in Table. 1. With the proposed inception-residual blocks on shortcut connections and the residual mapping mode of connection, the modified U-Net performs well on removing noise. The experiments to compare the functions of different parts in the generator will be described in Sec. IV.
C. Discriminator
Multi-layer convolutional neural networks (CNNs) are usually used as discriminators of traditional LSGANs, where the inputs are both the generated images and the ground truth images. Considering that the convolution only preserve the image structure [55], only differentiating the generated image and the ground truth using a single CNN would lose the discrimination of image details, resulting in loss of details in the generated image. To solve this problem, we propose a multi-level joint discriminator consisting of several sub-discriminators. As shown in Fig. 6, every individual sub-discriminator is a \begin{equation*} D = \frac {1}{N}(D_{1}+D_{2}+\cdots +D_{N}),\tag{13}\end{equation*}
D. Network Training
The pair-wise 2D slices of the recovered image
Experiments
To evaluate the effectiveness of the propose method, we apply it to both simulated low-dose CT and official simulated low-dose CT images and compared the performance with several state-of-the-art image reconstruction algorithms, including K-SVD [25], BM3D [23], KAIST-NET [29], RED-CNN [28], WGAN-VGG [42], SAGAN [47]. For quantitative analysis, peak signal to noise ratio (PSNR) and structured similarity index (SSIM) are used as the quantitive metrics for the evaluation. The PSNR is commonly used as the measurement of the pixel-wise differences between two images whereas the SSIM is used as the reflection of the human visual perceptual differences. For qualitative analysis, reader study is performed on 10 groups of images in terms of artifact reduction, noise suppression, contrast retention, lesion discrimination and overal quality. In the following, we introduce the selections of datasets, parameters setting, implementation of experiments, and discuss the experimental results of comparing the proposed method with state-of-the-art methods with respect to different evaluation metrics.
A. Experimental Datasets
1) Simulated Dataset
For the simulated noisy data, 4036 normal-dose CT images with the size of \begin{equation*} N \sim \text {Poisson}(N_{0}\text {exp}(-y)),\tag{14}\end{equation*}
In our experiments, the blank scan flux
2) Official Simulated Clinical Data
For the official simulated clinical data, we use the dataset authorized by Mayo Clinics for “the 2016 NIH-AAPM-Mayo Clinic Low Dose CT Grand Challenge”, which contains 2378 full and quarter dose
B. Parameter Setting
In our experiments, several parameter combinations are evaluated and the parameters are empirically set as follows. The base learning rate is set to
All the experiments are performed through Python with the Tensorflow and Keras libraries on an Intel Xeon Silver 4110 2.1GHz PC with 32G RAM and an NVIDIA TITAN Xp graphic processing unit card with 12G RAM.
C. Comparator Methods
Several different state-of-the-art methods are compared with our proposed method, including K-SVD [25], BM3D [23], KAIST-NET [29], RED-CNN [28], WGAN-VGG [42], SAGAN [47]. Dictionary learning [25] and BM3D [23] are two most popular image-based denoising methods already widely applied for LDCT. KAIST-NET [29] is one of the most recently proposed CNN-based LDCT denoising method, which can be considered as a deepened variant of the lightweight CNN model [27]. RED-CNN [28] proposes a successful attempt in applying U-net structure [50] in medical image denoising. It replaces the pooling/unpooling layers of U-net with convolution/deconvolution pairs. WGAN-VGG [42] and SAGAN [47] are both state-of-the-art image reconstruction methods based on GAN structure. WGAN-VGG adopts the WGAN structure to generate de-noised image and combines the perceptual loss calculated by the VGG network and the WGAN adversarial loss to keep the image content after de-noising. SAGAN utilizes U-net structure with long skip connections as the generator of GAN and proposes a sharpness detection network to calculate the sharpness loss as a complement of the perceptual loss and adversarial loss. The KAIST-NET, RED-CNN, WGAN-VGG and SAGAN are fine-tuned using our training samples and the hyper-parameters are adjusted according to our experimental experience.
D. Implementation of Experiments
The experiments are implemented as follows:
the test LDCT images in different databases are processed to provide reconstruction NDCT images by the proposed method, as well as by the comparator state-of-the-art methods;
a blind reader study is performed on 10 groups of images for qualitative analysis. The images processed by different methods are sent to two radiologists to independently score every image in terms of artifact reduction, noise suppression, contrast retention, lesion discrimination and overall quality on a five-point scale (1 =unacceptable and 5 =excellent). The mean and standard deviation values of the scores from the two radiologists are calculated as the final evaluation results;
for quantitative analysis, the resulting de-noising images are first compared with the ground truth to generate the mean squared error:
where\begin{equation*} \text {RMSE}=\sqrt {\frac {1}{m\times n}\sum _{i=0}^{m-1}\sum _{j=0}^{n-1}[I_{r}(i,j)-G(i,j)]^{2}},\tag{15}\end{equation*} View Source\begin{equation*} \text {RMSE}=\sqrt {\frac {1}{m\times n}\sum _{i=0}^{m-1}\sum _{j=0}^{n-1}[I_{r}(i,j)-G(i,j)]^{2}},\tag{15}\end{equation*}
andm represent the width and height of the image. Then we calculate the peak signal to noise ratio (PSNR) as follows:n where\begin{equation*} \text {PSNR}=10\cdot \log _{10}\left({\frac {\text {MAX}_{I_{r}}^{2}}{\text {MSE}}}\right),\tag{16}\end{equation*} View Source\begin{equation*} \text {PSNR}=10\cdot \log _{10}\left({\frac {\text {MAX}_{I_{r}}^{2}}{\text {MSE}}}\right),\tag{16}\end{equation*}
is the maximum possible pixel value of the image. It is set to 255 in our experiments since the pixels of the images are represented using 8 bits per sample. MSE represents the mean squared error as defined above. PSNR is used to evaluate the performance of the proposed method in removing noise, while MSE is used to assess the ability of the proposed method in preserving the small nodules, without discarding them as noise;\text {MAX}_{I_{r}} structured similarity index (SSIM) is taken into account to evaluate the performance of the proposed method in reconstructing images
that are preceptually similar to the ground truth imagesI_{r} . SSIM is mathematically defined as:G where\begin{align*} \text {SSIM}(I_{r},G)=\frac {(2\mu _{I_{r}}\mu _{G}+C_{1})(2\sigma _{I_{r}G}+C_{2})} {(\mu _{I_{r}}^{2}+\mu _{G}^{2}+C_{1})(\sigma _{I_{r}}^{2}+\sigma _{G}^{2}+C_{2})}, \\ {}\tag{17}\end{align*} View Source\begin{align*} \text {SSIM}(I_{r},G)=\frac {(2\mu _{I_{r}}\mu _{G}+C_{1})(2\sigma _{I_{r}G}+C_{2})} {(\mu _{I_{r}}^{2}+\mu _{G}^{2}+C_{1})(\sigma _{I_{r}}^{2}+\sigma _{G}^{2}+C_{2})}, \\ {}\tag{17}\end{align*}
,\mu _{I_{r}} ,\mu _{G} ,\sigma _{I_{r}} , and\sigma _{G} represent the local means, standard deviations and cross-covariance for images\sigma _{I_{r}G} andI_{r} , respectively.G andC_{1}=(k_{1}L)^{2} are variables to stabilize the division with weak denominator, whereC_{2}=(k_{2}L)^{2} the dynamic range of the pixel values that is set to 255 andL andk_{1} are set to 0.01 and 0.03 in our experiments.k_{2}
E. Examinations of Design Options
Table. 3 and Table. 4 illustrate the performance of using different modules in the proposed network on the testing data from two datasets. Fig. 7 shows an example LDCT image processed by the network with different strategies. The comparisons demonstrate that the inception-residual blocks(IRB), residual mapping(RM), the multi-level joint discriminator(MLJD) and the combination of adversarial loss, perceptual loss, MSE loss and noise loss (CL) can significantly improve the image restoration in terms of PSNR and SSIM. Fig. 8 shows the absolute differences between images processed by different methods (Fig. 7(b) to (g)) and the normal-dose CT image (Fig. 7(a)), where it can be observed more clearly that the proposed method provides the smallest difference, proving that it can preserve most details and suppress most noise and artifacts.
Results of removing noise from the low-dose lung CT image by the proposed network with different modules. (a) NDCT, (b) LDCT, (c) U-Net+VGG, (d) U-Net(IRB)+VGG, (e) U-Net(IRB+RM)+VGG, (f) U-Net(IRB+RM)+MLJD, (g) U-Net(IRB+RM)+MLJD+CL.
Absolute differences between the NDCT image and the de-noised images in Fig. 7. Brighter color represents larger difference. (a)LDCT, (b) U-Net+VGG, (c) U-Net(IRB)+VGG, (d) U-Net(IRB+RM)+VGG, (e) U-Net(IRB+RM)+MLJD, (f) U-Net(IRB+RM)+MLJD+CL.
Fig. 9 shows the pixel-wise similarity between the generated NDCT images and the ground truth NDCT images over the training process of the proposed method. It demonstrates that the proposed network achieves a convergence rate after about 500 training iterations.
Pixel-wise similarity between the generated NDCT images and the ground truth NDCT images over the training process of the proposed method.
F. Comparisons With Other Models
1) Simulated Data
Two representative slices from the tesing set are used to demonstrate the performance of our proposed method. Fig. 10 shows the de-noising results of applying different methods on a
Results of removing noise from the simulated low-dose lung CT image with noise level
Zoomed ROI images of the red rectangles in Fig. 10. (a) NDCT, (b) LDCT, (c) K-SVD, (d) BM3D, (e) KAIST-Net, (f) RED-CNN, (g) WGAN-VGG, (h) SAGAN, (i) the proposed network. The arrows indicate three regions containing features revealed differently by the competing algorithms.
Absolute differences between the NDCT image and the de-noised images in Fig. 10. Brighter color represents larger difference. (a) LDCT, (b) K-SVD, (c) BM3D, (d) KAIST-Net, (e) RED-CNN, (f) WGAN-VGG, (g) SAGAN, (h) the proposed network.
Fig. 13 shows the de-noising results from another
Results of removing noise from the simulated low-dose abdominal CT image with noise level
Zoomed ROI images of the red rectangles in Fig. 13. (a) NDCT, (b) LDCT, (c) K-SVD, (d) BM3D, (e) KAIST-Net, (f) RED-CNN, (g) WGAN-VGG, (h) SAGAN, (i) the proposed network. The arrows indicate three regions containing features revealed differently by the competing algorithms.
To further show the merits of the proposed method, the mean and standard deviation values of the subjective quality scores (as described in Step 2 in Section IV-D) of the images produced by different methods are shown in Table. 5. It can be considered that K-SVD and BM3D provide good noise suppression scores but low artifact reduction, contrast retention and lesion discrimination scores. KAIST-NET and RED-CNN give high scores in both noise suppression and artifact reduction, but not as good scores in contrast retention. WGAN-VGG and SAGAN output good results in noise suppression, contrast retention and lesion discrimination but relatively low scores in artifact reduction. The proposed methods provides scores of 3.63 ± 0.27, 3.53 ± 0.22, 3.23 ± 0.20 and 3.26 ± 0.24 for noise suppresion, artifact reduction, contrast retension and lesion discrimination over the testing dataset. Given to the subjective scores of the NDCT images, which are 3.65 ± 0.25, 3.58 ± 0.23, 3.27 ± 0.24 and 3.28 ± 0.26 respectively, the proposed method performs closest to the ground truth NDCT images statistically. Compared to the state-of-the-art methods, which already provide good performance, our proposed method further improve the image quality so the recovered images have less noise and artifacts while preserving more contrast so that different lesion regions could be better discriminated. Therefore, the overall quality score of the images from the proposed method (3.31 ± 0.23) is closer to that of the ground truth NDCT image (3.40±0.25) than those of the comparators statistically.
For quantitative evaluation, Table. 6 shows the mean and standard deviation values of the PSNR, SSIM and RMSE of performing different methods on all the testing images with different noise levels. According to Eq. 15 and Eq. 17, the SSIM and RMSE values of the NDCT images, i.e., the ground truth images, are 1 and 0 respectively. Using the SSIM and RMSE values of the NDCT images as benchmarks, we can find that the proposed method provides closest PSNR, SSIM and RMSE values to the values of the ground truth images in cases of adding all noise levels, which confirms our previous qualitative observations. The p-values prove that the higher PSNR, SSIM and lower RMSE values have the statistically significance, which means the better performance from the proposed method is over the whole testing dataset.
2) Official Simulated Clinical Data
Fig. 15 show a representative
Results of removing noise from the MAYO clinical low-dose CT image. (a) NDCT, (b) LDCT, (c) K-SVD, (d) BM3D, (e) KAIST-Net, (f) RED-CNN, (g) WGAN-VGG, (h) SAGAN, (i) the proposed network.
Zoomed ROI images of the red rectangles in Fig. 15. (a) NDCT, (b) LDCT, (c) K-SVD, (d) BM3D, (e) KAIST-Net, (f) RED-CNN, (g) WGAN-VGG, (h) SAGAN, (i) the proposed network. The arrows indicate three regions containing features revealed differently by the competing algorithms.
Table. 7 shows the mean and standard deviation values of the subjective quality scores from two experienced radiologists to judge the qualities of LDCT images processed by different methods. The proposed method gets highest scores in artifact reduction, contrast retention and lesion discrimination, while sceond highest scores in noise suppression, therefore it achieves higher overall score than the other methods.
To quantitatively evaluate our proposed method, Table. 8 summarizes the mean and standard deviation values of the PSNR, SSIM and RMSE of all the testing LDCT images restored by different methods. The proposed method achieves the best SSIM and second best PSNR and RMSE (very close to the best one from RED-CNN), which accords with the aforementioned qualitative observations.
G. Discussions
The main target of the proposed method is to restore the LDCT image as close to the NDCT image as possible. As described in the above sections, the GAN method is adopted to recover the image by using a modified U-Net with inception-residual blocks in the short-cut connections as generator, proposing a multi-level joint discriminator consisting of multiple CNNs, and defining a novel loss function by integrating least square loss, VGG loss, MSE loss and noise loss. Compared with the state-of-the-art CT image de-noising methods, this modified GAN network is effective for better visual image quality with more structure details and less noise and artifacts.
The experimental results have demonstrated that since the noise is non-uniformly distributed in CT images, the traditional image-patch-based de-noising methods, represented by K-SVD and BM3D, are likely to suffer from the streaking artifacts adjacent to the high attenuation regions. In contrast, the deep learning based methods can provide higher image quality because they learn the structures and contents from a large amount of image data. And comparing the results of KAIST-NET and RED-CNN to those of WGAN-VGG, SAGAN and the proposed method, we can find that the GAN-based methods can avoid the over-smoothing problem that usually happens in the MSE-based deep neural networks. It can be mainly attributed to the capabilities of adversarial loss and VGG loss in preserving image visual details. However, the WGAN-VGG and SAGAN methods preserve the detailed structures at the expense of mimicking some noise into lesions, especially when the noise is adjacent to the lesion regions, making the generated images look natural but cause severe distortion for medical diagnosis. One reason for this phenomenon is that the generators in WGAN-VGG and SAGAN cannot eliminate the noise during the propagation processing. The CNN generator in WGAN-VGG is likely to capture the noise features together with the structure features. The U-Net used in SAGAN performs better than the traditional CNN one by using the multi-scale sampling strategy, but the noise still exists in the shortcut connections. The proposed method adds the inception-residual blocks on each shortcut connection route of the U-Net, so the noise can be strongly reduced with those visual structures well preserved. Another reason for the “false lesion” phenomenon is that the discriminators used in both the WGAN-VGG and the SAGAN only calculate the similarity between the generated image and the ground truth image in one scale, missing the noise that is easily confused with the lesions or structures with tiny size. The proposed method improves the works before by simultaneously computing the differences between the output from every deconvolutional layer and the corresponding downsampling layer of the ground truth image as the loss of the whole network, which can further avoid mimicking noise as lesions. Moreover, all the methods are evaluated on lung images and abdominal images from different datasets, and the p-values illustrate the statistical significance of all the PSNR, SSIM and RMSE values, so the better performance of the proposed method is robust.
Although the proposed method can generate images quite close to the ground truth NDCT image, we can observe that there is still noise in the generated image. The main reason for this problem is that the noise existed in the input image is so visually similar to those structures that the features of noise are extracted and tranferred into the output image by the network, even with the proposed inception-residual blocks in the U-Net. Actually, it is a common problem for all the deep learning based de-noising methods. How to learn the noise model and distinguish it from structure model is the question that needs deeper research. Furthermore, our proposed network is an image post-denoising method which suffers from the information loss in the input image during the FBP reconstruction. A possibly better way to adopt the capability of deep neural network in learning data patterns is to design a network that maps the sinogram signal of LDCT image to that close to NDCT image. It could be an interesting direction in our future research plan.
Another problem of the proposed method, which is also a common problem for deep learning based methods, is that the generalization capability of the network is not as high as those model-based image processing methods. It is mainly embodied in two aspects: 1) the network need to be re-trained for data with different types of noise, even we have trained the proposed network with a variety of medical images with different noise levels; and 2) the hyper-parameters, including the kernel sizes of convolutional layers and the coefficients of different parts in the loss function, need to be adjusted carefully when the dataset is changed.
H. Running Time
Table 9 shows the average running time of different methods used for recovering the LDCT images. On average, it takes our proposed method 1.84 seconds to process one
Conclusion
In conclusion, we have proposed a LSGAN network with novel architecture for low-dose CT image de-noising. We incoporated the inception-residual block and residual mapping in the U-Net structure and applied it as the generator of the GAN network. We proposed a novel multi-level joint discriminator to distinguish the output of each deconvolutional layer in the generator from the corresponding down-sampled ground truth image. The least square adversarial loss, VGG-19 based perceptual loss, MSE based pixel loss and the noise loss were combined together as the loss for optimizing the whole network. Experimental results on both simulated and official simulated clinical images have illustrated that the proposed method can effectively remove noise and artifacts from the image while preserving the structures and eliminating the false lesions. Therefore, the proposed method outperformed the state-of-the-art methods in the visual effect and the quantitative assessments.
ACKNOWLEDGMENT
The authors would like to thank H. Wang, Q. Hu, and Y. Wang for helpful discussions and fruitful feedback along the way. Jianning Chi would also like to thank the editor, associate editor, and referees for comments and suggestions which greatly improved this paper.