Introduction
According to the WHO (World Health Organization), breast cancer has surpassed lung cancer as the most frequent cancer and the fifth largest cause of cancer mortality worldwide. Globally, there will be 2.3 million cases of breast cancer in women alone in 2020, and 685,000 people will pass away from the disease [1]. Breast cancer mortality has decreased by 40% in high-income countries since regular mammography screening was introduced by health authorities in the 1980s for age groups deemed to be at risk, in contrast to the situation in low- and middle-income countries [2]. Therefore, reducing breast cancer mortality globally requires early diagnosis and treatment of the disease. Since the pathogenesis of breast cancer is still unknown and there are currently no proven preventative measures, early diagnosis is still the best medical course of action [3].
Currently, the early diagnosis of breast cancer requires specialized radiologists and mammologists, which makes mammography screening programs costly to implement and can be more difficult to implement in countries with low incomes and a shortage of radiologists [2]. A few false positives can result from mammography screening, which can cause patients and their families unneeded worry and anxiety, additional imaging tests, and occasionally needle biopsies [4]. In contrast, deep learning-based AI-assisted technology can streamline radiologists’ evaluation of screening mammography images, increasing their effectiveness and precision. Because of this, deep learning has become a common technique for creating mammography computer-aided detection/diagnosis (CAD) schemes [5].
Two views of each breast are typically obtained during a mammogram: a top-down view known as craniocaudal (CC) and a lateral view known as mediolateral oblique (MLO). Radiologists search for certain abnormalities on mammograms to determine whether they are abnormal, the most frequent ones being masses, calcifications, structural deformities, and asymmetric densities [6]. The radiologist will refer to as many views as possible when looking for abnormalities to determine if a suspicious lesion is present. Figure 1 shows an example of a benign and malignant breast lesion, where the border features of a malignant breast lesion are distinctly different from those of a benign breast lesion. Most of the benign lesions have smooth, lobulated borders, whereas the malignant lesions have irregular, burr-like borders [7]. Comparing multiple views of the same breast can help improve the detection rate of lesions and reduce the incidence of false positive results [8], [9]. Therefore, we consider the correlation between different views of the same breast and construct a global-local analysis method with different views.
The global-local analysis method combines features of the whole image with features of small localized blocks for classification. Global analysis helps to detect abnormalities such as distortion and asymmetric denseness of breast structures, and local analysis helps to detect abnormalities such as masses and calcifications. There are two mainstream global-local analysis methods, one is to extract global features from the whole image, and local features from within the region of interest, and then combine the extracted global and local features for classification [10], [11]. The other is to first train a patch classifier and later fine-tune it for the whole mammogram image classification [12]. Although the performance demonstrated by these two methods has been comparable to that of medical experts, there are still some shortcomings. The local features are too dependent on the accuracy of the region of interest localization, which may result in obtaining suboptimal solutions. In response, many scholars have proposed improved global-local analysis methods. Petrini et al. [13], [14] proposed an architecture consisting of multiple CNN (convolutional neural network) paths, where each path extracts features from different views, and then the output of the features from all paths are stitched together and finally sent to the fully connected layer for classification. Chen et al. [15] proposed a pure transformer model with local and global blocks to learn the dependencies between different views of the mammary gland. Although the improved model described above uses the complementary information of different views to learn the dependencies between different views, it does not explore the cross-view information. In our opinion, the global features of the same mammary gland are similar and the local features should differ. When extracting local features, making cross-attention between different view features is beneficial to find the interrelationship between lesions and extract more effective characterization information.
Cross-attention of different views can be achieved by cross transformers. The cross-attention mechanism of the cross transformer can guide the transformer to learn the association information of different features during the training process and achieve the effective fusion of different features [16]. Currently, cross transformers have been widely used in many fields. In the time-domain speech enhancement task, Wang et al. [16] used cross transformers to fuse local features extracted by local transformers and global features extracted by global transformers to obtain a better contextual feature representation, where the Q(Query) and K(Key), V(Value) input to the cross transformer are from different features. In a few-sample target detection task, Han et al. [17] achieved asymmetric batch cross-attention across branches by aggregating K and V from different features. In hyperspectral and multispectral image fusion tasks, Wang et al. [18] added the cross-attention idea to the traditional transformer self-attention mechanism to achieve information fusion between two modalities by exchanging K from different modal features. In the face recognition task, Li et al. [19] used cross transformers, which can remove the noisy information due to race while retaining useful identity features by exchanging V from different features.
In this paper, we propose the Local Cross-View Transformers and Global Representation Collaborating for Mammogram Classification (LCVT-GR) model. The model uses different view images to train in an end-to-end manner. In this model, the global representation and local representation of mammogram images are analyzed in parallel using the global-local parallel analysis method. In generating the local representation, a cross-view transformer is used to achieve information exchange between different view features. Finally, the classifier fuses the local representation and the global representation for the final prediction. The innovations of this paper are:
An improved global-local parallel analysis method for multi-view mammography images is proposed to analyze the global representation and local representation of the mammary gland in parallel.
A new local cross-view transformer is proposed to learn the dependencies between different views and achieve the information fusion between different views.
Materials and Methods
A. Data Collection
Two open-source digital mammography image datasets, the MiniDigital Dataset for Screening Mammography (Mini-DDSM) [20] and the Chinese Mammography Database (CMMD) [21], were used in this study. The Mini-DDSM included 3904 breasts (1952 patients) from multiple centers, 1342 breast biopsies were confirmed benign, 1358 breast biopsies were confirmed malignant, and 1204 breasts were normal. The CMMD includes 2601 breasts (1775 patients) from multiple centers, 556 breast biopsies were confirmed benign, 1316 breast biopsies were confirmed malignant, and 729 breasts were normal. Each breast in both open-source datasets has paired images for both CC and MLO views, resulting in a total of 7808 mammogram images for the Mini-DDSM dataset and 5202 mammogram images for the CMMD dataset. Figure 2 and Figure 3 shows examples of four views of a patient’s left and right breast in the Mini-DDSM dataset and CMMD dataset, respectively. In this study, the mammogram images from the above two open-source datasets were classified, with normal and benign breasts considered positive cases and malignant breasts considered negative cases. We divided each dataset into a training set and a test set in the ratio of 80:20, and the results of the division of the number of images and the number of breasts in each dataset are given in Table 1.
Four views for the right and left breast mammography images from the public datasets of Mini-DDSM.
Four views for the right and left breast mammography images from the public datasets of CMMD.
The Mini-DDSM used in this study was derived from kaggle’s processed Mini-DDSM’s 16-bit PNG (portable network graphics) images. For the CMMD dataset, we converted all raw DICOM images to lossless 8-bit JPEG (joint photographic experts group) images for subsequent processing.
B. Data Pre-Processing
This section provides a detailed introduction to the data preprocessing process and the required visualization.
1) Breast Region Segmentation
The size of the open source dataset Mini-DDSM is (495-2746)
Since the original mammogram images are too large, direct resizing may lose information about some lesions, leading to models that are unable to learn from these lesions [22].
As shown in Figure 2, the Mini-DDSM dataset’s original mammogram images frequently contain undesirable view label information, which will certainly reduce the classification accuracy.
As shown in Figure 3, the CMMD dataset’s original mammogram images contain a significant amount of redundant regions. The lesions are only present in the mammary region, which still accounts for less than half of the mammogram images. In addition to being useless for categorizing lesions as benign or malignant, redundant regions also interfere with model training and raise computing costs [23].
To solve the above problem, we use the BRS (breast region segmentation) module to pre-process the original mammogram images. Figure 4 shows the processing process of this module on the Mini-DDSM dataset. The module first replaces the pixel values larger than 254 with 0, and then detects the edges of the tissue to obtain the coordinates of the four corners of the ROI (region of interest) box. Based on the coordinates of the four corners of the ROI box, the breast region is obtained by cropping on the original image. Finally, the mammary region is resized to a fixed size of
Processing of the BRS module on the Mini-DDSM dataset. (a) The original image; (b) ROI rectangle; (c) The cropped image; (d) The resized image.
2) Data Augmentation
To improve the robustness and generalization of the model, we enhanced the training set images with four data augmentation methods, all with probability values set to 0.5. The four data augmentation methods were a) flipping the images in the vertical direction; b) flipping the images in the horizontal direction; c) affine transformation with a rotate value of 20, a translate_percent value of 0.1, a shear value of 20, and a scale value of 0.8 to 1.2; d) elastic transformation with an alpha value of 10 and a sigma value of 15. And all images of the dataset were normalized. Each training image in the final dataset is eight times larger than it was before augmentation. Following augmentation, 49,952 training images of Mini-DDSM and 33,312 training images of CMMD were obtained. Figure 5 depicts an example of enhanced mammography images for these four data enhancement methods.
C. Proposed LCVT-GR Overall Architecture
The overall architecture of LCVT-GR is shown in Figure 6. The input of the model is the images of two views of the breast (CC view and MLO view). The input images are first extracted from the features
D. Local Cross-View Transformers Module
LCVTM models the local semantic relationship between two views to generate a local representation. To better learn the dependencies between the two views, LCVTM uses the idea of cross transformers. Where the attention mechanism of the transformer follows the cross-shaped window self-attention method in CSWin [24]. Since our model LCVT-GR inputs the information of two views, in order to achieve the interaction of
1) Cross-Shaped Window Self-Attention
Transformer uses a self-attention mechanism to model contextual information in order to capture long-range dependencies. However, this pixel point pair-based modeling approach necessitates a significant amount of computation, often the quadratic power of the input feature size [25]. Therefore, the computational cost consumption can be very high when the input feature map resolution is relatively high. Swin et al. [26] recommended the usage of local windows self-attention to broaden the field of perception through shift windows in order to solve this issue. However, this still does not address the issue of the token’s constrained attention area within the transformer block. To expand the attention area more effectively, CSWin [24] proposed the cross-shaped window self-attention mechanism, which implements self-attention by forming horizontal and vertical stripes of the cross-shaped window, which results in a wider receptive field for the token within each transformer blocks with stronger contextual modeling capabilities.
The schematic diagram of cross-shaped window self-attention is shown in Figure 7. Cross-shaped window self-attention is based on a multi-head self-attention mechanism by first linearly projecting the input feature \begin{align*} & \mathrm {LCVTM}\left ({{ \mathrm {U}_{\mathrm {CC}},\mathrm {U}_{\mathrm {MLO}} }}\right )\mathrm {=Concat}\left ({{ \mathrm {GAP}\left ({{ \mathrm {A} }}\right )\mathrm {,GAP}\left ({{ \mathrm {B} }}\right ) }}\right ) \tag {1}\\ & \mathrm {A=Concat(}\mathrm {A}_{1}\mathrm {,\ldots ,}\mathrm {A}_{\mathrm {k}}\mathrm {,\ldots ,}\mathrm {A}_{\mathrm {K}})\mathrm {W}^{\mathrm {O}}\mathrm { k=1,\ldots ,K} \tag {2}\\ & \mathrm {B=Concat(}\mathrm {B}_{1}\mathrm {,\ldots ,}\mathrm {B}_{\mathrm {k}}\mathrm {,\ldots ,}\mathrm {B}_{\mathrm {K}})\mathrm {W}^{\mathrm {O}}\mathrm { k=1,\ldots ,K} \tag {3}\end{align*}
2) Cross-View Attention Module
For the cross attention of both \begin{align*} & \hspace {-2pc}\left [{{ \mathrm {U}_{\mathrm {CC}}^{1}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {CC}}^{\mathrm {m}}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {CC}}^{\mathrm {M}} }}\right ] \\ & =\mathrm {U}_{\mathrm {CC}},\mathrm {U}_{\mathrm {CC}}^{\mathrm {m}}\mathrm {\in }\mathrm {R}^{\left ({{ \mathrm {sw\times W} }}\right )\mathrm {\times C}}\mathrm {, M=}\mathrm {H} \mathord {\left /{{\vphantom {\mathrm {H} {\mathrm {sw}}}}}\right . \hspace {-1.2pt} } {\mathrm {sw}} \tag {4}\\ & \hspace {-2pc}\left [{{ \mathrm {U}_{\mathrm {CC}}^{1}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {CC}}^{\mathrm {z}}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {CC}}^{\mathrm {Z}} }}\right ] \\ & =\mathrm {U}_{\mathrm {CC}},\mathrm {U}_{\mathrm {CC}}^{\mathrm {z}}\mathrm {\in }\mathrm {R}^{\left ({{ \mathrm {sw\times H} }}\right )\mathrm {\times C}}\mathrm {, Z=}\mathrm {W} \mathord {\left /{{\vphantom {\mathrm {W} {\mathrm {sw }}}}}\right . \hspace {-1.2pt} } {\mathrm {sw }} \tag {5}\\ & \hspace {-2pc}\left [{{ \mathrm {U}_{\mathrm {MLO}}^{1}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {MLO}}^{\mathrm {m}}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {MLO}}^{\mathrm {M}} }}\right ] \\ & =\mathrm {U}_{\mathrm {MLO}},\mathrm {U}_{\mathrm {MLO}}^{\mathrm {m}}\mathrm {\in }\mathrm {R}^{\left ({{ \mathrm {sw\times W} }}\right )\mathrm {\times C}}\mathrm {,M=H/sw } \tag {6}\\ & \hspace {-2pc}\left [{{ \mathrm {U}_{\mathrm {MLO}}^{1}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {MLO}}^{\mathrm {z}}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {MLO}}^{\mathrm {Z}} }}\right ] \\ & =\mathrm {U}_{\mathrm {MLO}},\mathrm {U}_{\mathrm {MLO}}^{\mathrm {z}}\mathrm {\in }\mathrm {R}^{\left ({{ \mathrm {sw\times H} }}\right )\mathrm {\times C}}\mathrm {, Z=W/sw } \tag {7}\end{align*}
Assuming that the projection Q(query), K(key) and V(value) dimensions of the \begin{align*} \mathrm {A}_{\mathrm {k}}& =\begin{cases} \displaystyle {\mathrm {H-CAttn}}_{\mathrm {k}}\left ({{ \mathrm {U}_{\mathrm {CC}},\mathrm {U}_{\mathrm {MLO}} }}\right ) \\ \displaystyle =\left [{{ \mathrm {A}_{\mathrm {k}}^{1}\mathrm {,\ldots ,}\mathrm {A}_{\mathrm {k}}^{\mathrm {m}}\mathrm {,\ldots ,}\mathrm {A}_{\mathrm {k}}^{\mathrm {M}} }}\right ]~\mathrm { k=1,\ldots ,K/2} \\ \displaystyle {\mathrm {V-CAttn}}_{\mathrm {k}}\left ({{ \mathrm {U}_{\mathrm {CC}},\mathrm {U}_{\mathrm {MLO}} }}\right ) \\ \displaystyle =\left [{{ \mathrm {A}_{\mathrm {k}}^{1}\mathrm {,\ldots ,}\mathrm {A}_{\mathrm {k}}^{\mathrm {z}}\mathrm {,\ldots ,}\mathrm {A}_{\mathrm {k}}^{\mathrm {Z}} }}\right ]~\mathrm { k=}\frac {\mathrm {K}}{2}\mathrm {+1,\ldots ,K} \\ \end{cases} \tag {8}\\ \mathrm {A}_{\mathrm {k}}^{\mathrm {m}}& =\mathrm {CVAM}(\mathrm {Q}_{\mathrm {MLO}}^{\mathrm {m}},\mathrm {K}_{\mathrm {CC}}^{\mathrm {m}}, \mathrm {V}_{\mathrm {CC}}^{\mathrm {m}}) \tag {9}\\ \mathrm {A}_{\mathrm {k}}^{\mathrm {z}}& =\mathrm {CVAM(}\mathrm {Q}_{\mathrm {MLO}}^{\mathrm {z}},\mathrm {K}_{\mathrm {CC}}^{\mathrm {z}},\mathrm {V}_{\mathrm {CC}}^{\mathrm {z}}\mathrm {) } \tag {10}\\ \mathrm {Q}_{\mathrm {MLO}}^{\mathrm {m}}& =\mathrm {U}_{\mathrm {MLO}}^{\mathrm {m}}\mathrm {W}_{\mathrm {k}}^{\mathrm {Q}},\mathrm {K}_{\mathrm {CC}}^{\mathrm {m}}=\mathrm {U}_{\mathrm {CC}}^{\mathrm {m}}\mathrm {W}_{\mathrm {k}}^{\mathrm {K}},\mathrm {V}_{\mathrm {CC}}^{\mathrm {m}}=\mathrm {U}_{\mathrm {CC}}^{\mathrm {m}}\mathrm {W}_{\mathrm {k}}^{\mathrm {V}} \tag {11}\\ \mathrm {Q}_{\mathrm {MLO}}^{\mathrm {z}}& =\mathrm {U}_{\mathrm {MLO}}^{\mathrm {z}}\mathrm {W}_{\mathrm {k}}^{\mathrm {Q}},\mathrm {K}_{\mathrm {CC}}^{\mathrm {z}}=\mathrm {U}_{\mathrm {CC}}^{\mathrm {z}}\mathrm {W}_{\mathrm {k}}^{\mathrm {K}},\mathrm {V}_{\mathrm {CC}}^{\mathrm {z}}=\mathrm {U}_{\mathrm {CC}}^{\mathrm {z}}\mathrm {W}_{\mathrm {k}}^{\mathrm {V}} \tag {12}\end{align*}
\begin{align*} & \hspace {-1pc}\mathrm {CVAM}\left ({{ \mathrm {Q}_{\mathrm {MLO}}^{\mathrm {m}},\mathrm {K}_{\mathrm {CC}}^{\mathrm {m}},\mathrm {V}_{\mathrm {CC}}^{\mathrm {m}} }}\right ) \\ & =\mathrm {softmax}\left ({{ \frac {\mathrm {Q}_{\mathrm {MLO}}^{\mathrm {m}}\left ({{ \mathrm {K}_{\mathrm {CC}}^{\mathrm {m}} }}\right )^{\mathrm {T}}}{\sqrt {\mathrm {d}}_{\mathrm {k}}} }}\right )\mathrm {V}_{\mathrm {CC}}^{\mathrm {m}} \tag {13}\end{align*}
Figure 8 shows how the output
3) Global Representation Module
Although LCVTM adopts CSWin’s cross-shaped window self-attention method, which effectively expands the attention region, LCVTM lacks the global information of the image. Therefore, we designed the GRM component to extract the global information of the image. GRM combines the features \begin{equation*} \mathrm {GRM}\left ({{ \mathrm {U}_{\mathrm {CC}},\mathrm {U}_{\mathrm {MLO}} }}\right )\mathrm {=GMP}\left ({{ \mathrm {Concat(}\mathrm {U}_{\mathrm {CC}},\mathrm {U}_{\mathrm {MLO}}) }}\right ) \tag {14}\end{equation*}
The global representation information generated by GRM, which is extracted from the whole image, can compensate for LCVTM’s lack of global information extraction. The efficiency of the GRM component will be demonstrated in the ablation experiment section.
Results and Discussion
A. Implementation Details
We use AdamW optimizer [28] to train the models to minimize the binary cross-entropy (BCE) loss with a learning rate of 0.0001 and a weight decay of 0.01. The batch size is 8 for the multi-view model and 16 for the single-view model, and all models are trained for 20 epoch. We use OneCycleLR to dynamically control the learning rate reduction based on BCE loss during model training, where the maximum learning rate is 0.0001 and the proportion of the learning rate increase is 0.1. All experiments are implemented using PyTorch and performed on an NVIDIA GTX 1080 Ti GPU (12GB).
B. Evaluation Metrics
We refer to the paper [14], [27] to evaluate the classification results based on two metrics: the area under the receiver operating characteristic curve (AUC-ROC) and the area under the precision-recall curve (AUC-PR). The AUC-ROC and AUC-PR are common metrics used to assess the performance of radiologists, enabling the assessment of model performance and the comparison of differences between models. The AUC-ROC reflects the balance between the model’s TPR (true positive rate also called recall) and FPR (false positive rate) at different probability thresholds. The higher the AUC-ROC value of a model, the better its ability to distinguish between positive and negative cases. The TPR and FPR are evaluated as:\begin{align*} \mathrm {TPR}& =\frac {\mathrm {TP}}{\mathrm {(TP+FN)}} \tag {15}\\ \mathrm {FPR}& =\frac {\mathrm {FP}}{\mathrm {(FP+TN)}} \tag {16}\end{align*}
\begin{equation*} \mathrm {Precision=}\frac {\mathrm {TP}}{\mathrm {(TP+FP)}} \tag {17}\end{equation*}
C. Results and Discussions
Table 2 shows the results of comparing our model with several two-view mammogram image classification models as well as traditional CNN classification models on the test datasets of Mini-DDSM and CMMD. When training competing models, our batch size is no longer 8 or 16, but is set as large as possible to make full use of the GPU memory. In Table 2, we give the values of the batch size settings for the model during training. PHResNet18 [10] is a two-view breast cancer classification method based on parameterized hypercomplex neural networks proposed by Lopez et al. PHResNet18 uses ResNet18 as a backbone to model the correlations that exist between different views using hypercomplex algebraic properties. The breast-wide-model [14] is a two-branch, two-view breast cancer classification model with ResNet22 as the backbone, as proposed by Wu et al. The model extracts features from different views for each branch and finally aggregates the extracted features for the final prediction. Two-views-classifier [13] is a two-view breast cancer classification model based on three-time migration learning proposed by Petrini et al. The model first trains the patched classifier on natural images, then trains the one-view classifier using the patched classifier weights, and finally trains the two-view classifier using the one-view classifier weights. Since in the experimental part we only compare the network structure of the two-views-classifier, the migration learning method mentioned in the paper is not used.
We used bootstrap resampling (2000 bootstrap repetitions) to estimate the 95% CI (confidence interval) in our tests and give the mean, lower, and upper values of the 95% CI for both AUC-ROC and AUC-PR metrics in Table 2. In addition, as shown in Fig. 9 and Fig. 10, we plotted the ROC curves and PR curves of this paper’s model and the competing models on the Mini-DDSM and CMMD datasets, respectively, to visualize the classification performance. From the test results, we can see that this paper’s model significantly outperforms the competing models. On the Mini-DDSM test dataset, AUC-ROC reaches 85.85% and AUC-PR reaches 65.76%, with an average improvement of 9.24% in AUC-ROC and 20.6% in AUC-PR. On the CMMD test dataset, AUC-ROC reached 87.12% and AUC-PR reached 89.03%, with an average improvement of 6.58% in AUC-ROC and 6.99% in AUC-PR.
The ROC curves of comparison results of different models on different datasets. (a) Mini-DDSM; (b) CMMD.
The PR curves of comparison results of different models on different datasets. (a) Mini-DDSM; (b) CMMD.
The variability of features can be assessed qualitatively using TSNE (t-distributed stochastic neighbor embedding) [29] plots. For each pair of training and test samples from the Mini-DDSM and CMMD datasets, we plotted the features from both local and global analysis using TSNE plots. As shown in Fig. 11, our model can extract discriminative features with discrepancies for classification.
TSNE plots of the (a) Mini-DDSM training set; (b) Mini-DDSM test set; (c) CMMD training set; (d) CMMD test set.
D. Ablation Studies
By comparing LCVT-GR with several two-view mammogram image classification models as well as traditional CNN classification models, we demonstrate that LCVT-GR has the best classification performance, which likewise proves that our strategy of using the global-local parallel analysis method to mimic the manual analysis is effective. Besides, to validate the soundness of the LCVT-GR design, we also discuss the structure of LCVT-GR, for which we perform three ablation experiments.
1) Key Components
To evaluate the effectiveness of the model components, we conducted ablation experiments on each component of LCVT-GR, and Table 3 shows the results of the ablation experiments on both Mini-DDSM and CMMD datasets. LTM indicates that in the LCVTM module, no interaction between the two views is performed. On the Mini-DDSM dataset, single-view classification prediction using the backbone network (tf_efficientnetv2_s) achieves 84.04% AUC-ROC and 61.5% AUC-PR (shown in the first row of Table 3).
As can be demonstrated in the second and third rows of Table 3, the AUC-ROC or AUC-PR of the test results is improved using Backbone+LCVTM or Backbone+GRM for multi-view classification prediction, indicating that the LCVTM and GRM components are effective. Thus, when Backbone+LCVTM+GRM (LCVT-GR) is used for multi-view classification prediction, the boosting effect of the two components adds up to an increase in AUC-ROC to 85.85% and AUC-PR to 65.76% (shown in the sixth row of Table 3). In addition to this, comparing the fourth, fifth, and sixth rows of Table 3 shows that making the two views cross-attention when extracting local representation information improves the AUC-ROC by 1% and the AUC-PR by 1.05%. Using the multi-view classification model gives better predictions than using the single-view classification model, with a maximum improvement of 1.99% in AUC-ROC and 3.09% in AUC-PR. Looking at the results of the ablation experiments for both the Mini-DDSM and CMMD datasets, we found that they have the same pattern. Therefore, we conclude that all components of LCVT-GR are effective and that the structure of LCVT-GR is optimal.
2) Dynamic Stripe Width
In Table 3, we examine the trade-off between stripe width (sw) and model performance on the Mini-DDSM dataset. We find that the computational costs (FLOPs) increase as the stripe width increases, but the performance of the model does not increase with it. We come to the conclusion that when sw is set to 1, better model performance can be attained with the least amount of computational expense.
3) Number of Heads
In Table 5, we also examine the trade-off between the number of heads (K) and the model performance on the Mini-DDSM dataset. We find that the performance of the model improves significantly at the beginning as K increases and decreases when K is large enough. The value of K affects the performance of the model but does not change the computational costs (FLOPs). We believe that when sw is set to 1, K is set to 4, which leads to the optimal performance of the model. Based on the results of this ablation experiment, the LCVT-GR model for all experiments in this paper is set to sw =1 and K =4 by default.
In summary, the first ablation experiment demonstrates that a) the key components of LCVT-GR are all valid, b) the classification results are better with two views of information interacting than without, and c) the structure of two views performs better than the structure of a single view. This is in line with our initial assumptions. The images of two views of a breast often contain complementary information, so the classification performance of the two-view model is better than that of the single-view model. The information interaction between the two views can learn more dependencies, which is important for improving the detection rate of lesions. In the second and third experiments, we ablated two variables of the model, sw and K, respectively, and finally determined that the model achieves a good trade-off between computational cost and performance when sw =1 and K =4. With these three ablation experiments, we demonstrate that the structure of LCVT-GR is scientific and effective.
Conclusion
In this paper, we propose a new multi-view mammography image classification method that uses a two-view global-local parallel analysis method to extract global and local information about mammography images. Global analysis helps to detect abnormalities such as distortion and asymmetric denseness of breast structures, and local analysis helps to detect abnormalities such as masses and calcifications. In order to better learn the dependencies between two views and realize the information exchange between different view features, we employ the cross-transformer concept while extracting local information.
To validate the effectiveness of our method, we conducted comparison experiments and ablation experiments on two publicly available datasets, Mini-DDSM and CMMD. The results of the comparison experiments show that our method achieves better results compared with existing advanced methods, with greater improvements in both AUC-ROC and AUC-PR assessment metrics. The results of the ablation experiments show that our model architecture is scientific and effective and achieves a good trade-off between computational cost and model performance.
Our proposed method will help to build high-performance and robust deep learning-based mammography CAD systems to improve efficiency and reduce cost in the early diagnosis of breast cancer. Since the public dataset used in this study experiment is still relatively small, we will further validate our model on a larger dataset in future work.
Conflicts of Interest
None declared.