Journals & Magazines >IEEE Access >Volume: 12

Local Cross-View Transformers and Global Representation Collaborating for Mammogram Classification

The overall architecture of our proposed model LCVT-GR.

Abstract:

When analyzing screening mammography images, radiologists compare multiple views of the same breast to help improve the detection rate of lesions and reduce the incidence...Show More

Metadata

Abstract:

When analyzing screening mammography images, radiologists compare multiple views of the same breast to help improve the detection rate of lesions and reduce the incidence of false-positive results. Therefore, to make the deep learning-based mammography computer-aided detection/diagnosis (CAD) system meet the radiologists’ requirements for accuracy and generality, the construction of deep learning models needs to mimic manual analysis and consider the correlation between different views of the same breast. In this paper, we propose the Local Cross-View Transformers and Global Representation Collaborating for Mammogram Classification (LCVT-GR) model. The model uses different view images to train in an end-to-end manner. In this model, the global and local representations of mammogram images are analyzed in parallel using the global-local parallel analysis method. To validate the effectiveness of our method, we conducted comparison experiments and ablation experiments on two publicly available datasets, Mini-DDSM and CMMD. The results of the comparison experiments show that our method achieves better results compared with existing advanced methods, with greater improvements in both AUC-ROC and AUC-PR assessment metrics. The results of the ablation experiments show that our model architecture is scientific and effective and achieves a good trade-off between computational cost and model performance.

The overall architecture of our proposed model LCVT-GR.

Published in: IEEE Access ( Volume: 12)

Page(s): 74596 - 74606

Date of Publication: 17 May 2024

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2024.3402260

Funding Agency:

Citations are not available for this document.

Contents

SECTION I.

Introduction

According to the WHO (World Health Organization), breast cancer has surpassed lung cancer as the most frequent cancer and the fifth largest cause of cancer mortality worldwide. Globally, there will be 2.3 million cases of breast cancer in women alone in 2020, and 685,000 people will pass away from the disease [1]. Breast cancer mortality has decreased by 40% in high-income countries since regular mammography screening was introduced by health authorities in the 1980s for age groups deemed to be at risk, in contrast to the situation in low- and middle-income countries [2]. Therefore, reducing breast cancer mortality globally requires early diagnosis and treatment of the disease. Since the pathogenesis of breast cancer is still unknown and there are currently no proven preventative measures, early diagnosis is still the best medical course of action [3].

Currently, the early diagnosis of breast cancer requires specialized radiologists and mammologists, which makes mammography screening programs costly to implement and can be more difficult to implement in countries with low incomes and a shortage of radiologists [2]. A few false positives can result from mammography screening, which can cause patients and their families unneeded worry and anxiety, additional imaging tests, and occasionally needle biopsies [4]. In contrast, deep learning-based AI-assisted technology can streamline radiologists’ evaluation of screening mammography images, increasing their effectiveness and precision. Because of this, deep learning has become a common technique for creating mammography computer-aided detection/diagnosis (CAD) schemes [5].

Two views of each breast are typically obtained during a mammogram: a top-down view known as craniocaudal (CC) and a lateral view known as mediolateral oblique (MLO). Radiologists search for certain abnormalities on mammograms to determine whether they are abnormal, the most frequent ones being masses, calcifications, structural deformities, and asymmetric densities [6]. The radiologist will refer to as many views as possible when looking for abnormalities to determine if a suspicious lesion is present. Figure 1 shows an example of a benign and malignant breast lesion, where the border features of a malignant breast lesion are distinctly different from those of a benign breast lesion. Most of the benign lesions have smooth, lobulated borders, whereas the malignant lesions have irregular, burr-like borders [7]. Comparing multiple views of the same breast can help improve the detection rate of lesions and reduce the incidence of false positive results [8], [9]. Therefore, we consider the correlation between different views of the same breast and construct a global-local analysis method with different views.

FIGURE 1.

Examples of benign and malignant breast lesions.

Show All

The global-local analysis method combines features of the whole image with features of small localized blocks for classification. Global analysis helps to detect abnormalities such as distortion and asymmetric denseness of breast structures, and local analysis helps to detect abnormalities such as masses and calcifications. There are two mainstream global-local analysis methods, one is to extract global features from the whole image, and local features from within the region of interest, and then combine the extracted global and local features for classification [10], [11]. The other is to first train a patch classifier and later fine-tune it for the whole mammogram image classification [12]. Although the performance demonstrated by these two methods has been comparable to that of medical experts, there are still some shortcomings. The local features are too dependent on the accuracy of the region of interest localization, which may result in obtaining suboptimal solutions. In response, many scholars have proposed improved global-local analysis methods. Petrini et al. [13], [14] proposed an architecture consisting of multiple CNN (convolutional neural network) paths, where each path extracts features from different views, and then the output of the features from all paths are stitched together and finally sent to the fully connected layer for classification. Chen et al. [15] proposed a pure transformer model with local and global blocks to learn the dependencies between different views of the mammary gland. Although the improved model described above uses the complementary information of different views to learn the dependencies between different views, it does not explore the cross-view information. In our opinion, the global features of the same mammary gland are similar and the local features should differ. When extracting local features, making cross-attention between different view features is beneficial to find the interrelationship between lesions and extract more effective characterization information.

Cross-attention of different views can be achieved by cross transformers. The cross-attention mechanism of the cross transformer can guide the transformer to learn the association information of different features during the training process and achieve the effective fusion of different features [16]. Currently, cross transformers have been widely used in many fields. In the time-domain speech enhancement task, Wang et al. [16] used cross transformers to fuse local features extracted by local transformers and global features extracted by global transformers to obtain a better contextual feature representation, where the Q(Query) and K(Key), V(Value) input to the cross transformer are from different features. In a few-sample target detection task, Han et al. [17] achieved asymmetric batch cross-attention across branches by aggregating K and V from different features. In hyperspectral and multispectral image fusion tasks, Wang et al. [18] added the cross-attention idea to the traditional transformer self-attention mechanism to achieve information fusion between two modalities by exchanging K from different modal features. In the face recognition task, Li et al. [19] used cross transformers, which can remove the noisy information due to race while retaining useful identity features by exchanging V from different features.

In this paper, we propose the Local Cross-View Transformers and Global Representation Collaborating for Mammogram Classification (LCVT-GR) model. The model uses different view images to train in an end-to-end manner. In this model, the global representation and local representation of mammogram images are analyzed in parallel using the global-local parallel analysis method. In generating the local representation, a cross-view transformer is used to achieve information exchange between different view features. Finally, the classifier fuses the local representation and the global representation for the final prediction. The innovations of this paper are:

An improved global-local parallel analysis method for multi-view mammography images is proposed to analyze the global representation and local representation of the mammary gland in parallel.
A new local cross-view transformer is proposed to learn the dependencies between different views and achieve the information fusion between different views.

SECTION II.

Materials and Methods

A. Data Collection

Two open-source digital mammography image datasets, the MiniDigital Dataset for Screening Mammography (Mini-DDSM) [20] and the Chinese Mammography Database (CMMD) [21], were used in this study. The Mini-DDSM included 3904 breasts (1952 patients) from multiple centers, 1342 breast biopsies were confirmed benign, 1358 breast biopsies were confirmed malignant, and 1204 breasts were normal. The CMMD includes 2601 breasts (1775 patients) from multiple centers, 556 breast biopsies were confirmed benign, 1316 breast biopsies were confirmed malignant, and 729 breasts were normal. Each breast in both open-source datasets has paired images for both CC and MLO views, resulting in a total of 7808 mammogram images for the Mini-DDSM dataset and 5202 mammogram images for the CMMD dataset. Figure 2 and Figure 3 shows examples of four views of a patient’s left and right breast in the Mini-DDSM dataset and CMMD dataset, respectively. In this study, the mammogram images from the above two open-source datasets were classified, with normal and benign breasts considered positive cases and malignant breasts considered negative cases. We divided each dataset into a training set and a test set in the ratio of 80:20, and the results of the division of the number of images and the number of breasts in each dataset are given in Table 1.

TABLE 1 Summary of the Number of Images and the Number of Breasts in Each Dataset and Subset

FIGURE 2.

Four views for the right and left breast mammography images from the public datasets of Mini-DDSM.

Show All

FIGURE 3.

Four views for the right and left breast mammography images from the public datasets of CMMD.

Show All

The Mini-DDSM used in this study was derived from kaggle’s processed Mini-DDSM’s 16-bit PNG (portable network graphics) images. For the CMMD dataset, we converted all raw DICOM images to lossless 8-bit JPEG (joint photographic experts group) images for subsequent processing.

B. Data Pre-Processing

This section provides a detailed introduction to the data preprocessing process and the required visualization.

1) Breast Region Segmentation

The size of the open source dataset Mini-DDSM is (495-2746) $\times$ (1088-3481) and the size of CMMD is $1914\times 2294$ . Although the higher resolution of mammogram images contains more information, training deep learning models with high-resolution original images, the following challenges still exist:

Since the original mammogram images are too large, direct resizing may lose information about some lesions, leading to models that are unable to learn from these lesions [22].
As shown in Figure 2, the Mini-DDSM dataset’s original mammogram images frequently contain undesirable view label information, which will certainly reduce the classification accuracy.
As shown in Figure 3, the CMMD dataset’s original mammogram images contain a significant amount of redundant regions. The lesions are only present in the mammary region, which still accounts for less than half of the mammogram images. In addition to being useless for categorizing lesions as benign or malignant, redundant regions also interfere with model training and raise computing costs [23].

To solve the above problem, we use the BRS (breast region segmentation) module to pre-process the original mammogram images. Figure 4 shows the processing process of this module on the Mini-DDSM dataset. The module first replaces the pixel values larger than 254 with 0, and then detects the edges of the tissue to obtain the coordinates of the four corners of the ROI (region of interest) box. Based on the coordinates of the four corners of the ROI box, the breast region is obtained by cropping on the original image. Finally, the mammary region is resized to a fixed size of $640\times 640$ pixels and used as the input to the model. The module processes the CMMD dataset similarly to the Mini-DDSM dataset, the only difference is that the CMMD dataset does not have white areas and does not need to replace pixel values larger than 254 to eliminate the effect of white areas on the detection of breast tissue.

FIGURE 4.

Processing of the BRS module on the Mini-DDSM dataset. (a) The original image; (b) ROI rectangle; (c) The cropped image; (d) The resized image.

Show All

2) Data Augmentation

To improve the robustness and generalization of the model, we enhanced the training set images with four data augmentation methods, all with probability values set to 0.5. The four data augmentation methods were a) flipping the images in the vertical direction; b) flipping the images in the horizontal direction; c) affine transformation with a rotate value of 20, a translate_percent value of 0.1, a shear value of 20, and a scale value of 0.8 to 1.2; d) elastic transformation with an alpha value of 10 and a sigma value of 15. And all images of the dataset were normalized. Each training image in the final dataset is eight times larger than it was before augmentation. Following augmentation, 49,952 training images of Mini-DDSM and 33,312 training images of CMMD were obtained. Figure 5 depicts an example of enhanced mammography images for these four data enhancement methods.

FIGURE 5.

Sample augmented images from different augmentation methods. (1–2): Image flipping in both vertical and horizontal direction, (3–7): Affine transformation with a value of scale from 0.8 to 1.2, (8): Elastic transformation.

Show All

C. Proposed LCVT-GR Overall Architecture

The overall architecture of LCVT-GR is shown in Figure 6. The input of the model is the images of two views of the breast (CC view and MLO view). The input images are first extracted from the features $\mathrm {U}_{\mathrm {CC}}$ and $\mathrm {U}_{\mathrm {MLO}}$ of these two views by a backbone model, here the backbone model used in this study is tf_efficientnetv2_s. Then, the features extracted from the backbone model are passed through the Local Cross-View Transformers Module (LCVTM) and Global Representation Module (GRM) in parallel to generate the local and global representations. Finally, the local and global representations are concatenated into the MLP (Multi-Layer Perception) classification layer to generate the prediction results.

FIGURE 6.

The overall architecture of our proposed model LCVT-GR.

Show All

D. Local Cross-View Transformers Module

LCVTM models the local semantic relationship between two views to generate a local representation. To better learn the dependencies between the two views, LCVTM uses the idea of cross transformers. Where the attention mechanism of the transformer follows the cross-shaped window self-attention method in CSWin [24]. Since our model LCVT-GR inputs the information of two views, in order to achieve the interaction of $\mathrm {U}_{\mathrm {CC}}$ and $\mathrm {U}_{\mathrm {MLO}}$ information, we design a Cross-View Attention Module (CVAM).

1) Cross-Shaped Window Self-Attention

Transformer uses a self-attention mechanism to model contextual information in order to capture long-range dependencies. However, this pixel point pair-based modeling approach necessitates a significant amount of computation, often the quadratic power of the input feature size [25]. Therefore, the computational cost consumption can be very high when the input feature map resolution is relatively high. Swin et al. [26] recommended the usage of local windows self-attention to broaden the field of perception through shift windows in order to solve this issue. However, this still does not address the issue of the token’s constrained attention area within the transformer block. To expand the attention area more effectively, CSWin [24] proposed the cross-shaped window self-attention mechanism, which implements self-attention by forming horizontal and vertical stripes of the cross-shaped window, which results in a wider receptive field for the token within each transformer blocks with stronger contextual modeling capabilities.

The schematic diagram of cross-shaped window self-attention is shown in Figure 7. Cross-shaped window self-attention is based on a multi-head self-attention mechanism by first linearly projecting the input feature $\mathrm {X\in }\mathrm {R}^{\left ({{ \mathrm {H\times W} }}\right)\mathrm {\times C}}$ onto K heads ( $\left \{{{ {h}_{1}\mathrm {,\ldots ,}{h}_{\mathrm {K}} }}\right \}$ ), and then dividing the K heads equally into two parallel groups, each with the number of channels C/2. The two groups of heads apply different ways of self-attention, with the first group of heads performing horizontal striped self-attention and the second group of heads performing vertical striped self-attention, both in parallel. Finally, the outputs of these two groups are connected. Unlike the cross-shaped window self-attention method, we utilize the CVAM to perform cross-attention on the information of the two views ( $\mathrm {U}_{\mathrm {CC}}$ and $\mathrm {U}_{\mathrm {MLO}}$ ) of the input LCVTM. The horizontal CVAM exchanges the Q generated by the first group of $\mathrm {U}_{\mathrm {CC}}$ headers and the first group of $\mathrm {U}_{\mathrm {MLO}}$ headers, and the vertical CVAM exchanges the Q generated by the second group of $\mathrm {U}_{\mathrm {CC}}$ headers and the second group of $\mathrm {U}_{\mathrm {MLO}}$ headers. CVAM implements cross-attention of two views by exchanging the Q generated by horizontal stripe self-attention and the Q generated by vertical stripe self-attention of both views. CVAM achieves cross-attention by referring to the local co-occurrence module proposed in the paper [27], both by exchanging Qs generated by different features to achieve information interaction. Assuming that after CVAM, the output of $\mathrm {U}_{\mathrm {CC}}$ is A and the output of $\mathrm {U}_{\mathrm {MLO}}$ is B, the output of LCVTM can be defined as:

$\begin{align*} & \mathrm {LCVTM}\left ({{ \mathrm {U}_{\mathrm {CC}},\mathrm {U}_{\mathrm {MLO}} }}\right )\mathrm {=Concat}\left ({{ \mathrm {GAP}\left ({{ \mathrm {A} }}\right )\mathrm {,GAP}\left ({{ \mathrm {B} }}\right ) }}\right ) \tag {1}\\ & \mathrm {A=Concat(}\mathrm {A}_{1}\mathrm {,\ldots ,}\mathrm {A}_{\mathrm {k}}\mathrm {,\ldots ,}\mathrm {A}_{\mathrm {K}})\mathrm {W}^{\mathrm {O}}\mathrm { k=1,\ldots ,K} \tag {2}\\ & \mathrm {B=Concat(}\mathrm {B}_{1}\mathrm {,\ldots ,}\mathrm {B}_{\mathrm {k}}\mathrm {,\ldots ,}\mathrm {B}_{\mathrm {K}})\mathrm {W}^{\mathrm {O}}\mathrm { k=1,\ldots ,K} \tag {3}\end{align*}$ View Source

FIGURE 7.

The diagram of cross-shaped window self-attention.

Show All

$\mathrm {W}^{\mathrm {O}}\mathrm {\in }\mathrm {R}^{\mathrm {C\times C}}$ denotes the projection matrix and the output dimension is set to C. GAP stands for global average pooling and LN stands for layer normalization.

2) Cross-View Attention Module

For the cross attention of both $\mathrm {U}_{\mathrm {CC}}$ and $\mathrm {U}_{\mathrm {MLO}}$ views, $\mathrm {U}_{\mathrm {CC}}$ is uniformly divided into non-overlapping equal high horizontal stripes $\left [{{ \mathrm {U}_{\mathrm {CC}}^{1}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {CC}}^{\mathrm {m}}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {CC}}^{\mathrm {M}} }}\right]$ and non-overlapping equal wide vertical stripes $\left [{{ \mathrm {U}_{\mathrm {CC}}^{1}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {CC}}^{\mathrm {z}}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {CC}}^{\mathrm {Z}} }}\right]$ , and $\mathrm {U}_{\mathrm {MLO}}$ performs the same operation. Sw is the dynamic stripe width when dividing the stripes, and sw can be adjusted to balance the learning ability and computational complexity of the model.

$\begin{align*} & \hspace {-2pc}\left [{{ \mathrm {U}_{\mathrm {CC}}^{1}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {CC}}^{\mathrm {m}}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {CC}}^{\mathrm {M}} }}\right ] \\ & =\mathrm {U}_{\mathrm {CC}},\mathrm {U}_{\mathrm {CC}}^{\mathrm {m}}\mathrm {\in }\mathrm {R}^{\left ({{ \mathrm {sw\times W} }}\right )\mathrm {\times C}}\mathrm {, M=}\mathrm {H} \mathord {\left /{{\vphantom {\mathrm {H} {\mathrm {sw}}}}}\right . \hspace {-1.2pt} } {\mathrm {sw}} \tag {4}\\ & \hspace {-2pc}\left [{{ \mathrm {U}_{\mathrm {CC}}^{1}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {CC}}^{\mathrm {z}}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {CC}}^{\mathrm {Z}} }}\right ] \\ & =\mathrm {U}_{\mathrm {CC}},\mathrm {U}_{\mathrm {CC}}^{\mathrm {z}}\mathrm {\in }\mathrm {R}^{\left ({{ \mathrm {sw\times H} }}\right )\mathrm {\times C}}\mathrm {, Z=}\mathrm {W} \mathord {\left /{{\vphantom {\mathrm {W} {\mathrm {sw }}}}}\right . \hspace {-1.2pt} } {\mathrm {sw }} \tag {5}\\ & \hspace {-2pc}\left [{{ \mathrm {U}_{\mathrm {MLO}}^{1}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {MLO}}^{\mathrm {m}}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {MLO}}^{\mathrm {M}} }}\right ] \\ & =\mathrm {U}_{\mathrm {MLO}},\mathrm {U}_{\mathrm {MLO}}^{\mathrm {m}}\mathrm {\in }\mathrm {R}^{\left ({{ \mathrm {sw\times W} }}\right )\mathrm {\times C}}\mathrm {,M=H/sw } \tag {6}\\ & \hspace {-2pc}\left [{{ \mathrm {U}_{\mathrm {MLO}}^{1}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {MLO}}^{\mathrm {z}}\mathrm {,\ldots ,}\mathrm {U}_{\mathrm {MLO}}^{\mathrm {Z}} }}\right ] \\ & =\mathrm {U}_{\mathrm {MLO}},\mathrm {U}_{\mathrm {MLO}}^{\mathrm {z}}\mathrm {\in }\mathrm {R}^{\left ({{ \mathrm {sw\times H} }}\right )\mathrm {\times C}}\mathrm {, Z=W/sw } \tag {7}\end{align*}$ View Source

Assuming that the projection Q(query), K(key) and V(value) dimensions of the $\mathrm {k}^{\mathrm {th}}$ head are $\mathrm {d}_{\mathrm {k}}$ , the output of the $\mathrm {k}^{\mathrm {th}}$ head of $\mathrm {U}_{\mathrm {CC}}$ after cross attention is defined as:

$\begin{align*} \mathrm {A}_{\mathrm {k}}& =\begin{cases} \displaystyle {\mathrm {H-CAttn}}_{\mathrm {k}}\left ({{ \mathrm {U}_{\mathrm {CC}},\mathrm {U}_{\mathrm {MLO}} }}\right ) \\ \displaystyle =\left [{{ \mathrm {A}_{\mathrm {k}}^{1}\mathrm {,\ldots ,}\mathrm {A}_{\mathrm {k}}^{\mathrm {m}}\mathrm {,\ldots ,}\mathrm {A}_{\mathrm {k}}^{\mathrm {M}} }}\right ]~\mathrm { k=1,\ldots ,K/2} \\ \displaystyle {\mathrm {V-CAttn}}_{\mathrm {k}}\left ({{ \mathrm {U}_{\mathrm {CC}},\mathrm {U}_{\mathrm {MLO}} }}\right ) \\ \displaystyle =\left [{{ \mathrm {A}_{\mathrm {k}}^{1}\mathrm {,\ldots ,}\mathrm {A}_{\mathrm {k}}^{\mathrm {z}}\mathrm {,\ldots ,}\mathrm {A}_{\mathrm {k}}^{\mathrm {Z}} }}\right ]~\mathrm { k=}\frac {\mathrm {K}}{2}\mathrm {+1,\ldots ,K} \\ \end{cases} \tag {8}\\ \mathrm {A}_{\mathrm {k}}^{\mathrm {m}}& =\mathrm {CVAM}(\mathrm {Q}_{\mathrm {MLO}}^{\mathrm {m}},\mathrm {K}_{\mathrm {CC}}^{\mathrm {m}}, \mathrm {V}_{\mathrm {CC}}^{\mathrm {m}}) \tag {9}\\ \mathrm {A}_{\mathrm {k}}^{\mathrm {z}}& =\mathrm {CVAM(}\mathrm {Q}_{\mathrm {MLO}}^{\mathrm {z}},\mathrm {K}_{\mathrm {CC}}^{\mathrm {z}},\mathrm {V}_{\mathrm {CC}}^{\mathrm {z}}\mathrm {) } \tag {10}\\ \mathrm {Q}_{\mathrm {MLO}}^{\mathrm {m}}& =\mathrm {U}_{\mathrm {MLO}}^{\mathrm {m}}\mathrm {W}_{\mathrm {k}}^{\mathrm {Q}},\mathrm {K}_{\mathrm {CC}}^{\mathrm {m}}=\mathrm {U}_{\mathrm {CC}}^{\mathrm {m}}\mathrm {W}_{\mathrm {k}}^{\mathrm {K}},\mathrm {V}_{\mathrm {CC}}^{\mathrm {m}}=\mathrm {U}_{\mathrm {CC}}^{\mathrm {m}}\mathrm {W}_{\mathrm {k}}^{\mathrm {V}} \tag {11}\\ \mathrm {Q}_{\mathrm {MLO}}^{\mathrm {z}}& =\mathrm {U}_{\mathrm {MLO}}^{\mathrm {z}}\mathrm {W}_{\mathrm {k}}^{\mathrm {Q}},\mathrm {K}_{\mathrm {CC}}^{\mathrm {z}}=\mathrm {U}_{\mathrm {CC}}^{\mathrm {z}}\mathrm {W}_{\mathrm {k}}^{\mathrm {K}},\mathrm {V}_{\mathrm {CC}}^{\mathrm {z}}=\mathrm {U}_{\mathrm {CC}}^{\mathrm {z}}\mathrm {W}_{\mathrm {k}}^{\mathrm {V}} \tag {12}\end{align*}$ View Source

$\mathrm {W}_{\mathrm {k}}^{\mathrm {Q}}\in \mathrm {R}^{\mathrm {C\times }\mathrm {d}_{\mathrm {k}}},\mathrm {W}_{\mathrm {k}}^{\mathrm {K}}\mathrm {\in }\mathrm {R}^{\mathrm {C\times }\mathrm {d}_{\mathrm {k}}},\mathrm {W}_{\mathrm {k}}^{\mathrm {V}}\mathrm {\in }\mathrm {R}^{\mathrm {C\times }\mathrm {d}_{\mathrm {k}}}$ denote the projection matrix of Q, K and V of the $\mathrm {k}^{th}$ head, respectively. $\mathrm {d}_{\mathrm {k}}$ denotes the channel dimension of the kth head with the value C/K. The output of cross-attention of horizontal stripes to the $\mathrm {k}^{th}$ head is noted as ${\mathrm {H-CAttn}}_{\mathrm {k}}(\mathrm {X}_{1},\mathrm {X}_{2})$ , and the output of cross-attention of vertical stripes to the $\mathrm {k}^{th}$ head is noted as ${\mathrm {V-CAttn}}_{\mathrm {k}}(\mathrm {X}_{1},\mathrm {X}_{2})$ . The Q and K, V of CVAM come from the features of different views, respectively, and the CVAM is calculated as:

$\begin{align*} & \hspace {-1pc}\mathrm {CVAM}\left ({{ \mathrm {Q}_{\mathrm {MLO}}^{\mathrm {m}},\mathrm {K}_{\mathrm {CC}}^{\mathrm {m}},\mathrm {V}_{\mathrm {CC}}^{\mathrm {m}} }}\right ) \\ & =\mathrm {softmax}\left ({{ \frac {\mathrm {Q}_{\mathrm {MLO}}^{\mathrm {m}}\left ({{ \mathrm {K}_{\mathrm {CC}}^{\mathrm {m}} }}\right )^{\mathrm {T}}}{\sqrt {\mathrm {d}}_{\mathrm {k}}} }}\right )\mathrm {V}_{\mathrm {CC}}^{\mathrm {m}} \tag {13}\end{align*}$ View Source

Figure 8 shows how the output $\mathrm {A}_{\mathrm {k}}$ of the $\mathrm {k}^{th}$ head of $\mathrm {U}_{\mathrm {CC}}$ after cross-attention is obtained, and the output of the $\mathrm {k}^{th}$ head of $\mathrm {U}_{\mathrm {MLO}}$ after cross-attention can be derived similarly.

$FIGURE 8. - The diagram for calculating $\mathrm {A}_{\mathrm {k}}$ .$

FIGURE 8.

The diagram for calculating $\mathrm {A}_{\mathrm {k}}$ .

Show All

3) Global Representation Module

Although LCVTM adopts CSWin’s cross-shaped window self-attention method, which effectively expands the attention region, LCVTM lacks the global information of the image. Therefore, we designed the GRM component to extract the global information of the image. GRM combines the features $\mathrm {U}_{\mathrm {CC}}$ and $\mathrm {U}_{\mathrm {MLO}}$ of the two views extracted by backbone and performs feature dimensionality reduction by GMP (generalized mean pooling).

$\begin{equation*} \mathrm {GRM}\left ({{ \mathrm {U}_{\mathrm {CC}},\mathrm {U}_{\mathrm {MLO}} }}\right )\mathrm {=GMP}\left ({{ \mathrm {Concat(}\mathrm {U}_{\mathrm {CC}},\mathrm {U}_{\mathrm {MLO}}) }}\right ) \tag {14}\end{equation*}$ View Source

The global representation information generated by GRM, which is extracted from the whole image, can compensate for LCVTM’s lack of global information extraction. The efficiency of the GRM component will be demonstrated in the ablation experiment section.

SECTION III.

Results and Discussion

A. Implementation Details

We use AdamW optimizer [28] to train the models to minimize the binary cross-entropy (BCE) loss with a learning rate of 0.0001 and a weight decay of 0.01. The batch size is 8 for the multi-view model and 16 for the single-view model, and all models are trained for 20 epoch. We use OneCycleLR to dynamically control the learning rate reduction based on BCE loss during model training, where the maximum learning rate is 0.0001 and the proportion of the learning rate increase is 0.1. All experiments are implemented using PyTorch and performed on an NVIDIA GTX 1080 Ti GPU (12GB).

B. Evaluation Metrics

We refer to the paper [14], [27] to evaluate the classification results based on two metrics: the area under the receiver operating characteristic curve (AUC-ROC) and the area under the precision-recall curve (AUC-PR). The AUC-ROC and AUC-PR are common metrics used to assess the performance of radiologists, enabling the assessment of model performance and the comparison of differences between models. The AUC-ROC reflects the balance between the model’s TPR (true positive rate also called recall) and FPR (false positive rate) at different probability thresholds. The higher the AUC-ROC value of a model, the better its ability to distinguish between positive and negative cases. The TPR and FPR are evaluated as:

$\begin{align*} \mathrm {TPR}& =\frac {\mathrm {TP}}{\mathrm {(TP+FN)}} \tag {15}\\ \mathrm {FPR}& =\frac {\mathrm {FP}}{\mathrm {(FP+TN)}} \tag {16}\end{align*}$ View Source

where the letters TP, FN, FP, and TN stand for the corresponding totals of true positive, false negative, false positive, and true negative samples. The AUC-PR reflects the balance between the model’s recall and precision (positive predictive value) at different probability thresholds. The precision is defined as:

$\begin{equation*} \mathrm {Precision=}\frac {\mathrm {TP}}{\mathrm {(TP+FP)}} \tag {17}\end{equation*}$

View Source

C. Results and Discussions

Table 2 shows the results of comparing our model with several two-view mammogram image classification models as well as traditional CNN classification models on the test datasets of Mini-DDSM and CMMD. When training competing models, our batch size is no longer 8 or 16, but is set as large as possible to make full use of the GPU memory. In Table 2, we give the values of the batch size settings for the model during training. PHResNet18 [10] is a two-view breast cancer classification method based on parameterized hypercomplex neural networks proposed by Lopez et al. PHResNet18 uses ResNet18 as a backbone to model the correlations that exist between different views using hypercomplex algebraic properties. The breast-wide-model [14] is a two-branch, two-view breast cancer classification model with ResNet22 as the backbone, as proposed by Wu et al. The model extracts features from different views for each branch and finally aggregates the extracted features for the final prediction. Two-views-classifier [13] is a two-view breast cancer classification model based on three-time migration learning proposed by Petrini et al. The model first trains the patched classifier on natural images, then trains the one-view classifier using the patched classifier weights, and finally trains the two-view classifier using the one-view classifier weights. Since in the experimental part we only compare the network structure of the two-views-classifier, the migration learning method mentioned in the paper is not used.

TABLE 2 Testing Results Show Breast-Level Estimates of AUC-ROC and AUC-RP on Mini-DDSM and CMMD

We used bootstrap resampling (2000 bootstrap repetitions) to estimate the 95% CI (confidence interval) in our tests and give the mean, lower, and upper values of the 95% CI for both AUC-ROC and AUC-PR metrics in Table 2. In addition, as shown in Fig. 9 and Fig. 10, we plotted the ROC curves and PR curves of this paper’s model and the competing models on the Mini-DDSM and CMMD datasets, respectively, to visualize the classification performance. From the test results, we can see that this paper’s model significantly outperforms the competing models. On the Mini-DDSM test dataset, AUC-ROC reaches 85.85% and AUC-PR reaches 65.76%, with an average improvement of 9.24% in AUC-ROC and 20.6% in AUC-PR. On the CMMD test dataset, AUC-ROC reached 87.12% and AUC-PR reached 89.03%, with an average improvement of 6.58% in AUC-ROC and 6.99% in AUC-PR.

FIGURE 9.

The ROC curves of comparison results of different models on different datasets. (a) Mini-DDSM; (b) CMMD.

Show All

FIGURE 10.

The PR curves of comparison results of different models on different datasets. (a) Mini-DDSM; (b) CMMD.

Show All

The variability of features can be assessed qualitatively using TSNE (t-distributed stochastic neighbor embedding) [29] plots. For each pair of training and test samples from the Mini-DDSM and CMMD datasets, we plotted the features from both local and global analysis using TSNE plots. As shown in Fig. 11, our model can extract discriminative features with discrepancies for classification.

FIGURE 11.

TSNE plots of the (a) Mini-DDSM training set; (b) Mini-DDSM test set; (c) CMMD training set; (d) CMMD test set.

Show All

D. Ablation Studies

By comparing LCVT-GR with several two-view mammogram image classification models as well as traditional CNN classification models, we demonstrate that LCVT-GR has the best classification performance, which likewise proves that our strategy of using the global-local parallel analysis method to mimic the manual analysis is effective. Besides, to validate the soundness of the LCVT-GR design, we also discuss the structure of LCVT-GR, for which we perform three ablation experiments.

1) Key Components

To evaluate the effectiveness of the model components, we conducted ablation experiments on each component of LCVT-GR, and Table 3 shows the results of the ablation experiments on both Mini-DDSM and CMMD datasets. LTM indicates that in the LCVTM module, no interaction between the two views is performed. On the Mini-DDSM dataset, single-view classification prediction using the backbone network (tf_efficientnetv2_s) achieves 84.04% AUC-ROC and 61.5% AUC-PR (shown in the first row of Table 3).

TABLE 3 Ablation Study of Key Components of our LCVT-GR Model

TABLE 4 Ablation on Stripe Width (sw) in the Mini-DDSM Dataset

As can be demonstrated in the second and third rows of Table 3, the AUC-ROC or AUC-PR of the test results is improved using Backbone+LCVTM or Backbone+GRM for multi-view classification prediction, indicating that the LCVTM and GRM components are effective. Thus, when Backbone+LCVTM+GRM (LCVT-GR) is used for multi-view classification prediction, the boosting effect of the two components adds up to an increase in AUC-ROC to 85.85% and AUC-PR to 65.76% (shown in the sixth row of Table 3). In addition to this, comparing the fourth, fifth, and sixth rows of Table 3 shows that making the two views cross-attention when extracting local representation information improves the AUC-ROC by 1% and the AUC-PR by 1.05%. Using the multi-view classification model gives better predictions than using the single-view classification model, with a maximum improvement of 1.99% in AUC-ROC and 3.09% in AUC-PR. Looking at the results of the ablation experiments for both the Mini-DDSM and CMMD datasets, we found that they have the same pattern. Therefore, we conclude that all components of LCVT-GR are effective and that the structure of LCVT-GR is optimal.

2) Dynamic Stripe Width

In Table 3, we examine the trade-off between stripe width (sw) and model performance on the Mini-DDSM dataset. We find that the computational costs (FLOPs) increase as the stripe width increases, but the performance of the model does not increase with it. We come to the conclusion that when sw is set to 1, better model performance can be attained with the least amount of computational expense.

3) Number of Heads

In Table 5, we also examine the trade-off between the number of heads (K) and the model performance on the Mini-DDSM dataset. We find that the performance of the model improves significantly at the beginning as K increases and decreases when K is large enough. The value of K affects the performance of the model but does not change the computational costs (FLOPs). We believe that when sw is set to 1, K is set to 4, which leads to the optimal performance of the model. Based on the results of this ablation experiment, the LCVT-GR model for all experiments in this paper is set to sw =1 and K =4 by default.

TABLE 5 Ablation on the Number of Heads in Mini-DDSM Dataset

In summary, the first ablation experiment demonstrates that a) the key components of LCVT-GR are all valid, b) the classification results are better with two views of information interacting than without, and c) the structure of two views performs better than the structure of a single view. This is in line with our initial assumptions. The images of two views of a breast often contain complementary information, so the classification performance of the two-view model is better than that of the single-view model. The information interaction between the two views can learn more dependencies, which is important for improving the detection rate of lesions. In the second and third experiments, we ablated two variables of the model, sw and K, respectively, and finally determined that the model achieves a good trade-off between computational cost and performance when sw =1 and K =4. With these three ablation experiments, we demonstrate that the structure of LCVT-GR is scientific and effective.

SECTION IV.

Conclusion

In this paper, we propose a new multi-view mammography image classification method that uses a two-view global-local parallel analysis method to extract global and local information about mammography images. Global analysis helps to detect abnormalities such as distortion and asymmetric denseness of breast structures, and local analysis helps to detect abnormalities such as masses and calcifications. In order to better learn the dependencies between two views and realize the information exchange between different view features, we employ the cross-transformer concept while extracting local information.

To validate the effectiveness of our method, we conducted comparison experiments and ablation experiments on two publicly available datasets, Mini-DDSM and CMMD. The results of the comparison experiments show that our method achieves better results compared with existing advanced methods, with greater improvements in both AUC-ROC and AUC-PR assessment metrics. The results of the ablation experiments show that our model architecture is scientific and effective and achieves a good trade-off between computational cost and model performance.

Our proposed method will help to build high-performance and robust deep learning-based mammography CAD systems to improve efficiency and reduce cost in the early diagnosis of breast cancer. Since the public dataset used in this study experiment is still relatively small, we will further validate our model on a larger dataset in future work.

Conflicts of Interest

None declared.

Cites in Papers - |

Cites in Papers - IEEE (1)

Select All

Sampoornamma Sudarsa, R. Pradeep Kumar Reddy, "Systematic Review on Breast Cancer Prediction and Classification by using Machine Learning and Deep Learning Methods", 2024 8th International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), pp.2049-2059, 2024.

Show Article

Google Scholar

References is not available for this document.

Local Cross-View Transformers and Global Representation Collaborating for Mammogram Classification

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction