Introduction
Digital X-ray radiography, boasting the advantages of low radiation dose, fast imaging speed and high image definition, is a common imaging method widely used in clinical diagnosis [1]. During the chest X-ray imaging process, there are many factors, e.g., patient’s position, breathing status, exposure, irradiation field, center line, projection angle, etc., that will affect the imaging quality and directly influence the doctor’s diagnosis of the disease. High quality chest X-ray images are helpful for the clinicians to accurately diagnose the diseases, while low quality X-ray images are likely to cause misdiagnosis or missed diagnosis [2]. At present, the quality judgment of the chest X-ray images mainly depends on the subjective evaluation from the radiologists. The results depend on the skill level and experience of the evaluators. The evaluation process requires high concentration and heavy workload. The image quality assessment method based on machine recognition can achieve objective evaluation of chest X-ray image quality, avoid subjective uncertainties, and effectively improve the efficiency.
Image Quality Assessment (IQA) is one of the basic technologies in image processing [3]. It is widely used in algorithm design and analysis, system performance evaluation, etc., and can help complete image super-resolution, image restoration, evaluation and debugging of imaging product parameters and other tasks.
IQA can be divided into two methods: subjective quality assessment and objective quality assessment according to the evaluation method. The subjective IQA method judges the image quality through the observer’s subjective scoring. The advantage is that it can truly reflect people’s subjective visual experience. The results are direct, accurate and reliable. Mean Opinion Score (MOS) and Differential Mean Opinion Score (DMOS) are the main forms of obtaining subjective scores [4]. However, this method is usually affected by objective factors such as the observation environment and the number of experimenters, thus having many limitations. The objective IQA method establishes a mathematical model based on the visual system of the human eye and evaluates or scores the images to be tested. It has the advantages of batch processing and reproducible results, which is easier to be applied in various scenarios.
Objective image quality assessment methods can be divided into Full-Reference Image Quality Assessment (FR-IQA), Reduced-Reference Image Quality Assessment (RR-IQA) and No-Reference Image Quality Assessment (NR-IQA) according to their processing methods. Among them, FR-IQA algorithms rely entirely on the reference image and evaluate the image by detecting the difference between the distorted image and its corresponding original undistorted image. There are already many FR-IQA methods, such as SSIM [5], MS-SSIM [6], VIF [7], IFC [8], FSIM [9], etc. Due to the existence of reference images, FR-IQA can usually accurately measure the quality of distorted images. However, their usefulness is limited due to the difficulty in obtaining reference images in practical applications. RR-IQA algorithms utilize the representative information of the reference image and its distorted image (e.g., some statistical features of the reference image, such as power spectrum). Image quality evaluation is conducted by measuring the similarity of these features in the reference image and the distorted image [10], [11], [12], [13]. NR-IQA algorithms evaluate the image quality without using the original reference image. In terms of practicality, NR-IQA methods are widely used because no information about the reference image is required [14], [15], [16], [17]. So far, designing an effective NR-IQA method remains a challenging research topic.
Since there is often no gold standard image in medical images, the quality assessment is mainly based on NR-IQA. Early NR-IQA methods mainly used handcrafted features, including lots of local features and global features. However, its limited feature expression ability resulted in low performance. With the rapid development of deep learning, deep features are increasingly used in NR-IQA. The performance of deep features far exceeds that of handcrafted features.
Although NR-IQA has made a lot of progress, however, due to the complexity of real clinical image data and many factors causing low quality, there are still many problems in directly applying it to the quality assessment of chest X-rays. Firstly, the most used NR-IQA data sets such as LIVE, TID2008, TID2013 are either natural distortion data or artificially simulated distortion data. Their data distribution and characteristics are far from real medical imaging data. Secondly, the accuracy of algorithm prediction is mainly measured based on the correlation with the subjective scoring of the human eye, which cannot meet the relevant requirements of the medical domain for medical image quality [18]. Finally, most of the existing methods use quality assessment methods based on the chest X-ray image features [3], [19], [20], and use image segmentation algorithms to semantically segment the diagnostic regions, and then use classification algorithms to judge image quality. However, such methods require large amount of annotated chest X-ray data, which brings huge data annotation workload to doctors and challenges the current data-driven deep learning methods.
At present, chest X-ray quality assessment methods that only rely on medical image feature learning still rely heavily on expensive or expert-knowledge supported datasets. The collation of these data requires heavy work in data collection, sampling, and manual labeling, which it is difficult to scale up. This expensive data collation process limits the size of the data set, which in turn hinders the effective improvement of the performance of the algorithm model. In recent years, with the continuous development of transfer learning technology, a number of unsupervised multimodal pre-training models using image-text pairs have emerged, and have achieved excellent performance in downstream tasks such as image classification and image segmentation [21], [22]. Considering that each patient’s chest X-ray image will have a corresponding diagnosis report and quality control report, it is possible to construct the X-ray image-text pairs by performing natural language processing on the diagnosis report and quality control report and obtain the knowledge descriptions related to the quality of chest X-ray images. Then they can be fine-tuned based on the CLIP pre-trained model to obtain better performance. In addition, sicne the text feature has a certain generalization ability, it can effectively improve the generalization performance of the model.
Each single chest X-ray images may have many quality problems. For example, a chest X-ray image may have two quality problems of uneven clavicle and abnormal external objects at the same time. Thus, there will be multiple quality judgment rules, and every quality judgment rule is related to the local area of the image, leading to a typical multi-label image feature learning problem. Using the fine-grained comparative learning method of local image features in DualCoOp [23] can further improve the performance.
Therefore, inspired by the multimodal pre-training big models CLIP [21] and ALIGN [22], we propose a chest X-ray quality assessment method that organically combines image-text contrastive learning with medical domain knowledge fusion. First, an algorithm framework for chest X-ray quality assessment is presented. It achieves cross-domain transfer learning by fusing large scale real clinical chest X-ray images and diagnosis report text information based on Contrastive Language-Image Pre-training (CLIP) and model fine-tuning. While improving the prediction accuracy of the algorithm, the cost of massive data labeling is avoided, and the local visual patch features of the X-ray image are aligned with multiple text features to ensure that the visual features contain more fine-grained image information. The contribution of this paper can be summarized as:
A medical image quality assessment method integrating medical domain knowledge is proposed. The text data from chest X-ray image diagnosis reports and quality control reports are converted into triplet information through knowledge extraction and knowledge fusion, which are used as the guidance information from medical domain knowledge to medical images in the process of model training.
Exploiting the text annotation data of large-scale medical images together with their corresponding diagnostic reports and quality control reports, cross domain transfer learning is accomplished by fine-tuning the pre-trained model based on the contrastive text-image pairs, which effectively help overcome the problem of insufficient training data and avoid the heavy manual data labeling work.
By aligning local visual patch features of the X-ray image to multiple text features, the visual features are able to contain more fine-grained image information, so that the problem that global visual information cannot reflect the local image quality for disease diagnosis is addressed.
Related Work
A. Factors Affecting the Disqualification of the Chest X-Ray Imagess
According to the working experience and technical guidelines of radiologists [18], high-quality chest X-ray images should meet the following conditions, as illustrated in Figure 1: 1) The image should have no artifacts or severe image noise. Bad cases include abnormal external objects not belonging to the body (except those that cannot be removed), motion blur, post-processing, and equipment artifacts, etc. No granular noise should be visible to the naked eye in soft tissue density areas. 2) The position of the patient should be correct and appropriate, and 80% of the scapula in the case should be moved outside the lung field. (3) Both sides of the thorax should be symmetrical; the double clavicles should be flatp; and the sternoclavicular joints should be symmetrical. (4) The irradiation field should be properly selected, and there should be no case where chest tissue cannot be displayed due to occluded cuts (including both lung apexes, the lateral edge of the ribs above the diaphragm, and the double costophrenic angle). (5) The chest projection should be in the center of the image. The upper part of the chest X-ray should include the lung apices on both sides, and a 3–5 cm empty exposure area can be seen above the soft tissue shadow of the shoulder; the lower part should include the bilateral costophrenic angle and about 1–3 cm below; the outer sides of the chest should include the outer edge of the ribs and the soft tissue of the chest wall. (6) The bilateral lung fields, trachea and adjacent bronchi, heart and aortic border, diaphragm and bilateral costophrenic angles, lung fields and mediastinum behind the heart shadow should be clearly displayed. (7) The level of lung field and mediastinum, lung field and chest wall, lung field and shoulder soft tissue should be distinguishable. The structure of lung field texture should be clear and sharp, mediastinal soft tissue can be vaguely distinguished, and lung texture overlapping with heart shadow can be clearly displayed. (8) There should be no image blur caused by patient movement, and there should be no visible breathing, heartbeat motion blur, or diaphragm ghosting in the diagnostic area.
According to the above requirements, we identified 13 common problems in chest X-ray image quality assessment. As shown in Figure 2, we provide 13 common unqualified chest X-ray image examples with comparisons, including (a) overlapping of scapula and lung field, (b) misalignment of clavicles on both sides, (c) inconsistent clavicle height on both sides, (d) asymmetrical sternoclavicular joints, (e) unclear lung apex, (f) lung apex not included, (g) too little empty exposure above the shoulders, (h) diaphragm not included, (i) costophrenic angle not included, (j) asymmetrical ribcage on both sides, (k) asymmetrical lung fields on both sides, (l) existence of abnormal objects, (m) motion blur. The left part of each image is the original image, and the right part is the unqualified image. Figure 2-(n) contain 7 disqualification factors; they are overlapping of scapula and lung field, misalignment of clavicles on both sides, inconsistent clavicle height on both sides, asymmetrical sternoclavicular joints, unclear lung apex, asymmetrical ribcage on both sides, asymmetrical lung fields on both sides, existence of abnormal objects, which are marked in different colors.
Examples of disqualified chest X-ray images: (a) overlapping of scapula and lung field, (b) misalignment of clavicles on both sides, (c) inconsistent clavicle height on both sides, (d) asymmetrical sternoclavicular joints, (e) unclear lung apex, (f) lung apex not included, (g) too little empty exposure above the shoulders, (h) diaphragm not included, (i) costophrenic angle not included, (j) asymmetrical ribcage on both sides, (k) asymmetrical lung fields on both sides, (l) existence of abnormal objects, (m) motion blur; (n) multiple disqualification factors. Marks on the images: lake blue indicates that the scapula overlaps with the lung field; sky blue indicates that the clavicles on both sides are not in the same position and height; yellow indicates asymmetry of the sternoclavicular joints on both sides; dark red indicates that the apex of the lung is not clear; gray indicates thoracic asymmetry; green indicates lung field asymmetry; pink indicates abnormal objects.
B. No-Reference Image Quality Assessment Methods
No-Reference Image Quality Assessment (NR-IQA) is an important type of objective quality evaluation methods. Since no original reference image is needed, it has a broad application prospect. Feichtenhofer et al. proposed a sharpness measurement method based on the statistical analysis of local edge gradients for the no-reference assessment of distorted images with blurs and noises [24]. Mittal et al. did not use subjective opinion scores for training, but performed the NR-IQA by calculating locally normalized brightness coefficients in the spatial domain, which is more competitive than the two methods of peak signal to noise ratio and structural similarity [16]. Min et al. proposed a NR-IQA framework based on pseudo reference images, which generated pseudo reference images as new reference images for the distorted images [25].
In recent years, with the continuous development of deep learning technology, a variety of NR-IQA methods based on depth learning have been proposed. Kang et al. first proposed to use CNN to effectively measure the distortion degree of local image regions, and the experimental results showed that the performance was better than the traditional methods [26]. Gu et al. proposed a vector regression framework for NR-IQA, which estimated the trust score vector by CNNs and improved the evaluation performance by using object-oriented pooling [27]. Gao et al. used two objective image quality assessment indicators of peak signal to noise ratio and structural similarity, instead of subjective scoring, to train the network for the lack of CT annotation data [28]. The end-to-end NR-IQA methods based on deep learning integrate feature extraction and fitting/regression into a unified framework and optimize them simultaneously, which are the current mainstream NR-IQA scheme. The main ideas include using GAN to restore the distorted image; using the generated restored image and distorted image to perform loss calculation and distance measurement; generating a quality score; and using the idea of rank learning to solve the problem of lack of large-scale data sets in IQA and to improve the accuracy of the model. In addition, the attention mechanism is used to improve the weight of the region of interest so as to assess the image quality. And the prior knowledge is learned through the meta-learning method to solve the problem that the scale of the no-reference image dataset is too small.
High quality medical images are the prerequisite for radiologists to accurately diagnose and treat diseases. Considering that deep learning has achieved remarkable results in object detection, organ segmentation and image classification, the quality assessment of medical images without reference by deep learning has gradually attract more attention and become a research hotspot. For example, Eck et al. proposed a CT quality assessment algorithm to realize the NR-IQA under the premise of ensuring the detection rate of lesions [29]. Mortamet et al. proposed a method for fully automatic assessment of 3D MRI quality by analyzing the air background pattern in MRI [30]. Esses et al. used neural networks to assess the image quality of the whole image of T2-weighted MRI of the liver [31]. Wang et al. used the two-step convolution neural network to evaluate the image quality of the region of interest on MRI of liver sites [32]. Xiao-Qian et al. used neural networks to achieve the evaluation of DR chest radiographs into four grades of excellent, good, medium and poor [33]. Wu et al. realized the automatic quality assessment of prenatal fetal ultrasound images based on two-step convolution neural network [34]. Li et al. used the improved AlexNet to score the quality of CT images, proving the potential of CNN in no-reference quality assessment of medical images [35]. However, it was still unable to effectively solve the problem of lack of high-quality scoring data sets. Due to the particularity of pathological images, the assessment of medical image quality cannot be determined solely by the whole image alone, while the quality of local images for disease diagnosis is often more important. At present, most research work is based on the overall structure of medical images. Their used scale of quality assessment is relatively large, leaving no analysis of local focus on the images for disease diagnosis.
Currently, there are only a few research work on no-reference medical images quality assessment based on deep learning. One of the main reasons is that the network needs a large amount of manually scored data for training. However, it is time and labor consuming to perform such manual quality scoring and annotation on massive medical images. At the same time, the current mainstream data driven deep learning methods have achieved satisfactory results in many cases. In order to solve these problems, many scholars have carried out adding prior knowledge in the machine learning process to improve the performance of the algorithm model. For instance, logical rules [36], [37] or algebraic equations [38], [39] have been added as constraints of loss functions; feature representation of the neural network was enhanced by using the association information between instances in the form of a knowledge graph [40]. These methods have achieved better performance in image classification tasks [41], [42]. Increasing amount of research results show that the data and knowledge driven methods are playing an important role in more and more fields.
At the same time, the two-stage training paradigm of “pretraining and fine-tuning” has gradually become one of the mainstream learning schemes in deep learning. Pretraining on large-scale datasets can significantly improve the model performance and generalization ability. In fact, large-scale data not only help define the approximation of the target problem, but is also necessary to ensure asymptotic convergence [43]. In the field of visual-language pre-training (VLP), CLIP [21] and ALIGN [22] collected millions of image-text pairs for learning visual representation from natural language supervision, which has been proved to be transferable to various downstream tasks, such as vision and language [44], image [45] and video tasks [46]. They directly aligned the visual and linguistic features through image-text contrastive (ITC) loss, which can also extended to large-scale data sets with high training efficiency.
The multi-label image recognition task obtains a more comprehensive understanding of an image by identifying multiple semantic labels present in the image [47], [48], [49], [50]. Prompt learning provides an effective way to transfer pre-trained visual-language models to other tasks and has achieved success in many tasks [51], [52], [53], [54], [55], [56]. However, these methods mainly focus on matching a single label per image and cannot handle the case where there are multiple multi-labels for an image. DualCoOp [23] proposes a fine-grained contrastive learning method, which effectively improves the ability to recognize multiple objects in multi-label recognition tasks. Considering that each chest X-ray image may have multiple image quality factors as shown in Figure 2 (n), we can learn from its designing idea to realize the identification of multiple chest X-ray image quality factors.
To summarize, NR-IQA method is an important kind of objective quality assessment method, which has important application potential in the field of medical image quality assessment. And currently, the deep learning based NR-IQA has become a research hotspot. However, the existing research works suffer from the lack of high quality labeled data; the global feature evaluation method of medical images cannot reflect the data quality of local focusing information used by the doctors for disease diagnosis; and the end-to-end data driven learning methods lacks medical domain knowledge fusion, which cause the bottleneck in quality assessment. In this paper, we combine image-text contrastive learning with medical domain knowledge. By using large-scale chest X-ray images, diagnostic reports and quality control reports information, the pretrained model based on contrastive text-image pairs is fine-tuned to achieve cross domain transfer learning. The local visual patch features of X-ray images are aligned with multiple text features to ensure that the visual features contain more fine-grained image information.
Chest X-Ray Quality Assessment With Medical Knowledge
Our proposed framework contains the chest X-ray quality control knowledge triplets, text encoder, and image encoder, as illustrated in Figure 1. First, in the training phase, the knowledge rules of chest X-ray standardized quality control [18] are converted to triplets of text description corresponding to X-ray images, which are denoted as
Image-text contrastive learning (ITC) is applied to text encoder and image encoder to respectively obtain text and image features. The standard vision Transformer (ViT) model [29] is used as the image encoder. The chest X-ray image
In the model inference phase, only input images are needed to feed into the model to obtain quality assessments for different aspects of the images. Combining knowledge graph triplets with ITC has the following two significant advantages: (1) redundant information in X-ray diagnosis report and X-ray quality control report can be removed, (2) medical domain knowledge can be naturally integrated with image-text contrastive learning model; thus effectively improving the model performance.
Chest X-Ray Quality Assessment Triplet
A. Joint Entity Relation Extraction
Entities (e.g., mediastinum, scapula, lung field, trachea, etc.) and attributes (e.g., clear, unclear, overlapping, aligned, etc.) are respectively identified from the chest X-ray diagnosis report and quality control report. In this paper, the joint entity relationship extraction method [57] based on parameter sharing is applied to the chest X-ray diagnosis report to obtain the triplets
Given a sentence \begin{align*} &\hspace {-0.5pc}\prod _{j=1}^{|D|} \left [{ \prod _{(h,r,t) \in T_{j} } p((h,r,t)|x_{j}) }\right] \tag {1}\\ &=\prod _{j=1}^{|D|} \left [{ \prod _{h \in T_{j} } p(h|x_{j}) \prod _{(r,t) \in T_{j}|s} p((r,t)|h,x_{j}) }\right] \tag {2}\\ &=\prod _{j=1}^{|D|} \left [{ \prod _{h \in T_{j} } p(h|x_{j}) \prod _{r \in T_{j}|h} p(t|h,x_{j}) \prod _{r \in R \setminus T_{j}|h} p_{r}(t_{\varnothing }|h,x_{j}) }\right] \tag {3}\end{align*}
The entity extraction model is used to find all possible entities in the sentence. Then for each found entity, the relationship extraction model is posed to find all its correlated relations and their corresponding entities. By formula (1), (2) and (3), multiple entity relation triples can be extracted in the sentence.
In this paper, the pretraining language representation model (BERT) [58] based on multi-layer bidirectional Transformer is used to conduct the feature extraction of sentence \begin{align*} h_{0}&=SW_{h} + W_{p} \tag {4}\\ h_{\alpha} &= Trans(h_{\alpha - 1}), \alpha \in \left [{ 1,N }\right] \tag {5}\end{align*}
Head entity identification adopts the binary classification method to directly decode the above encoding results, so as to identify all possible head entities. We use a linear layer and sigmoid activation function to determine whether each token is a beginning token or end token of the head entity, as shown in the following formula:\begin{align*} P_{i}^{start\_{}h}&=\delta (W_{start}X_{i}+b_{start}) \tag {6}\\ P_{i}^{end\_{}h}&=\delta (W_{end}X_{i}+b_{end}) \tag {7}\end{align*}
After identifying the head entity, joint recognition of the relationship and the tail entity is carried out. Here, a group of relationship-based tail entity recognition is implemented. The structure of the tail entity recognition layer is similar to that of the head entity recognition layer. The main difference lies in the input data. The input of the head entity recognition layer is the output of encoding layers, while the input of the tail entity recognition layer takes joint considerations on the head entity feature \begin{align*} P_{i}^{start\_{}t}&=\delta (W_{start}^{r} (X_{i}+\nu _{sub}^{k})+b_{start}^{r}) \tag {8}\\ P_{i}^{end\_{}t}&=\delta (W_{end}^{r} (X_{i}+\nu _{sub}^{k})+b_{end}^{r}) \tag {9}\end{align*}
B. Entity Alignment
After the triplets
Without loss of generality, we use the interaction model based on BERT [59] to achieve alignment of the two triplets
The pretrained BERT model is used to calculate the name/description of the entity. For the given entity input, the corresponding value of CLS downstream classification task tag is taken and mapped by multi-layer perceptron (MLP). Given \begin{equation*} C(e)=MLP(CLS(e)) \tag {10}\end{equation*}
For the given two entities in
\begin{equation*} S_{ij}=\frac {C(h_{qi}) \cdot C(h_{i})}{\| C(H_{qi}) \| \cdot \| C(h_{i}) \|} \tag {11}\end{equation*}
Then the row direction and column direction of the matrix are aggregated respectively. In the process of the row direction aggregation of the matrix, the maximum pooling operation is carried out for each row. For vector \begin{equation*} S_{i}^{max}=\max _{j=0}^{n} \{ S_{i0},S_{i1}, \cdots, S_{in} \} \tag {12}\end{equation*}
The Gaussian kernel function is used for one-to-many mapping for \begin{align*} K_{l}(S_{i}^{max})&=exp \left [{ -\frac {(S_{i}^{max}-\mu _{l})^{2}}{2\delta _{l}^{2}} }\right] \tag {13}\\ K^{r}(S_{i})&= \left [{ K_{1}(S_{i}^{max}), \cdots, K_{l}(S_{i}^{max}), \cdots, K_{L}(S_{i}^{max}) }\right] \\ \tag{13.1}\\ &\hspace {-2pc}\phi ^{r}(N(h_{qi}),N(h_{i})) \\ &=\frac {1}{|N(h_{qi})|}\sum _{i=1}^{|N(h_{qi})|}logK^{r}(S_{i}) \tag{13.2}\end{align*}
Column aggregation is completed in the same steps as above. Then, the two aggregation result vectors are added according to Equation (14) to obtain the neighbor view interactive similarity vector \begin{align*} \phi (N(h_{qi}),N(h_{i}))=\phi ^{r}(N(h_{qi}),N(h_{i}))\oplus \phi ^{c} (N(h_{qi}),N(h_{i})) \\ \tag {14}\end{align*}
For the two aligned entities
The vector averages of
Similar to the construction of entity similarity matrix, the entity attribute similarity matrix can be obtained. And finally the entity attribute similarity vector is obtained. Based on this, the entity similarity vector, the neighbor entity similarity vector and the attribute similarity vector are concatenated to obtain the similarity vector of two triplet pairs. Then, the similarity score between the entities can be calculated using the multi-layer perceptron.
In the process of entity triplet alignment, the candidate aligned entities with the highest cosine similarity are first calculated according to the entity vector
Image-Text Cross-Modal Contrastive Learning
After the triplets are obtained from the chest X-ray diagnostic report and quality control report and entity alignment is completed, a positive sample triplet description is obtained for the corresponding chest X-ray image. Then negative sample triplet description is constructed by changing entity relationships/attributes. The image and text encoding features are obtained by using feature extractors of different modalities, and then the improved fine-grained image-text contrastive function is used for cross-modal feature alignment.
A. Multi-Modal Feature Extraction
For a given chest X-ray image and its corresponding positive/negative sample triplets, image features and text features are extracted by the image encoder and text encoder respectively. Image encoder utilizes standard vision Transformer (ViT) model [60]. The chest X-ray image
B. Cross-Modal Feature Alignment
For the traditional cross-modal contrastive learning of image-text, such as CLIP shown in Figure 4 (a), they generally align the global feature \begin{equation*} Loss_{itc}=\frac {exp(f_{pos}/\pi)}{exp(f_{pos}/\pi +exp(f_{neg}/\pi))} \tag {15}\end{equation*}
The overall framework of our proposed method. In the training phase, through entity extraction, entity relationship extraction and entity alignment, the text data from chest X-ray diagnosis report and quality control report are converted to the knowledge graph triplets for chest X-ray image quality assessment. The obtained triples (e.g., <scapula, overlap, lung field>) are treated as positive samples, whereas the corresponding negative samples (e.g., <shoulder blades, no overlap, lung fields>) are manually constructed. Image-text contrastive learning (ITC) is applied to the text encoder and image encoder to obtain text and image features respectively (Tpos, Tneg, I). The image-text contrastive loss is adopted to maximize the feature similarity of positive pairs. In the model inference phase, only input images are needed to feed into the model to obtain quality assessments for different aspects of the images. Such combination of knowledge graph triplets and ITC can effectively improve the accuracy of chest X-ray image quality assessment.
Illustration of the different image-text cross-modal contrastive learning methods. (a) Image-text cross-modal contrastive learning. (b) Fine-grained image-text cross-modal contrastive learning, which is more suitable for the situation in this paper where each image has multiple quality failure factors corresponding to multiple different regions of the image, as shown in Figure 2 (n).
Specifically, instead of aligning the global features of an image with multiple text features, we align all visual features of an image with multiple text features, since these visual features contain more fine-grained image information. As shown in Figure 4 (b), the visual feature vector sequence \begin{align*} f_{pos}&=\sum _{i}^{N} Softmax(f_{pos(i)})\cdot f_{pos(i)} \tag {16}\\ f_{neg}&=\sum _{i}^{N} Softmax(f_{pos(i)})\cdot f_{neg(i)} \tag {17}\\ Softmax(f_{pos(i)})&=\frac {exp(f_{pos(i)})}{\sum _{j} exp(f_{pos(j)})} \tag {18}\end{align*}
Finally, the image-text contrastive loss function (ITC) is also used to optimize the model parameters:\begin{equation*} Loss_{itc}=\frac {exp(f_{pos}/\pi)}{exp(f_{pos})/\pi + exp(f_{neg}/\pi)} \tag {19}\end{equation*}
Experiment and Analysis
A. Data Preparation
The experimental data used in this study is the large-scale chest X-ray image dataset ChestX-ray8 [61] released by NIH. The dataset contains 112,120 frontal view X-ray images from 30,805 patients. A total number of 1000 qualified image are selected from the dataset as negative set; while 4846 disqualified images are selected as positive set. The 13 disqualification factors are: overlapping of scapula and lung field, misalignment of clavicles on both sides, inconsistent clavicle height on both sides, asymmetrical sternoclavicular joints, unclear lung apex, lung apex not included, too little empty exposure above the shoulders, diaphragm not included, costophrenic angle not included, asymmetrical ribcage on both sides, asymmetrical lung fields on both sides, existence of abnormal objects, and motion blur. Finally, a total number of 5846 chest X-ray images are included in the experimental dataset. Sample distribution of the image quality factors is shown in Table 1.
In the experiment, randomly 70% of the data (4092 chest films) are selected as the training set, 20% of the data (1169 chest films) as the validation set, and 25% of the data (1461 chest films) as the test set.
In data preprocessing of the training and inference phases, the visual encoder adopts the same image resolution as in [19], i.e.,
B. Model Training Details
For each label in the chest X-ray image, we use an SGD optimizer with an initial learning rate of 0.002, decayed by the cosine annealing rule. We obtain the best performance over the past 200 epochs by training the model for 1000 epochs. The entire model training is done on a server with 8 NVIDIA RTX 3090 GPU cards.
C. Evaluation Metric
The FR-IQA methods mostly use SSIM, MS-SSIM, VIF, IFC, FSIM and other quality assessment metrics. The NR-IQA cannot use the above-mentioned quality assessment due to the lack of reference images.
In our experiment, precision, recall, F1-Score, and false positive rate (FPR) are used to evaluate the performance of our proposed algorithm. The confusion matrix is defined as in Figure 5.
The definitions of the evaluation metrics are as follows.
D. Experimental Results and Analysis
From the experimental results shown in Table 2, it can be observed that among the evaluation results on the 13 influencing factors, 10 F1-Scores are higher than 0.95, 3 F1-Scores are between 0.92 and 0.95. The result on the existence of abnormal objects class is comparatively lower. Elaborately designed methods can be adopted in the future to further improve the model. The overall performance of our proposed model is satisfactory and it can be used for chest X-ray image quality assessment.
In order to further verify the effectiveness of our method proposed in this paper, we further carried out the following ablation study. On the same training data, we experimented with (a) only image features, (b) global image features + text features, and (c) fine-grained image features + text features. As shown in Table 3, the experimental results show that using the visual text model can achieve better performance than only using image features, and the fine-grained visual text model can achieve the best performance.
Conclusion
In this paper, we propose a chest X-ray quality assessment method combining both image-text contrastive learning and medical domain knowledge. Specifically, it integrates large-scale real clinical chest X-rays and diagnostic report text information, and fine-tunes the pretrained model based on contrastive text-image pairs. It achieves cross-domain transfer learning and can save the huge workload caused by doctors in labeling multi-modal medical data. The integration of the triplet information from knowledge graph into the deep learning model proposed in this paper provides a new solution for knowledge and data-driven machine learning methods. The proposed method can be extended to complete other tasks such as multi-lesion segmentation on medical images and prediction of disease progression. The experimental results and analysis show that the proposed method boasts good performance.