Journals & Magazines >IEEE Access >Volume: 11

Chest X-Ray Quality Assessment Method With Medical Domain Knowledge Fusion

The overall framework of our proposed method. Through entity extraction, entity relationship extraction and entity alignment, the text data from chest X-ray diagnosis rep...

Abstract:

Digital X-ray radiography is widely used in clinical diagnosis. High quality chest X-ray is conducive to the accurate diagnosis of diseases by clinicians. However, the qu...Show More

Metadata

Abstract:

Digital X-ray radiography is widely used in clinical diagnosis. High quality chest X-ray is conducive to the accurate diagnosis of diseases by clinicians. However, the quality assessment of the chest X-ray images mainly depends on the subjective evaluation of doctors, and the results are influenced by the skill level and experience of the evaluators, involving many issues such as heavy workload and various uncertain factors in such subjective judgment. In this paper, we propose a chest X-ray quality assessment method that combines image-text contrastive learning with medical domain knowledge fusion. Based on pretraining the model from contrastive text-image pairs, large-scale real clinical chest X-ray and diagnostic report text information are fused, and the model is fine-tuned to achieve cross domain transfer learning. While improving the prediction accuracy of the algorithm, the cost of massive sample data annotation is avoided. The local visual patch features of the X-ray images are aligned with multiple text features to ensure that the visual features contain more fine-grained image information. Theoretical analysis and experimental results show that the contrastive learning algorithm based on the fusion of triplet information in medical knowledge graph and chest X-ray multi-modal data has achieved good performance in terms of accuracy. In addition, the method proposed in this paper can be easily extended to complete other tasks such as medical image multi-lesion segmentation and disease progression prediction.

The overall framework of our proposed method. Through entity extraction, entity relationship extraction and entity alignment, the text data from chest X-ray diagnosis rep...

Published in: IEEE Access ( Volume: 11)

Page(s): 22904 - 22916

Date of Publication: 06 March 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3252893

Funding Agency:

Contents

SECTION I.

Introduction

Digital X-ray radiography, boasting the advantages of low radiation dose, fast imaging speed and high image definition, is a common imaging method widely used in clinical diagnosis [1]. During the chest X-ray imaging process, there are many factors, e.g., patient’s position, breathing status, exposure, irradiation field, center line, projection angle, etc., that will affect the imaging quality and directly influence the doctor’s diagnosis of the disease. High quality chest X-ray images are helpful for the clinicians to accurately diagnose the diseases, while low quality X-ray images are likely to cause misdiagnosis or missed diagnosis [2]. At present, the quality judgment of the chest X-ray images mainly depends on the subjective evaluation from the radiologists. The results depend on the skill level and experience of the evaluators. The evaluation process requires high concentration and heavy workload. The image quality assessment method based on machine recognition can achieve objective evaluation of chest X-ray image quality, avoid subjective uncertainties, and effectively improve the efficiency.

Image Quality Assessment (IQA) is one of the basic technologies in image processing [3]. It is widely used in algorithm design and analysis, system performance evaluation, etc., and can help complete image super-resolution, image restoration, evaluation and debugging of imaging product parameters and other tasks.

IQA can be divided into two methods: subjective quality assessment and objective quality assessment according to the evaluation method. The subjective IQA method judges the image quality through the observer’s subjective scoring. The advantage is that it can truly reflect people’s subjective visual experience. The results are direct, accurate and reliable. Mean Opinion Score (MOS) and Differential Mean Opinion Score (DMOS) are the main forms of obtaining subjective scores [4]. However, this method is usually affected by objective factors such as the observation environment and the number of experimenters, thus having many limitations. The objective IQA method establishes a mathematical model based on the visual system of the human eye and evaluates or scores the images to be tested. It has the advantages of batch processing and reproducible results, which is easier to be applied in various scenarios.

Objective image quality assessment methods can be divided into Full-Reference Image Quality Assessment (FR-IQA), Reduced-Reference Image Quality Assessment (RR-IQA) and No-Reference Image Quality Assessment (NR-IQA) according to their processing methods. Among them, FR-IQA algorithms rely entirely on the reference image and evaluate the image by detecting the difference between the distorted image and its corresponding original undistorted image. There are already many FR-IQA methods, such as SSIM [5], MS-SSIM [6], VIF [7], IFC [8], FSIM [9], etc. Due to the existence of reference images, FR-IQA can usually accurately measure the quality of distorted images. However, their usefulness is limited due to the difficulty in obtaining reference images in practical applications. RR-IQA algorithms utilize the representative information of the reference image and its distorted image (e.g., some statistical features of the reference image, such as power spectrum). Image quality evaluation is conducted by measuring the similarity of these features in the reference image and the distorted image [10], [11], [12], [13]. NR-IQA algorithms evaluate the image quality without using the original reference image. In terms of practicality, NR-IQA methods are widely used because no information about the reference image is required [14], [15], [16], [17]. So far, designing an effective NR-IQA method remains a challenging research topic.

Since there is often no gold standard image in medical images, the quality assessment is mainly based on NR-IQA. Early NR-IQA methods mainly used handcrafted features, including lots of local features and global features. However, its limited feature expression ability resulted in low performance. With the rapid development of deep learning, deep features are increasingly used in NR-IQA. The performance of deep features far exceeds that of handcrafted features.

Although NR-IQA has made a lot of progress, however, due to the complexity of real clinical image data and many factors causing low quality, there are still many problems in directly applying it to the quality assessment of chest X-rays. Firstly, the most used NR-IQA data sets such as LIVE, TID2008, TID2013 are either natural distortion data or artificially simulated distortion data. Their data distribution and characteristics are far from real medical imaging data. Secondly, the accuracy of algorithm prediction is mainly measured based on the correlation with the subjective scoring of the human eye, which cannot meet the relevant requirements of the medical domain for medical image quality [18]. Finally, most of the existing methods use quality assessment methods based on the chest X-ray image features [3], [19], [20], and use image segmentation algorithms to semantically segment the diagnostic regions, and then use classification algorithms to judge image quality. However, such methods require large amount of annotated chest X-ray data, which brings huge data annotation workload to doctors and challenges the current data-driven deep learning methods.

At present, chest X-ray quality assessment methods that only rely on medical image feature learning still rely heavily on expensive or expert-knowledge supported datasets. The collation of these data requires heavy work in data collection, sampling, and manual labeling, which it is difficult to scale up. This expensive data collation process limits the size of the data set, which in turn hinders the effective improvement of the performance of the algorithm model. In recent years, with the continuous development of transfer learning technology, a number of unsupervised multimodal pre-training models using image-text pairs have emerged, and have achieved excellent performance in downstream tasks such as image classification and image segmentation [21], [22]. Considering that each patient’s chest X-ray image will have a corresponding diagnosis report and quality control report, it is possible to construct the X-ray image-text pairs by performing natural language processing on the diagnosis report and quality control report and obtain the knowledge descriptions related to the quality of chest X-ray images. Then they can be fine-tuned based on the CLIP pre-trained model to obtain better performance. In addition, sicne the text feature has a certain generalization ability, it can effectively improve the generalization performance of the model.

Each single chest X-ray images may have many quality problems. For example, a chest X-ray image may have two quality problems of uneven clavicle and abnormal external objects at the same time. Thus, there will be multiple quality judgment rules, and every quality judgment rule is related to the local area of the image, leading to a typical multi-label image feature learning problem. Using the fine-grained comparative learning method of local image features in DualCoOp [23] can further improve the performance.

Therefore, inspired by the multimodal pre-training big models CLIP [21] and ALIGN [22], we propose a chest X-ray quality assessment method that organically combines image-text contrastive learning with medical domain knowledge fusion. First, an algorithm framework for chest X-ray quality assessment is presented. It achieves cross-domain transfer learning by fusing large scale real clinical chest X-ray images and diagnosis report text information based on Contrastive Language-Image Pre-training (CLIP) and model fine-tuning. While improving the prediction accuracy of the algorithm, the cost of massive data labeling is avoided, and the local visual patch features of the X-ray image are aligned with multiple text features to ensure that the visual features contain more fine-grained image information. The contribution of this paper can be summarized as:

A medical image quality assessment method integrating medical domain knowledge is proposed. The text data from chest X-ray image diagnosis reports and quality control reports are converted into triplet information through knowledge extraction and knowledge fusion, which are used as the guidance information from medical domain knowledge to medical images in the process of model training.
Exploiting the text annotation data of large-scale medical images together with their corresponding diagnostic reports and quality control reports, cross domain transfer learning is accomplished by fine-tuning the pre-trained model based on the contrastive text-image pairs, which effectively help overcome the problem of insufficient training data and avoid the heavy manual data labeling work.
By aligning local visual patch features of the X-ray image to multiple text features, the visual features are able to contain more fine-grained image information, so that the problem that global visual information cannot reflect the local image quality for disease diagnosis is addressed.

SECTION II.

Related Work

A. Factors Affecting the Disqualification of the Chest X-Ray Imagess

According to the working experience and technical guidelines of radiologists [18], high-quality chest X-ray images should meet the following conditions, as illustrated in Figure 1: 1) The image should have no artifacts or severe image noise. Bad cases include abnormal external objects not belonging to the body (except those that cannot be removed), motion blur, post-processing, and equipment artifacts, etc. No granular noise should be visible to the naked eye in soft tissue density areas. 2) The position of the patient should be correct and appropriate, and 80% of the scapula in the case should be moved outside the lung field. (3) Both sides of the thorax should be symmetrical; the double clavicles should be flatp; and the sternoclavicular joints should be symmetrical. (4) The irradiation field should be properly selected, and there should be no case where chest tissue cannot be displayed due to occluded cuts (including both lung apexes, the lateral edge of the ribs above the diaphragm, and the double costophrenic angle). (5) The chest projection should be in the center of the image. The upper part of the chest X-ray should include the lung apices on both sides, and a 3–5 cm empty exposure area can be seen above the soft tissue shadow of the shoulder; the lower part should include the bilateral costophrenic angle and about 1–3 cm below; the outer sides of the chest should include the outer edge of the ribs and the soft tissue of the chest wall. (6) The bilateral lung fields, trachea and adjacent bronchi, heart and aortic border, diaphragm and bilateral costophrenic angles, lung fields and mediastinum behind the heart shadow should be clearly displayed. (7) The level of lung field and mediastinum, lung field and chest wall, lung field and shoulder soft tissue should be distinguishable. The structure of lung field texture should be clear and sharp, mediastinal soft tissue can be vaguely distinguished, and lung texture overlapping with heart shadow can be clearly displayed. (8) There should be no image blur caused by patient movement, and there should be no visible breathing, heartbeat motion blur, or diaphragm ghosting in the diagnostic area.

FIGURE 1.

Example of qualified chest X-ray image.

Show All

According to the above requirements, we identified 13 common problems in chest X-ray image quality assessment. As shown in Figure 2, we provide 13 common unqualified chest X-ray image examples with comparisons, including (a) overlapping of scapula and lung field, (b) misalignment of clavicles on both sides, (c) inconsistent clavicle height on both sides, (d) asymmetrical sternoclavicular joints, (e) unclear lung apex, (f) lung apex not included, (g) too little empty exposure above the shoulders, (h) diaphragm not included, (i) costophrenic angle not included, (j) asymmetrical ribcage on both sides, (k) asymmetrical lung fields on both sides, (l) existence of abnormal objects, (m) motion blur. The left part of each image is the original image, and the right part is the unqualified image. Figure 2-(n) contain 7 disqualification factors; they are overlapping of scapula and lung field, misalignment of clavicles on both sides, inconsistent clavicle height on both sides, asymmetrical sternoclavicular joints, unclear lung apex, asymmetrical ribcage on both sides, asymmetrical lung fields on both sides, existence of abnormal objects, which are marked in different colors.

FIGURE 2.

Examples of disqualified chest X-ray images: (a) overlapping of scapula and lung field, (b) misalignment of clavicles on both sides, (c) inconsistent clavicle height on both sides, (d) asymmetrical sternoclavicular joints, (e) unclear lung apex, (f) lung apex not included, (g) too little empty exposure above the shoulders, (h) diaphragm not included, (i) costophrenic angle not included, (j) asymmetrical ribcage on both sides, (k) asymmetrical lung fields on both sides, (l) existence of abnormal objects, (m) motion blur; (n) multiple disqualification factors. Marks on the images: lake blue indicates that the scapula overlaps with the lung field; sky blue indicates that the clavicles on both sides are not in the same position and height; yellow indicates asymmetry of the sternoclavicular joints on both sides; dark red indicates that the apex of the lung is not clear; gray indicates thoracic asymmetry; green indicates lung field asymmetry; pink indicates abnormal objects.

Show All

B. No-Reference Image Quality Assessment Methods

No-Reference Image Quality Assessment (NR-IQA) is an important type of objective quality evaluation methods. Since no original reference image is needed, it has a broad application prospect. Feichtenhofer et al. proposed a sharpness measurement method based on the statistical analysis of local edge gradients for the no-reference assessment of distorted images with blurs and noises [24]. Mittal et al. did not use subjective opinion scores for training, but performed the NR-IQA by calculating locally normalized brightness coefficients in the spatial domain, which is more competitive than the two methods of peak signal to noise ratio and structural similarity [16]. Min et al. proposed a NR-IQA framework based on pseudo reference images, which generated pseudo reference images as new reference images for the distorted images [25].

In recent years, with the continuous development of deep learning technology, a variety of NR-IQA methods based on depth learning have been proposed. Kang et al. first proposed to use CNN to effectively measure the distortion degree of local image regions, and the experimental results showed that the performance was better than the traditional methods [26]. Gu et al. proposed a vector regression framework for NR-IQA, which estimated the trust score vector by CNNs and improved the evaluation performance by using object-oriented pooling [27]. Gao et al. used two objective image quality assessment indicators of peak signal to noise ratio and structural similarity, instead of subjective scoring, to train the network for the lack of CT annotation data [28]. The end-to-end NR-IQA methods based on deep learning integrate feature extraction and fitting/regression into a unified framework and optimize them simultaneously, which are the current mainstream NR-IQA scheme. The main ideas include using GAN to restore the distorted image; using the generated restored image and distorted image to perform loss calculation and distance measurement; generating a quality score; and using the idea of rank learning to solve the problem of lack of large-scale data sets in IQA and to improve the accuracy of the model. In addition, the attention mechanism is used to improve the weight of the region of interest so as to assess the image quality. And the prior knowledge is learned through the meta-learning method to solve the problem that the scale of the no-reference image dataset is too small.

High quality medical images are the prerequisite for radiologists to accurately diagnose and treat diseases. Considering that deep learning has achieved remarkable results in object detection, organ segmentation and image classification, the quality assessment of medical images without reference by deep learning has gradually attract more attention and become a research hotspot. For example, Eck et al. proposed a CT quality assessment algorithm to realize the NR-IQA under the premise of ensuring the detection rate of lesions [29]. Mortamet et al. proposed a method for fully automatic assessment of 3D MRI quality by analyzing the air background pattern in MRI [30]. Esses et al. used neural networks to assess the image quality of the whole image of T2-weighted MRI of the liver [31]. Wang et al. used the two-step convolution neural network to evaluate the image quality of the region of interest on MRI of liver sites [32]. Xiao-Qian et al. used neural networks to achieve the evaluation of DR chest radiographs into four grades of excellent, good, medium and poor [33]. Wu et al. realized the automatic quality assessment of prenatal fetal ultrasound images based on two-step convolution neural network [34]. Li et al. used the improved AlexNet to score the quality of CT images, proving the potential of CNN in no-reference quality assessment of medical images [35]. However, it was still unable to effectively solve the problem of lack of high-quality scoring data sets. Due to the particularity of pathological images, the assessment of medical image quality cannot be determined solely by the whole image alone, while the quality of local images for disease diagnosis is often more important. At present, most research work is based on the overall structure of medical images. Their used scale of quality assessment is relatively large, leaving no analysis of local focus on the images for disease diagnosis.

Currently, there are only a few research work on no-reference medical images quality assessment based on deep learning. One of the main reasons is that the network needs a large amount of manually scored data for training. However, it is time and labor consuming to perform such manual quality scoring and annotation on massive medical images. At the same time, the current mainstream data driven deep learning methods have achieved satisfactory results in many cases. In order to solve these problems, many scholars have carried out adding prior knowledge in the machine learning process to improve the performance of the algorithm model. For instance, logical rules [36], [37] or algebraic equations [38], [39] have been added as constraints of loss functions; feature representation of the neural network was enhanced by using the association information between instances in the form of a knowledge graph [40]. These methods have achieved better performance in image classification tasks [41], [42]. Increasing amount of research results show that the data and knowledge driven methods are playing an important role in more and more fields.

At the same time, the two-stage training paradigm of “pretraining and fine-tuning” has gradually become one of the mainstream learning schemes in deep learning. Pretraining on large-scale datasets can significantly improve the model performance and generalization ability. In fact, large-scale data not only help define the approximation of the target problem, but is also necessary to ensure asymptotic convergence [43]. In the field of visual-language pre-training (VLP), CLIP [21] and ALIGN [22] collected millions of image-text pairs for learning visual representation from natural language supervision, which has been proved to be transferable to various downstream tasks, such as vision and language [44], image [45] and video tasks [46]. They directly aligned the visual and linguistic features through image-text contrastive (ITC) loss, which can also extended to large-scale data sets with high training efficiency.

The multi-label image recognition task obtains a more comprehensive understanding of an image by identifying multiple semantic labels present in the image [47], [48], [49], [50]. Prompt learning provides an effective way to transfer pre-trained visual-language models to other tasks and has achieved success in many tasks [51], [52], [53], [54], [55], [56]. However, these methods mainly focus on matching a single label per image and cannot handle the case where there are multiple multi-labels for an image. DualCoOp [23] proposes a fine-grained contrastive learning method, which effectively improves the ability to recognize multiple objects in multi-label recognition tasks. Considering that each chest X-ray image may have multiple image quality factors as shown in Figure 2 (n), we can learn from its designing idea to realize the identification of multiple chest X-ray image quality factors.

To summarize, NR-IQA method is an important kind of objective quality assessment method, which has important application potential in the field of medical image quality assessment. And currently, the deep learning based NR-IQA has become a research hotspot. However, the existing research works suffer from the lack of high quality labeled data; the global feature evaluation method of medical images cannot reflect the data quality of local focusing information used by the doctors for disease diagnosis; and the end-to-end data driven learning methods lacks medical domain knowledge fusion, which cause the bottleneck in quality assessment. In this paper, we combine image-text contrastive learning with medical domain knowledge. By using large-scale chest X-ray images, diagnostic reports and quality control reports information, the pretrained model based on contrastive text-image pairs is fine-tuned to achieve cross domain transfer learning. The local visual patch features of X-ray images are aligned with multiple text features to ensure that the visual features contain more fine-grained image information.

SECTION III.

Chest X-Ray Quality Assessment With Medical Knowledge

Our proposed framework contains the chest X-ray quality control knowledge triplets, text encoder, and image encoder, as illustrated in Figure 1. First, in the training phase, the knowledge rules of chest X-ray standardized quality control [18] are converted to triplets of text description corresponding to X-ray images, which are denoted as $G_{pos} = (h,r,t)$ and treated as positive samples. And then the corresponding negative samples $G_{neg} = (h,r,t)$ are manually constructed. With the diagnostic report and quality control report paired to the chest X-ray image, we conduct entity extraction, entity relationship extraction and entity alignment process, so as to map the organ entity and its attribute/relationship in each report to the knowledge triplet information, and build the quality assessment triplets of each patient’s chest X-ray image.

Image-text contrastive learning (ITC) is applied to text encoder and image encoder to respectively obtain text and image features. The standard vision Transformer (ViT) model [29] is used as the image encoder. The chest X-ray image $I \in R^{H \times W \times 1}$ is divided into $N$ pieces of $p \times p$ sized patches $X_{p} \in R^{N \times (P^{2} \times 1)}$ , $N=\frac {H \times W}{p \times p}$ . Then they are encoded into $N+1$ feature vector sequences $\{ I_{gal}, I_{1}, I_{2}, \cdots, I_{N} \}$ . Similarly, the positive and negative text samples are encoded into $\{T_{pos_{gal}}, T_{pos_{1}}, T_{pos_{2}}, \cdots, T_{pos_{N}}\}$ , $\{T_{neg_{gal}}, T_{neg_{1}}, T_{neg_{2}}, \cdots, T_{neg_{N}}\}$ using the Transformer structure. Traditional image-text contrastive learning (ITC) uses global visual feature vectors to align visual language features with text features. Since each chest X-ray image contains multiple triples of text corresponding to different organ regions, we propose to use the patch level visual feature vectors of for more fine-grained feature alignment. Feature vector embeddings $f_{pos(i)}=g_{i}\left ({I_{i}}\right)\cdot g_{t}\left ({T_{pos_{gal}}}\right), f_{neg(i)}=g_{i}(I_{i})\cdot g_{t}(T_{neg_{gal}})$ are constructed by inner product operation. In the training process, the feature vectors provided by the image encoder are expected to be as close as possible to the positive sample $T_{p}os$ and far away from the negative sample $T_{n}eg$ .

In the model inference phase, only input images are needed to feed into the model to obtain quality assessments for different aspects of the images. Combining knowledge graph triplets with ITC has the following two significant advantages: (1) redundant information in X-ray diagnosis report and X-ray quality control report can be removed, (2) medical domain knowledge can be naturally integrated with image-text contrastive learning model; thus effectively improving the model performance.

SECTION IV.

Chest X-Ray Quality Assessment Triplet

A. Joint Entity Relation Extraction

Entities (e.g., mediastinum, scapula, lung field, trachea, etc.) and attributes (e.g., clear, unclear, overlapping, aligned, etc.) are respectively identified from the chest X-ray diagnosis report and quality control report. In this paper, the joint entity relationship extraction method [57] based on parameter sharing is applied to the chest X-ray diagnosis report to obtain the triplets $G_{d}(h_{d}, r_{d}, t_{d})$ , and to the quality control report to obtain the triplets $G_{q}(h_{q}, r_{q}, t_{q})$ .

Given a sentence $X_{j}$ in the training set $D$ and all possible triplets $T_{i}=\{(h,r,t)\}$ in the sentence, the optimization objective of obtaining triplet entities and relations turns into the maximization of the likelihood function in the training set [57]:

$\begin{align*} &\hspace {-0.5pc}\prod _{j=1}^{|D|} \left [{ \prod _{(h,r,t) \in T_{j} } p((h,r,t)|x_{j}) }\right] \tag {1}\\ &=\prod _{j=1}^{|D|} \left [{ \prod _{h \in T_{j} } p(h|x_{j}) \prod _{(r,t) \in T_{j}|s} p((r,t)|h,x_{j}) }\right] \tag {2}\\ &=\prod _{j=1}^{|D|} \left [{ \prod _{h \in T_{j} } p(h|x_{j}) \prod _{r \in T_{j}|h} p(t|h,x_{j}) \prod _{r \in R \setminus T_{j}|h} p_{r}(t_{\varnothing }|h,x_{j}) }\right] \tag {3}\end{align*}$ View Source

where

$h \in T_{j}$

it an entity in the triplet

$T_{i}$

;

$T_{i}|h$

represents the related triplets of the entity

$h$

$T_{i}$

;

$(r,t) \in T_{j}|h$

stands for a

$(r,t)$

pair related to entity

$h$

in the triplet

$T_{i}$

;

$R$

is the set of all possible entity relationships;

$R \setminus T_{j} |h$

represents other relationships associated with the entity

$h$

$t_{\varnothing }$

denotes the empty entity.

The entity extraction model is used to find all possible entities in the sentence. Then for each found entity, the relationship extraction model is posed to find all its correlated relations and their corresponding entities. By formula (1), (2) and (3), multiple entity relation triples can be extracted in the sentence.

In this paper, the pretraining language representation model (BERT) [58] based on multi-layer bidirectional Transformer is used to conduct the feature extraction of sentence $X_{j}$ . The sentence representation learning is completed by jointly mediating the context of each word. The model includes $N$ identical Transformer blocks $Trans(x)$ , where $x$ represents the input vector. The specific calculation is:

$\begin{align*} h_{0}&=SW_{h} + W_{p} \tag {4}\\ h_{\alpha} &= Trans(h_{\alpha - 1}), \alpha \in \left [{ 1,N }\right] \tag {5}\end{align*}$ View Source

where

$S$

represents a word vector matrix in the input sentence;

$W_{h}$

represents the word embedding matrix;

$W_{p}$

denotes a position embedding matrix.

$p$

is the position of a word in a sentence;

$h_{\alpha}$

is an implicit state vector; and

$N$

is the number of Transformer blocks. Since the input is a single sentence, piece-wise embedding is not used.

Head entity identification adopts the binary classification method to directly decode the above encoding results, so as to identify all possible head entities. We use a linear layer and sigmoid activation function to determine whether each token is a beginning token or end token of the head entity, as shown in the following formula:

$\begin{align*} P_{i}^{start\_{}h}&=\delta (W_{start}X_{i}+b_{start}) \tag {6}\\ P_{i}^{end\_{}h}&=\delta (W_{end}X_{i}+b_{end}) \tag {7}\end{align*}$ View Source

where

$P_{i}^{start\_{}h}$

and

$p_{i}^{end\_{}h}$

denote the probability that the

$i_{th}$

token serves as the head or tail entity in the input sentence, respectively. If it exceeds a certain threshold, the corresponding token represents a head entity. Based on the nearest matching principle, the identified start and end tokens are paired to obtain the candidate head entity set.

After identifying the head entity, joint recognition of the relationship and the tail entity is carried out. Here, a group of relationship-based tail entity recognition is implemented. The structure of the tail entity recognition layer is similar to that of the head entity recognition layer. The main difference lies in the input data. The input of the head entity recognition layer is the output of encoding layers, while the input of the tail entity recognition layer takes joint considerations on the head entity feature $\nu _{sub}^{k}$ :

$\begin{align*} P_{i}^{start\_{}t}&=\delta (W_{start}^{r} (X_{i}+\nu _{sub}^{k})+b_{start}^{r}) \tag {8}\\ P_{i}^{end\_{}t}&=\delta (W_{end}^{r} (X_{i}+\nu _{sub}^{k})+b_{end}^{r}) \tag {9}\end{align*}$ View Source

where

$\nu _{sub}^{k}$

represents the vector average of all tokens contained in the

$k_{th}$

candidate head entity.

B. Entity Alignment

After the triplets $G_{d}(h_{d}, r_{d}, t_{d})$ and $G_{q}(h_{q}, r_{q}, t_{q})$ are obtained from the chest X-ray diagnostic report and quality control report respectively, they need to be aligned with the standard triplets $G_{pos}(h, r, t)$ constructed by the knowledge rules for the chest X-ray standardization quality control standards [4].

Without loss of generality, we use the interaction model based on BERT [59] to achieve alignment of the two triplets $G_{d}(h_{d}, r_{d}, t_{d})$ and $G_{pos}(h, r, t)$ . The alignment of $G_{q}(h_{q}, r_{q}, t_{q})$ and $G_{pos}(h, r, t)$ adopts the same method.

The pretrained BERT model is used to calculate the name/description of the entity. For the given entity input, the corresponding value of CLS downstream classification task tag is taken and mapped by multi-layer perceptron (MLP). Given $e \in E$ , the vector representation of the entity can be calculated by the following formula:

$\begin{equation*} C(e)=MLP(CLS(e)) \tag {10}\end{equation*}$ View Source

For the given two entities in $G_{q}(h_{q}, r_{q}, t_{q})$ and $G_{pos}(h, r, t)$ , i.e., $h_{qi} \in h_{q}$ and $h_{i} \in h$ , the vector representation $C(h_{qi})$ and $C(h_{i})$ of the name/description of the two entities can be calculated using Equation (10). By calculating the cosine distance of the two vector representations, the similarity of the two entities can be obtained.

$C(h_{qi})$ and $C(h_{i})$ are calculated respectively for the names/descriptions of similar nodes of entities in the two graphs to obtain two vector sets, i.e., $\{ C(h_{qi})\}_{i=1}^{|N(h_{q})|}$ and $\{ C(h_{i})\}_{i=1}^{|N(h)|}$ . The similarity matrix of the two vector sets can be obtained by cosine similarity calculation. The calculation formula is as follows:

$\begin{equation*} S_{ij}=\frac {C(h_{qi}) \cdot C(h_{i})}{\| C(H_{qi}) \| \cdot \| C(h_{i}) \|} \tag {11}\end{equation*}$ View Source

Then the row direction and column direction of the matrix are aggregated respectively. In the process of the row direction aggregation of the matrix, the maximum pooling operation is carried out for each row. For vector $S_{i}=\{ S_{i0}, S_{i1}, \cdots, S_{in} \}$ in the $i_{th}$ row, considering the heterogeneity of the triplet, the neighbor entities of the two aligned entities are not identical. Therefore, the similarity between one of the neighbor entities in $h_{qi}$ and its most similar entity among the neighbor entities of $h_{i}$ is considered. Their the maximum value $S_{i}^{max}$ is taken, that is:

$\begin{equation*} S_{i}^{max}=\max _{j=0}^{n} \{ S_{i0},S_{i1}, \cdots, S_{in} \} \tag {12}\end{equation*}$ View Source

where

$n$

indicates the maximum number of neighbors.

The Gaussian kernel function is used for one-to-many mapping for $S_{i}^{max}$ , and multiple mapping values are obtained to form a vector $K^{r}(S_{i})$ . We average the matrix $K^{r}(S)$ logarithmically in the column direction and obtain a vector of length $L$ . The calculation process is as follows:

$\begin{align*} K_{l}(S_{i}^{max})&=exp \left [{ -\frac {(S_{i}^{max}-\mu _{l})^{2}}{2\delta _{l}^{2}} }\right] \tag {13}\\ K^{r}(S_{i})&= \left [{ K_{1}(S_{i}^{max}), \cdots, K_{l}(S_{i}^{max}), \cdots, K_{L}(S_{i}^{max}) }\right] \\ \tag{13.1}\\ &\hspace {-2pc}\phi ^{r}(N(h_{qi}),N(h_{i})) \\ &=\frac {1}{|N(h_{qi})|}\sum _{i=1}^{|N(h_{qi})|}logK^{r}(S_{i}) \tag{13.2}\end{align*}$ View Source

where

$L$

represents the number of Gaussian kernels, and

$r$

represents row aggregation.

Column aggregation is completed in the same steps as above. Then, the two aggregation result vectors are added according to Equation (14) to obtain the neighbor view interactive similarity vector $\phi (N(h_{qi}),N(h_{i}))$ :

$\begin{align*} \phi (N(h_{qi}),N(h_{i}))=\phi ^{r}(N(h_{qi}),N(h_{i}))\oplus \phi ^{c} (N(h_{qi}),N(h_{i})) \\ \tag {14}\end{align*}$ View Source

where

$\oplus$

represents the concatenation calculation.

For the two aligned entities $h_{q}$ and $h$ , their corresponding neighbor triplets are $G_{q}(h_{q},r_{q},t_{q})$ and $G_{pos}(h,r,t)$ , respectively. If the tail entity $t_{q}$ and $t$ are similar, the relationship $r_{q}$ and $r$ are also similar. The mask matrix can be calculated to calibrate the neighbor entity similarity matrix.

The vector averages of $C(e)$ in the head entity set and the tail entity set of the entity relation are calculated respectively, and the vector representation of the entity relation is obtained by concatenating the these vectors. The similarity matrix $M$ is obtained according to the neighbor relation vector of entity $h_{q}$ and $h$ , i.e., $M_{ij}=sim(C(r_{q}i),C(r_{j}))$ , where $M_{ij}$ represents the cosine similarity between the entity $h_{q}$ ’s $i_{th}$ neighbor relation $r_{qi}$ and the entity $h$ ’s $j_{t}h$ neighbor relation $r_{j}$ . The matrix $S$ is then modified by $M$ , i.e., $S=S\otimes M$ , where $\otimes$ stands for elementwise multiplication.

Similar to the construction of entity similarity matrix, the entity attribute similarity matrix can be obtained. And finally the entity attribute similarity vector is obtained. Based on this, the entity similarity vector, the neighbor entity similarity vector and the attribute similarity vector are concatenated to obtain the similarity vector of two triplet pairs. Then, the similarity score between the entities can be calculated using the multi-layer perceptron.

In the process of entity triplet alignment, the candidate aligned entities with the highest cosine similarity are first calculated according to the entity vector $C(e)$ . Then the similarity scores of the candidate entities and entity $h$ are calculated respectively, and the results are sorted in descending order.

SECTION V.

Image-Text Cross-Modal Contrastive Learning

After the triplets are obtained from the chest X-ray diagnostic report and quality control report and entity alignment is completed, a positive sample triplet description is obtained for the corresponding chest X-ray image. Then negative sample triplet description is constructed by changing entity relationships/attributes. The image and text encoding features are obtained by using feature extractors of different modalities, and then the improved fine-grained image-text contrastive function is used for cross-modal feature alignment.

A. Multi-Modal Feature Extraction

For a given chest X-ray image and its corresponding positive/negative sample triplets, image features and text features are extracted by the image encoder and text encoder respectively. Image encoder utilizes standard vision Transformer (ViT) model [60]. The chest X-ray image $I \in R^{H \times W \times 1}$ is divided into $N$ patches $X_{p} \in R^{N \times (P^{2} \times 1)}$ , $N=\frac {H \times W}{p \times p}$ with resolution of $p\times p$ , and then encoded into a sequence of visual feature vectors $\{I_{gal}, T_{1}, T_{2}, \cdots, T_{N}\}$ , with sequence length of N+1, where $\{I_{gal}\}$ is used to represent the global features of the image. The text encoder uses the standard Transformer model to encode the triplets of text representing positive and negative samples as $\{T_{pos_{gal}}, T_{pos_{1}}, T_{pos_{2}}, \cdots, T_{pos_{N}}\}$ , $\{T_{neg_{gal}}, T_{neg_{1}}, T_{neg_{2}}, \cdots, T_{neg_{N}}\}$ respectively. Both image encoder and text encoder can use the feature encoder in pretrained CLIP model. Since it has been pretrained on large-scale natural image-text pairs, the model boasts strong image-text matching ability. Based on this, parameters of the model are fine-tuned, which can not only retain the original strong cross-modal matching ability, but also adapt the model to the data domain of chest medical image.

B. Cross-Modal Feature Alignment

For the traditional cross-modal contrastive learning of image-text, such as CLIP shown in Figure 4 (a), they generally align the global feature $\{I_{gal}\}$ of the image with the global feature $\{T_{pos_{gal}}\}$ of the positive sample text, while the global feature $\{I_{gal}\}$ of image and the global feature $\{T_{neg_{gal}}\}$ of negative sample text are pulled apart. Specifically, the global feature vector of the image and the feature vector of the text are mapped to the same dimension by the corresponding projection layers $g_{i}$ and $g_{t}$ respectively, and the embedded vector features are constructed by the inner product operation: $f_{pos}=g_{i}\left ({I_{gal}}\right)\cdot g_{t}\left ({T_{pos_{gal}}}\right), f_{neg}=g_{i}(I_{gal})\cdot g_{t}(T_{neg_{gal}})$ . Then, image-text contrastive loss function (ITC) is used to optimize the model parameters:

$\begin{equation*} Loss_{itc}=\frac {exp(f_{pos}/\pi)}{exp(f_{pos}/\pi +exp(f_{neg}/\pi))} \tag {15}\end{equation*}$ View Source

where

$\pi$

is the learnable temperature coefficient. As each chest X-ray image corresponds to multiple positive/negative sample triplets, and different triplets contain different organ regions of the image, fine-grained image information will be lost when the global features of an image are aligned with the corresponding text features. Therefore, inspired by [23], we improve the traditional cross-modal contrastive learning of image-text and propose a fine-grained image-text alignment method.

FIGURE 3.

The overall framework of our proposed method. In the training phase, through entity extraction, entity relationship extraction and entity alignment, the text data from chest X-ray diagnosis report and quality control report are converted to the knowledge graph triplets for chest X-ray image quality assessment. The obtained triples (e.g., <scapula, overlap, lung field>) are treated as positive samples, whereas the corresponding negative samples (e.g., <shoulder blades, no overlap, lung fields>) are manually constructed. Image-text contrastive learning (ITC) is applied to the text encoder and image encoder to obtain text and image features respectively (Tpos, Tneg, I). The image-text contrastive loss is adopted to maximize the feature similarity of positive pairs. In the model inference phase, only input images are needed to feed into the model to obtain quality assessments for different aspects of the images. Such combination of knowledge graph triplets and ITC can effectively improve the accuracy of chest X-ray image quality assessment.

Show All

FIGURE 4.

Illustration of the different image-text cross-modal contrastive learning methods. (a) Image-text cross-modal contrastive learning. (b) Fine-grained image-text cross-modal contrastive learning, which is more suitable for the situation in this paper where each image has multiple quality failure factors corresponding to multiple different regions of the image, as shown in Figure 2 (n).

Show All

Specifically, instead of aligning the global features of an image with multiple text features, we align all visual features of an image with multiple text features, since these visual features contain more fine-grained image information. As shown in Figure 4 (b), the visual feature vector sequence $\{I_{1}, I_{2}, \cdots, I_{N}\}$ apart from the global image feature and the feature vector of the text are mapped to the same dimension by the corresponding projection-layer $g_{i}$ and $g_{t}$ respectively, and the embedded vector features are constructed by inner product operation: $f_{pos(i)}=g_{i}\left ({I_{i}}\right)\cdot g_{t}\left ({T_{pos_{gal}}}\right), f_{neg(i)}=g_{i}(I_{i})\cdot g_{t}(T_{neg_{gal}})$ . The obtained embedded vector feature is also a vector sequence with length N. In order to obtain the feature vector of length 1 for each image-triplet, we embed positive sample into the feature vector sequence to aggregate the features of positive/negative sample embedding vector respectively:

$\begin{align*} f_{pos}&=\sum _{i}^{N} Softmax(f_{pos(i)})\cdot f_{pos(i)} \tag {16}\\ f_{neg}&=\sum _{i}^{N} Softmax(f_{pos(i)})\cdot f_{neg(i)} \tag {17}\\ Softmax(f_{pos(i)})&=\frac {exp(f_{pos(i)})}{\sum _{j} exp(f_{pos(j)})} \tag {18}\end{align*}$ View Source

Finally, the image-text contrastive loss function (ITC) is also used to optimize the model parameters:

$\begin{equation*} Loss_{itc}=\frac {exp(f_{pos}/\pi)}{exp(f_{pos})/\pi + exp(f_{neg}/\pi)} \tag {19}\end{equation*}$ View Source

SECTION VI.

Experiment and Analysis

A. Data Preparation

The experimental data used in this study is the large-scale chest X-ray image dataset ChestX-ray8 [61] released by NIH. The dataset contains 112,120 frontal view X-ray images from 30,805 patients. A total number of 1000 qualified image are selected from the dataset as negative set; while 4846 disqualified images are selected as positive set. The 13 disqualification factors are: overlapping of scapula and lung field, misalignment of clavicles on both sides, inconsistent clavicle height on both sides, asymmetrical sternoclavicular joints, unclear lung apex, lung apex not included, too little empty exposure above the shoulders, diaphragm not included, costophrenic angle not included, asymmetrical ribcage on both sides, asymmetrical lung fields on both sides, existence of abnormal objects, and motion blur. Finally, a total number of 5846 chest X-ray images are included in the experimental dataset. Sample distribution of the image quality factors is shown in Table 1.

TABLE 1 Data Distribution With Factors Influencing Chest X-Ray Image Quality

In the experiment, randomly 70% of the data (4092 chest films) are selected as the training set, 20% of the data (1169 chest films) as the validation set, and 25% of the data (1461 chest films) as the test set.

In data preprocessing of the training and inference phases, the visual encoder adopts the same image resolution as in [19], i.e., $512\times 512$ pixels, and all intensities are normalized to 0-1. The text encoder uses the same Transformer as in [54].

B. Model Training Details

For each label in the chest X-ray image, we use an SGD optimizer with an initial learning rate of 0.002, decayed by the cosine annealing rule. We obtain the best performance over the past 200 epochs by training the model for 1000 epochs. The entire model training is done on a server with 8 NVIDIA RTX 3090 GPU cards.

C. Evaluation Metric

The FR-IQA methods mostly use SSIM, MS-SSIM, VIF, IFC, FSIM and other quality assessment metrics. The NR-IQA cannot use the above-mentioned quality assessment due to the lack of reference images.

In our experiment, precision, recall, F1-Score, and false positive rate (FPR) are used to evaluate the performance of our proposed algorithm. The confusion matrix is defined as in Figure 5.

FIGURE 5.

Confusion matrix.

Show All

The definitions of the evaluation metrics are as follows. $Precision=TP/(TP+FP)\,\,Recall=TP/(TP+FN)\,\,F_{1}=(2 \times Precision \times Recall)/(Precision+Recall)\,\,FPR=FP/(FP+TN)$

D. Experimental Results and Analysis

From the experimental results shown in Table 2, it can be observed that among the evaluation results on the 13 influencing factors, 10 F1-Scores are higher than 0.95, 3 F1-Scores are between 0.92 and 0.95. The result on the existence of abnormal objects class is comparatively lower. Elaborately designed methods can be adopted in the future to further improve the model. The overall performance of our proposed model is satisfactory and it can be used for chest X-ray image quality assessment.

TABLE 2 Chest X-ray Image Quality Assessment Results

In order to further verify the effectiveness of our method proposed in this paper, we further carried out the following ablation study. On the same training data, we experimented with (a) only image features, (b) global image features + text features, and (c) fine-grained image features + text features. As shown in Table 3, the experimental results show that using the visual text model can achieve better performance than only using image features, and the fine-grained visual text model can achieve the best performance.

TABLE 3 Comparative Results on Different Features

SECTION VII.

Conclusion

In this paper, we propose a chest X-ray quality assessment method combining both image-text contrastive learning and medical domain knowledge. Specifically, it integrates large-scale real clinical chest X-rays and diagnostic report text information, and fine-tunes the pretrained model based on contrastive text-image pairs. It achieves cross-domain transfer learning and can save the huge workload caused by doctors in labeling multi-modal medical data. The integration of the triplet information from knowledge graph into the deep learning model proposed in this paper provides a new solution for knowledge and data-driven machine learning methods. The proposed method can be extended to complete other tasks such as multi-lesion segmentation on medical images and prediction of disease progression. The experimental results and analysis show that the proposed method boasts good performance.

References is not available for this document.

Chest X-Ray Quality Assessment Method With Medical Domain Knowledge Fusion

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work

A. Factors Affecting the Disqualification of the Chest X-Ray Imagess

B. No-Reference Image Quality Assessment Methods

Chest X-Ray Quality Assessment With Medical Knowledge