Journals & Magazines >IEEE Transactions on Medical ... >Volume: 43 Issue: 1

LViT: Language Meets Vision Transformer in Medical Image Segmentation

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Deep learning has been widely used in medical image segmentation and other aspects. However, the performance of existing medical image segmentation models has been limite...Show More

Metadata

Abstract:

Deep learning has been widely used in medical image segmentation and other aspects. However, the performance of existing medical image segmentation models has been limited by the challenge of obtaining sufficient high-quality labeled data due to the prohibitive data annotation cost. To alleviate this limitation, we propose a new text-augmented medical image segmentation model LViT (Language meets Vision Transformer). In our LViT model, medical text annotation is incorporated to compensate for the quality deficiency in image data. In addition, the text information can guide to generate pseudo labels of improved quality in the semi-supervised learning. We also propose an Exponential Pseudo label Iteration mechanism (EPI) to help the Pixel-Level Attention Module (PLAM) preserve local image features in semi-supervised LViT setting. In our model, LV (Language-Vision) loss is designed to supervise the training of unlabeled images using text information directly. For evaluation, we construct three multimodal medical segmentation datasets (image + text) containing X-rays and CT images. Experimental results show that our proposed LViT has superior segmentation performance in both fully-supervised and semi-supervised setting. The code and datasets are available at https://github.com/HUANGLIZI/LViT.

Published in: IEEE Transactions on Medical Imaging ( Volume: 43, Issue: 1, January 2024)

Page(s): 96 - 107

Date of Publication: 03 July 2023

ISSN Information:

PubMed ID: 37399157

DOI: 10.1109/TMI.2023.3291719

Funding Agency:

Contents

I. Introduction

Medical image segmentation is one of the most critical tasks in medical image analysis. In clinical practice, accurate segmentation results are often achieved manually or semi-automatically. It remains a challenging task to extract the desired object accurately, especially when the target organ to be extracted is of high complexity in terms of tissue structures. Recent research shows that deep learning can be a promising approach for automatic medical image segmentation, as the knowledge of experts can be learned and extracted by using a certain deep learning method. A summary of existing solutions is shown in Figure 1(a): (1) one shared encoder followed by two separate decoders [1]; (2) two separate encoders followed by one shared decode [2]; (3) two separate encoders followed by a modality interaction model [3]. However, two inherent issues concerning the creation of high quality medical image datasets severely limit the application: one is the difficulty in obtaining high-quality images, and the other one is the high cost of data annotation [4], [5]. These two issues have dramatically limited the performance improvement of medical image segmentation models. Since it is challenging to improve the quantity and quality of medical images themselves, it may be more feasible to use complementary and easy-to-access information to make up for the quality defects of medical images. Thus, we turn our attention to the written medical notes accompanied by medical images. It is well known that text data of medical records are usually generated along with the patients, so no extra cost is needed to access the corresponding text data. The medical text record data and the image data are naturally complementary to each other, so the text information can compensate for the quality deficiency in the medical image data. On the other hand, expert segmentation annotation is often expensive and time-consuming, especially for new diseases like COVID-19, where high-quality annotations are even more difficult to obtain [4], [6], [7]. In order to address the issue of under-annotated data, some approaches have gone beyond traditional supervised learning by training their models using both labeled and more widely available unlabeled data, such as semi-supervised learning [5], [8] and weakly-supervised learning [9]. However, the performance of these approaches is largely determined by the credibility of the pseudo label. This is because the number of pseudo labels is much larger than ground truth labels. Therefore, the critical question to be answered is how to improve the quality of the pseudo label. To effectively address this issue, we develop a model that can be trained using the medical texts written by domain experts. By learning additional expert knowledge from text information, we can improve the quality of pseudo labels. Fig. 1.

Comparison of current medical image segmentation models and our proposed LViT model.

MIT Libraries

MIT Libraries

LViT: Language Meets Vision Transformer in Medical Image Segmentation

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

LViT: Language Meets Vision Transformer in Medical Image Segmentation

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References