I. Introduction
Medical image segmentation is one of the most critical tasks in medical image analysis. In clinical practice, accurate segmentation results are often achieved manually or semi-automatically. It remains a challenging task to extract the desired object accurately, especially when the target organ to be extracted is of high complexity in terms of tissue structures. Recent research shows that deep learning can be a promising approach for automatic medical image segmentation, as the knowledge of experts can be learned and extracted by using a certain deep learning method. A summary of existing solutions is shown in Figure 1(a): (1) one shared encoder followed by two separate decoders [1]; (2) two separate encoders followed by one shared decode [2]; (3) two separate encoders followed by a modality interaction model [3]. However, two inherent issues concerning the creation of high quality medical image datasets severely limit the application: one is the difficulty in obtaining high-quality images, and the other one is the high cost of data annotation [4], [5]. These two issues have dramatically limited the performance improvement of medical image segmentation models. Since it is challenging to improve the quantity and quality of medical images themselves, it may be more feasible to use complementary and easy-to-access information to make up for the quality defects of medical images. Thus, we turn our attention to the written medical notes accompanied by medical images. It is well known that text data of medical records are usually generated along with the patients, so no extra cost is needed to access the corresponding text data. The medical text record data and the image data are naturally complementary to each other, so the text information can compensate for the quality deficiency in the medical image data. On the other hand, expert segmentation annotation is often expensive and time-consuming, especially for new diseases like COVID-19, where high-quality annotations are even more difficult to obtain [4], [6], [7]. In order to address the issue of under-annotated data, some approaches have gone beyond traditional supervised learning by training their models using both labeled and more widely available unlabeled data, such as semi-supervised learning [5], [8] and weakly-supervised learning [9]. However, the performance of these approaches is largely determined by the credibility of the pseudo label. This is because the number of pseudo labels is much larger than ground truth labels. Therefore, the critical question to be answered is how to improve the quality of the pseudo label. To effectively address this issue, we develop a model that can be trained using the medical texts written by domain experts. By learning additional expert knowledge from text information, we can improve the quality of pseudo labels.
Comparison of current medical image segmentation models and our proposed LViT model.