I. Introduction
In the field of medicine, extending vision-language models (VLMs) to few-shot medical image classification is a critical task. In real-world scenarios, medical datasets with only a few dozen images are common, as it may not be feasible to obtain information for all possible diseases or access annotated medical images easily [1]. This makes few-shot medical image classification particularly important. Currently, there is limited research specifically focused on VLMs in this crucial task. Existing research indicate that directly applying VLMs such as CLIP to the medical domain may not yield satisfactory results, for at least two reasons. Firstly, CLIP and many other VLMs are pre-trained on natural image-text pairs, which may overlook the importance of medical domain knowledge, leading to subpar performance [2], [3]. On the other hand, the category texts of medical images often contain highly abstract medical terminologies, which pose a greater challenge for existing VLMs [1], [4]. Therefore, it is necessary to improve the CLIP model to make it suitable for few-shot medical image classification tasks.