I. Introduction
Medical image analysis plays a critical role in the healthcare industry by developing automated solutions for the diagnosis and treatment of various medical conditions. According to [1], the AI in Healthcare Market is expected to grow from around USD 14.6 Billion to USD 102.7 Billion by 2028, as the generation of vast and intricate healthcare datasets continues to grow steadily. Its primary objective is to accurately detect and classify diseases using medical imaging data, ultimately improving patient outcomes and reducing the workload of medical professionals. In this work, we focus on the coronavirus disease, COVID-19, which is a highly infectious respiratory illness caused by the SARS-CoV-2 virus [2]. To improve the accuracy and efficiency of COVID-19 diagnoses, we propose a Vision Transformer model, ViTMed, which uses a transformer architecture to classify Computed Tomography (CT) scan images. Unlike traditional convolutional neural networks (CNNs) that process images using convolutions, ViTMed processes images as sequences of patches, which are transformed into embeddings using multi-head self-attention layers. This approach has been shown to outperform CNN-based approaches on some datasets [3]. Limited availability of medical images is a persistent challenge, which is mitigated by applying techniques such as data augmentation. Another challenge is capturing the significant features and attributes from the images for accurate classification. Therefore, various algorithms and models are studied to identify the most suitable approach that can yield the most accurate results. The contributions of this paper are concluded as below: