I. Introduction
Multi-modal medical images are widely adopted in disease screening and diagnosis due to their ability to provide complementary soft-tissue characteristics and diagnostic information. For instance, commonly acquired magnetic resonance (MR) sequences include T1-weighted, T2-weighted, post-contrast T1-weighted (T1Gd), and fluid-attenuated inversion recovery (FLAIR) images, each of which is considered as a distinct modality that highlights specific anatomy and pathology. Clinically, a combination of multiple modalities is often used to present pathological changes and assist clinicians in making accurate diagnoses. However, obtaining complete multi-modal images for each patient can be challenging due to factors such as limited scanning time, motion or artifact-induced image corruption, and the use of different imaging protocols [1]. When dealing with incomplete data, it is undesirable to simply discard it as it often contains valuable information, and also infeasible to re-scan missing sequences for data completion due to the high cost of data acquisition. Therefore, multi-modal image synthesis (also known as data imputation) has been explored to generate missing modalities from limited available data, which has the potential to benefit downstream data analysis (i.e., segmentation [2], registration [3]), enhance the diagnostic accuracy of diseases (i.e., Alzheimer’s disease [4]), and assist surgery planning [5].