I. Introduction
With the rapid expansion of online platforms and social media, disinformation has emerged as a pervasive and concerning phenomenon. Disinformation, a term that includes malinformation (e.g., cyberbullying) and misinformation(e.g., fake news) has caused severe problems for online users. While cyberbullying leads to emotional and psychological consequences [14], [24], [43], the proliferation of fake news also inflicts harm on individuals by inducing stress and anxiety [11], and can even have detrimental effects on society by fostering polarization among people [4]. The widespread of disinformation highlights the urgent need for effective detection methods to mitigate its harmful impact and create safer online environments [9]. Previous studies on disinformation detection often rely on textual features of the post [10], [18], [22] or user and social media feature [8], [13], [27], [35]. This line of work may fall short when facing adversarial attacks, as demonstrated in Figure 1, posing more challenges for accurate detection. By intentionally manipulating input data, adversarial attacks can deceive models and lead to incorrect or unexpected predictions. Existing literature utilizes masked language modeling (MLM) [17] or perturbing inflections [36] to create adversarial samples. However, MLM may generate tokens that reverse the semantics of a whole sentence (e.g., The food is good can be replaced by bad) and requires heuristics to avoid such circumstances. In addition, perturbing inflections can only account for a restricted range of replacements in order to preserve semantic consistency.