1 Introduction
Deep neural networks (DNNs) have demonstrated superior performance in many artificial intelligence applications, such as pattern recognition and natural language processing [1], [2], [3]. However, researchers have recently found that even a highly accurate DNN can be vulnerable to carefully-crafted adversarial examples that are intentionally designed to mislead a DNN into making incorrect predictions [4], [5], [6], [7], [8]. For example, an attacker can make imperceptible modifications to a panda image (from \mathsf {{}I}_\mathsf {{}1} to \mathsf {{}I}_\mathsf {{}2} in Fig. 1) to mislead a state-of-the-art DNN model [9] to classify it as a monkey. This phenomenon creates high risk when applying DNNs to safety- and security-critical applications, such as driverless cars, face recognition ATMs, and Face ID security on mobile phones [10]. For example, researchers have recently shown that even the state-of-the-art public Face ID system can be fooled by using a carefully-crafted sticker on a hat [11]. Thus, there is an urgent need to understand the prediction process of adversarial examples and identify the root cause of incorrect predictions [10], [12]. Such an understanding is valuable for developing adversarially robust solutions [13], [14], [15]. A recent survey identifies two important questions that require analysis [10]: (1) why similar images (e.g., adversarial and normal panda images) diverge into different predictions, and (2) why images from different classes (e.g., adversarial panda images and normal monkey images) merge into the same prediction.