1 Introduction
Deep neural networks (DNNs) have evolved to become state-of-the-art in a torrent of artificial intelligence applications, such as image classification and language translation [26, 29, 59, 60]. However, researchers have recently found that DNNs are generally vulnerable to maliciously generated adversarial examples, which are intentionally designed to mislead a DNN into making incorrect predictions [34, 37, 53, 63]. For example, an attacker can modify an image of a panda (IA in Fig. 1) slightly, even imperceptibly to human eyes, and the generated adversarial example (IB in Fig. 1) is able to mislead a state-of-the-art DNN [21] to classify it as something else entirely (e.g., a monkey), because the DNN detects a monkey’s face in the top right corner of the adversarial example (Fig. 11A). This phenomenon brings high risk in applying DNNs to safety- and security-critical applications, such as driverless cars, facial recognition ATMs, and Face ID security on mobile phones [1]. Hence, there is a growing need to understand the inner workings of adversarial examples and identify the root cause of the incorrect predictions.