I. Introduction
The framework of Markov decision processes (MDPs) has been successfully applied in many types of systems control [1], [2]. Thanks to its simplicity, and generality, it is capable of modeling most of the dynamical processes. For unknown processes, reinforcement learning (RL) techniques have shown great potential in controlling unknown systems. As a matter of fact, during the last decade, we have witnessed an increased surge of interest in RL, where, by exploiting modern methods in Deep Learning [3], researchers were able to reach higher performance, sometimes surpassing human performance in games such as Go, Dota, and Atari games [4], [5], [6], [7]. RL has also been increasingly used in industrial applications, from temperature control in buildings [8], to health-care [9], financial trading [10] and more. RL-based systems are however vulnerable to AI cyber-attacks (e.g., leveraging data poisoning or adversarial samples), and as recently pointed out by Gartner and Microsoft [11], [12], only a small fraction of the companies have the right tools in place to secure their ML systems.