I. Introduction
Reinforcement Learning (RL) provides a framework for solving a wide variety of problems in which the system model is not available or it may be intractable [2]. The idea at the core of RL is that an agent can learn the optimal action to take from its experience, gained while interacting with the environment. Specifically, the agent is aware of its current state and chooses an action, after which the next state is reached. Subsequently, after each action is taken, the agent receives a reward from the environment, which regulates the value of the actions chosen. The agent’s objective is to maximize the cumulative reward observed throughout the trajectory of states and actions. Being able to tackle complex systems, RL has found success in applications as diverse as autonomous driving [3], [4], robotics [5], and smart grids [6], among many others.