I. Introduction
Reinforcement learning (RL) aims to learn the optimal policy from interactions with the environment. RL is formalized in the framework of the Markov decision process (MDP) where the learner gains decision-supporting knowledge about the underlying structures of the environment from a sequence of observations [1]. In online mode, the learner updates its knowledge after each observation with the temporal difference (TD); while in batch mode, it learns in a single step when enough observations are collected. Q-Iearning is a typical RL algorithm in which the learner accumulates the knowledge about the underlying Q-function (or action-function) to determine the optimal policy [2].