I. INTRODUCTION
Sequential decision making is of vital importance to the design and operations of real-world complex systems in transportation, energy and healthcare applications, where agents interact with the uncertain environment and make decisions over time to maximize certain forms of cumulative reward. The related dynamic information gathering and decision-making problems can be modeled using Markov Decision Processes (MDPs) [1]. Recently, reinforcement learning (RL) becomes popular for solving MDPs. In standard RL, one seeks a policy that maximizes the expected total discounted rewards, which we refer to as risk-neutral RL. However, maximizing the expected reward does not necessarily avoid the maybe rare occurrences of undesirable outcomes, and in a situation where it is important to maintain reliable performance, we aim to evaluate and control the risk.