I. Introduction
There has been a growing interest in applying reinforcement learning (RL) to complex high-stakes robotics problems, including controlling autonomous vehicles and healthcare assistant robots [1], [2]. As one can expect, before deployment of any RL policy, such safety-critical applications often require an accurate policy evaluation because inaccurate estimations could lead to false optimism or even catastrophic consequences. However, a main difficulty for policy evaluation in these settings is that we are crucially interested in assessing certain rare but high-stake events in safety-critical systems. The de facto standard evaluation method is the vanilla Monte Carlo (MC) which, in a multitude of practical settings of interest, requires a prohibitively large number of testing before the evaluation can be deemed statistically valid [3]. For example, [1] shows that, in order to demonstrate the safety of self-driving cars, the number of miles that needs to be clocked in is technically in the hundreds of billions. Consequently, when such large-volume testing is high-stakes, costly, or time-consuming, a growing number of studies have been developed to accelerate the estimation of rare event probability as the policy evaluation metric [4], [5]. However, existing literature typically either fails to exploit the sequential, interactive nature of tasks by confining to specific failure-causing initial conditions [6], [7] or only focuses on small state/action spaces due to the curse of dimensionality [8], [9].
The interactions among the environment transition model, the agent and the environment adversary introduced by our proposed accelerated policy evaluation method. The reward signal is for updating the adversary policy . The subscript denotes the time step.