Loading [MathJax]/extensions/MathMenu.js
Scalable Safety-Critical Policy Evaluation with Accelerated Rare Event Sampling | IEEE Conference Publication | IEEE Xplore

Scalable Safety-Critical Policy Evaluation with Accelerated Rare Event Sampling


Abstract:

Evaluating rare but high-stakes events is one of the main challenges in obtaining reliable reinforcement learning policies, especially in large or infinite state/action s...Show More

Abstract:

Evaluating rare but high-stakes events is one of the main challenges in obtaining reliable reinforcement learning policies, especially in large or infinite state/action spaces where limited scalability dictates a prohibitively large number of testing iterations. On the other hand, a biased or inaccurate policy evaluation in a safety-critical system could potentially cause unexpected catastrophic failures during deployment. This paper proposes the Accelerated Policy Evaluation (APE) method, which simultaneously uncovers rare events and estimates the rare event probability in Markov decision processes. The APE method treats the environment nature as an adversarial agent and learns towards, through adaptive importance sampling, the zero-variance sampling distribution for the policy evaluation. Moreover, APE is scalable to large discrete or continuous spaces by incorporating function approximators. We investigate the convergence property of APE in the tabular setting. Our empirical studies show that APE can estimate the rare event probability with a smaller bias while only using orders of magnitude fewer samples than baselines in multi-agent and single-agent environments.
Date of Conference: 23-27 October 2022
Date Added to IEEE Xplore: 26 December 2022
ISBN Information:

ISSN Information:

Conference Location: Kyoto, Japan

Funding Agency:


I. Introduction

There has been a growing interest in applying reinforcement learning (RL) to complex high-stakes robotics problems, including controlling autonomous vehicles and healthcare assistant robots [1], [2]. As one can expect, before deployment of any RL policy, such safety-critical applications often require an accurate policy evaluation because inaccurate estimations could lead to false optimism or even catastrophic consequences. However, a main difficulty for policy evaluation in these settings is that we are crucially interested in assessing certain rare but high-stake events in safety-critical systems. The de facto standard evaluation method is the vanilla Monte Carlo (MC) which, in a multitude of practical settings of interest, requires a prohibitively large number of testing before the evaluation can be deemed statistically valid [3]. For example, [1] shows that, in order to demonstrate the safety of self-driving cars, the number of miles that needs to be clocked in is technically in the hundreds of billions. Consequently, when such large-volume testing is high-stakes, costly, or time-consuming, a growing number of studies have been developed to accelerate the estimation of rare event probability as the policy evaluation metric [4], [5]. However, existing literature typically either fails to exploit the sequential, interactive nature of tasks by confining to specific failure-causing initial conditions [6], [7] or only focuses on small state/action spaces due to the curse of dimensionality [8], [9].

The interactions among the environment transition model, the agent and the environment adversary introduced by our proposed accelerated policy evaluation method. The reward signal is for updating the adversary policy . The subscript denotes the time step.

Contact IEEE to Subscribe

References

References is not available for this document.