I. Introduction
Robots are increasingly used for safety-critical applications such as autonomous driving [1] and surgery [2]. These tasks, characterized by complex cost functions and (possibly unknown) dynamics, are challenging for classical controllers [3]. This motivates learning-based controllers, especially reinforcement learning (RL) algorithms. Their ability to adapt to complex reward signals and unknown dynamics has led to superior performance in various domains [4]. However, a significant limitation of RL is the lack of safety guarantees [3]. This is undesirable for deployment in safety-critical scenarios despite promising results.