I. Introduction
Deep reinforcement learning (deep RL) has demonstrated impressive performance on a wide range of sequential decision problems ranging from the Go game [1], [2] to robot control [3]–[5]. However, deep RL in practice still requires significant engineering efforts on tuning algorithms and reward functions to find an effective control policy. Conditions such as sparse reward functions and sensitive dynamics amplify the complexity of the problem. These properties degrade the accuracy of the policy gradient method by obscuring informative signals and causing inaccurate gradient estimation. Engineers often try to mitigate these issues by designing denser and smoother rewards or incorporating hand-designed controllers based on their prior knowledge.