I. Introduction
Deep learning has been applied to a number of problems in recent years and has exhibited exceptional performance in particularly challenging domains including computer vision [1], [2], [3], [4], natural language processing [5], and speech recognition [6]. However, despite these impressive gains in accuracy metrics, these models exhibit poor robustness to even minor shifts in the data distribution [7], [8], [9]. These minor distribution shifts, which are likely to occur in the real-world, pose a significant challenge to the deployment of these systems in the real world.