1. Introduction
Automatic speech recognition (ASR) is becoming increasingly more integral in our day-to-day lives enabling virtual assistants and smart speakers like Siri, Google Now, Cortana, Amazon Echo, Google Home, Apple HomePod, Microsoft Invoke, Baidu Duer and many more. While recent breakthroughs have tremendously improved ASR performance [1], [2] these models still suffer considerable degradation from reasonable variations in reverberations, ambient noise, accents and Lombard reflexes that humans have little or no issue recognizing.