1. INTRODUCTION
Data augmentation techniques are used to enhance models’ performance by adding additional variations to the training data. These techniques are widely applied to improve automatic speech recognition (ASR) performance [1]–[4]. In [1], the authors used speed perturbation to create new speech utterances by changing the frequency components and number of time frames of speech recordings. This additional training data helped to decrease the word error rate (WER) by 3.2% relative on Librispeech task with 960 hours Librispeech data. In [2], reverberation was added to the speech to make it more realistic. Recently, a common technique is to remove or mask information in the spectrogram domain. For instance, SpecAugment [5] removes speech information in T continuous random time frames or F frequency bins. At the time, this augmentation not only increased ASR accuracy, but also achieved the state-of-the-art WER on the LibriSpeech 960-hour dataset at 5.8%. [3] proposed data augmentation via adding additional noise to speech, reducing WER by 21.3% relative on their self-constructed 100 sentence evaluation set.