I. Introduction
Automatic speech recognition (ASR) has witnessed the great development of deep neural networks, particularly for Transformers [1], [2], [3]. As the models grow bigger and their architecture become more complicated, we get higher ASR accuracy but less prediction explainability, due to the black-box nature of neural networks. This will inevitably hamper our trust in the models and as a result, their application in some serious areas, such as voice-control in smart cars. Therefore, it is very important to pay attention to model explainability, which can help us to better understand, diagnose, and improve our models.