1. INTRODUCTION
End-to-end spoken language understanding (E2E SLU) has recently shown promising results [1], [2], [3]. They outperform the previous SLU approaches that cascade automatic speech recognition and natural language understanding (ASR-NLU) for intent classification for the spoken utterances. The main advantage of E2E SLU models is their ability to understand the speaker’s intent without translating the speech to a text transcript. This allows the models to fully exploit additional information such as emotion and nuance characterized with acoustic signals. Recently, leveraging large-scale pre-trained language models (PLMs) such as BERT [4] has enhanced SLU performances [5], [6] by benefiting from richly learned textual representation. However, these methods exploit only limited textual information by explicitly aligning the spoken utterance and its transcript representations. Hence, existing E2E SLU methods can be further improved in terms of effective learning of PLM-based speech and text representations.