1 Introduction
Voice user interfaces (VUI) (e.g. Amazon Alexa, chatbots, and social robots) are becoming an essential part of everyday life [5, 15]. For these systems to carry out effective dialogue, they must be able to determine the intent behind a user’s spoken utterance. For the purpose of this paper, intent recognition is defined as commonly understood in the NLP community, i.e. the task of taking a written or spoken input, and determining which of several classes it matches in order to best respond or guide the interaction, not to be confused with the broader meaning in the HRI context, i.e. inferring goals of the user based on their observed actions from sensors or visual cues. This type of intent recognition is essential to building complex conversational experiences in HRI, which is a key challenge. While rule-based parsing is a common approach for some interactions, it is not effective for more advanced and novel dialogue contexts [27]. To improve user experience while interacting with such systems, state-of-the-art models are trained using large labeled datasets for intent recognition customized to specific applications [8, 9, 26].