I. Introduction
The problem we consider in this paper involves audio recordings of the dialogue between a voice assistant (such as Alexa in Echo Dot) and a “user” collected in controlled lab experiments.
The user's voice is synthesized. There is no recording of any actual conversation between customer and voice assistant involved.
The first seconds of each dialogue is recorded. Fig. 1 shows the lab setup and Fig. 2 shows the waveforms of two example audio recordings. In our use case, such a dialogue usually follows this pattern: wake word(user) question/request(user) acknowledgement of receiving the question(voice assistant) actual answer/content(voice assistant). For example, the first plot in Fig. 2 corresponds to this dialogue:User: Alexa, add bananas to my shopping list.
Alexa: You already have bananas on your shop-ping
list.