Abstract:
This paper proposes a novel fully neural network based voice activity detection (VAD) method that estimates whether each speech segment is speech or non-speech even in ve...Show MoreMetadata
Abstract:
This paper proposes a novel fully neural network based voice activity detection (VAD) method that estimates whether each speech segment is speech or non-speech even in very low signal-to-noise ratio (SNR) environments. Our innovation is to improve context-awareness of speech variability by introducing multiple auxiliary networks into the neural VAD framework. While previous studies reported that phonetic-aware auxiliary features extracted from a phoneme recognition network can improve VAD performance, none examined other effective auxiliary features for enhancing noise robustness. Thus, this paper present a neural VAD that uses auxiliary features extracted from not only the phoneme recognition network but also a speech enhancement network and an acoustic scene classification network. The last two networks are expected to improve context-awareness even in extremely low SNR environments since they can extract de-noised speech awareness and noisy environment awareness. In addition, we expect that combining these multiple auxiliary features yield synergistic improvements in VAD performance. Experiments verify the superiority of the proposed method in very low SNR environments.
Date of Conference: 02-06 September 2019
Date Added to IEEE Xplore: 18 November 2019
ISBN Information: