Loading [MathJax]/extensions/MathZoom.js
ADA-VAD: Unpaired Adversarial Domain Adaptation for Noise-Robust Voice Activity Detection | IEEE Conference Publication | IEEE Xplore

ADA-VAD: Unpaired Adversarial Domain Adaptation for Noise-Robust Voice Activity Detection


Abstract:

Voice Activity Detection (VAD) is becoming an essential front-end component in various speech processing systems. As those systems are commonly deployed in environments w...Show More

Abstract:

Voice Activity Detection (VAD) is becoming an essential front-end component in various speech processing systems. As those systems are commonly deployed in environments with diverse noise types and low signal-to-noise ratios (SNRs), an effective VAD method should perform robust detection of speech region out of noisy background signals. In this paper, we propose adversarial domain adaptive VAD (ADA-VAD), which is a deep neural network (DNN) based VAD method highly robust to audio samples with various noise types and low SNRs. The proposed method trains DNN models for a VAD task in a supervised manner. Simultaneously, to mitigate the performance degradation due to back-ground noises, the adversarial domain adaptation method is adopted to match the domain discrepancy between noisy and clean audio stream in an unsupervised manner. The results show that ADA-VAD achieves an average of 3.6%p and 7%p higher AUC than models trained with manually extracted features on the AVA-speech dataset and a speech database synthesized with an unseen noise database, respectively.
Date of Conference: 23-27 May 2022
Date Added to IEEE Xplore: 27 April 2022
ISBN Information:

ISSN Information:

Conference Location: Singapore, Singapore

Funding Agency:


1. INTRODUCTION

The purpose of voice activity detection (VAD), also known as speech activity detection, is to find speech segments in audio recordings. It has been established as an essential pre-processing stage in various applications such as auto-speech recognition and speaker verification. As those systems are commonly deployed in environments with diverse noise types and low signal-to-noise ratios (SNRs), the crucial aspect of VAD is its robustness to background noise. Recently, several DNN-based learning approaches have shown improved performance, robustness, and generality over conventional statistical methods [1], [2], [3], [4], [5]. For instance, a recent study proposed a VAD method based on the long-short term memory neural network (LSTM) [1], [2] that uses contextual information of audio. Another work proposed a boosted deep neural network (bDNN) [5] that uses multi-resolution stacking (MRS). In addition, an adaptive context attention model (ACAM) [3] has been proposed to encourage the model to focus on crucial parts of the input features. Note that all these models are trained with manually-extracted features such as multi-resolution cochlea-gram (MRCG) and mel-spectrogram. The DNN-based VAD methods generally perform well on audio steams from clean environment. However, for recordings in low-SNR environment, the performance of both approaches is drastically degraded [6]. Moreover, the performance degradation due to unseen background noises has long been a difficult task in VAD. [7], [8], [9].

Contact IEEE to Subscribe

References

References is not available for this document.