Conferences >ICASSP 2022 - 2022 IEEE Inter...

ADA-VAD: Unpaired Adversarial Domain Adaptation for Noise-Robust Voice Activity Detection

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Voice Activity Detection (VAD) is becoming an essential front-end component in various speech processing systems. As those systems are commonly deployed in environments w...Show More

Metadata

Abstract:

Voice Activity Detection (VAD) is becoming an essential front-end component in various speech processing systems. As those systems are commonly deployed in environments with diverse noise types and low signal-to-noise ratios (SNRs), an effective VAD method should perform robust detection of speech region out of noisy background signals. In this paper, we propose adversarial domain adaptive VAD (ADA-VAD), which is a deep neural network (DNN) based VAD method highly robust to audio samples with various noise types and low SNRs. The proposed method trains DNN models for a VAD task in a supervised manner. Simultaneously, to mitigate the performance degradation due to back-ground noises, the adversarial domain adaptation method is adopted to match the domain discrepancy between noisy and clean audio stream in an unsupervised manner. The results show that ADA-VAD achieves an average of 3.6%p and 7%p higher AUC than models trained with manually extracted features on the AVA-speech dataset and a speech database synthesized with an unseen noise database, respectively.

Published in: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 23-27 May 2022

Date Added to IEEE Xplore: 27 April 2022

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP43922.2022.9746755

Conference Location: Singapore, Singapore

Funding Agency:

Contents

1. INTRODUCTION

The purpose of voice activity detection (VAD), also known as speech activity detection, is to find speech segments in audio recordings. It has been established as an essential pre-processing stage in various applications such as auto-speech recognition and speaker verification. As those systems are commonly deployed in environments with diverse noise types and low signal-to-noise ratios (SNRs), the crucial aspect of VAD is its robustness to background noise. Recently, several DNN-based learning approaches have shown improved performance, robustness, and generality over conventional statistical methods [1], [2], [3], [4], [5]. For instance, a recent study proposed a VAD method based on the long-short term memory neural network (LSTM) [1], [2] that uses contextual information of audio. Another work proposed a boosted deep neural network (bDNN) [5] that uses multi-resolution stacking (MRS). In addition, an adaptive context attention model (ACAM) [3] has been proposed to encourage the model to focus on crucial parts of the input features. Note that all these models are trained with manually-extracted features such as multi-resolution cochlea-gram (MRCG) and mel-spectrogram. The DNN-based VAD methods generally perform well on audio steams from clean environment. However, for recordings in low-SNR environment, the performance of both approaches is drastically degraded [6]. Moreover, the performance degradation due to unseen background noises has long been a difficult task in VAD. [7], [8], [9].

References is not available for this document.

ADA-VAD: Unpaired Adversarial Domain Adaptation for Noise-Robust Voice Activity Detection

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1. INTRODUCTION

References

IEEE Account

Purchase Details

Profile Information

Need Help?

ADA-VAD: Unpaired Adversarial Domain Adaptation for Noise-Robust Voice Activity Detection

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1. INTRODUCTION

References

IEEE Account

Purchase Details

Profile Information

Need Help?