1. INTRODUCTION
The purpose of voice activity detection (VAD), also known as speech activity detection, is to find speech segments in audio recordings. It has been established as an essential pre-processing stage in various applications such as auto-speech recognition and speaker verification. As those systems are commonly deployed in environments with diverse noise types and low signal-to-noise ratios (SNRs), the crucial aspect of VAD is its robustness to background noise. Recently, several DNN-based learning approaches have shown improved performance, robustness, and generality over conventional statistical methods [1], [2], [3], [4], [5]. For instance, a recent study proposed a VAD method based on the long-short term memory neural network (LSTM) [1], [2] that uses contextual information of audio. Another work proposed a boosted deep neural network (bDNN) [5] that uses multi-resolution stacking (MRS). In addition, an adaptive context attention model (ACAM) [3] has been proposed to encourage the model to focus on crucial parts of the input features. Note that all these models are trained with manually-extracted features such as multi-resolution cochlea-gram (MRCG) and mel-spectrogram. The DNN-based VAD methods generally perform well on audio steams from clean environment. However, for recordings in low-SNR environment, the performance of both approaches is drastically degraded [6]. Moreover, the performance degradation due to unseen background noises has long been a difficult task in VAD. [7], [8], [9].