I. Introduction
Speech enhancement (SE) techniques have been a topic of great interest to researchers in the last decades. Over years, a variety of traditional statistical model-based SE techniques has been extensively addressed. E.g., spectral subtraction methods [1], Wiener filtering [2], and minimal mean-square error (MMSE) of the spectrum amplitude [3][4][5]. In the last few years, deep learning research has already impacted a wide range of speech enhancement work in both traditional and new contexts. E.g., regression model based on deep neural network (DNN) [6], convolutional neural network (CNN)-based speech augmentation approach [7], and long short-term memory (LSTM) neural network's time series modelling ability to improve speech [8]. According to these state-of-the-art studies, given enough hidden layers, DNN can learn any complicated transform function to approximate any mapping from input to code arbitrary well. Increasing the number of hidden layers helps to increase the capacity of the DNN for function approximation. Train DNN with large variations of speech patterns in different noisy environments, create severe interference effects in real-world environments, resulting in a delayed learning process and poor generalization of new inputs in unknown signal-to-noise ratios (SNR) [9][10]. Furthermore, even though context features were employed as input to the network, residual noise appeared in the increased output due to the DNN's frame-by-frame conversion of speech. Xu et al in [11] proposed a separable denoising autoencoder (SDAE), which have two autoencoders that represent speech and noise separately with pre-training that its input and output magnitudes are based on Fourier coefficients. Deep recurrent neural networks (DRNN) by Huang et al. added the recurrent structure to SDAE, as well as a discriminative term [12]. Meanwhile, Liu et al. conducted various tests to determine the SDAE's generalization power [13]. When not exposed to certain noise types during training, the network performs worse with mixtures containing those noise types. It was also found that invisible speakers and mixing weights had poor performance. Experimentally, however, we confront a lot more variations, such as the ratio of contributions from different sources, the frequency response of microphones, the degree of reverberations, and so on [14].