1. Introduction
Speech enhancement is to separate clean speech from noisy speech [1]. It is an essential branch of speech signal processing and has been widely studied in the past few decades. It can be used in hearing aids, voice recorders, and smart speakers, as well as the front end of tasks such as speech recognition [2] and speaker recognition [3]. In recent years, a large number of speech enhancement methods based on deep learning have been proposed [4] [5] [6] [7] [8], showing stronger robustness than traditional signal-based methods. These methods can generally be divided into timedomain methods and frequency-domain methods. The timedomain methods [9] [10] use the neural network to directly map noisy speech waveform to clean speech waveform and usually do not require any preprocessing. The frequencydomain methods generally use short-term Fourier transform (STFT) to convert the noisy speech from the time domain to the frequency domain. They then use the neural network to map the magnitude spectrum of the noisy speech to some masking [11] or the magnitude spectrum of the clean speech [5]. Compared with the chaotic time-domain sampling points, the magnitude spectrum contains more geometric information, which makes it easier to calculate losses and analyze frequency components. As the SNR of noisy speech decreases, the correct phase becomes more and more important for speech intelligibility and quality [12]. However, since the mapping of the phase spectrum is complicated (no obvious geometric structure), the speech enhancement methods in the time domain are also widely used.