I. Introduction
Recently, monaural Speech Enhancement (SE) has seen a significant leap in performance thanks mainly to the adoption of supervised-learning techniques based on Deep Neural Networks (DNNs). In fact, DNN models, have been shown to be able to significantly improve intelligibility measures like the Short-Time Objective Intelligibility measure (STOI) [1], a feat which was not possible with non-supervised classical approaches unless very constrained situations were considered [2]. In the supervised approach, the model learns a direct mapping between noisy input features and output clean features by minimizing a loss function between the ground truth clean example and the output of the model.