Introduction
Speech enhancement (SE) improves the intelligibility and overall perceived quality of degraded speech signals. The SE systems are used as preprocessors by the automatic speech recognizer (ASR) to make speech recognition work well. Since SE generates clean versions of noisy speech, the ASR system does not need to be altered for robustness. In the literature, a variety of SE strategies have been documented, such as statistical model [1], spectral subtraction [2], subspace [3], and acoustics [4]. DNNs have transformed various applications and continue to be at the forefront of technological advancements. The ability to tackle complex tasks [5] makes DNNs valuable tools for applications across multiple domains, such as image processing [6], [7], [8], [9], bioinformatics [10], [11], [12], and video-audio processing [13], [14]. DNNs have shown better performance at modeling complex nonlinearity and perform better on SE in severe noisy backgrounds that are highly non-stationary. SE has made large progress due to the recent growth of DNNs [15]. The study [16] compressed the models to reduce DNN size, regression-based DNNs [17] are proposed for SE, DNN-based SE is proposed to examine performance across different speech datasets [18], time-frequency-based training objective is proposed for SE [19], and generalized gaussian distributions are proposed for SE using regression-based DNN [20]. Recurrent neural network (RNN)-based approaches, such as deep RNNs, are introduced where the information from previous frames is used in conjunction with the current frame [21]. By using the gates, a long-term short-term memory (LSTM) can control how much information from earlier time frames can be used to change the current frame. LSTM [22] has outperformed FDNN-based models [23]. CNNs have recently been applied to enhance noisy speech. CNN estimates the target speech by using the time-frequency (T-F) representation of noisy speech [24]. To construct the temporal envelope in the time-frequency domain, the study [25] proposes embedded CNNs. A study [26] proposes a fully connected and CNN-based model to perform SE in low SNRs. Another study [27] proposes UNET with architectural changes at several stages. A study [28] proposes a cooperative attention-based SE by combining local and global attention in a self-adaptive way. A convolutional transformer neural network is proposed to learn the local and global features [29]. A low-complexity swim transformer is proposed in [30] for SE. Multi-dimensional CNNs and GRUs are proposed using multi-headed attention [31]. A study [32] proposes residual GRU for augmented speech enhancement.
To encode the information from the speech spectrograms, the CNN structures determine which parts of the information are crucial for speech and learning neural hidden representations. On the other hand, CNNs tend to overlook spatial information, such as the spatial locations of pitch and formant components, although they provide crucial clues for human perception of speech signals. This information is often distorted since achieving ideal speech enhancement is a difficult task. Different neural architectures capture spatial cues (variations across time and frequency representations) and temporal cues (variations across time) based on the input characteristics of spectrograms. The convolutional layer in CNN performs the crucial task of extracting features from data (e.g., spectrograms). Even though CNNs perform very well, they still have a few limitations. The pooling operation used in convolutional layers results in a loss of valuable information. The ability of CNN to learn needs large amounts of training data. The layers in CNN lower the spatial resolution, and the output of the networks does not vary even when the inputs are altered slightly. To alleviate the limitations of CNN, Capsule Networks (CapsNet) are proposed [33]. CapsNet can retrieve spatial information and other important features to prevent the information loss that occurs during pooling operations. The CapsNet structure incorporating a dynamic routing protocol was proposed by Sabour, Frosst, and Hinton [33] to bring spatial information into account and achieve success.
This study implements and examines the applicability of CapsNet to single-channel SE. It is expected that spatial information in the speech spectrograms would improve the performance of SE (noise reduction without loss of speech quality). The spectrograms are used as inputs by the CapsNet-based SE systems to maintain a large portion of the speech content. To prevent local-to-global information loss, a recurrent structure of window-level capsules is applied to the spectrograms. The recurrent structure manages the input sequence of variable-length feature frames. To acquire temporal dependencies in the sequences, recurrent layers are added to the structure. Utterance-level dynamic routing on the recurrent outputs of the capsule is utilized. The spatiotemporal information encoded in the input speech spectrograms is completely utilized by CapsNet. CapsNet in this study is intended to minimize the loss of frequency (spatial) information as happened in the typical CNN paradigm, where a local loss propagates to higher layers, thereby resulting in the loss of global information. The objectives of this study are to develop the CapsNet technique, which includes the ability to process input sequences of varying lengths; the use of dynamic routing to capture the spatial information in the spectrograms; and the inclusion of recurrent connections to capture the temporal information in the speech spectrograms. The contributions of this study are given as follows:
Considering the hierarchical structure and temporal relationships within speech signals, this study proposes a CapsNet-based speech enhancement model that provides expressive and context-aware feature representations while keeping spatial information in capsules.
The pooling recurrent hidden states is replaced with utterance-level dynamic routing to obtain robust speech features from the outputs of the capsules.
To prevent local-to-global loss of information, a recurrent structure of window-level capsules is applied to the spectrograms that manage the input sequence of variable-length feature frames.
To extract temporal dependencies in the sequences, two types of recurrent layers including LSTM and GRU are added at the top of the CapsNet structure.
The remainder of this paper is organized as follows: Section II provides a literature survey. Section III covers CNN and CapsNet. Section IV describes the CNN+RNN system as a baseline. The CapsNet-based speech enhancement system is discussed in Section V. Section VI describes the detailed system implementation. System evaluations are presented in Section VII. Section VIII discusses the findings. Finally, Section IX concludes the study.
Related Studies
Deep learning (DL) has grown in recent years, but still, challenges are faced by various models with possible solutions [34]. A signal-to-noise ratio (SNR)-aware CNN model for SE is proposed, with a multi-task learning framework and an SNR-adaptive framework [35]. The target speech is estimated by a learning system that uses autoencoder CNN to improve noisy speech [36]. A phase-and-harmonics-aware CNN is proposed to design a two-stream framework for SE [37]. A multi-objective learning CNN is proposed to reduce noise in speech spectrograms and improve the quality of speech as it is perceived [38]. Three variations of CNNs to enhance noisy speech are proposed for cochlear implant users. To fix the delays in real-time CNN-based speech enhancement, causal versions of three CNNs are proposed [39]: spectral subtraction-style CNN, basic CNN, and Wiener-style CNN. Speech enhancement is treated as a sequence-to-sequence model, and a CNN is proposed to use long-term contexts to find the target speakers. The CNN model integrated contexts systematically using dilated convolutions, which greatly expanded receptive fields. In addition, the model includes gating strategies and residual learning [40]. A causal convolutional RNN framework for progressive learning is proposed, which leverages CNNs and RNNs to reduce parameter numbers and computational load [41]. An end-to-end paradigm for speech enhancement based on CNN and RNN is proposed, which is entirely data-driven and involves no inferences about the noise type or stationarity. The model allowed the leveraging of local characteristics in the spectral and temporal domains [42]. Based on a fully convolutional neural network and analyzing the complicated spectrograms, it is suggested that a time-domain SE be used to improve target speech signals [43]. Conv1D, frequency-dilated Conv2D, residual learning, and skip connections are modules of the proposed CNN model.
Convolutional recurrent neural networks (CRNs) with convolutional encoder-decoder (CED) architectures can improve single-channel SE. CRNs represent temporal data by incorporating recurrent nets (like LSTMs) between the encoder and decoder. For fully connected LSTM processing, a CRN’s internal representations in feature maps and attention to the local structure of convolutional mappings must be disregarded. CRNs may also limit the feature space size of the LSTM’s input, which, because of its fully connected nature, needs many trainable parameters. In a new study, the fully-connected LSTM was replaced with a convolutional LSTM (ConvLSTM), and the new network was called a fully convolutional recurrent network (FCRN) [44]. The ConvLSTM maintains the hierarchical organization of input feature maps. It is shown that this helps capture the harmonic structures of speech, which lets it handle high-dimensional input features with fewer trainable parameters than usual LSTM. A novel CRN based on complex spectral mapping is suggested, which results in a causal approach for speaker-independent and noise-independent SE [45]. To decrease the number of trainable parameters and the computational cost without compromising performance, CRN also included a grouping technique. The real and imaginary parts of the noisy speech spectrograms are considered two separate input channels. A CRN to address the real-time monaural SE is proposed where CED and LSTMs are incorporated into the CRN architecture [46], which leads to a causal SE naturally suited for real-time speech processing. Motivated by multitasking, a gated CRN is proposed for complex spectral mapping by incorporating the gated linear units (GLUs) into the analysis, which leads to another efficient causal SE [47]. The study in [48] presented a novel convolutional fusion network (CFN) for optimizing model performance, inter-channel dependencies, information utilization, and parameter optimization in single-channel SE. First, a novel group convolutional fusion unit (GCFU) composed of a conventional CNN and a depth-wise separable CNN is utilized to reconstruct the signal. To take advantage of the inter-channel dependencies inside the network, the whole input sequence is fed concurrently to two convolution networks in parallel, and their outputs are rearranged and concatenated. To further enhance the model’s performance, multiple layers within the encoder and decoder are connected using the intra-skip connection technique.
CapsNet Vs CNN
A convolutional layer efficiently learns hidden representations that encompass feature-level and time-level information cues. It can recognize patterns specified by kernel sizes such that various parts of the input data share the same kernel sizes. While processing speech spectrograms as inputs, a convolutional layer learns spectrogram patterns representing information such as harmonic structures and low-energy zones [49]. Since a shared kernel is applied to the input spectrograms, the pooling layer selects parts with the maximum and minimum, ignoring other parts in the pooling window. As a result, the detailed instant information is ignored. To concurrently capture feature representations within the feature maps, the dynamic multi-pooling CNN has the capacity to uphold a variety of pooling operations on different sections of the input feature maps [50]. The loss of information remains a challenge for various pooling operations. The recently suggested CapsNet can pass information to the upper layers while maintaining the spatial information in the shape of a vector direction. The output of the convolutional layers is a sequence of varying lengths and requires compression into a global utterance-level representation with a fixed size. Recurrent layers capture the temporal information in the speech sequence. One approach is to represent speech utterances using the hidden states of the top recurrent layer from the last (and/or initial) timestep. The first or last timestep, however, often corresponds to silence frames. Additionally, the dissemination of contextual information is lost across long distances. To solve such problems, an attention process is usually implemented to extract feature sequences. Another approach applies a gating mechanism to CNN for capturing relevant features, which creates gate values depending on the input windows of a defined size across the feature map. An utterance-level representation is derived using the dynamic routing protocol by simultaneously taking into account the weights applied to timesteps and feature dimensions. Further, the entire sequence is examined during the routing procedure, ensuring adequate capture of long-distance contexts.
The insensitivity to defined spatial cues is one of the CNN drawbacks. The neural structure in the convolutional layers provides scalar outputs that solely indicate the probabilities that the feature patterns (such as formants) satisfy the kernel. Detailed instantiation factors, such as formant positions, are not taken into account. Usually, a shared kernel is applied to different regions of the spectrograms, as shown in Fig. 1. Whenever the feature patterns match the kernel, the output is enabled (highlighted in red), irrespective of whether the activation occurred at the beginning or end of the feature pattern. Further, the max-pooling layer in CNN only keeps the most active neurons, which makes it difficult for the top layers to get information about spatial correlations. The max-pooling operation selects the active outputs at various spectrogram positions, as indicated in Fig. 1. For different speech utterances, the final CNN outputs remain the same. Thus, the subsequent layers are unable to discriminate between two sequences. CapNets feature two advantages (the capsules and dynamic routing protocol) over the CNN architecture that enable CapsNet to be considered. (i) The neurons in a convolutional layer result in scalar outputs; on the other hand, capsules are groups of neurons that result in vector outputs containing the instantiation information, including the pose and orientation of the patterns in speech spectrograms. Since a neuron’s output is a scalar quantity, it does not provide spatial information. (ii) The convolution layers in CNN hold very important functions for detecting feature maps. However, there is a loss of valuable information while using max-pooling layers [51]. Hence, the dynamic-routing protocol replaced the max-pooling layer. The protocol couples the capsules in different positions from lower to higher layers and enables the capsules in higher layers to fetch important spatial relationships and features to overcome the information loss observed in max-pooling layers. The spatial information differentiates the capsule outputs, as illustrated in Fig. 2. Different outputs from the capsules are produced by the tenor increases at various points; these outputs are marked in orange and blue. The outputs from the position-aware capsule are instead routed dynamically to the higher layers.
CNN’s sense at different spectrogram positions. A shared kernel is used on separate segments of the input spectrogram. The active outputs are marked in yellow. Despite the fact that the active outputs derive from separate segments of the spectrogram, max pooling yields the same result (each indicated in yellow).
Capsule sense at different spectrogram positions. Various capsule outputs are produced at distinct segments, which are indicated in orange and blue. The capsule outputs are then sent to the higher layer through dynamic routing. The distinct final outputs (highlighted in orange and blue) support the SE.
Baseline System
To demonstrate the capability of CapsNet for SE, a baseline system comprised of CNN layers and recurrent layers is first implemented. The CNN module learns a neural representation from the input speech spectrogram, in which a convolutional layer learns feature maps from the spectrograms, whereas a pooling layer compresses the size of feature maps. Recurrent layers, such as the GRU or LSTM, are added to the CNN module so that the model can learn temporal cues. A fully connected layer at the end performs spectral mapping from noisy speech to clean speech. The CNN and RNN modules are described below.
A. Convolutional Neural Network Module
The CNN module in the baseline consists of convolutional and max-pooling layers. The input three-dimensional feature maps (width, height, and channel) for each convolutional layer are transformed into other three-dimensional feature maps, as shown in Fig. 3(A). A kernel, which is a three-dimensional matrix with width, height, and channel number, is applied to the input by multiplying the kernel and feature maps element-by-element. The kernel is then distributed among many segments. The element-wise multiplied outputs are routed through a non-linear activation function to get scale-valued outputs. A single neuron consists of a kernel and a corresponding non-linear activation function. However, to get multiple outputs, multiple kernels are utilized. The max-pooling layer reduces the size of the input feature map by taking the maximum value over a certain window, set by the hyperparameters of the layer.
Architectures of CNN and capsule networks. The neurons in CNN with static outputs are replaced with capsules (groups of neurons), which output vectors representing the feature patterns. The max-pooling layer is replaced with a dynamic routing protocol to route the capsules in the upper layers.
B. Recurrent Layers
Recurrent layers are added to the CNN module so that the baseline can learn about temporal information. The recurrent layers can be made up of simple recurrent units (SRUs) [52], gated recurrent units (GRUs) [32], or long-short-term memory (LSTM) [53]. In the baseline, bi-directional GRUs (BiGRUs) are added to obtain the temporal cues. The outputs of the CNN module are routed through BiGRUs in the forward and backward directions. To obtain the output, the final state of the forward GRU and the first state of the backward GRU are concatenated and passed onto the fully connected layer.
Proposed Capsnet-Based SE Systems
Figure 4 shows the proposed CapsNet-based framework for SE, which is made up of one CNN module, several capsule layers, and one fully connected layer. The structure of the CNN module is similar to the baseline. The CapsNet makes CNN more sensitive to learning the parameters of instantiation so it can recognize spatial information. Even though the capsule architecture can get spatial information well, it has a higher computational cost than the CNN architecture. Further, the kernels of convolutional layers are shared throughout the inputs (spectrograms). This enables the CNN module to transfer optimized weights from one part of the spectrogram to another part of the same spectrogram. Each capsule in the CapsNet is made up of a group of neurons, and the output of each neuron is a different attribute of the same feature. Keeping a group instead of a single neuron provides both probability and instantiation information. This allows the network to learn all the features of speech segments. A CNN’s output features serve as the inputs to the capsule. As shown in Fig. 3(B), the convolutional layers that make up the first layer of capsules are called primary capsules. To create a capsule, neurons positioned at the same coordinates in both the width and height dimensions of the output spectrogram feature map are combined. The dynamic routing protocol is employed to identify the hierarchical connections among the acquired features across various layers.
The system architecture of the CNN-LSTM-RecCapsNet and CNN-GRU-RecCapsNet. The output feature maps from the lower convolutional and pooling layers are first divided into windows for the capsule component, followed by a shared-weight capsule applied to the windows. Lastly, the outputs are combined and routed to the next layer of utterance-level capsules.
During dynamic routing, let us assume the \begin{equation*} \hat {u}_{j\mid i} =\mathbf {W}_{ij}u_{i}+b_{ji} \tag{1}\end{equation*}
\begin{align*} C_{ij}&=\frac {exp(d_{ij})}{\sum _{k}^{}exp(d_{ik})} \tag{2}\\ s_{j}&=\sum _{i}^{}c_{ij}\hat {u}_{j\mid i}, \hat {u}_{j\mid i}=\mathbf {W}_{ij}u_{ij} \tag{3}\\ v_{j}&=\frac {\left \|{ s_{j} }\right \|^{2}}{1+\left \|{ s }\right \|^{2}}\frac {s_{j}}{\left \|{ s_{j} }\right \|} \tag{4}\\ d_{ij}&\leftarrow d_{ij}+\hat {u}_{j\mid i} \odot v_{j} \tag{5}\end{align*}
A. Recurrent CapsNet
The temporal features in speech signals provide significant information such as envelope, periodicity, and structure. The recurrent connections are incorporated in the dynamic routing process to allow the model to fetch temporal features, thereby introducing a recurrent capsule (RecCaps) structure in the proposed SE system. The \begin{equation*} {} {u}_{t,j\mid i}=\mathbf {W}_{ij}^{u}u_{t,i}+\mathbf {W}_{ij}^{o}o_{t-1}+b_{ij} \tag{6}\end{equation*}
Systems Implementation
The proposed speech enhancement contains CapsNet, CNN module, and GRU/LSTM layers, as demonstrated in Fig. 4. CNN is a common module in the baseline and proposed SE systems, which is applied to the spectrograms to derive neural representations for the upper layers. Table 1 provides the architectural details of the CNN module. To capture the correlation information across frequencies and timesteps, two distinct convolutional layers of kernel size
Algorithm for Speech Enhancement
Input No. of Samples, sequences, and channels
Features Extraction and pooling:
Output responding to formula:
Dynamic Routing
for capsule
for
for all capsules
for capsules
for capsules
for capsules
Return
LSTM/GRU Initialization
While
Initialize moments:
Create LSTM/GRU Models
Train/Validate Models
Test Models
A. Proposed SE System Configuration
The recurrent capsule network (RecCapsNet)-based SE system is developed by stacking RecCaps on the CNN module, denoted as the CNN-RecCapsNet. Regarding the CapsNet module, the initial step involves dividing the feature map from the CNN module into overlapping windows. Afterward, eight convolutional layers with a kernel size of
To further improve the system’s ability to learn long-term temporal dependencies, a GRU layer is placed atop the CNN module and parallel to the capsule branch, labeled as the CNN-GRU-RecCapsNet. Additionally, a CNN-LSTM-RecCapsNet is formulated by replacing the GRU layer with an LSTM layer. The outputs of the GRU layers and capsule elements are sent to distinct sets of fully connected dense layers. The outcomes of the two branches are combined using heuristic weights.
B. Baseline System Configuration
The first baseline CNN system has five convolutional layers, including a GRU layer above the CNN module, named CNN-GRU. The second baseline system is constructed by layering LSTM on top of the CNN module, named CNN-LSTM. Both the GRU and LSTM layers are bidirectional, with 64 neurons on each side. The final forward GRU/LSTM states and the first backward GRU/LSTM states are combined and supplied into a fully connected, dense layer of 64 neurons activated by ReLU with a drop-out rate of 0.5. The cross-entropy criteria are the objective function during network training.
C. Influence of Routing Number
The dynamic routing protocol is critical and is required to ensure that the group of connected lower-layer capsules is routed to the upper layers. The routing iteration process can be repeated as needed. Typically, 3 is a common routing number, denoted by
D. Hyperparameters Tuning
To select the appropriate number of capsules, channels, convolutional layers, and neurons in the LSTM/GRU layers, this study performed hyperparameter tuning and examined the best configuration for speech enhancement. Two configurations are examined for selection, given in Table 2. The performance analysis of the SE system with two configurations is demonstrated in Figure 6. It can be observed that configuration 2 yields better SE performance (STOI: 91.6%, PESQ: 2.912, Covl: 3.82, and SNRSeg: 9.85dB at 5dB babble noise). Marginal performance improvements are noticed with configuration 3, with additional model complexity and memory size. Configuration 1 shows the lowest SE performance. As a result, configuration 2 is adopted for experiments that show reasonable performance with low computational complexity and better SE results.
System Evaluation
Experiments are conducted by using the publicly available LibriSpeech [54] and DEMAND [55] databases, whereas the performance of the CapsNet-based SE systems is evaluated by the PESQ [56] and STOI [57] objective measures.
A. Database
Clean utterances of both genders are used in experiments from the LibriSpeech database with a defined recipe of training and testing sets. The LibriSpeech database contains around 1000 hours of 16kHz English speech data. The data is collected from LibriVox audiobooks that have been carefully segmented and aligned. To evaluate the SE models in noisy settings, various noise sources are selected from the DEMAND database. The DEMAND (Diverse Environments Multichannel Acoustic Noise Database) offers a collection of recordings that enable systems to be tested in a variety of contexts utilizing real-world noise. The database includes 15 recordings. All recordings are produced using a 16-channel array of microphones, with the lowest distance between them being 5 cm and the greatest being 21.8 cm. To produce the noisy stimuli, three SNRs are considered: −5dB, 0dB, and 5dB, respectively. The training set of utterances contains both genders and is mixed with all kinds of noises. Except for the two noises (Factory2 and cafeteria2), all other noises are included in the training and testing of the system. Factory2 and cafeteria2 are represented as unseen noises.
B. Evaluation Measures
Two measures, including the PESQ (Perceptual Evaluation of Speech Quality) and STOI (Short-Time Objective Intelligibility), evaluate the proposed SE. PESQ determines speech quality, whereas STOI determines speech intelligibility.
PESQ, an ITU-T P.862 recommendation, assigns a value between −0.5 and 4.5 to perceptually score speech quality. In contrast to other objective measures that accept both positive and negative loudness variations equally, the PESQ treats both independently. Since positive and negative loudness variations impact the perceived quality differently, a positive difference indicates the addition of a spectral component, such as a noise signal, and a negative difference shows the removal or substantial attenuation of a spectral component. STOI measures speech intelligibility and assigns values between 0 and 1. In the short-time speech segments, STOI reveals a correlation between the temporal envelopes of clean and distorted speech. It differs from many objective measures by analyzing relatively short speech segments (10–20 milliseconds) instead of the complete speech utterance.
C. Network Training
An optimal weight initialization is important for CapsNet to converge [58]. In the experiments, the Glorot-Uniform (Xavier) [59] initializer is used to initialize both the CNN module and the capsule layer. The batch size is set to 32. The Adam optimizer is set with excellent default parameters that have been tested with deep learning problems, such as
Results and Discussions
A. Speech Enhancement in Seen Noises
This section first presents the results of CapsNet-based SE systems trained on the LibriSpeech and seen noises from the DEMAND datasets. Five non-stationary noises are selected to showcase the proposed SE systems.
Table 3 shows the average STOI results for five example noises at −5dB, 0dB, and 5dB SNR. Three proposed SE systems perform better with recurrent CapsNet and LSTM/GRU layers. It can be seen that CNN-LSTM-RecCapsNet is significantly better at STOI than CNN-RecCapsNet and CNN-GRU-RecCapsNet. This indicates that CNN-LSTM-RecCapsNet is a better architecture for speech enhancement than the other two. CNN-LSTM-RecCapsNet consistently outperforms five noises at all SNRs. The proposed CNN-LSTM-RecCapsNet yielded the highest STOI values (79.1%) with cafeteria and factory noise at low SNR (−5dB). The STOI with babble noise is increased from 55.6% to 77.9% with CNN-LSTM-RecCapsNet and achieves a 22.3% improvement in STOI at −5dB SNR. Similarly, the STOI with street noise is increased from 70.5% to 84.6% with CNN-LSTM-RecCapsNet and achieves a 10.1% improvement in STOI at 0dB SNR. Further, the STOI with factory noise is increased from 54.9% to 78.0% with CNN-GRU-RecCapsNet and achieves a 23.1% improvement in STOI at −5dB SNR. The average STOI for three CapsNet across all noise levels and five noises is 81.92%, 83.12%, and 84.69% suggesting the viability of CapsNet for speech enhancement. The STOI with average noisy speech increased from 66.01% to 81.92%, 83.12%, and 84.69% with CNN-RecCapsNet, CNN-GRU-RecCapsNet, and CNN-LSTM-RecCapsNet, thereby achieves 15.91%, 17.11%, and 18.68% improvements in STOI. CNN-LSTM-RecCapsNet and CNN-GRU-RecCapsNet outperform CNN-RecCapsNet, indicating the success of GRU/LSTM in the proposed SE.
Table 4 shows the average PESQ results for five noises at −5dB, 0dB, and 5dB SNR. RecCapsNets are clearly more efficient in terms of increasing perceptual speech quality (PESQ). The better PESQ score implies that there is less loss of spatial information in speech spectrograms when capsules use dynamic routing to get to the upper layers. Three proposed SE systems perform better in terms of PESQ with recurrent CapsNet and LSTM/GRU layers. CNN-LSTM-RecCapsNet outperforms CNN-RecCapsNet and CNN-GRU-RecCapsNet in terms of perceptual speech quality (PESQ). RecCapsNet with stacked LSTM/GRU layers outperforms CNN-RecCapsNet, indicating the significance of these layers in the proposed SE system for improving the perceived quality of noisy speech. In terms of PESQ, CNN-LSTM-RecCapsNet consistently outperforms five noises at all SNRs. The proposed CNN-LSTM-RecCapsNet yielded the highest PESQ value (2.13) with factory noise at low SNR (−5dB). The PESQ with cafeteria noise is increased from 1.53 to 2.65 with CNN-LSTM-RecCapsNet and achieves a 42.26% improvement in PESQ at 0dB SNR. Similarly, the PESQ with exhibition noise is increased from 2.03 to 2.92 with CNN-LSTM-RecCapsNet and achieves a 30.47% improvement in PESQ at 5dB SNR. Further, the PESQ with babble noise is increased from 1.35 to 2.01 with CNN-GRU-RecCapsNet and achieves a 32.83% improvement in PESQ at −5dB SNR. The average PESQ for three CapsNet across all noise levels and five noises is 2.36, 2.48, and 2.55 suggesting the viability of CapsNet for speech enhancement. The PESQ with average noisy speech is increased from 1.69 to 2.36, 2.48, and 2.55 thereby improving PESQ by 28.38%, 31.85%, and 33.72% with CNN-RecCapsNet, CNN-GRU-RecCapsNet, and CNN-LSTM-RecCapsNet, respectively. The average PESQ and STOI results for all noises from the LibriSpeech across three SNRs are given in Table 5.
B. Speech Enhancement in Unseen Noises
This section presents the results of the proposed SE systems using LibriSpeech and two noises (cafeteria2 and factory2) from the DEMAND datasets. Table 6 lists the STOI findings for two unseen noises at −5dB, 0dB, and 5dB SNRs where the proposed CapsNet SE systems perform significantly better in improving speech intelligibility in unseen noisy environments. All three versions of CapsNet improved the STOI by considerable proportions. For example, the STOI with cafeteria2 noise is increased from 54.5% with noisy speech to 70.4%, 71.9%, and 73.8% with CNN-RecCapsNet, CNN-GRU-RecCapsNet, and CNN-LSTM-RecCapsNet, achieving 15.9%, 17.4%, and 19.3% improvements in STOI at −5dB SNR. Similarly, the STOI with factory2 noise is increased from 63.7% with noisy speech to 73.2%, 73.8%, and 77.9% with CNN-RecCapsNet, CNN-GRU-RecCapsNet, and CNN-LSTM-RecCapsNet, achieving 9.5%, 10.%, and 14.2% improvements in STOI at 0dB SNR.
Table 7 gives PESQ findings for two unseen noises at −5dB, 0dB, and 5dB SNRs where the proposed CapsNet SE performs significantly better in improving the perceptual speech quality in unseen noisy environments. Three CapsNet architectures increased the PESQ scores by considerable proportions. For example, the PESQ with cafeteria2 noise is increased from 1.62 with noisy speech to 2.22, 2.43, and 2.49 with CNN-RecCapsNet, CNN-GRU-RecCapsNet, and CNN-LSTM-RecCapsNet, achieving 27.02%, 33.33%, and 34.93% improvements in PESQ scores at 0dB SNR. Further, the PESQ with factory2 noise is increased from 1.71 with noisy speech to 2.77, 2.82, and 2.88 with CNN-RecCapsNet, CNN-GRU-RecCapsNet, and CNN-LSTM-RecCapsNet, achieving 36.9%, 37.23%, and 40.62% improvements in PESQ scores at 5dB SNR. The comparison of STOI and PESQ in seen and unseen noises is provided in Fig. 7. In summary, the proposed CNN-LSTM-RecCapsNet and CNN-GRU-RecCapsNet perform the best in terms of both STOI and PESQ.
C. Model Comparison
The proposed CapsNet-based SE systems are measured against different baselines: Feed-forward DNN (FDNN), Unidirectional LSTM, CNN-GRU, Temporal CNN (TCNN), a gated recurrent network (GRN), a convolutional fusion network (CFN), and a fully convolutional network (FCN). The FDNN [21] is a 5-layered model with 2048 units in each layer activated by a ReLU activator. The Unidirectional LSTM model implemented in [60] is a 4-layered model with 1024 units in each layer. In the CNN-GRU model [61], the CNN is constructed of four layered frequency-dilated convolution layers. A three-layered GRU model with 256 units is stacked atop a CNN to capture the speech temporal dynamics. TCNN [62] is composed of the encoder, decoder, and temporal convolutional module (TCM). Two-dimensional causal convolutional layers are employed in the encoder-decoder modules, whereas the TCM is composed of one-dimensional causal and dilated convolutional layers. Batch normalization and a parametric ReLU activator are employed after each encoder layer. The GRN [40] uses four stacked convolutional layers to fetch spatial features followed by temporal time-dilation convolutions. CFN [48] reproduces the speech signal using group convolutional fusion units consisting of conventional and depth-wise separable CNNs. FCN [63] has 8 convolutional layers, and each one is followed by batch normalization and the Leaky ReLU activator. The other benchmarks include MCBNet [64], PL-CRNN [41], DTLN [65], DCCRN [66], DNN-TGSA [67], DeepResGRU [32], DeepXi [68], CRN-BLSTM [69], and CNN [84].
This section first compares the FDNN with three proposed CapsNet models. As illustrated in Table 8, the FDNN improves the STOI by 9.6% with cafeteria noise to 11.3% with the factory at −5dB SNR. On the other hand, the FDNN improves the PESQ by 0.25 with babble noise to 0.46 with factory noise at −5dB SNR over the unprocessed mixtures. Going from the FDNN to LSTM improves both metrics considerably. The LSTM improves the STOI and PESQ by 17.7% and 0.38 (20.76%) with babble noise at −5dB SNR. Unlike the FDNN, the LSTM model changes over time by enabling recurrent connections. Moving from LSTM to CNN-GRU yields slightly better results than the unprocessed mixtures. For example, take the 0 dB SNR case, where the STOI with babble noise is improved by 15.1% whereas the PESQ is improved by 0.58 (25.21%). Further, the temporal CNN yields better but almost similar results to CNN-GRU over the noisy mixtures in babble and cafeteria noises. For illustration, take the −5 dB SNR case; the STOI and PESQ with cafeteria noise improve by 15.2% 0.56 (29.94%). The gated recurrent networks (GRN) leveraged long-term contextual information and interpreted SE as a sequence-to-sequence mapping, considerably improving the STOI and PESQ across noisy mixtures. For example, at 5dB babble noisy mixture, the GRN improves the PESQ and STOI by 0.51 (20%) and 5.6%. The CFN and FCN improve the STOI and PESQ over the unprocessed mixtures. For example, taking 0 dB average SNRs of both noises, the CFN improves the STOI by 14.21% and the PESQ by 0.71 (30.21%) whereas the FCN improves the PESQ by 0.42 (20.79%), and the STOI by 7.68%. In addition by taking average values of both noises, the CNN improves the STOI by 11.9% and the PESQ by 0.55 (25.11%) over the noisy speech.
The proposed CapsNet-based SE systems consistently outperform the benchmarks in all scenarios except −5dB SNRs (babble and cafeteria), where CFN marginally outperforms the proposed CNN-RecCapsNet. In other noisy scenarios, the CapsNet improves the STOI and PESQ substantially. Take, for example, babble noise at −5dB SNR; the STOI with three CapsNet increases by 22.5%, 23.1%, and 24.2%, respectively, whereas the PESQ improves by 0.52, 0.61, and 0.68 over the unprocessed mixtures. Moving from unprocessed mixtures to three CapsNet yielded substantially better results over the TCNN, CFN, and CNN-GRU, respectively. The STOI improves from 75.6%, 73.5%, and 72.4% with TCNN, CFN, and CNN-GRU to 79.1% with the proposed CNN-LSTM-RecCapsNet. On the other hand, the CNN-GRU-RecCapsNet achieves 3.9%, 0.5%, and 5.7% improvements in STOI at −5dB babble noise. The average STOI improves from 76.8% with GRN to 83.3% with CNN-RecCapsNet. The average PESQ improves from 2.20 and 2.35 with FDNN and LSTM to 2.60 and 2.66 with CNN-GRU-RecCapsNet and CNN-LSTM-RecCapsNet, respectively. In contrast to CNN-RecCapsNet, the LSTM/GRU layers over the CNN module improve the system’s ability to learn long-term temporal dependencies, thereby notably increasing the perceptual quality and intelligibility (which can be observed in Table 7). Further, the loss of speech spectrograms’ spatial information with CapsNet is optimized, which is important for increasing humans’ perception of speech. The overall improvements of the proposed SE systems and the benchmarks over the unprocessed mixtures are demonstrated in Fig. 8.
The overall improvements of the CapsNet-based SE systems and the benchmarks over the unprocessed mixtures.
For a visual representation, spectrograms are plotted to examine noise reduction. The FDNN does not degrade the speech signal severely, but it is unable to remove noise parts, as observed in the low-frequency regions. The LSTM decreases noise further, yet some noise remains visible. The spectrograms with the CapsNet and recurrent layers seem the closest to the clean spectrogram. Furthermore, they introduce negligible distortions and are close to a clean spectrogram. Figure 9 demonstrates sample spectrograms of utterance degraded by babble noise at 0dB SNR. The marked low-frequency regions indicate the presence of residual noise.
Visual Representation of CapsNet-based and Benchmark SE Systems. The sample spectrograms of utterance were degraded by babble noise at 0dB SNR. The improved spectrograms with the CapsNet and recurrent layers seem the closest to the clean spectrogram, introducing negligible distortions.
D. Parameters Efficiency
When compared to the benchmarks, the proposed SE systems have greater parameter efficiency. The number of learnable parameters in the CapsNet and different benchmark models is shown in Fig. 10. The parameter numbers of the CNN-RecCapsNet (0.71 million) system decrease significantly with the recurrent structure. The integration of recurrent layers (GRU/LSTM) has raised the number of parameters, but the learnable parameters are lower than benchmarks. CNN-GRU-RecCapsNet (1.52 million) has 25% fewer parameter numbers than CNN-LSTM-RecCapsNet (1.73 million). LSTM and FDNN have larger parameter sizes than the three proposed CapsNets. GRN (2.49 million) outperforms other benchmarks in terms of parameter efficiency due to the inclusion of shared weights in convolution computations. We further provide multiple-accumulate operations (MACs) and memory size of the proposed models. MACs are often used as a metric to measure the computational complexity. The neural networks with more MACs are computationally more intensive. CNN-RecCapsNet shows the lowest MACs and memory size (0.528G/s and 4.57MB). With the integration of the LSTM layer, the CNN-LSTM-RecCapsNet shows greater MACs and memory (0.697G/s and 8.95MB). CNN-GRU-RecCapsNet shows fewer MACs (0.61G/s) and memory (6.85MB) as compared to CNN-LSTM-RecCapsNet. As computational resources are often constrained in real-world applications, it is necessary to establish an appropriate trade-off between the model’s enhancement performance and parameter efficiency. Further, this section provides complete details of computational complexity (MACs, FLOPs, Para
E. Cross Database Analysis
A speech dataset is often composed of various utterances spoken by distinct speakers. The speech utterances are recorded in controlled situations for clean recordings that are acceptable for speech applications. The utterances are usually recorded in different controlled contexts for each dataset, which might result in distinct parts in the speech utterances. For example, the quality of a speech recorded with different microphones by the same individual might differ considerably. Therefore, the proposed SE systems are further evaluated on the IEEE and TIMIT datasets. Table 10–11 shows the average PESQ and STOI results across all noises and SNRs on IEEE and TIMIT speech datasets. The TIMIT speech corpus was created to provide speech data for acoustic-phonetic experiments, as well as to design and evaluate SE and ASR systems. The TIMIT speech corpus [70] provides phonetically rich sentence recordings. It contains a 16-bit, 16-kHz speech waveform file for each utterance. The IEEE speech corpus [71] contains 720 utterances. To produce noisy mixtures, 1000 (TIMIT) and 600 (IEEE) clean utterances of both genders are randomly selected and mixed with cafeteria and babble noise at −5dB, 0dB, and 5dB SNRs.
Table 10 shows a cross-corpus STOI analysis in which all SE systems perform better in terms of increasing speech intelligibility. SE systems outperform the benchmarks in the TIMIT and IEEE speech corpus. With the IEEE speech corpus, the performance and comparison with the benchmarks show that STOI scores are improved by 19.6%, 18.4%, and 17% with CNN-LSTM-RecCapsNet, CNN-GRU-RecCapsNet, and CNN-RecCapsNet at the −5 dB, which is a challenging SNR situation. In other SNR conditions, the proposed SE systems increased the STOI scores, which represents a substantial intelligibility performance in the different speech corpus. Similarly, Table 11 shows a cross-corpus PESQ analysis where the proposed SE systems perform substantially better in terms of increasing perceptual speech quality. With TIMIT, the proposed SE systems improved the average PESQ scores by 0.75, 0.87, and 0.92 over the unprocessed mixtures.
To further analyze the suggested SE model, this study uses the publicly available VoiceBank+DEMAND dataset with an exact remedy followed by the benchmark studies. The training set (composed of 11572 speech utterances) consists of 28 speakers with four SNRs (15dB, 10dB, 5dB, and 0dB). The test sets (composed of 824 speech utterances) consist of 2 speakers with four SNRs (17.5dB, 12.5dB, 7.5dB, and 2.5dB). The results are presented in Table 12 to validate the performance of the suggested model with benchmark models: SEGAN [72], MetricGAN+ [73], GAGNet [74], RDL-Net [75], DEMUCS [76], TSTNN [77], and SE-Conformer [78]. With the VoiceBank+DEMAND dataset, the suggested model achieves the best results as compared to the benchmark models except
This study further examined the proposed deep learning (DL) models with various non-machine learning (NML) approaches to highlight their superiority. The NML methods for speech enhancement include low-rank sparse decomposition (LRSD) [79], nonnegative RPCA (NRPCA) [80], spectral subtraction (SpcSub) [2], and statistical models [3], [81]. Table 13 provides the experimental results achieved with LRSD, NRPCA, MMSE, and the three proposed models using STOI and PESQ. The results indicate that clearly, the ML/DL-based speech enhancement has superior performance in improving speech quality and intelligibility
The proposed model is further applied to speech separation task and compared to various related models including basic CNN [82], U-Net [27], and CapsNet [83]. Four evaluation measures examine the speech separation performance including Source-to-Interference Ratio (SIR), Signal-to-Distortion Ratio (SDR), Source-to-Artifact Ratio (SAR), and Negative Source-to-Distortion Ratio (NSDR). In general, a high SDR, SIR, and SAR are desirable because they indicate that the processed signal closely resembles the original source signal while minimizing the presence of unwanted artifacts or distortions. A positive NSDR value, as in the case of SDR, indicates better performance and higher speech quality, whereas a negative NSDR indicates less effective separation or enhancement with noticeable introduced distortions. Table 14 provides results of the speech separation where the proposed model (CNN-LSTM-RecCapsNet) obtains better results as compared to the related architectures.
F. Subjective Analysis
This section subjectively evaluates the performance of three proposed speech enhancement models: CNN-RecCapsNet (proposed-1), CNN-GRU-RecCapsNet (Proposed-2), and CNN-LSTM-RecCapsNet (Proposed-3). The subjective test computes the mean opinion scores (MOS) for two SNRs: 0dB and 5dB. Five different age groups participated as volunteers in the subjective tests. Before testing, the volunteers underwent training in a preliminary session. The experiments are performed in a noise-free environment. During evaluations, there was no repetition of speech samples. Participants were instructed to provide their evaluations of the enhanced speech quality, as depicted in Figure 11. The MOS scores indicate the efficacy of the proposed speech enhancement models. All three models achieved better MOS scores, particularly under challenging SNR (MOS
Conclusion
This research devises a methodology for applying capsule networks (CapsNet) to the monaural speech enhancement task. To extract spatial information from the speech spectrogram, a recurrent capsule framework is leveraged to capture neural representations and recurrent layers are integrated to capture temporal information. The spatiotemporal information encoded in the input speech spectrograms is completely utilized by CapsNet. With CapsNet, this study minimized the difficulties of spatial (frequency) information loss in the typical CNN paradigm, where the loss of local information propagates to higher levels. CapsNet-based SE systems are trained on the LibriSpeech, IEEE, and TIMIT speech corpus, with noise sources collected from the DEMAND dataset. In seen noises, the STOI scores with average noisy mixtures increased from 66.01% to 81.92%, 83.12%, and 84.69% with CNN-RecCapsNet, CNN-GRU-RecCapsNet, and CNN-LSTM-RecCapsNet concluding the role of GRU/LSTM layers in the proposed SE systems for speech intelligibility improvement. The PESQ scores with unseen noisy mixtures are increased from 1.71 with noisy speech to 2.77, 2.82, and 2.88 with CNN-RecCapsNet, CNN-GRU-RecCapsNet, and CNN-LSTM-RecCapsNet, achieving 36.9%, 37.23%, and 40.62% improvements in perceptual speech quality. In addition, the proposed CapsNet-based SE systems consistently outperform the benchmarks in all scenarios except −5dB SNRs (babble and cafeteria). In contrast to unprocessed mixtures, the three CapsNet yielded substantially better results than the benchmarks. The STOI scores are improved from 75.6%, 73.5%, and 72.4% with TCNN, CFN, and CNN-GRU to 79.1% with the proposed CNN-LSTM-RecCapsNet. On the other hand, CNN-GRU-RecCapsNet achieves 3.9%, 0.5%, and 5.7% improvements in STOI scores. The loss of speech spectrogram spatial information with CapsNet is optimized, which is important for increasing human perception (quality) of speech. The spectrogram plots verified that CapsNet and recurrent layers generated the closest copy of the clean spectrogram. Since computational resources are often limited in real-world applications, superior model enhancement performance, and parameter efficiency are achieved. Cross-corpus STOI and PESQ evaluations demonstrated that CapsNet systems outperform in three different speech corpus. The proposed model is implemented for speech enhancement, however; this model can be adopted for automatic speech recognition (ASR), speech emotion recognition (SER), image processing, and bioinformatics after making appropriate changes in the model.
ACKNOWLEDGMENT
The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work through large group Research Project under grant number RGP2/383/44. The authors would like to thank Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2024R161), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.