Introduction
Recent developments in speech signal processing have shown numerous clinical applications for non-invasive diagnosis of diseases which helps in effective remote health monitoring and remote healthcare facilities [1], [2], [3], [4]. In the current Coronavirus Disease 2019 (COVID-19) pandemic scenario, this speech-based remote health monitoring system can play a crucial role. According to the World Health Organization data, more than 579 million people have suffered including six million deaths reported till August 8, 2022, due to COVID-19 [5]. The standard and reliable test of COVID-19 is the Reverse transcription-polymerase chain reaction test (RT-PCR) test which is expensive (US 125 per test package, and over 15,000 to set up a processing lab) and also time-consuming (4–6 hours of processing time, and a turn-around of 2–4 days, including shipping) [6]. To deal with this challenging situation, there is a huge requirement for large-scale testing for isolating infected individuals and contact tracing [7]. Under this scenario, speech-based COVID-19 detection (CD) is one of the simplest, safest as well cost-effective methods [8].
Several temporal and spectral acoustic features of subjects have been used as inputs to a random forest model for the classification of speech into nine categories such as shallow and deep breathing, shallow and heavy cough, sustained vowel phonation (/o/, /e/, /a/), and normal and fast counting [9]. Detection accuracy of 66.74 % is reported in this study. In [10], respiratory sounds such as cough and breathing have been employed to classify COVID-19 from asthma using 733-dimensional features including 477-dimensional handcrafted features and 256-dimensional VGGNet-based features. The Logistic Regression-based classifier is used to provide an area under the receiver operator characteristic curve (ROC-AUC) of above 80%. The CD from online available speech data has been carried out using phoneme level analysis, Mel filter bank features, and the SVM classifier. It is reported that an accuracy of 88.6% is achieved from a limited number of 19 speakers [11]. An automated machine learning-based COVID-19 classification model is developed using glottal, prosodic, and spectral features from short-duration speech segments [12]. The proposed model yields a classification accuracy of 80%. Modified cepstral features are extracted from two speech databases and fed to the support vector machine (SVM) classifiers for CD and maximum accuracy of 85% is obtained [13]. Transfer learning-based deep neural network classifiers are used for CD for cough, breath, and speech with a ROC-AUC of 0.982, 0.942, and 0.923 respectively [14]. Several machine learning-based algorithms are analyzed for the mobile health solutions of CD and it is observed that the SVM technique provides the highest accuracy of 97% for the Coswara database [15]. A mobile application is developed for CD by combining the symptoms checker with voice, breath, and cough signals for robust performance on openly sourced and noisy data sets by using deep CNN and gradient boosting [16].
Even though several speech-based CD methods have been proposed, there is still scope for improvement in terms of detection accuracy, computational complexity as well as testing on multiple datasets in different categories of speech. As the early CD is essential, the higher and more reliable accuracy of detection is very important which would drastically reduce the spread and medical emergency of the detection. Additionally, many researchers have focused on using chest X-rays for CD using several image processing techniques [17], [18], [19], [20], [21]. Although it achieved superior performance in terms of accuracy but acquisition of chest X-rays is a cumbersome task. A physical visit, a well-trained technician for successful data acquisition, and a medical practitioner are all required. In light of these considerations, the current research focuses on the development of an improved CD system based on speech. For efficient extraction of information from the speech samples, an effective combination of speech features is used in this paper along with Light Gradient Boosting Machine which was proposed by Microsoft in 2016 [22]. It provides improved training performance requiring minimum memory, and parallel processing ability as well as handling large-scale data compared to the traditional machine learning algorithms. In recent years, it has been employed for genomics data analysis [23], speech processing [16], image processing [24], arrhythmia detection [25], and others. Because of the associated advantages, the gradient boosting technique is chosen in the current implementation to achieve better classification performance. The main research contributions of the paper are listed below:
Application of intelligent preprocessing techniques to bring the speech quality of the different real-life recorded speech to equal acoustic levels.
Extraction of spectral, cepstral, and periodicity features at frame level for efficient combination of high dimensional relevant audio features at sample level to accurately detect several respiratory diseases including COVID-19 and Asthma.
Development of Gradient Boosting Machine as a classifier and comparison of the detection performance matrices of the proposed method with those obtained from the standard methods using five datasets in thirteen different categories.
Assessment of the generalization ability of the proposed model which can be presented as a clinical application method wherein the model is trained with a large number of speech samples from the cough category of multiple datasets. Later it can predict the condition of the patient from his/her cough sound.
The paper is organized into four sections with Section I dealing with the introduction, literature review, motivations, and objectives of the investigation. The details of the materials and methods employed are dealt with in section II. Section III contains an analysis of results, and contributions in terms of research findings. The outcome of the research, limitations, and future research scope are presented in section IV.
Material and Methods
The block diagram of the proposed speech-based COVID-19 detection scheme is presented in Fig. 1 consisting of the following steps: dataset collection, preprocessing and features extraction, scaling of features, classification model training, and validation, and performance evaluation.
A. Datasets
Five datasets have been used to evaluate the performance of the suggested model in this study. These are: Coswara (Dataset-1) [9], Crowdsourced respiratory by the University of Cambridge (Dataset-2) [10], Virufy (Dataset-3) [26], recorded interviews from online platforms in telephone quality speech (Dataset-4) [11], Coughvid (Dataset-5) [7]. Out of these, data set-2 is used for both binary (COVID-19 positive, and healthy) and multi-class classification (COVID-19 positive, Asthma positive, and healthy) whereas datasets-1,3,4,5 are used for the binary classification task. These datasets contain speech samples of subjects from more than 50 countries. The dataset preparation follows a standard technique as shown in Fig. 2. Due to the deadly spreading nature of the COVID-19, the speech samples are recorded for most of the speech datasets in the online mode either by using mobile or web-based applications [7], [9], [10], [11], [26]. Along with the audio samples the COVID-19 status, location, gender, age, and the health conditions of the patients are also stored. The brief details of these five datasets are listed in Table I. A total of 4178 speech samples have been used in the simulation study. Complete details of these datasets are given in supplementary information S1.
B. Preprocessing
Speech preprocessing is critical to the overall success of developing a robust and efficient speech recognition system [27]. When speech is recorded by different users in different environments, then the speech quality varies drastically in one category within the dataset as well as across different datasets [28]. The background noise level significantly affects the overall performance of the speech recognition system [29], [30]. For highly non-stationary situations, the noise level is computed using the noise estimation algorithm [31]. To evaluate the effect of preprocessing, the variation in noise level and coefficient of variation are plotted in Figures 3 and 4 for two cases before and after preprocessing. The coefficient of variation measures the variation in the noise level by calculating the ratio between the standard deviation and mean of the estimated noise levels for one class [32]. For the noise level estimation, the cough category sound is used for dataset-1,2,3,5 and complete sentence sounds for dataset-4. The steps involved in preprocessing are mentioned below.
Change in Coefficient of variation (CV) of noise level between positive and negative class.
1) Low Pass Filtering
The sampling frequency of speech signals is different for different datasets. However, significant information is found within the 8 kHz bandwidth [33]. It is also evident from Fig. 5, where the time-frequency representation of one cough signal of dataset-2 is plotted using the spectrogram. To remove the unwanted signal components which are not associated with human speech, all the audio signals are passed through a low pass filter of 10 kHz. To maintain a uniform sampling rate and to extract the same number of features for each frame, all speech signals are resampled at the maximum available sampling frequency (48 kHz) of all the datasets.
2) Speech Enhancement
The multi-band spectral subtraction approach has been employed to denoise the speech samples of all five datasets [34]. This is a simple and effective method for denoising signals affected by colored noises where spectral subtraction is performed separately at different frequency bands.
3) Voice Activity Detection and Dynamic Level Control
To separate the voiced frames from the unvoiced frames, a simple short-term energy-based voice activity detection (VAD) algorithm is used. The voiced frames are then passed through a Dynamic Level Controller (DLC). It is made up of an expander and a compressor, with the expander boosting low signal levels and the compressor lowering peak levels [35].
C. Features Extraction
In this section, the details of the audio features extraction techniques used in the investigation are dealt with. At the frame and sample levels, numerous audio features are extracted in the frequency, structural, statistical, and temporal domains. The complete recording of a single user in one category comprises one sample, while a frame is a subset of the entire audio data found in a sample. Considering there is ’n' number of frames present in each sample, the details of the frame-level features are described below. The features are named as f(serial number of the feature) such as f1 to f5701.
Spectral Features — The speech signal is a non-stationary signal but the properties remain constant over fixed time intervals of 10–30 ms. The short-time spectral features are obtained by converting the time domain signal into the frequency domain by applying different Transform techniques. These features provide information about spectral information which plays an important role in speech recognition [36]. In this work, the hamming window is chosen as it provides less spectral leakage and the side lobes of this window are lower than the others [37]. A window size of 25 msec duration with 50% overlapping between two successive frames has been considered. The spectral features extracted are: Linear Spectrum (n×512), Mel Spectrum (n×32), Bark Spectrum (n×32), and Equivalent Rectangular Bandwidth (ERB) Spectrum (n×44). Therefore, the total dimension of spectral features is (n×620).
Cepstral Features — The cepstral features help in extracting relevant speech information for speech emotion recognition tasks by using filter banks based on human speech perception [13]. The cepstral features are Mel-frequency cepstral coefficients (MFCC), MFCC Delta, MFCC Delta Delta, Gammatone cepstral coefficients (GTCC), GTCC Delta, GTCC Delta Delta, each of dimension (n×13). Therefore, the total dimension of cepstral features is (n×78)
Spectral Descriptors — These features extract statistical information from the lengthy spectral features. These features are widely used in speaker, music, mood recognition, and classification tasks [38]. The spectral descriptors used are: Centroid, Crest, Decrease, Entropy, Flatness, Flux, Kurtosis, Roll-off Point, Skewness, Slope, and Spread, each having dimension (n×1). The total dimension of spectral descriptors is (n×11).
Periodicity Features —These features provide important time-domain information of speech which helps in monaural speech analysis [39]. The features used are: Pitch (n×1), and Harmonic Ratio (n×1).
For this purpose, MATLAB-based audioFeatureExtractor is used [40], [41]. The fusion of spectral features, cepstral features, spectral descriptors, and periodicity features yields an n* 712-dimensional feature vector for each speech sample. As the frame numbers vary for each sample, so training in machine learning becomes difficult. Therefore, in this work, the statistical measures are computed at the sample level and it provides a fixed length of features for each sample. To extract statistical distributions at the sample level, several statistical features are extracted from the frame-level features [10]. The sample level features are: mean (f1:f712), median (f713:f1424), RMS (root-mean-square) (f1425:f2136), maximum (f2137:f2848), minimum (f2849:f3560), quartile (1st and 3 rd quartile, interquartile range) (f3561:f3563), standard deviation (SD) (f3564:f4275), skewness (f4276:f4987), kurtosis (f4988:f5699) of all frame-level features. Also, the Zero crossing rate (ZCR) (f5700), and Short-time energy (STE) (f5701) are calculated sample-wise. Each combined feature vector is the concatenation of the sample level features and it is a 5701-dimensional feature vector. Outliers in the high-dimensional feature vector can have an impact on the learning algorithm's performance. As a result, feature scaling is an important preprocessing step. The robust scaler removes the median and scales the data according to the quantile range, removing outliers from the features [42].
D. LightGBM (LGM)
The LGM is an effective gradient boosting decision tree with gradient-based one-side sampling (GOSS) and exclusive feature bundling (EFB) to increase computational efficiency without affecting the accuracy [22]. The steps involved in LGM modeling are: (i) defining the loss function, (ii) performing the GOSS sampling, and identification of the optimal segmentation point using a histogram-based algorithm, (iii) calculation of feature dimension by the EFB method, (iv) performing the leaf-wise algorithm to combine the samples to fit residuals, and (v) splitting the nodes based on the objective function and generate a decision tree.
Let us consider X as the input feature vector and Y as the class labels. The aim of LGM is to determine the approximation function
\begin{align*}
\widehat{F}(x)=\underset{F}{argmin}\;E_{xy}\left[L(y,F(x)))\right] \tag{1}
\end{align*}
\begin{align*}
F_{M}(X)=\sum\limits _{m=1}^{M}F_{m}(X)\tag{2}
\end{align*}
\begin{align*}
\tau _{m}&=\sum\limits _{i-1}^{n}\;L(y_{i}\;,\;F_{m-1}(x_{i})+F_{m}(x_{i}))\\
& \cong \;\sum\limits _{i=1}^{n}\;\left(g_{i}F_{m}(x_{i})\right)\;+ \frac{1}{2}h_{i}F_{m}^{2}(x_{i})) \tag{3}
\end{align*}
\begin{align*}
\tau _{m}=\sum\limits _{j=1}^{J}\;\left(\left(\sum\limits _{i\in I_{j}}g_{i}\right)\;w_{j}+ \frac{1}{2}\left(\sum\limits _{i\in I_{j}}h_{i}+\lambda \right)w_{j}^{2}\right) \tag{4}
\end{align*}
\begin{align*}
w_{j}^\ast =- \frac{{ \sum\nolimits _{i\in I_{j}}}g_{i}}{{ \sum\nolimits _{i\in I_{j}}}\;h_{i}+\lambda } \tag{5}
\end{align*}
\begin{align*}
G=& \frac{1}{2}\left(\left| \frac{\left({ \sum\nolimits _{i\in I_{L}}}g_{i}\right)^{2}}{\left({ \sum\nolimits _{i\in I_{L}}}h_{i}+\lambda \right)}+ \frac{\left({ \sum\nolimits _{i\in I_{R}}g_{i}}\right)^{2}}{\left({ \sum\nolimits _{i\in I_{R}}h_{i}+\lambda }\right)} \right. \right. \\
&\quad \left.\left. + \frac{\left({ \sum\nolimits _{i\in I}g_{i}}\right)^{2}}{\left({ \sum\nolimits _{i\in I}h_{i}+\lambda }\right)}\right|\right) \tag{6}
\end{align*}
Results and Discussions
The performance of the proposed model is assessed for two tasks, (I) binary classification task to predict the speech samples as COVID-19 positive or negative, and (II) multiclass classification task to predict COVID-19 positive, Asthma positive, and healthy speech samples. To perform this, the speech samples are passed through the additional preprocessing blocks such as low pass filtering, speech enhancement, voice activity detection, and dynamic level control. Then a total of 5701 features are extracted from each sample. Here, the preprocessing block is a part of the feature extraction. These features are combined with an LGM classifier and three baseline classifiers such as Random Forest (RF) [9], SVM [10], [11], and K-Nearest Neighbor (KNN) [44] used for the speech classification task. For the development of the classification model five-fold stratified cross-validation scheme is employed. Standard performance measures as reported in [45] such as Classification Accuracy (CA), F-2 Score (F-2), Precision (PR), Recall (RC), and area under the curve (AUC), are employed in this study. The details of the performance measures are described in supplementary information S2. Grid search is used to find the optimal parameters of the classifiers. These parameters are listed in supplementary information S3.
A. Performance Evaluation as a Binary Classification Task
The comparative study between the performance of LGM, SVM, RF, and KNN classifiers for binary classification task are presented in Tables II and III. The LGM classifier provides an average accuracy of 0.978, an F-2 Score of 0.979, and an AUC of 0.976 across all the categories in the five datasets. The average accuracy, F-2 Score, and AUC of the SVM classifier are 0.749, 0.717, and 0.712, respectively. Similarly, for the RF classifier, the average accuracy, F-2 score, and AUC are found to be 0.967, 0.966, and 0.963, respectively. For the KNN classifier, the values are 0.753, 0.745, and 0.728. The results show that the LGM classifier performs better on the high-dimensional features than the SVM, RF, and KNN classifiers.
B. Performance Evaluation as a Three-Class Classification Task
To further evaluate the prediction ability of the classifiers, an assessment of multi-class data has been carried out for dataset-2 contains samples of COVID-19 positive, Asthma positive, and healthy in the cough and breathing sound categories. The results are listed in Table IV. It is observed that the performance of the LGM classifier is superior in all the performance measures as compared to the SVM, RF, and KNN classifiers respectively. The ROC curves are two-dimensional plots that provide the relative trade-offs between the true positive and false-positive rates [45]. The ROC curves of dataset-2 in the cough category (binary and multi-class) are shown in Fig. 6 and Fig. 7 respectively. The proposed approach has a high true true-positive rate and a low false false-positive rate, according to the ROC curves. The AUC of the proposed model is 0.99, which is better in comparison to the RF, SVM, and KNN models. The proposed features with the additional preprocessing provide better results compared to standard features and classifiers.
Comparison of ROC curves of different classifiers for multiclass classification in cough category of dataset-2.
Comparison of ROC curves of different classifiers for binary classification in cough category of dataset-2.
C. Comparison With Baseline Models and Combined Datasets
A comparative analysis of the proposed model over the existing methods used in the five datasets are shown in Table V. The Improvement in the detection performance is mentioned in the last column. It is observed that the proposed model shows consistent performance across all the datasets as well as in the combined dataset. There is approximately 30%, 15%, 25%, 9%, and 20% minimum improvement in CD performance for datasets 1,2,3,4,5. For the assessment of the generalization ability of the proposed model, a combined dataset is prepared with the speech signals in the cough category from datasets 1,2,3,5. In the combined dataset, there is a total of 1528 samples from the healthy category, while 1344 samples are from the COVID-19 positive category. The performance of all four methods is evaluated and the results are listed in Table VI. It is observed that the proposed model shows the highest accuracy of 0.983 over the other three standard models.
The minimum CD performance of the proposed method is approximately 97 % across all sound categories, databases, and CV schemes. The proposed approach has a high true-positive rate and a low false-positive rate, according to the ROC curves. The AUC of the proposed model is 0.99, which is better in comparison to the RF, SVM, and KNN models. The proposed features with the additional preprocessing provide better results compared to standard features and classifiers.
D. Statistical Analysis of Classifier Models
The statistical analysis of the comparison of the performance of the LGM model with the standard machine learning-based models SVM, RF, and KNN over five datasets is listed in Table VII. For this purpose, the t-statistic value between the two classifiers is computed as mentioned in (7).
\begin{align*}
\begin{array}{l}t= \frac{c_{1}-c_{2}}{\sqrt{v_{1}^{2}+v_{2}^{2}}}\end{array} \tag{7}
\end{align*}
Conclusion
In the current study, a non-invasive and effective respiratory disease detection scheme is developed and tested for COVID-19 and Asthma. The major contributions of the investigation are the use of improved preprocessing techniques, an effective combination of spectral, cepstral, and periodicity features along with the implementation of gradient boosting machines for robust and consistent performance across multiple datasets. The proposed model can be used for early and fast automatic diagnosis of COVID-19 without the subject visiting a hospital as well as without the assistance of a medical professional. However, it is suggested that the detection scheme by the use of the proposed intelligent model can be verified by the medical professional before a prescription is initiated. It may be noted that the proposed detection scheme involves more computations and training time. There is still room to improve the method's computing complexity for faster implementations. The effective preprocessing techniques, as well as the combination of audio features can be further implemented and tested for other speech recognition tasks including emotion recognition, Parkinson's disease, and heart disease detection.
ACKNOWLEDGMENT
The authors express their gratitude to Professor Cecilia Mascolo, Department of Computer Science and Technology and Chancellor, Master, and Scholar of the University of Cambridge for sharing the speech database of COVID-19 [10].