Loading [MathJax]/extensions/MathMenu.js
Monaural Speech Separation Based on Computational Auditory Scene Analysis and Objective Quality Assessment of Speech | IEEE Journals & Magazine | IEEE Xplore

Monaural Speech Separation Based on Computational Auditory Scene Analysis and Objective Quality Assessment of Speech


Abstract:

Monaural speech separation is a very challenging problem in speech signal processing. It has been studied extensively, and many separation systems based on computational ...Show More

Abstract:

Monaural speech separation is a very challenging problem in speech signal processing. It has been studied extensively, and many separation systems based on computational auditory scene analysis (CASA) have been proposed in the last two decades. Although the research on CASA has tended to introduce high-level knowledge into separation processes using primitive data-driven methods, the knowledge on speech quality still has not been combined with it. This makes the performance evaluation of CASA mainly focused on the signal-to-noise ratio (SNR) improvement. Actually, the quality of the separated speech is not directly related to its SNR. In order to solve this problem, we propose a new method which combines CASA with objective quality assessment of speech (OQAS). In the grouping process of CASA, we use OQAS as the guide to instruct the CASA system. With this combination, the performance of the speech separation can be improved not only in SNR, but also in mean opinion score (MOS). Our system is systematically evaluated and compared with previous systems, and it yields substantially better performance, especially for the subjective perceptual quality of separated speech.
Published in: IEEE Transactions on Audio, Speech, and Language Processing ( Volume: 14, Issue: 6, November 2006)
Page(s): 2014 - 2023
Date of Publication: 30 November 2006

ISSN Information:

References is not available for this document.

I. Introduction

In a natural world, a speech signal is frequently accompanied by other sound sources upon reaching auditory systems, yet listeners are capable of holding conversations in a wide range of conditions. This phenomenon is well known as the “cocktail party” effect [1]. It is valuable to make a computer have the ability of a human being to segregate the object source from other interfering sources. An effective separation system can greatly facilitate many applications, including automatic speech recognition (ASR), speaker identification, audio retrieval, digital content management, etc. Therefore, the research on speech separation gradually catches the researchers' attentions, and it has become an increasingly popular topic in the field of signal processing.

Select All
1.
C. Cherry, "Some experiments in the recognition of speech with one andtwo ears", J. Acoust. Soc. Amer., vol. 25, pp. 975-981, 1953.
2.
A. K. Barros, T. Rutkowski, F. Itakura and N. Ohnishi, "Estimation of speech embedded in a reverberantand noisy environment by independent component analysis and wavelets", IEEE Trans. Neural Netw., vol. 13, no. 4, pp. 888-893, Jul. 2002.
3.
H. Krim and M. Viberg, "Two decades of array signal processing research: The parametricapproach", IEEE Signal Process. Mag., vol. 13, no. 4, pp. 67-94, Jul. 1996.
4.
A. S. Bregman, Auditory Scene Analysis, MA, Cambridge:MIT Press, 1990.
5.
G. J. Brown and M. P. Cooke, "Computational auditory scene analysis", Comput. Speech Lang., vol. 8, pp. 297-336, 1994.
6.
M. P. Cooke, Modeling Auditory Processing and Organization, U.K., Cambridge:Cambridge Univ. Press, 1993.
7.
D. P. W. Ellis, Prediction-driven computational auditory scene analysis, 1996.
8.
D. F. Rosenthal and H. G. Okuno, Computational Auditory Scene Analysis, NJ, Mahwah:Lawrence Erlbaum, 1998.
9.
D. L. Wang and G. J. Brown, "Separation of speech from interfering soundsbased on oscillatory correlation", IEEE Trans. Neural Netw., vol. 10, no. 3, pp. 684-697, May 1999.
10.
M. Weintraub, A theory and computational model of auditory monaural sound separation, 1985.
11.
G. N. Hu and D. L. Wang, "Monaural speech segregation based on pitch tracking andamplitude modulation", IEEE Trans. Neural Netw., vol. 15, no. 5, pp. 1135-1150, Sep. 2004.
12.
N. Roman, D. L. Wang and G. J. Brown, "Speechsegregation based on sound localization", J. Acoust. Soc. Amer., vol. 114, pp. 2236-2252, 2003.
13.
D. Godsmark and G. J. Brown, "A blackboard architecture for computational auditory sceneanalysis", Speech Commun., vol. 27, pp. 351-366, 1999.
14.
Subjective performance assessment of telephone-band and wideband digital codecs, 1996.
15.
P. Gray, M. P. Hollier and R. E. Massara, "Nonintrusivespeech-quality assessment using vocal-tract models", Proc. Inst. Elect. Eng.-Vision Image Signal Process., vol. 147, no. 6, pp. 493-501, Dec. 2000.
16.
C. Jin and R. Kubichek, "Vectorquantization techniques for output-based objective speech quality", Proc. Int. Conf. Acoust. Speech Signal Process., vol. 1, pp. 491-494, 1996-May.
17.
D. S. Kim, "ANIQUE: An auditory model for single-endedspeech quality estimation", IEEE Trans. Audio Speech Lang. Process., vol. 13, no. 5, pp. 821-831, Sep. 2005.
18.
Single-ended method for objective speech quality assessment in narrow-band telephony applications, 2004.
19.
NiQAproduct description, 2003, [online] Available: .
20.
NiNASwissQual's Non-intrusive algorithm for estimating the subjective quality of live speech, 2001, [online] Available: .
21.
B. C. J. Moore, An Introduction to the Psychology of Hearing, CA, San Diego:Academic, 1997.
22.
D. L. Wang, "On ideal binary mask as the computational goal of auditory scene analysis" in Speech Separation by Humans and Machines, MA, Norwell:Kluwer, pp. 181-197, 2005.
23.
L. A. Drake, Sound source separation via computational auditory scene analysis (CASA)-enhanced beamforming, 2001.
24.
D. F. Rosenthal and H. G. Okuno, Computational Auditory Scene Analysis, NJ, Mahwah:Lawrence Erlbaum, 1998.
25.
Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, 2001.
26.
S. F. Boll, "Suppression of acoustic noise in speechusing spectral subtraction", IEEE Trans. Acoust. Speech Signal Process., vol. 27, no. 2, pp. 113-120, Feb. 1979.
27.
Martin, "Noise power spectral density estimation based on optimalsmoothing and minimum statistics", IEEE Trans. Speech Audio Process., vol. 9, no. 5, pp. 504-512, Jul. 2001.
28.
P. Cariani, "Temporal coding of periodicity pitch in the auditory system:An overview", Neural Plasticity, vol. 6, pp. 147-172, 1999.
29.
R. Meddis and L. O'Mard, "A unitary model of pitch perception", J. Acoust. Soc. Amer., vol. 102, pp. 1811-1820, 1997.
30.
M. Slaney and R. F. Lyon, "On the importance of timeA temporal representation of sound" in Visual Representations of Speech Signals, New York:Wiley, pp. 95-116, 1993.

Contact IEEE to Subscribe

References

References is not available for this document.