Loading [MathJax]/extensions/MathMenu.js
Speech Dereverberation Using Fully Convolutional Networks | IEEE Conference Publication | IEEE Xplore

Speech Dereverberation Using Fully Convolutional Networks


Abstract:

Speech derverberation using a single microphone is addressed in this paper. Motivated by the recent success of the fully convolutional networks (FCN) in many image proces...Show More

Abstract:

Speech derverberation using a single microphone is addressed in this paper. Motivated by the recent success of the fully convolutional networks (FCN) in many image processing applications, we investigate their applicability to enhance the speech signal represented by short-time Fourier transform (STFT) images. We present two variations: a “U-Net” which is an encoder-decoder network with skip connections and a generative adversarial network (GAN) with U-Net as generator, which yields a more intuitive cost function for training. To evaluate our method we used the data from the REVERB challenge, and compared our results to other methods under the same conditions. We have found that our method outperforms the competing methods in most cases.
Date of Conference: 03-07 September 2018
Date Added to IEEE Xplore: 02 December 2018
ISBN Information:

ISSN Information:

Conference Location: Rome, Italy
References is not available for this document.

I. Introduction

Reverberation, resulting in from multiple reflections from the rooms facets and objects, degrade the speech quality, and in severe cases, the speech intelligibility, especially for hearing impaired people. The success rate of automatic speech recognition (ASR) systems may also significantly deteriorate in reverberant conditions, especially in cases of mismatch between the training and test phases. Reverberation is the result of convolving an anechoic speech utterance by a long acoustic path. The output signal suffers from overlap-and self-masking effects that may deteriorate the speech quality [1]. These are often manifested as “blurring” effects on the short-time Fourier transform (STFT) images. A plethora of methods for speech dereverberation using both single-and multimicrophone exists [2].

Select All
1.
A. K. Nábělek, T. R. Letowski and F. M. Tucker, "Reverberant overlap-and self-masking in consonant identification", The Journal of the Acoustical Society of America, vol. 86, no. 4, pp. 1259-1265, 1989.
2.
P. A. Naylor and N. D. Gaubitch, Speech dereverberation, Springer Science Business Media, 2010.
3.
K. Kinoshita, M. Delcroix, S. Gannot, E. Habets, R. Haeb-Umbach, W. Kellermann, et al., "A summary of the reverb challenge: State-of-the-art and remaining challenges in reverberant speech processing research", EURASIP Journal on Advances in Signal Processing, vol. 7, pp. 1-19, Oct. 2016.
4.
M. Delcroix, T. Yoshioka, A. Ogawa, Y. Kubo, M. Fujimoto, N. Ito, et al., "Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the reverb challenge", Proc. REVERB Challenge Workshop, vol. 1, pp. 1-8, 2014.
5.
B. Schwartz, S. Gannot and E. A. P. Habets, "Online speech dereverberation using Kalman filter and EM algorithm", IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 23, no. 2, pp. 394-406, Feb. 2015.
6.
H. Kallasjoki, J. F. Gemmeke, K. Palomaki, A. V. Beeston, G. J. Brown and K. J. Palomäki, "Recognition of reverberant speech by missing data imputation and NMF feature enhancement", Proc. REVERB Challenge Workshop, 2014.
7.
J.-D. Polack, "Playing billiards in the concert hall: The mathematical foundations of geometrical room acoustics", Applied Acoustics, vol. 38, no. 2-4, pp. 235-244, 1993.
8.
E. Habets, S. Gannot and I. Cohen, "Late reverberant spectral variance estimation based on a statistical model", IEEE Signal Processing Letters, vol. 16, no. 9, pp. 770-773, Sep. 2009.
9.
B. Cauchi, I. Kodrasi, R. Rehr, S. Gerlach, A. Jukic, T. Gerkmann, et al., "Joint dereverberation and noise reduction using beamforming and a single-channel speech enhancement scheme", Proc. REVERB Challenge Workshop, 2014.
10.
D. R. González, S. C. Arias and J. R. Calvo, "Single channel speech enhancement based on zero phase transformation in reverberated environments", Proc. REVERB Challenge Workshop, 2014.
11.
S. Wisdom, T. Powers, L. Atlas and J. Pitton, "Enhancement and recognition of reverberant and noisy speech by extending its coherence", Proc. REVERB Challenge Workshop, 2014.
12.
X. Xiao, S. Zhao, D. Hoang, H. Nguyen, X. Zhong, D. L. Jones, et al., "The NTU-ADSC systems for reverberation challenge 2014", Proc. REVERB Challenge Workshop, 2014.
13.
K. Han, Y. Wang, D. Wang, W. S. Woods, I. Merks and T. Zhang, "Learning spectral mapping for speech dereverberation and denoising", IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 23, no. 6, pp. 982-992, Jun. 2015.
14.
D. S. Williamson and D. Wang, "Time-frequency masking in the complex domain for speech dereverberation and denoising", IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 25, no. 7, pp. 1492-1501, Jul. 2017.
15.
F. Weninger, S. Watanabe, Y. Tachioka and B. Schuller, "Deep recurrent de-noising auto-encoder and blind de-reverberation for reverberated speech recognition", IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4623-4627, May 2014.
16.
D. S. Wang, Y. X. Zou and W. Shi, "A deep convolutional encoder-decoder model for robust speech dereverberation", 22nd International Conference on Digital Signal Processing (DSP), pp. 1-5, 2017.
17.
E. Shelhamer, J. Long and T. Darrell, "Fully convolutional networks for semantic segmentation", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 640-651, April 2017.
18.
Y. Wang, A. Narayanan and D. Wang, "On training targets for supervised speech separation", IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 22, no. 12, pp. 1849-1858, Dec. 2014.
19.
D. Michelsanti and Z.-H. Tan, "Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification" in INTERSPEECH, 2017.
20.
O. Ronneberger, P. Fischer and T. Brox, "U-net: Convolutional networks for biomedical image segmentation" in Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234-241, 2015.
21.
P. Isola, J.-Y. Zhu, T. Zhou and A. A. Efros, "Image-to-image translation with conditional adversarial networks", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967-5976, 2017.
22.
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, et al., "Generative adversarial nets" in Advances in Neural Information Processing Systems (NIPS), pp. 2672-2680, 2014.
23.
T. Robinson, J. Fransen, D. Pye, J. Foote and S. Renals, "WSJCAMO: a british english speech corpus for large vocabulary continuous speech recognition", International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 81-84, May 1995.
24.
M. Lincoln, I. McCowan, J. Vepa and H. K. Maganti, "The multichannel wall street journal audio visual corpus (MC-WSJ-AV): specification and initial experiments", IEEE Workshop on Automatic Speech Recognition and Understanding., pp. 357-362, Nov. 2005.
Contact IEEE to Subscribe

References

References is not available for this document.