Loading [a11y]/accessibility-menu.js
Automatic transcription of conversational telephone speech | IEEE Journals & Magazine | IEEE Xplore

Automatic transcription of conversational telephone speech


Abstract:

This paper discusses the Cambridge University HTK (CU-HTK) system for the automatic transcription of conversational telephone speech. A detailed discussion of the most im...Show More

Abstract:

This paper discusses the Cambridge University HTK (CU-HTK) system for the automatic transcription of conversational telephone speech. A detailed discussion of the most important techniques in front-end processing, acoustic modeling and model training, language and pronunciation modeling are presented. These include the use of conversation side based cepstral normalization, vocal tract length normalization, heteroscedastic linear discriminant analysis for feature projection, minimum phone error training and speaker adaptive training, lattice-based model adaptation, confusion network based decoding and confidence score estimation, pronunciation selection, language model interpolation, and class based language models. The transcription system developed for participation in the 2002 NIST Rich Transcription evaluations of English conversational telephone speech data is presented in detail. In this evaluation the CU-HTK system gave an overall word error rate of 23.9%, which was the best performance by a statistically significant margin. Further details on the derivation of faster systems with moderate performance degradation are discussed in the context of the 2002 CU-HTK 10 /spl times/ RT conversational speech transcription system.
Published in: IEEE Transactions on Speech and Audio Processing ( Volume: 13, Issue: 6, November 2005)
Page(s): 1173 - 1185
Date of Publication: 30 November 2005

ISSN Information:


I. Introduction

The transcription of conversational telephone speech is one of the most challenging tasks for speech recognition technology. State-of-the-art systems still yield high word error rates typically within a range of 20%–30%. Work on this task has been aided by extensive data collection, namely the Switchboard-1 corpus [10]. Originally designed as a resource to train and evaluate speaker identification systems, the corpus now serves as the primary source of data for work on automatic transcription of conversational telephone speech in English.

Contact IEEE to Subscribe

References

References is not available for this document.