Conferences >2021 IEEE International Confe...

A Sequence-to-sequence Based Error Correction Model for Medical Automatic Speech Recognition

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

The use of Automatic Speech Recognition (ASR) systems in medical applications is receiving rapidly growing interest due to their ability to reduce distractions and the co...Show More

Metadata

Abstract:

The use of Automatic Speech Recognition (ASR) systems in medical applications is receiving rapidly growing interest due to their ability to reduce distractions and the cognitive workload of physicians, particularly during critical medical procedures. However, state-of-the-art ASR systems still experience recognition errors, especially in noisy environments where speakers rely on medical-domain terminologies. This paper proposes a customized language model and a neural network based sequence-to-sequence (seq2seq) error correction module for medical ASR systems to provide domain adaptation and more reliable transcription results. Specifically, the error correction module learns the error patterns in noisy scenarios and is able to correct such errors during inference. Our experiments show that the proposed method can reduce the sentence error rate (SER) by up to 81% for formatted input and up to 31% SER for unformatted input in noisy environments.

Published in: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

Date of Conference: 09-12 December 2021

Date Added to IEEE Xplore: 14 January 2022

ISBN Information:

DOI: 10.1109/BIBM52615.2021.9669554

Conference Location: Houston, TX, USA

References is not available for this document.

Contents

I. Introduction

Automatic Speech Recognition (ASR) systems often significantly increase the efficiency and convenience of human-computer interactions for a variety of applications and domains. In recent years, the medical and healthcare field is one such domain that has received increasing attention from the ASR community. For example, in hospitals, manual interactions with computing systems (e.g., recording physician notes, retrieving patient data, and searching for medical information) can be distracting, tedious, and time-consuming. A speech-based system could even be used during surgeries or emergency interventions, where such a system could also more quickly alert physicians of recommended steps or help prevent deviations from usual treatment protocols and workflows. However, there are several challenges that need to be addressed before using an ASR in the medical field will be feasible. First, the system has to work reliably in environments with different levels and types of noise, such as multiple speakers, sounds generated by medical equipment, etc. Second, it is difficult to identify a universal dataset to be used for training and evaluating medical speech recognition tasks. And third, medical terminology can be much more complex than everyday expressions, e.g., medical terms may be longer than most other dictionary words, they are often combined in unusual ways and more difficult to pronounce, while also often sharing very similar pronunciations across different words (such as names of procedures, diseases, medications, etc.). Therefore, a more advanced approach to speech recognition is needed. Since this problem is so new, there have only been a few prior efforts to investigate the design of an ASR for medical purposes [1]. One approach to designing such a system is to collect medical speech data and and build a speech corpus that can be used to train a system from scratch. Edwards et al. [2] present a speech recognition system trained with 270 hours of medical speech data and 30 million tokens of text from clinical episodes, resulting in a word error rate (WER) that is below 16% in realistic clinical cases. Chiu et al. [3] trained two models, a Connectionist Temporal Classification (CTC) phoneme based model and a Listen Attend and Spell (LAS) grapheme based model, with 14,000 hours of medical conversations, yielding WERs of 20.1% and 18.3%, respectively. Another option is to use an existing ASR system and adapt it to the medical domain. Liu et al. [4] evaluate two well-known ASR systems, Nuance Dragon and SRI Decipher, on spoken clinical questions, and adapt the SRI system to the medical domain using a language model, achieving an WER of 26.7%. Salloum et al. [5] propose a method called “crowdsourced transcription process” to continuously refine ASR language models. Mani et al. [6] perform medical domain adaptation of Google ASR and ASPIRE via machine translation, which is achieved by learning a mapping from out-of-domain errors to in-domain medical terms, yielding a WER of 7%.

Select All

S. Latif, J. Qadir, A. Qayyum, M. Usama and S. Younis, "Speech technology for healthcare: Opportunities challenges and state of the art", IEEE Reviews in Biomedical Engineering, vol. 14, pp. 342-356, 2021.

View Article

Google Scholar

E. Edwards, W. Salloum, G. P. Finley, J. Fone, G. Cardiff, M. Miller, et al., "Medical speech recognition: Reaching parity with humans", Speech and Computer, pp. 512-524, 2017.

CrossRef Google Scholar

C.-C. Chiu, A. Tripathi, K. Chou, C. Co, N. Jaitly, D. Jaunzeikare, et al., "Speech recognition for medical conversations", Interspeech, pp. 2972-2976, 2018.

CrossRef Google Scholar

F. Liu, G. Tur, D. Hakkani-Tur and H. Yu, "Towards spoken clinical-question answering: Evaluating and adapting automatic speech-recognition systems for spoken clinical questions", Journal of the American Medical Informatics Association: JAMIA, vol. 18, pp. 625-30, 06 2011.

CrossRef Google Scholar

W. Salloum, E. Edwards, S. Ghaffarzadegan, D. Suendermann-Oeft and M. Miller, "Crowdsourced continuous improvement of medical speech recognition", AAAI Workshops, 2017.

Google Scholar

A. Mani, S. Palaskar, N. V. Meripo, S. Konam and F. Metze, "Asr error correction and domain adaptation using machine translation", Proc. IEEE ICASSP, pp. 6344-6348, 2020.

View Article

Google Scholar

V. Peddinti, G. Chen, V. Manohar, T. Ko, D. Povey and S. Khudanpur, "Jhu aspire system: Robust lvcsr with tdnns ivector adaptation and rnn-lms", IEEE Workshop on ASRU, pp. 539-546, 2015.

View Article

Google Scholar

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, et al., The kaldi speech recognition toolkit, 2011.

Google Scholar

D. Povey, M. Hannemann, G. Boulianne, L. Burget, A. Ghoshal, M. Janda, et al., "Generating exact lattices in the wfst framework", Proc. IEEE ICASSP, pp. 4213-4216, 2012.

View Article

Google Scholar

10.

Y. Wu, M. Schuster, Z. Chen et al., "Google’s neural machine translation system: Bridging the gap between human and machine translation", arXiv preprint arXiv:1609.08144, 2016.

Google Scholar

11.

J. Crego, J. Kim, G. Klein et al., "Systran’s pure neural machine translation systems", arXiv preprint arXiv:1610.05540, 2016.

Google Scholar

12.

J. Zhou, Y. Cao, X. Wang, P. Li and W. Xu, "Deep recurrent models with fast-forward connections for neural machine translation", Transactions of the Association for Computational Linguistics, vol. 4, pp. 371-383, 07 2016.

CrossRef Google Scholar

13.

H. Wang, S. Dong, Y. Liu, J. Logan, A. K. Agrawal and Y. Liu, "Asr error correction with augmented transformer for entity retrieval", Interspeech, pp. 1550-1554, 2020.

CrossRef Google Scholar

14.

Y. Weng, S. S. Miryala, C. Khatri, R. Wang, H. Zheng, P. Molino, et al., "Joint contextual modeling for asr correction and language understanding", Proc. IEEE ICASSP, pp. 6349-6353, 2020.

View Article

Google Scholar

15.

L. Orosanu and D. Jouvet, "Adding new words into a language model using parameters of known words with similar behavior", Procedia Computer Science, vol. 128, pp. 18-24, 2018.

CrossRef Google Scholar

16.

D. Bahdanau, K. Cho and Y. Bengio, "Neural machine translation by jointly learning to align and translate", arXiv preprint arXiv:1409.0473, 2016.

Google Scholar

17.

M.-T. Luong, H. Pham and C. D. Manning, "Effective approaches to attention-based neural machine translation", arXiv preprint arXiv:1508.04025, 2015.

CrossRef Google Scholar

18.

A. Pampari, P. Raghavan, J. Liang and J. Peng, "emrqa: A large corpus for question answering on electronic medical records", arXiv preprint arXiv:1809.00732, 2018.

CrossRef Google Scholar

19.

C. K. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan and J. Gehrke, A scalable noisy speech dataset and online subjective test framework, pp. 1816-1820, 2019.

Google Scholar

20.

D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization", Proc. ICLR, 2015.

Google Scholar

References is not available for this document.

A Sequence-to-sequence Based Error Correction Model for Medical Automatic Speech Recognition

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

A Sequence-to-sequence Based Error Correction Model for Medical Automatic Speech Recognition

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?