A Sequence-to-sequence Based Error Correction Model for Medical Automatic Speech Recognition | IEEE Conference Publication | IEEE Xplore

A Sequence-to-sequence Based Error Correction Model for Medical Automatic Speech Recognition


Abstract:

The use of Automatic Speech Recognition (ASR) systems in medical applications is receiving rapidly growing interest due to their ability to reduce distractions and the co...Show More

Abstract:

The use of Automatic Speech Recognition (ASR) systems in medical applications is receiving rapidly growing interest due to their ability to reduce distractions and the cognitive workload of physicians, particularly during critical medical procedures. However, state-of-the-art ASR systems still experience recognition errors, especially in noisy environments where speakers rely on medical-domain terminologies. This paper proposes a customized language model and a neural network based sequence-to-sequence (seq2seq) error correction module for medical ASR systems to provide domain adaptation and more reliable transcription results. Specifically, the error correction module learns the error patterns in noisy scenarios and is able to correct such errors during inference. Our experiments show that the proposed method can reduce the sentence error rate (SER) by up to 81% for formatted input and up to 31% SER for unformatted input in noisy environments.
Date of Conference: 09-12 December 2021
Date Added to IEEE Xplore: 14 January 2022
ISBN Information:
Conference Location: Houston, TX, USA
References is not available for this document.

I. Introduction

Automatic Speech Recognition (ASR) systems often significantly increase the efficiency and convenience of human-computer interactions for a variety of applications and domains. In recent years, the medical and healthcare field is one such domain that has received increasing attention from the ASR community. For example, in hospitals, manual interactions with computing systems (e.g., recording physician notes, retrieving patient data, and searching for medical information) can be distracting, tedious, and time-consuming. A speech-based system could even be used during surgeries or emergency interventions, where such a system could also more quickly alert physicians of recommended steps or help prevent deviations from usual treatment protocols and workflows. However, there are several challenges that need to be addressed before using an ASR in the medical field will be feasible. First, the system has to work reliably in environments with different levels and types of noise, such as multiple speakers, sounds generated by medical equipment, etc. Second, it is difficult to identify a universal dataset to be used for training and evaluating medical speech recognition tasks. And third, medical terminology can be much more complex than everyday expressions, e.g., medical terms may be longer than most other dictionary words, they are often combined in unusual ways and more difficult to pronounce, while also often sharing very similar pronunciations across different words (such as names of procedures, diseases, medications, etc.). Therefore, a more advanced approach to speech recognition is needed. Since this problem is so new, there have only been a few prior efforts to investigate the design of an ASR for medical purposes [1]. One approach to designing such a system is to collect medical speech data and and build a speech corpus that can be used to train a system from scratch. Edwards et al. [2] present a speech recognition system trained with 270 hours of medical speech data and 30 million tokens of text from clinical episodes, resulting in a word error rate (WER) that is below 16% in realistic clinical cases. Chiu et al. [3] trained two models, a Connectionist Temporal Classification (CTC) phoneme based model and a Listen Attend and Spell (LAS) grapheme based model, with 14,000 hours of medical conversations, yielding WERs of 20.1% and 18.3%, respectively. Another option is to use an existing ASR system and adapt it to the medical domain. Liu et al. [4] evaluate two well-known ASR systems, Nuance Dragon and SRI Decipher, on spoken clinical questions, and adapt the SRI system to the medical domain using a language model, achieving an WER of 26.7%. Salloum et al. [5] propose a method called “crowdsourced transcription process” to continuously refine ASR language models. Mani et al. [6] perform medical domain adaptation of Google ASR and ASPIRE via machine translation, which is achieved by learning a mapping from out-of-domain errors to in-domain medical terms, yielding a WER of 7%.

Select All
1.
S. Latif, J. Qadir, A. Qayyum, M. Usama and S. Younis, "Speech technology for healthcare: Opportunities challenges and state of the art", IEEE Reviews in Biomedical Engineering, vol. 14, pp. 342-356, 2021.
2.
E. Edwards, W. Salloum, G. P. Finley, J. Fone, G. Cardiff, M. Miller, et al., "Medical speech recognition: Reaching parity with humans", Speech and Computer, pp. 512-524, 2017.
3.
C.-C. Chiu, A. Tripathi, K. Chou, C. Co, N. Jaitly, D. Jaunzeikare, et al., "Speech recognition for medical conversations", Interspeech, pp. 2972-2976, 2018.
4.
F. Liu, G. Tur, D. Hakkani-Tur and H. Yu, "Towards spoken clinical-question answering: Evaluating and adapting automatic speech-recognition systems for spoken clinical questions", Journal of the American Medical Informatics Association: JAMIA, vol. 18, pp. 625-30, 06 2011.
5.
W. Salloum, E. Edwards, S. Ghaffarzadegan, D. Suendermann-Oeft and M. Miller, "Crowdsourced continuous improvement of medical speech recognition", AAAI Workshops, 2017.
6.
A. Mani, S. Palaskar, N. V. Meripo, S. Konam and F. Metze, "Asr error correction and domain adaptation using machine translation", Proc. IEEE ICASSP, pp. 6344-6348, 2020.
7.
V. Peddinti, G. Chen, V. Manohar, T. Ko, D. Povey and S. Khudanpur, "Jhu aspire system: Robust lvcsr with tdnns ivector adaptation and rnn-lms", IEEE Workshop on ASRU, pp. 539-546, 2015.
8.
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, et al., The kaldi speech recognition toolkit, 2011.
9.
D. Povey, M. Hannemann, G. Boulianne, L. Burget, A. Ghoshal, M. Janda, et al., "Generating exact lattices in the wfst framework", Proc. IEEE ICASSP, pp. 4213-4216, 2012.
10.
Y. Wu, M. Schuster, Z. Chen et al., "Google’s neural machine translation system: Bridging the gap between human and machine translation", arXiv preprint arXiv:1609.08144, 2016.
11.
J. Crego, J. Kim, G. Klein et al., "Systran’s pure neural machine translation systems", arXiv preprint arXiv:1610.05540, 2016.
12.
J. Zhou, Y. Cao, X. Wang, P. Li and W. Xu, "Deep recurrent models with fast-forward connections for neural machine translation", Transactions of the Association for Computational Linguistics, vol. 4, pp. 371-383, 07 2016.
13.
H. Wang, S. Dong, Y. Liu, J. Logan, A. K. Agrawal and Y. Liu, "Asr error correction with augmented transformer for entity retrieval", Interspeech, pp. 1550-1554, 2020.
14.
Y. Weng, S. S. Miryala, C. Khatri, R. Wang, H. Zheng, P. Molino, et al., "Joint contextual modeling for asr correction and language understanding", Proc. IEEE ICASSP, pp. 6349-6353, 2020.
15.
L. Orosanu and D. Jouvet, "Adding new words into a language model using parameters of known words with similar behavior", Procedia Computer Science, vol. 128, pp. 18-24, 2018.
16.
D. Bahdanau, K. Cho and Y. Bengio, "Neural machine translation by jointly learning to align and translate", arXiv preprint arXiv:1409.0473, 2016.
17.
M.-T. Luong, H. Pham and C. D. Manning, "Effective approaches to attention-based neural machine translation", arXiv preprint arXiv:1508.04025, 2015.
18.
A. Pampari, P. Raghavan, J. Liang and J. Peng, "emrqa: A large corpus for question answering on electronic medical records", arXiv preprint arXiv:1809.00732, 2018.
19.
C. K. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan and J. Gehrke, A scalable noisy speech dataset and online subjective test framework, pp. 1816-1820, 2019.
20.
D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization", Proc. ICLR, 2015.

Contact IEEE to Subscribe

References

References is not available for this document.