Asynchronous stochastic optimization for sequence training of deep neural networks | IEEE Conference Publication | IEEE Xplore

Asynchronous stochastic optimization for sequence training of deep neural networks


Abstract:

This paper explores asynchronous stochastic optimization for sequence training of deep neural networks. Sequence training requires more computation than frame-level train...Show More

Abstract:

This paper explores asynchronous stochastic optimization for sequence training of deep neural networks. Sequence training requires more computation than frame-level training using pre-computed frame data. This leads to several complications for stochastic optimization, arising from significant asynchrony in model updates under massive parallelization, and limited data shuffling due to utterance-chunked processing. We analyze the impact of these two issues on the efficiency and performance of sequence training. In particular, we suggest a framework to formalize the reasoning about the asynchrony and present experimental results on both small and large scale Voice Search tasks to validate the effectiveness and efficiency of asynchronous stochastic optimization.
Date of Conference: 04-09 May 2014
Date Added to IEEE Xplore: 14 July 2014
Electronic ISBN:978-1-4799-2893-4

ISSN Information:

Conference Location: Florence, Italy
References is not available for this document.

1. Introduction

Deep neural networks (DNNs) have become the dominant acoustic model for speech recognition [1], [2], [3], [4], [5]. Two frameworks to incorporate DNNs into the existing HMM-based decoder include the hybrid [6] and tandem [7] approaches. In this paper, we focus on the hybrid approach.

Select All
1.
F. Seide, G. Li, and D. Yu, "Conversational speech transcription using context-dependent deep neural networks," in INTERSPEECH, 2011, pp. 437-440.
2.
G.E. Dahl, D. Yu, and L. Deng, "Context-dependent pretrained deep neural networks for large vocabulary speech recognition," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2011.
3.
G. Dahl, D. Yu, L. Deng, and A. Acero, "Context-dependent pre-trained deep neural networks for large vocabulary speech recognition," IEEE Transactions on Audio, Speech, and Language Processing, Special Issue on Deep Learning for Speech and Language Processing, vol. 20, no. 1, pp. 30-42, 2012.
4.
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, "Application of pretrained deep neural networks to large vocabulary speech recognition," in INTERSPEECH, 2012.
5.
G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82-97, 2012.
6.
H. Bourlard and N. Morgan, Connectionist speech recognition, Kluwer Academic Publishers, 1994.
7.
H. Hermansky, D.P.W. Ellis, and S. Sharma, "Tandem connectionist feature extraction for conventional HMM systems," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Istanbul, Turkey, June 2000.
8.
Y. Normandin, Hidden Markov Models, Maximum Mutual Information, and the Speech Recognition Problem, Ph.D. thesis, McGill University, Montreal, Canada, 1991.
9.
D. Povey, Discriminative Training for Large Vocabulary Speech Recognition, Ph.D. thesis, Cambridge, England, 2004.
10.
B. Kingsbury, "Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, Taiwan, Apr. 2009, pp. 3761-3764.
11.
B. Kingsbury, T. N. Sainath, and H. Soltau, "Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian-free optimization," in INTERSPEECH, 2012.
12.
H. Su, G. Li, D. Yu, and F. Seide, "Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp. 6664-6668.
13.
K. Vesely, A. Ghoshal, L. Burget, and D. Povey, "Sequencediscriminative training of deep neural networks," in INTERSPEECH, 2013.
14.
L. Bottou, "Stochastic gradient learning in neural networks," in Neuro-Nimes, 1991.
15.
Y. Kubo, T. Hori, and A. Nakamura, "Large vocabulary continuous speech recognition based on WFST structured classifiers and deep bottleneck features," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),2013.
16.
Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng, "Building high-level features using large scale unsupervised learning," in International Conference on Machine Learning, 2012, pp. 81-88.
17.
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, "Large scale distributed deep networks," in Advances in Neural Information Processing Systems (NIPS), 2012.
18.
G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato, M. Devin, and J. Dean, "Multilingual acoustic models using distributed deep neural networks," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada, Apr. 2013, vol. 1.
19.
C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
20.
G. Heigold, A Log-Linear Discriminative Modeling Framework for Speech Recognition, Ph.D. thesis, RWTH Aachen University, Aachen, Germany, June 2010.
21.
J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, and B. Strope, ""Your word is my command": Google search by voice: A case study," in Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, chapter 4, pp. 61-90. Springer, 2010.

References

References is not available for this document.