端到端語音識別(二) ctc
相關筆記
History
ICML-2006. Graves et al. [1] introduced the connectionist temporal classification (CTC) objective function for phone recognition.
ICML-2014. Graves [2] demonstrated that character-level speech transcription can be performed by a recurrent neural network with minimal preprocessing.
Baidu. 2014 [3] DeepSpeech, 2015 [4] DeepSpeech2.
ASRU-2015. YaJie Miao [5] presented Eesen framework.
ASRU-2015. Google [6] extended the application of Context-Dependent (CD) LSTM trained with CTC and sMBR loss.
ICASSP-2016. Google [7] presented a compact large vocabulary speech recognition system that can run efficiently on mobile devices, accurately and with low latency.
NIPS-2016. Google [8] used whole words as acoustic units.
2017, IBM [9] employed direct acoustics-to-word models.
Reference
[1]. A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist temporal classfification: labelling unsegmented sequence data with recurrent neural networks. In ICML, 2006.
[2]. Graves, Alex and Jaitly, Navdeep. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1764–1772, 2014.
[3]. Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G.,Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates,A., et al. (2014a).Deepspeech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
[4]. D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” CoRR arXiv:1512.02595, 2015.
[5]. Yajie Miao, Mohammad Gowayyed, Florian Metze. EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding. 2015 Automatic Speech Recognition and Understanding Workshop (ASRU 2015)
[6]. A. Senior, H. Sak, F. de Chaumont Quitry, T. N. Sainath, and K. Rao, “Acoustic Modelling with CD-CTC-SMBR LSTM RNNS,” in ASRU, 2015
[7]. I. McGraw, R. Prabhavalkar, R. Alvarez, M. Gonzalez Arenas, K. Rao, D. Rybach, O. Alsharif, H. Sak, A. Gruenstein, F. Beaufays, and C. Parada, “Personalized speech recognition on mobile devices,” in Proc. of ICASSP, 2016.
[8]. H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: Acoustic-to-word lstm model for large vocabulary speech recognition,” arXiv preprint arXiv:1610.09975,2016.
[9]. K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, D. Nahamoo, “Direct Acoustics-to-Word Models for English Conversational Speech Recognition” arXiv preprint arXiv:1703.07754,2017.