Utterance-Wise Recurrent Dropout And Iterative Speaker Adaptation For Robust Monaural Speech Recognition
單聲道語音識別的逐句循環Dropout叠代說話人自適應
WRBN(wide residual BLSTM network,寬殘差雙向長短時記憶網絡)
[2] J. Heymann, L. Drude, and R. Haeb-Umbach, "Wide residual blstm network with discriminative speaker adaptation for robust speech recognition," submitted to the CHiME, vol. 4, 2016.
reverberation,n. [聲] 混響;反射;反響;回響
CLDNN(convolutional, long short-term memory, fully connected deep neural networks,卷積-長短時記憶-全連接深度神經網絡)
[1] T.N. Sainath, O. Vinyals, A. Senior, and H. Sak, "Convolutional, long short-term memory, fully connected deep neural networks," in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4580
speech separation,語音分離,將多說話人同時說話的語句分離為各個說話人獨立說話的語句。
在LSTM訓練中使用Dropout能有效緩解過擬合。
[3] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov, "Improving neural networks by preventing co-adaptation of feature detectors," arXiv preprint arXiv:1207.0580, 2012.
在輸出門、遺忘門以及輸入門使用基於語句采樣丟幀Mask
[7] G. Cheng, V. Peddinti, D. Povey, V. Manohar, S. Khudanpur, and Y. Yan, "An exploration of dropout with lstms," in Proceedings of Interspeech, 2017.
基於MLLR的叠代自適應方法,使用上一次叠代的解碼結果來更新高斯參數。
[10] P.C. Woodland, D. Pye, and M.J.F. Gales, "Iterative unsupervised adaptation using maximum likelihood linear regression," inSpokenLanguage, 1996.ICSLP96.Proceedings., Fourth International Conference on. IEEE, 1996, vol. 2, pp. 1133–1136.
近期提出了一種batch正則化說話人自適應。
[14] P. Swietojanski, J. Li, and S. Renals, "Learning hidden unit contributions for unsupervised acoustic model adaptation," IEEE/ACMTransactionsonAudio,Speech, and Language Processing, vol. 24, no. 8, pp. 1450– 1463, 2016.
本文使用了無監督的LIN說話人自適應
[11]
使用的LIN層矩陣維數為80*80,該層被三個輸入特征共享(原始、delta、delta-delta)。
本文嘗試使用以下兩種方式進行叠代的說話人自適應:
- 在叠代時使用上一次叠代的模型生成新標簽進行訓練。
- 每次叠代堆疊一個額外的線性輸入層(數學上,多個線性層相當於一個隱層)
傳統DNN訓練方式是segment-wise
實驗得出,使用RNN時,Iter(叠代方案)更優;使用tri-gram時,Stack(堆疊)方案更優
Utterance-Wise Recurrent Dropout And Iterative Speaker Adaptation For Robust Monaural Speech Recognition