1. 程式人生 > >kaldi中文語音識別thchs30模型訓練程式碼功能和配置引數解讀

kaldi中文語音識別thchs30模型訓練程式碼功能和配置引數解讀

Monophone

單音素模型的訓練

    
  1. # Flat start and monophone training, with delta-delta features. # This script applies cepstral mean normalization (per speaker).
  2. #monophone 訓練單音素模型
  3. steps/train_mono.sh --boost-silence 1.25 --nj $n --cmd "$train_cmd" data/mfcc/train data/lang exp
    /mono || exit 1;
  4. #test monophone model
  5. local/thchs-30_decode.sh --mono true --nj $n "steps/decode.sh" exp/mono data/mfcc &
 
train_mono.sh 用法

    
  1. echo "Usage: steps/train_mono.sh [options] <data-dir> <lang-dir> <exp-dir>"
  2. echo " e.g.: steps/train_mono.sh data/train.1k data/lang exp/mono"
  3. echo "main options (for others, see top of script file)"
其中的引數設定, 訓練單音素的基礎HMM模型,迭代40次,並按照 realign_iters 的次數對資料對齊

   
  1. # Begin configuration section.
  2. nj=4
  3. cmd=run.pl
  4. scale_opts="--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1"
  5. num_iters=40 # Number of iterations of training
  6. max_iter_inc=30 # Last iter to increase #Gauss on.
  7. totgauss=1000 # Target #Gaussians.
  8. careful=false
  9. boost_silence=1.0 # Factor by which to boost silence likelihoods in alignment
  10. realign_iters="1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 23 26 29 32 35 38";
  11. config= # name of config file.
  12. stage=-4
  13. power=0.25 # exponent to determine number of gaussians from occurrence counts
  14. norm_vars=false # deprecated, prefer --cmvn-opts "--norm-vars=false"
  15. cmvn_opts= # can be used to add extra options to cmvn.
  16. # End configuration section.

thchs - 30 _decode . sh 測試單音素模型,實際使用mkgraph.sh建立完全的識別網路,並輸出一個有限狀態轉換器,最後使用decode.sh以語言模型和測試資料為輸入計算WER.

    
  1. #decode word
  2. utils/mkgraph.sh $opt data/graph/lang $srcdir $srcdir/graph_word || exit 1;
  3. $decoder --cmd "$decode_cmd" --nj $nj $srcdir/graph_word $datadir/test $srcdir/decode_test_word || exit 1
  4. #decode phone
  5. utils/mkgraph.sh $opt data/graph_phone/lang $srcdir $srcdir/graph_phone || exit 1;
  6. $decoder --cmd "$decode_cmd" --nj $nj $srcdir/graph_phone $datadir/test_phone $srcdir/decode_test_phone || exit 1

align_si . sh用指定模型對指定資料進行對齊,一般在訓練新模型前進行,以上一版本模型作為輸入,輸出在 <align-dir>

    
  1. #monophone_ali
  2. steps/align_si.sh --boost-silence 1.25 --nj $n --cmd "$train_cmd" data/mfcc/train data/lang exp/mono exp/mono_ali || exit 1;
  3. # Computes training alignments using a model with delta or
  4. # LDA+MLLT features.
  5. # If you supply the "--use-graphs true" option, it will use the training
  6. # graphs from the source directory (where the model is). In this
  7. # case the number of jobs must match with the source directory.
  8. echo "usage: steps/align_si.sh <data-dir> <lang-dir> <src-dir> <align-dir>"
  9. echo "e.g.: steps/align_si.sh data/train data/lang exp/tri1 exp/tri1_ali"
  10. echo "main options (for others, see top of script file)"
  11. echo " --config <config-file> # config containing options"
  12. echo " --nj <nj> # number of parallel jobs"
  13. echo " --use-graphs true # use graphs in src-dir"
  14. echo " --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs."

Triphone

以單音素模型為輸入訓練上下文相關的三音素模型

      
  1. #triphone
  2. steps/train_deltas.sh --boost-silence 1.25 --cmd "$train_cmd" 2000 10000 data/mfcc/train data/lang exp/mono_ali exp/tri1 || exit 1;
  3. #test tri1 model
  4. local/thchs-30_decode.sh --nj $n "steps/decode.sh" exp/tri1 data/mfcc &

train_deltas . sh 中的相關配置如下,其中輸入

     
  1. # Begin configuration.
  2. stage=-4 # This allows restarting after partway, when something when wrong.
  3. config=
  4. cmd=run.pl
  5. scale_opts="--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1"
  6. realign_iters="10 20 30";
  7. num_iters=35 # Number of iterations of training
  8. max_iter_inc=25 # Last iter to increase #Gauss on.
  9. beam=10
  10. careful=false
  11. retry_beam=40
  12. boost_silence=1.0 # Factor by which to boost silence likelihoods in alignment
  13. power=0.25 # Exponent for number of gaussians according to occurrence counts
  14. cluster_thresh=-1 # for build-tree control final bottom-up clustering of leaves
  15. norm_vars=false # deprecated. Prefer --cmvn-opts "--norm-vars=true"
  16. # use the option --cmvn-opts "--norm-means=false"
  17. cmvn_opts=
  18. delta_opts=
  19. context_opts= # use"--context-width=5 --central-position=2" for quinphone
  20. # End configuration.
  21. echo "Usage: steps/train_deltas.sh <num-leaves> <tot-gauss> <data-dir> <lang-dir> <alignment-dir> <exp-dir>"
  22. echo "e.g.: steps/train_deltas.sh 2000 10000 data/train_si84_half data/lang exp/mono_ali exp/tri1"

LDA_MLLT

對特徵使用LDA和MLLT進行變換,訓練加入LDA和MLLT的三音素模型。

LDA+MLLT refers to the way we transform the features after computing the MFCCs: we splice across several frames, reduce the dimension (to 40 by default) using Linear Discriminant Analysis), and then later estimate, over multiple iterations, a diagonalizing transform known as MLLT or CTC.

詳情可參考 http://kaldi-asr.org/doc/transform.html



     
  1. #triphone_ali
  2. steps/align_si.sh --nj $n --cmd "$train_cmd" data/mfcc/train data/lang exp/tri1 exp/tri1_ali || exit 1;
  3. #lda_mllt
  4. steps/train_lda_mllt.sh --cmd "$train_cmd" --splice-opts "--left-context=3 --right-context=3" 2500 15000 data/mfcc/train data/lang exp/tri1_ali exp/tri2b || exit 1;
  5. #test tri2b model
  6. local/thchs-30_decode.sh --nj $n "steps/decode.sh" exp/tri2b data/mfcc &
train_lda_mllt . sh相關程式碼配置如下:

     
  1. # Begin configuration.
  2. cmd=run.pl
  3. config=
  4. stage=-5
  5. scale_opts="--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1"
  6. realign_iters="10 20 30";
  7. mllt_iters="2 4 6 12";
  8. num_iters=35 # Number of iterations of training
  9. max_iter_inc=25 # Last iter to increase #Gauss on.
  10. dim=40
  11. beam=10
  12. retry_beam=40
  13. careful=false
  14. boost_silence=1.0 # Factor by which to boost silence likelihoods in alignment
  15. power=0.25 # Exponent for number of gaussians according to occurrence counts
  16. randprune=4.0 # This is approximately the ratio by which we will speed up the
  17. # LDA and MLLT calculations via randomized pruning.
  18. splice_opts=
  19. cluster_thresh=-1 # for build-tree control final bottom-up clustering of leaves
  20. norm_vars=false # deprecated. Prefer --cmvn-opts "--norm-vars=false"
  21. cmvn_opts=
  22. context_opts= # use "--context-width=5 --central-position=2" for quinphone.
  23. # End configuration.

Sat

運用基於特徵空間的最大似然線性迴歸(fMLLR)進行說話人自適應訓練 This does Speaker Adapted Training (SAT), i.e. train on  fMLLR-adapted features.  It can be done on top of either LDA+MLLT, or  delta and delta-delta features.  If there are no transforms supplied  in the alignment directory, it will estimate transforms itself before  building the tree (and in any case, it estimate