kaldi中文語音識別_基於thchs30(3)

阿新 • • 發佈：2019-01-08

接上回，我們繼續看run.sh

#you can obtain the database by uncommting the following lines
#[ -d $thchs ] || mkdir -p $thchs || exit 1
#echo "downloading THCHS30 at $thchs ..."
#local/download_and_untar.sh $thchs http://www.openslr.org/resources/18 data_thchs30 || exit 1
#local/download_and_untar.sh $thchs http://www.openslr.org/resources/18 resource || exit 1

#local/download_and_untar.sh $thchs http://www.openslr.org/resources/18 test-noise || exit 1
這沒什麼可說的，這個就是讓你下載thchs30語音資料包後,解壓到相應的目錄下，但是這裡原版run.sh中已經註釋這些了，意思是你如果需要就用這個指令碼下載，我們已經下載完畢了，這裡不需要。

還記得上回咱們說到，因為記憶體可能不夠，單步跑時是在run.sh的指令碼中看到
#data preparation
這句，在它之後就全是shell的命令。建議一條一條的跑。不然中間會有莫名奇妙的斷檔和錯誤。如何一條條跑呢？
使用註釋：:<<! 。。。。 ! 這兩句相當於c語言的/* */. 中間的。。。。相當於要註釋的內容。

這裡就是啦。

#data preparation
#generate text, wav.scp, utt2pk, spk2utt
local/thchs-30_data_prep.sh $H $thchs/data_thchs30 || exit 1;

所以我們先來跑local/thchs-30_data_prep.sh 這裡是資料準備工作，我們先來看看這裡面的內容

#!/bin/bash
# Copyright 2016 Tsinghua University (Author: Dong Wang, Xuewei Zhang). Apache 2.0.
# 2016 LeSpeech (Author: Xingyu Na)

#This script pepares the data directory for thchs30 recipe. 此處註明此指令碼用於準備hchs30的

資料目錄。

#It reads the corpus and get wav.scp and transcriptions.它讀取語料庫並得到wav.scp和音標。

dir=$1
corpus_dir=$2

這兩個其實就是上面這個命令local/thchs-30_data_prep.sh $H $thchs/data_thchs30的兩個引數 $H $thchs/data_thchs30

$1 代表 $H 也就是 run.sh中的H=`pwd` 實際上就是當前目錄

$2 代表 $thchs/data_thchs30 因為run.sh中之前宣告thchs=/opt/kaldi/egs/thchs30/thchs30-openslr
所以這裡$thchs/data_thchs30就是指的/opt/kaldi/egs/thchs30/thchs30-openslr/data_thchs30 也就是語音目錄

cd $dir
echo "creating data/{train,dev,test}" 進入該目錄,列印文字"creating data/{train,dev,test}"
mkdir -p data/{train,dev,test} 建立data目錄,及子目錄，一會兒會在這下面生成資料準備檔案

#create wav.scp, utt2spk.scp, spk2utt.scp, text 我的理解是建立語音的相關檔案

這裡說明一下根據音訊名和標註建立:wav.scp, utt2spk.scp, spk2utt.scp, text以及word.txt phone.txt。
wav.scp中第一列為錄音編號<recording-id>，第二列為音訊檔案路徑<extended-filename>
舉例：A11_000 /opt/kaldi/egs/thchs30/thchs30-openslr/data_thchs30/train/A11_0.wav

utt2spk中第一列為錄音編號<utterance-id>，第二列為講話者id<speaker-id>
舉例：A11_000 A11
spk2utt中第一列為講話著<speaker-id>，後面跟著他所說的話<utterance-id1> <utterance-id2> …

這個就是後面需要將data/train/utt2spk 轉換為 data/train/spk2utt格式的
word.txt中第一列為錄音編號<utterance-id>，第二列為講話內容，後面我們在研究這些是怎麼生成的。

舉例：A11_000 綠是陽春煙景大塊文章的底色四月的林巒更是綠得鮮活秀媚詩意盎然
phone.txt中第一列為錄音編號<utterance-id>，第二列為講話內容的聲音標註，後面我們在研究這些是怎麼生成的。
舉例：A11_000 l v4 sh ix4 ii iang2 ch un1 ii ian1 j ing3 d a4 k uai4 uu un2 zh ang1 d e5 d i3 s e4 s iy4 vv ve4 d e5 l in2 l uan2 g eng4 sh ix4 l v4 d e5 x ian1 h uo2 x iu4 m ei4 sh ix1 ii i4 aa ang4 r an2

(
#進入迴圈，這裡是生成每個檔案的步驟
for x in train dev test; do
echo "cleaning data/$x" #迴圈顯示
cd $dir/data/$x #進入每個目錄
rm -rf wav.scp utt2spk spk2utt word.txt phone.txt text #刪除這個檔案，應該是如果有這些檔案就重新生成
echo "preparing scps and text in data/$x" #迴圈顯示
#updated new "for loop" figured out the compatibility issue with Mac created by Xi Chen, in 03/06/2018 #這個是個註釋，意思是更新了for迴圈,修復了在Mac上的相容問題
#for nn in `find $corpus_dir/$x/*.wav | sort -u | xargs -i basename {} .wav`; do
for nn in `find $corpus_dir/$x -name "*.wav" | sort -u | xargs -I {} basename {} .wav`; do #進入相應目錄迴圈查詢"*.wav"語音檔案,並排序去除重複行
spkid=`echo $nn | awk -F"_" '{print "" $1}'` #說話者id
spk_char=`echo $spkid | sed 's/$[A-Z]$.*/\1/'` #說話的內容
spk_num=`echo $spkid | sed 's/[A-Z]$[0-9]$/\1/'` #說話者號，號碼為0向上遞增
spkid=$(printf '%s%.2d' "$spk_char" "$spk_num") #說話者內容和號碼輸出
utt_num=`echo $nn | awk -F"_" '{print $2}'` #說話號，號碼為0向上遞增
uttid=$(printf '%s%.2d_%.3d' "$spk_char" "$spk_num" "$utt_num") #說話者內容和號碼, 說話號輸出

echo $uttid $corpus_dir/$x/$nn.wav >> wav.scp #說話者內容和號碼, 說話號碼輸出語音檔案全路徑名稱輸出例如
A11_000 /opt/kaldi/egs/thchs30/thchs30-openslr/data_thchs30/train/A11_0.wav
echo $uttid $spkid >> utt2spk #說話者內容和號碼, 說話號碼輸出說話者id 例如 A11_000 A11
echo $uttid `sed -n 1p $corpus_dir/data/$nn.wav.trn` >> word.txt # #說話者內容和號碼, 說話號碼輸出並且找到相應檔案獲取語音資料（內容的第一行是中文)
例如 A11_000 綠是陽春煙景大塊文章的底色四月的林巒更是綠得鮮活秀媚詩意盎然
echo $uttid `sed -n 3p $corpus_dir/data/$nn.wav.trn` >> phone.txt #說話者內容和號碼, 說話號碼輸出並且找到相應檔案獲取語音資料（內容的第三行是音標)
例如 A11_000 l v4 sh ix4 ii iang2 ch un1 ii ian1 j ing3 d a4 k uai4 uu un2 zh ang1 d e5 d i3 s e4 s iy4 vv ve4 d e5 l in2 l uan2 g eng4 sh ix4 l v4 d e5 x ian1 h uo2 x iu4 m ei4 sh ix1 ii i4 aa ang4 r an2

done
#所有的都進行排序
cp word.txt text
sort wav.scp -o wav.scp
sort utt2spk -o utt2spk
sort text -o text
sort phone.txt -o phone.txt
done
) || exit 1

utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt
#呼叫utils/utt2spk_to_spk2utt.pl 將utt2spk檔案轉為spk2utt,以下同樣
utils/utt2spk_to_spk2utt.pl data/dev/utt2spk > data/dev/spk2utt
utils/utt2spk_to_spk2utt.pl data/test/utt2spk > data/test/spk2utt

echo "creating test_phone for phone decoding" #應該是建立測試集的音標
(
rm -rf data/test_phone && cp -R data/test data/test_phone || exit 1 #刪除data下的test_phone目錄，將data的test data下的拷過來
cd data/test_phone && rm text && cp phone.txt text || exit 1 #進去後刪除原來的text ，拷貝phone.txt作為text

)

我們來看看utils/utt2spk_to_spk2utt.pl 這個指令碼

#!/usr/bin/env perl
# Copyright 2010-2011 Microsoft Corporation

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
# MERCHANTABLITY OR NON-INFRINGEMENT.
# See the Apache 2 License for the specific language governing permissions and
# limitations under the License.

# converts an utt2spk file to a spk2utt file.
# Takes input from the stdin or from a file argument;
# output goes to the standard out.

if ( @ARGV > 1 ) {
die "Usage: utt2spk_to_spk2utt.pl [ utt2spk ] > spk2utt";
}

while(<>){
@A = split(" ", $_);
@A == 2 || die "Invalid line in utt2spk file: $_";
($u,$s) = @A;
if(!$seen_spk{$s}) {
$seen_spk{$s} = 1;
push @spklist, $s;
}
push (@{$spk_hash{$s}}, "$u");
}
foreach $s (@spklist) {
$l = join(' ',@{$spk_hash{$s}});
print "$s $l\n";

}

這裡面基本上就是轉換，好了，我們先將這些處理完了再說，未完待續。。。。。。

kaldi中文語音識別_基於thchs30(3)

kaldi中文語音識別_基於thchs30(3)

kaldi中文語音識別_基於thchs30(1)

kaldi中文語音識別thchs30模型訓練程式碼功能和配置引數解讀

kaldi中文語音識別(1)——thchs30

語音識別——基於深度學習的中文語音識別系統實現（程式碼詳解）

基於seq2seq+attention的中文語音識別

94、tensorflow實現語音識別0,1,2,3,4,5,6,7,8,9

Unity中使用百度中文語音識別功能

【Windows語音識別】基於SAPI v5.1的語音識別程式配置

使用 pocketsphinx 做中文語音識別時報錯 ERROR: Input audio file has sample rate [44100], but decoder expects [160

Amazon Transcribe 語音識別_自動語音識別技術

winform程式實現中文語音識別

一個基於Windows Vista speech API5.3以及WPF技術的語音識別程式碼

一個基於c#3.0的開發基於2000/XP/2003下語音識別的通用類

Kaldi學習筆記（四）——thchs30中文線上識別

IOS Android 和 Unity上基於kaldi的離線語音識別系統

從零開始語音識別--基於Kaldi工具

kaldi上執行thchs30中文語音庫的錯誤總結

一個基於Windows Vista speech API5 3以及WPF技術的語音識別代碼

一個基於c 3 0的開發基於2000/XP/2003下語音識別的通用類

kaldi中文語音識別_基於thchs30(3)

相關推薦