1. 程式人生 > >HMM原理簡述和使用說明

HMM原理簡述和使用說明

1、原理簡述   

      為了對GMM-HMM在語音識別上的應用有個巨集觀認識,花了些時間讀了下HTK(用htk完成簡單的孤立詞識別)的部分原始碼,對該演算法總算有了點大概認識,達到了預期我想要的。不得不說,網路上關於語音識別的通俗易懂教程太少,都是各種公式滿天飛,很少有說具體細節的,當然了,那需要有實戰經驗才行。下面總結以下幾點,對其有個巨集觀印象即可(以孤立詞識別為例)。

  一、每個單詞的讀音都對應一個HMM模型,大家都知道HMM模型中有個狀態集S,那麼每個狀態用什麼來表示呢,數字?向量?矩陣?其實這個狀態集中的狀態沒有具體的數學要求,只是一個名稱而已,你可以用’1’, ’2’, ‘3’…表示,也可以用’a’, ‘b’, ’c ’表示。另外每個HMM模型中到底該用多少個狀態,是通過先驗知識人為設定的。

  二、HMM的每一個狀態都對應有一個觀察值,這個觀察值可以是一個實數,也可以是個向量,且每個狀態對應的觀察值的維度應該相同。假設現在有一個單詞的音訊檔案,首先需要將其進行取樣得到數字資訊(A/D轉換),然後分幀進行MFCC特徵提取,假設每一幀音訊對應的MFCC特徵長度為39,則每個音訊檔案就轉換成了N個MFCC向量(不同音訊檔案對應的N可能不同),這就成了一個序列,而在訓練HMM模型的引數時(比如用Baum-Welch演算法),每次輸入到HMM中的資料要求就是一個觀測值序列。這時,每個狀態對應的觀測值為39維的向量,因為向量中元素的取值是連續的,需要用多維密度函式來模擬,通常情況下用的是多維高斯函式。在GMM-HMM體系中,這個擬合函式是用K個多維高斯混合得到的。假設知道了每個狀態對應的K個多維高斯的所有引數,則該GMM生成該狀態上某一個觀察向量(一幀音訊的MFCC係數)的概率就可以求出來了。

  三、對每個單詞建立一個HMM模型,需要用到該單詞的訓練樣本,這些訓練樣本是提前標註好的,即每個樣本對應一段音訊,該音訊只包含這個單詞的讀音。當有了該單詞的多個訓練樣本後,就用這些樣本結合Baum-Welch演算法和EM演算法來訓練出GMM-HMM的所有引數,這些引數包括初始狀態的概率向量,狀態之間的轉移矩陣,每個狀態對應的觀察矩陣(這裡對應的是GMM,即每個狀態對應的K個高斯的權值,每個高斯的均值向量和方差矩陣)。

  四、在識別階段,輸入一段音訊,如果該音訊含有多個單詞,則可以手動先將其分割開(考慮的是最簡單的方法),然後提取每個單詞的音訊MFCC特徵序列,將該序列輸入到每個HMM模型(已提前訓練好的)中,採用前向演算法求出每個HMM模型生成該序列的概率,最後取最大概率對應的那個模型,而那個模型所表示的單詞就是我們識別的結果。

  五、在建立聲學模型時,可以用Deep Learning的方法來代替GMM-HMM中的GMM,因為GMM模擬任意函式的功能取決於混合高斯函式的個數,所以具有一定的侷限性,屬於淺層模型。而Deep Network可以模擬任意的函式,因而表達能力更強。注意,這裡用來代替GMM的Deep Nets模型要求是產生式模型,比如DBN,DBM等,因為在訓練HMM-DL網路時,需要用到HMM的某個狀態產生一個樣本的概率。

  六、GMM-HMM在具體實現起來還是相當複雜的。

  七、一般涉及到時間序列時才會使用HMM,比如這裡音訊中的語音識別,視訊中的行為識別等。如果我們用GMM-HMM對靜態的圖片分類,因為這裡沒涉及到時間資訊,所以HMM的狀態數可設為1,那麼此時的GMM-HMM演算法就退化成GMM演算法了。

2、使用說明

一、離散輸出的隱馬爾科夫模型(DHMM,HMM with discrete outputs)

最大似然引數估計EM(Baum Welch演算法

The script dhmm_em_demo.m gives an example of how to learn an HMM with discrete outputs. Let there be Q=2 states and O=3 output symbols. We create random stochastic matrices as follows.

O = 3;

Q = 2;

prior0 = normalise(rand(Q,1));

transmat0 = mk_stochastic(rand(Q,Q));

obsmat0 = mk_stochastic(rand(Q,O)); 

Now we sample nex=20 sequences of length T=10 each from this model, to use as training data.

T=10;   %序列長度

nex=20;  %樣本序列數目

data = dhmm_sample(prior0, transmat0, obsmat0, nex, T);      

Here data is 20x10. Now we make a random guess as to what the parameters are,

prior1 = normalise(rand(Q,1)); %初始狀態概率

transmat1 = mk_stochastic(rand(Q,Q)); %初始狀態轉移矩陣

obsmat1 = mk_stochastic(rand(Q,O)); %初始觀測狀態到隱藏狀態間的概率轉移矩陣

and improve our guess using 5 iterations of EM...

[LL, prior2, transmat2, obsmat2] = dhmm_em(data, prior1, transmat1, obsmat1, 'max_iter', 5);

%prior2, transmat2, obsmat2 為訓練好後的初始概率,狀態轉移矩陣及混合狀態概率轉移矩陣

LL(t) is the log-likelihood after iteration t, so we can plot the learning curve.

序列分類

To evaluate the log-likelihood of a trained model given test data, proceed as follows:

loglik = dhmm_logprob(data, prior2, transmat2, obsmat2)  %HMM測試

Note: the discrete alphabet is assumed to be {1, 2, ..., O}, where O = size(obsmat, 2). Hence data cannot contain any 0s.

To classify a sequence into one of k classes, train up k HMMs, one per class, and then compute the log-likelihood that each model gives to the test sequence; if the i'th model is the most likely, then declare the class of the sequence to be class i.

Computing the most probable sequence (Viterbi)

First you need to evaluate B(i,t) = P(y_t | Q_t=i) for all t,i:

B = multinomial_prob(data, obsmat);

Then you can use

[path] = viterbi_path(prior, transmat, B) 

二、具有高斯混合輸出的隱馬爾科夫模型(GHMM,HMM with mixture of Gaussians outputs)

Maximum likelihood parameter estimation using EM (Baum Welch)

Let us generate nex=50 vector-valued sequences of length T=50; each vector has size O=2.

O = 2;

T = 50;

nex = 50;

data = randn(O,T,nex);

Now let use fit a mixture of M=2 Gaussians for each of the Q=2 states using K-means.

M = 2;

Q = 2;

left_right = 0;

prior0 = normalise(rand(Q,1));

transmat0 = mk_stochastic(rand(Q,Q));

[mu0, Sigma0] = mixgauss_init(Q*M, reshape(data, [O T*nex]), cov_type);

mu0 = reshape(mu0, [O Q M]);

Sigma0 = reshape(Sigma0, [O O Q M]);

mixmat0 = mk_stochastic(rand(Q,M));

Finally, let us improve these parameter estimates using EM.

[LL, prior1, transmat1, mu1, Sigma1, mixmat1] = ...

    mhmm_em(data, prior0, transmat0, mu0, Sigma0, mixmat0, 'max_iter', 2);

Since EM only finds a local optimum, good initialisation is crucial. The initialisation procedure illustrated above is very crude, and is probably not adequate for real applications... Click here for a real-world example of EM with mixtures of Gaussians using BNT.

It is possible for p(x) > 1 if p(x) is a probability density function, such as a Gaussian. (The requirements for a density are p(x)>0 for all x and int_x p(x) = 1.) In practice this usually means your covariance is shrinking to a point/delta function, so you should increase the width of the prior (see below), or constrain the matrix to be spherical or diagonal, or clamp it to a large fixed constant (not learn it at all). It is also very helpful to ensure the components of the data vectors have small and comparable magnitudes (use e.g., KPMstats/standardize).

This is a well-known pathology of maximum likelihood estimation for Gaussian mixtures: the global optimum may place one mixture component on a single data point, and give it 0 covariance, and hence infinite likelihood. One usually relies on the fact that EM cannot find the global optimum to avoid such pathologies.

Since I implicitly add a prior to every covariance matrix (see below), what increases is loglik + log(prior), but what I print is just loglik, which may occasionally decrease. This suggests that one of your mixture components is not getting enough data. Try a better initialization or fewer clusters (states).

Estimates of the covariance matrix often become singular if you have too little data, or if too few points are assigned to a cluster center due to a bad initialization of the means. In this case, you should constrain the covariance to be spherical or diagonal, or adjust the prior (see below), or try a better initialization.

Buried inside of KPMstats/mixgauss_Mstep you will see that cov_prior is initialized to 0.01*I. This is added to the maximum likelihood estimate after every M step. To change this, you will need to modify the mhmm_em function so it calls mixgauss_Mstep with a different value.

Sequence classification

To classify a sequence (e.g., of speech) into one of k classes (e.g., the digits 0-9), proceed as in the DHMM case above, but use the following procedure to compute likelihood:

loglik = mhmm_logprob(data, prior, transmat, mu, Sigma, mixmat);

Computing the most probable sequence (Viterbi)

First you need to evaluate B(t,i) = P(y_t | Q_t=i) for all t,i:

B = mixgauss_prob(data(:,:,ex), mu, Sigma, mixmat);

where data(:,:,ex) is OxT where O is the size of the observation vector. Finally, use

[path] = viterbi_path(prior, transmat, B);

三、具有高斯輸出的HMM

This is just like the mixture of Gaussians case, except we have M=1, and hence there is no mixing matrix.

Online EM for discrete HMMs/ POMDPs

For some applications (e.g., reinforcement learning/ adaptive control), it is necessary to learn a model online. The script dhmm_em_online_demo gives an example of how to do this.


  MFCC:

  MFCC的matlab實現教程可參考:張智星老師的網頁教程mfcc. 最基本的12維特徵。

function mfcc=frame2mfcc(frame, fs, filterNum, mfccNum, plotOpt)
% frame2mfcc: Frame to MFCC conversion.
%    Usage: mfcc=frame2mfcc(frame, fs, filterNum, mfccNum, plotOpt)
%
%    For example:
%        waveFile='what_movies_have_you_seen_recently.wav';
%        [y, fs, nbits]=wavReadInt(waveFile);
%        startIndex=12000;
%        frameSize=512;
%        frame=y(startIndex:startIndex+frameSize-1);
%        frame2mfcc(frame, fs, 20, 12, 1);

%    Roger Jang 20060417

if nargin<1, selfdemo; return; end
if nargin<2, fs=16000; end
if nargin<3, filterNum=20; end
if nargin<4, mfccNum=12; end
if nargin<5, plotOpt=0; end

frameSize=length(frame);
% ====== Preemphasis should be done at wave level
%a=0.95;
%frame2 = filter([1, -a], 1, frame);
frame2=frame;
% ====== Hamming windowing
frame3=frame2.*hamming(frameSize);
% ====== FFT
[fftMag, fftPhase, fftFreq, fftPowerDb]=fftOneSide(frame3, fs);
% ====== Triangular band-pass filter bank
triFilterBankPrm=getTriFilterBankPrm(fs, filterNum);    % Get parameters for triangular band-pass filter bank
% Triangular bandpass filter.
for i=1:filterNum
    tbfCoef(i)=dot(fftPowerDb, trimf(fftFreq, triFilterBankPrm(:,i)));%得到filterNum個濾波係數
end
% ====== DCT
mfcc=zeros(mfccNum, 1); %DCT變換的前後個數也沒有變
for i=1:mfccNum
    coef = cos((pi/filterNum)*i*((1:filterNum)-0.5))'; %mfcc中的前mfccNum個係數
    mfcc(i) = sum(coef.*tbfCoef');%直接按照DCT公式
end
% ====== Log energy
%logEnergy=10*log10(sum(frame.*frame));
%mfcc=[logEnergy; mfcc];

if plotOpt
    subplot(2,1,1);
    plot(frame, '.-');
    set(gca, 'xlim', [-inf inf]);
    title('Input frame');
    subplot(2,1,2);
    plot(mfcc, '.-');
    set(gca, 'xlim', [-inf inf]);
    title('MFCC vector');
end

% ====== trimf.m (from fuzzy toolbox)
function y = trimf(x, prm) %由頻率的橫座標算出三角形內的縱座標,0~1
a = prm(1); b = prm(2); c = prm(3);
y = zeros(size(x));
% Left and right shoulders (y = 0)
index = find(x <= a | c <= x);
y(index) = zeros(size(index)); %只考慮三角波內的量
% Left slope
if (a ~= b)
    index = find(a < x & x < b);
    y(index) = (x(index)-a)/(b-a);
end
% right slope
if (b ~= c)
    index = find(b < x & x < c);
    y(index) = (c-x(index))/(c-b);
end
% Center (y = 1)
index = find(x == b);
y(index) = ones(size(index));

% ====== Self demo
function selfdemo
waveFile='what_movies_have_you_seen_recently.wav';
[y, fs, nbits]=wavReadInt(waveFile);
startIndex=12000;
frameSize=512;
frame=y(startIndex:startIndex+frameSize-1);
feval(mfilename, frame, fs, 20, 12, 1);

ZCR:

  過0檢測,用於判斷每一幀中過零點的數量情況,最簡單的版本可參考:zeros cross rate. 

waveFile='csNthu.wav';
frameSize=256;
overlap=0;
[y, fs, nbits]=wavread(waveFile);
frameMat=enframe(y, frameSize, overlap);
frameNum=size(frameMat, 2);
for i=1:frameNum
    frameMat(:,i)=frameMat(:,i)-mean(frameMat(:,i));    % mean justification
end
zcr=sum(frameMat(1:end-1, :).*frameMat(2:end, :)<0);
sampleTime=(1:length(y))/fs;
frameTime=((0:frameNum-1)*(frameSize-overlap)+0.5*frameSize)/fs;
subplot(2,1,1); plot(sampleTime, y); ylabel('Amplitude'); title(waveFile);
subplot(2,1,2); plot(frameTime, zcr, '.-');
xlabel('Time (sec)'); ylabel('Count'); title('ZCR');

EPD:

  端點檢測,檢測聲音的起始點和終止點,可參考:EPD in Time Domain,在時域中的最簡單檢測方法。

waveFile='sunday.wav';
[wave, fs, nbits] = wavread(waveFile);
frameSize = 256;
overlap = 128;

wave=wave-mean(wave);                % zero-mean substraction
frameMat=buffer2(wave, frameSize, overlap);    % frame blocking,每一列代表一幀
frameNum=size(frameMat, 2);            % no. of frames
volume=frame2volume(frameMat);        % volume,求每一幀的能量,絕對值或者平方和,volume為行向量
volumeTh1=max(volume)*0.1;            % volume threshold 1
volumeTh2=median(volume)*0.1;            % volume threshold 2
volumeTh3=min(volume)*10;            % volume threshold 3
volumeTh4=volume(1)*5;                % volume threshold 4
index1 = find(volume>volumeTh1); %找出volume大於閾值的那些幀序號
index2 = find(volume>volumeTh2);
index3 = find(volume>volumeTh3);
index4 = find(volume>volumeTh4);
%frame2sampleIndex()為從幀序號找到樣本點的序號(即每一個取樣點的序號)
%endPointX長度為2,包含了起點和終點的樣本點序號
endPoint1=frame2sampleIndex([index1(1), index1(end)], frameSize, overlap);
endPoint2=frame2sampleIndex([index2(1), index2(end)], frameSize, overlap);
endPoint3=frame2sampleIndex([index3(1), index3(end)], frameSize, overlap);
endPoint4=frame2sampleIndex([index4(1), index4(end)], frameSize, overlap);

subplot(2,1,1);
time=(1:length(wave))/fs;
plot(time, wave);
ylabel('Amplitude'); title('Waveform');
axis([-inf inf -1 1]);
line(time(endPoint1(  1))*[1 1], [-1, 1], 'color', 'm');%標起點終點線
line(time(endPoint2(  1))*[1 1], [-1, 1], 'color', 'g');
line(time(endPoint3(  1))*[1 1], [-1, 1], 'color', 'k');
line(time(endPoint4(  1))*[1 1], [-1, 1], 'color', 'r');
line(time(endPoint1(end))*[1 1], [-1, 1], 'color', 'm');
line(time(endPoint2(end))*[1 1], [-1, 1], 'color', 'g');
line(time(endPoint3(end))*[1 1], [-1, 1], 'color', 'k');
line(time(endPoint4(end))*[1 1], [-1, 1], 'color', 'r');
legend('Waveform', 'Boundaries by threshold 1', 'Boundaries by threshold 2', 'Boundaries by threshold 3', 'Boundaries by threshold 4');

subplot(2,1,2);
frameTime=frame2sampleIndex(1:frameNum, frameSize, overlap);
plot(frameTime, volume, '.-');
ylabel('Sum of Abs.'); title('Volume');
axis tight;
line([min(frameTime), max(frameTime)], volumeTh1*[1 1], 'color', 'm');
line([min(frameTime), max(frameTime)], volumeTh2*[1 1], 'color', 'g');
line([min(frameTime), max(frameTime)], volumeTh3*[1 1], 'color', 'k');
line([min(frameTime), max(frameTime)], volumeTh4*[1 1], 'color', 'r');
legend('Volume', 'Threshold 1', 'Threshold 2', 'Threshold 3', 'Threshold 4');

 GMM: 

   GMM用在擬合數據分佈上,本質上是先假設樣本的概率分佈為GMM,然後用多個樣本去學習這些GMM的引數。GMM建模在語音中可用於某個單詞的發音,某個人的音色等。其訓練過程可參考:speaker recognition.

function [M, V, W, logProb] = gmmTrain(data, gaussianNum, dispOpt)
% gmmTrain: Parameter training for gaussian mixture model (GMM)
%    Usage: function [M, V, W, logProb] = gmm(data, gaussianNum, dispOpt)
%        data: dim x dataNum matrix where each column is a data point
%        gaussianNum: No. of Gaussians or initial centers
%        dispOpt: Option for displaying info during training
%        M: dim x meanNum matrix where each column is a mean vector
%        V: 1 x gaussianNum vector where each element is a variance for a Gaussian
%        W: 1 x gaussianNum vector where each element is a weighting factor for a Gaussian

% Roger Jang 20000610

if nargin==0, selfdemo; return; end
if nargin<3, dispOpt=0; end

maxLoopCount = 50;    % Max. iteration
minImprove = 1e-6;    % Min. improvement
minVariance = 1e-6;    % Min. variance
logProb = zeros(maxLoopCount, 1);   % Array for objective function
[dim, dataNum] = size(data);

% Set initial parameters
% Set initial M
%M = data(1+floor(rand(gaussianNum,1)*dataNum),:);    % Randomly select several data points as the centers
if length(gaussianNum)==1,
    % Using vqKmeans to find initial centers
    fprintf('Start KMEANS to find the initial mu...\n');
%    M = vqKmeansMex(data, gaussianNum, 0);
    M = vqKmeans(data, gaussianNum, 0); %利用聚類的方法求均值,聚成gaussianNum類
%    M = vqLBG(data, gaussianNum, 0);
    fprintf('Start GMM training...\n');
    if any(any(~isfinite(M))); keyboard; end
else
    % gaussianNum is in fact the initial centers
    M = gaussianNum;
    gaussianNum = size(M, 2);
end
% Set initial V as the distance to the nearest center
if gaussianNum==1
    V=1;
else
    distance=pairwiseSqrDist(M);%pairwiseSqrDist是dll
   %distance=pairwiseSqrDist2(M);
   
    distance(1:(gaussianNum+1):gaussianNum^2)=inf;    % Diagonal elements are inf
    [V, index]=min(distance);    % Initial variance for each Gaussian
end
% Set initial W
W = ones(1, gaussianNum)/gaussianNum;    % Weight for each Gaussian,初始化時是均分權值

if dispOpt & dim==2, displayGmm(M, V, data); end
for i = 1:maxLoopCount  %開始迭代訓練引數,EM演算法
    % Expectation step:
    % P(i,j) is the probability of data(:,j) to the i-th Gaussian
    % Prob為每個樣本在GMM下的概率
    [prob, P]=gmmEval(data, M, V, W);
    logProb(i)=sum(log(prob)); %所有樣本的聯合概率
    if dispOpt
        fprintf('i = %d, log prob. = %f\n',i-1, logProb(i));
    end
    PW = diag(W)*P;
    BETA=PW./(ones(gaussianNum,1)*sum(PW));    % BETA(i,j) is beta_i(x_j)
    sumBETA=sum(BETA,2);

    % Maximization step:  eqns (2.96) to (2.98) from Bishop p.67:
    M = (data*BETA')./(ones(dim,1)*sumBETA');

   DISTSQ = pairwiseSqrDist(M, data);                    % Distance of M to data
   %DISTSQ = pairwiseSqrDist2(M, data);                    % Distance of M to data
   
    V = max((sum(BETA.*DISTSQ, 2)./sumBETA)/dim, minVariance);    % (2.97)
    W = (1/dataNum)*sumBETA;                    % (2.98)

    if dispOpt & dim==2, displayGmm(M, V, data); end
    if i>1, if logProb(i)-logProb(i-1)<minImprove, break; end; end
end
[prob, P]=gmmEval(data, M, V, W);
logProb(i)=sum(log(prob));
fprintf('Iteration count = %d, log prob. = %f\n',i, logProb(i));
logProb(i+1:maxLoopCount) = [];

% ====== Self Demo ======
function selfdemo
%[data, gaussianNum] = dcdata(2);
data = rand(1000,2);
gaussianNum = 8;
data=data';
plotOpt=1;
[M, V, W, lp] = feval(mfilename, data, gaussianNum, plotOpt);

pointNum = 40;
x = linspace(min(data(1,:)), max(data(1,:)), pointNum);
y = linspace(min(data(2,:)), max(data(2,:)), pointNum);
[xx, yy] = meshgrid(x, y);
data = [xx(:) yy(:)]';
z = gmmEval(data, M, V, W);
zz = reshape(z, pointNum, pointNum);
figure; mesh(xx, yy, zz); axis tight; box on; rotate3d on
figure; contour(xx, yy, zz, 30); axis image

% ====== Other subfunctions ======
function displayGmm(M, V, data)
% Display function for EM algorithm
figureH=findobj(0, 'tag', mfilename);
if isempty(figureH)
    figureH=figure;
    set(figureH, 'tag', mfilename);
    colordef black
    plot(data(1,:), data(2,:),'.r'); axis image
    theta=linspace(-pi, pi, 21);
    x=cos(theta); y=sin(theta);
    sigma=sqrt(V);
    for i=1:length(sigma)
        circleH(i)=line(x*sigma(i)+M(1,i), y*sigma(i)+M(2,i), 'color', 'y');
    end
    set(circleH, 'tag', 'circleH', 'erasemode', 'xor');
else
    circleH=findobj(figureH, 'tag', 'circleH');
    theta=linspace(-pi, pi, 21);
    x=cos(theta); y=sin(theta);
    sigma=sqrt(V);
    for i=1:length(sigma)
        set(circleH(i), 'xdata', x*sigma(i)+M(1,i), 'ydata', y*sigma(i)+M(2,i));
    end
    drawnow
end

 Speaker identification:

   給N個人的語音資料,用GMM可以訓練這N個人的聲音模型,然後給定一段語音,判斷該語音與這N個人中哪個最相似。方法是求出該語音在N個GMM模型下的概率,選出概率最大的那個。可參考:speaker recognition.

function [recogRate, confusionMatrix, speakerData]=speakerIdentify(speakerData, speakerGmm, useIntGmm)
% speakerIdentify: speaker identification using GMM parameters
%    Usage: [recogRate, confusionMatrix, speakerData]=speakerIdentify(speakerData, speakerGmm, useIntGmm)
%        speakerData: structure array generated by speakerDataRead.m
%        speakerGmm: speakerGmm(i).gmmPrm is the GMM parameters for speaker i.
%        useIntGmm: use fixed-point GMM

%    Roger Jang, 20070517, 20080726

if nargin<3, useIntGmm=0; end

% ====== Speaker identification using GMM parameters
speakerNum=length(speakerData);
for i=1:speakerNum
%    fprintf('%d/%d: Recognizing wave files by %s\n', i, speakerNum, speakerData(i).name);
    for j=1:length(speakerData(i).sentence)
%        fprintf('\tSentece %d...\n', j);
        frameNum=size(speakerData(i).sentence(j).fea, 2);
        logProb=zeros(speakerNum, frameNum); %logProb(i,m)表示第i個人第j個句子中第m幀在GMM模型下的log概率
        %找出一個句子,看它屬於哪個speaker
        for k=1:speakerNum,
%            fprintf('\t\tSpeaker %d...\n', k);
        %    logProb(k, :)=gmmEval(speakerData(i).sentence(j).fea, speakerGmm(k).gmmPrm);
            if ~useIntGmm
            %    logProb(k, :)=gmmEvalMex(speakerData(i).sentence(j).fea, gmm(k).mean, gmm(k).covariance, gmm(k).weight);
                logProb(k, :)=gmmEval(speakerData(i).sentence(j).fea, speakerGmm(k).gmmPrm);
            else
            %    logProb(k, :)=gmmEvalIntMex(speakerData(i).sentence(j).fea, gmm(k).mean, gmm(k).covariance, gmm(k).weight);
                logProb(k, :)=gmmEvalIntMex(speakerData(i).sentence(j).fea, speakerGmm(i).gmmPrm);
            end
        end
        cumLogProb=sum(logProb, 2);
        [maxProb, index]=max(cumLogProb);
        speakerData(i).sentence(j).predictedSpeaker=index; %找出身份
        speakerData(i).sentence(j).logProb=logProb;
    end
end

% ====== Compute confusion matrix and recognition rate
confusionMatrix=zeros(speakerNum);
for i=1:speakerNum,
    predictedSpeaker=[speakerData(i).sentence.predictedSpeaker];
    [index, count]=elementCount(predictedSpeaker);
    confusionMatrix(i, index)=count;
end
recogRate=sum(diag(confusionMatrix))/sum(sum(confusionMatrix));

GMM-HMM:

  訓練階段:給出HMM的k個狀態,每個狀態下的觀察樣本的生成可以用一個概率分佈來擬合,這裡是採用GMM擬合的。其實,可以把GMM-HMM整體看成是一個生成模型。給定該模型的5個初始引數(結合隨機和訓練樣本獲得),啟動EM演算法的E步:獲得訓練樣本分佈,即計算訓練樣本在各個狀態下的概率。M步:用這些訓練樣本重新評估那5個引數。

  測試階段:(以孤立詞識別為例)給定每個詞發音的frame矩陣,取出某一個GMM-HMM模型,算出該發音每一幀資料在取出的GMM-HMM模型各個state下的概率,結合模型的轉移概率和初始概率,獲得對應的clique tree,可用圖模型的方法inference出生成該語音的概率。比較多個GMM-HMM模型,取最大概率的模型對應的詞。

參考資料: