1. 程式人生 > 其它 >PRML-公式推導 - 1.69-1.72

PRML-公式推導 - 1.69-1.72

原文 https://www.cnblogs.com/wacc/p/5495448.html

貝葉斯線性迴歸

問題背景:

為了與PRML第一章一致,我們假定資料出自一個高斯分佈:

\[p(t|x,\mathbf{w},\beta)=\mathcal{N}(t|y(x,\mathbf{w}),\beta^{-1})=\sqrt{\frac{\beta}{2\pi}}\exp(-\frac{\beta}{2}(t-y(x,\mathbf{w}))^2) \]

其中\(\beta\)是精度,\(y(x,\mathbf{w})=\sum\limits_{j=0}^Mw_jx^j\)
\(\mathbf{w}\)

的先驗為:

\[p(\mathbf{w})=\mathcal{N}(\mathbf{w}|\mathbf{0},\alpha^{-1}\mathbf{I})=(\frac{\alpha}{2\pi})^{(M+1)/2}\exp(-\frac{\alpha}{2}\mathbf{w}^T\mathbf{w}) \]

其中\(\alpha\)是高斯分佈的精度。
為了表示方便,我們定義一個變換\(\phi(x)=(1,x,x^2,...,x^M)^T\),那麼\(y(x,\mathbf{w})=\mathbf{w}^T\phi(x)\)。為了對\(\mathbf{w}\)作推斷,我們需要收集資料更新先驗分佈,記收集到的資料為\(\mathbf{x}_N=\{x_1,...,x_N\}\)

\(\mathbf{t}_N=\{t_1,...,t_N\}\),其中\(t_i\)\(x_i\)對應的響應。進一步我們引入一個矩陣\(\Phi\),其定義如下:

\[\Phi=\begin{bmatrix}\phi(x_1)^T\\\phi(x_2)^T\\\vdots\\\phi(x_N)^T\end{bmatrix} \]

我們可以認為這個矩陣是由\(\phi(x_i)^T\)平鋪而成。

詳細推導過程

首先我們先驗證\(p(\mathbf{w}|\mathbf{x},\mathbf{t})\)是個高斯分佈。根據貝葉斯公式我們有:

\[\begin{aligned}p(\mathbf{w}|\mathbf{x},\mathbf{t})&\propto p(\mathbf{w}|\alpha)p(\mathbf{t}|\mathbf{x},\mathbf{w})\\&=(\frac{\alpha}{2\pi})^{(M+1)/2} \exp(-\frac{\alpha}{2}\mathbf{w}^T\mathbf{w})\prod_{i=1}^N\sqrt{\frac{\beta}{2\pi}}\exp(-\frac{\beta}{2}(t_i-\mathbf{w}^T\phi(x_i))^2)\\&\propto \exp(-\frac{1}{2}\Big\{\alpha \mathbf{w}^T\mathbf{w}+\beta\sum_{i=1}^N(t_i^2-2t_i\mathbf{w}^T\phi(x_i)+\mathbf{w}^T\phi(x_i)\phi(x_i)^T\mathbf{w})\Big\})\\&\propto \exp(-\frac{1}{2}\Big\{ \mathbf{w}^T(\alpha \mathbf{I}+\beta\sum_{i=1}^N\phi(x_i)\phi(x_i)^T)\mathbf{w}-2\beta\sum_{i=1}^N t_i\phi(x_i)^T\mathbf{w}\Big\})\\&\propto \exp(-\frac{1}{2}\Big\{\mathbf{w}^T(\alpha \mathbf{I}+\beta \Phi^T\Phi)\mathbf{w} -2\beta(\Phi^T\mathbf{t}_N)^T\mathbf{w}\Big\})\end{aligned} \]

\(S^{-1}=\alpha \mathbf{I}+\beta\Phi^T\Phi\)

,\(\mu=\beta S\Phi^T\mathbf{t}_N\)(注:原書中式1.72寫錯了)

\[p(\mathbf{w}|\mathbf{x},\mathbf{t})\propto \exp(-\frac{1}{2}(\mathbf{w}-\mu)^TS^{-1}(\mathbf{w}-\mu))\propto \mathcal{N}(\mathbf{w}|\mu,S) \]

至此,我們證明了後驗分佈也是個高斯,接下來我們計算predictive distribution,注意到\(p(t|x,\mathbf{x},\mathbf{t})\)是兩個高斯分佈的卷積,其結果也是一個高斯,但為了嚴謹起見,還是證明一下。

\[\begin{aligned}p(t|x,\mathbf{x},\mathbf{t})&=\int p(t|x,\mathbf{w})p(\mathbf{w}|\mathbf{x},\mathbf{t})d\mathbf{w}\\&=\frac{1}{(2\pi)^{M/2+1}}\frac{1}{(\beta^{-1}|S|)^{1/2}} \int \exp(-\frac{1}{2}\Big\{\beta(t-\mathbf{w}^T\phi(x))^2+(\mathbf{w}-\mu)^TS^{-1}(\mathbf{w}-\mu)\Big\})d\mathbf{w}\\&=\frac{1}{(2\pi)^{M/2+1}}\frac{1}{(\beta^{-1}|S|)^{1/2}}\int\exp(-\frac{1}{2}\Big\{\beta t^2-2\beta t\phi(x)^T\mathbf{w}+\beta\mathbf{w}^T\phi(x)\phi(x)^T\mathbf{w}\\&+\mathbf{w}^TS^{-1}\mathbf{w}-2\mu^T S^{-1}\mathbf{w}+\mu^T S^{-1}\mu\Big\})d\mathbf{w}\\&=\frac{1}{(2\pi)^{M/2+1}}\frac{1}{(\beta^{-1}|S|)^{1/2}}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu))\cdot \\&\int\exp(-\frac{1}{2}\Big\{-2\beta t\phi(x)^T\mathbf{w}+\beta\mathbf{w}^T\phi(x)\phi(x)^T\mathbf{w}+\mathbf{w}^TS^{-1}\mathbf{w}-2\mu^T S^{-1}\mathbf{w}\Big\})d\mathbf{w}\\&=\frac{1}{(2\pi)^{M/2+1}}\frac{1}{(\beta^{-1}|S|)^{1/2}}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu))\cdot\\&\int\exp(-\frac{1}{2}\Big\{\mathbf{w}^T(\underbrace{\beta\phi(x)\phi(x)^T+S^{-1})}_{\Lambda^{-1}}\mathbf{w}-2(\underbrace{\beta t\phi(x)^T+\mu^TS^{-1}}_{m^T\Lambda^{-1}})\mathbf{w}\Big\}d\mathbf{w}\\&=\frac{1}{(2\pi)^{M/2+1}}\frac{1}{(\beta^{-1}|S|)^{1/2}}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu))\cdot\\&\int\exp(-\frac{1}{2}\Big\{\mathbf{w}^T\Lambda^{-1}\mathbf{w}-2m^T\Lambda^{-1}\mathbf{w}+m^T\Lambda^{-1}m-m^T\Lambda^{-1}m\Big\})d\mathbf{w}\\&=\frac{1}{(2\pi)^{M/2+1}}\frac{1}{(\beta^{-1}|S|)^{1/2}}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu-m^T\Lambda^{-1}m))\cdot\\&\int\exp(-\frac{1}{2}\Big\{(\mathbf{w}-m)^T\Lambda^{-1}(\mathbf{w}-m)\Big\})d\mathbf{w}\\&=\frac{(2\pi)^{(M+1)/2}}{(2\pi)^{M/2+1}}\frac{|\Lambda|^{1/2}}{(\beta^{-1}|S|)^{1/2}}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu-m^T\Lambda^{-1}m))\\&=\frac{1}{(2\pi)^{1/2}}\frac{|(\beta\phi(x)\phi(x)^T+S^{-1})^{-1}|^{1/2}}{(\beta^{-1}|S|)^{1/2}}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu-m^T\Lambda^{-1}m))\\&=\frac{1}{(2\pi)^{1/2}}\frac{1}{(\beta^{-1}|S||\beta\phi(x)\phi(x)^T+S^{-1}|)^{1/2}}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu-m^T\Lambda^{-1}m))\end{aligned} \]

到這裡不知道怎麼推下去了,於是去網上閒逛找解決辦法,終於找到了一篇論文《Modeling Inverse Covariance Matrices by Basis Expansion》這篇論文裡介紹了一個引理
引理 (對稱矩陣的秩1擾動)\(\alpha\in\mathbb{R}\),\(\mathbf{a}\in\mathbb{R}^d\),\(P\in\mathbb{R}^{d\times d}\) 為可逆矩陣。如果\(\alpha\neq\mathbf{a} -(\mathbf{a}^TP\mathbf{a})^{-1}\)那麼秩1擾動矩陣\(P+\alpha \mathbf{a} \mathbf{a}^T\)可逆,且

\[(P+\alpha \mathbf{a} \mathbf{a}^T)^{-1}=P^{-1}-\frac{\alpha P^{-1}\mathbf{a}\mathbf{a}^T P^{-1}}{1+\alpha \mathbf{a}^TP^{-1}\mathbf{a}} \]

\[det(P+\alpha \mathbf{a} \mathbf{a}^T)=(1+\alpha \mathbf{a}^T P^{-1}\mathbf{a})det(P) \]

這條定理說的是如果我們給協方差矩陣一個秩為1的向量外積做擾動,我們可以將擾動後的矩陣的逆和行列式進行展開。具體地,我們考察\(|\beta\phi(x)\phi(x)^T+S^{-1}|\),發現

\[|\beta\phi(x)\phi(x)^T+S^{-1}|=(1+\beta \phi(x)^T S\phi(x))det(S^{-1})=(1+\beta \phi(x)^T S\phi(x))/|S| \]

於是

\[\frac{1}{(\beta^{-1}|S||\beta\phi(x)\phi(x)^T+S^{-1}|)^{1/2}}=\frac{1}{(\beta^{-1}|S|\cdot \frac{(1+\beta \phi(x)^T S\phi(x))}{|S|})^{1/2}}=\frac{1}{(\beta^{-1}+\phi(x)^T S\phi(x))^{1/2}} \]

接下來考察指數部分\(\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu-m^T\Lambda^{-1}m))\),注意到\(\mu=\beta S\Phi^T\mathbf{t}_N\),於是\(\mu^TS^{-1}\mu=\beta^2(\Phi^T\mathbf{t}_N)^TS(\Phi^T\mathbf{t}_N)\)。同時,應用上述引理我們有

\[\Lambda=(\beta\phi(x)\phi(x)^T+S^{-1})^{-1}=S-\frac{\beta S\phi(x)\phi(x)^TS}{1+\beta \phi(x)^TS\phi(x)}=S-\frac{ S\phi(x)\phi(x)^TS}{\beta^{-1}+\phi(x)^TS\phi(x)} \]

利用以上兩個關係,我們進一步進行推導

\[\begin{aligned}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu-m^T\Lambda^{-1}m))&=\exp(-\frac{1}{2}\Big\{\beta t^2+\beta^2(\Phi^T\mathbf{t}_N)^TS(\Phi^T\mathbf{t}_N)-(\beta t\phi(x)^T+\beta (\Phi^T\mathbf{t}_N)^T)\Lambda\Lambda^{-1}\Lambda(\beta t\phi(x)+\beta \Phi^T\mathbf{t}_N)\Big\})\\&=\exp(-\frac{1}{2}\Big\{\beta t^2+\beta^2(\Phi^T\mathbf{t}_N)^TS(\Phi^T\mathbf{t}_N)-\big[\beta^2 t^2\phi(x)^T\Lambda\phi(x)+2\beta^2 t\phi(x)^T\Lambda (\Phi^T\mathbf{t}_N)+\beta^2 (\Phi^T\mathbf{t}_N)^T\Lambda(\Phi^T\mathbf{t}_N)\big]\Big\})\\&= \exp(-\frac{1}{2}\Big\{\beta t^2+\beta^2(\Phi^T\mathbf{t}_N)^TS(\Phi^T\mathbf{t}_N)+\big[-\beta^2 t^2\phi(x)^TS\phi(x)+\beta^3 t^2 \frac{(\phi(x)^TS\phi(x))^2}{1+\beta\phi(x)^TS\phi(x)}\\ &-2\beta^2 t\phi(x)^TS(\Phi^T\mathbf{t}_N)+2\beta^3 t\frac{\phi(x)^T\phi(x)\phi(x)^TS(\Phi^T\mathbf{t}_N)}{1+\beta\phi(x)^TS\phi(x)}\\&-\beta^2(\Phi^T\mathbf{t}_N)^TS(\Phi^T\mathbf{t}_N)+\beta^3 \frac{(\Phi^T\mathbf{t}_N)^TS\phi(x)\phi(x)^TS(\Phi^T\mathbf{t}_N)}{1+\beta\phi(x)^TS\phi(x)}\big]\Big\})\\&=\exp(-\frac{1}{2}\Big\{\big(\beta-\beta^2\phi(x)^TS\phi(x)+\beta^3 \frac{(\phi(x)^TS\phi(x))^2}{1+\beta\phi(x)^TS\phi(x)}\big)t^2\\ &+2\big(\beta^3 \frac{\phi(x)^T\phi(x)\phi(x)^TS(\Phi^T\mathbf{t}_N)}{1+\beta\phi(x)^TS\phi(x)}-\beta^2\phi(x)^TS(\Phi^T\mathbf{t}_N)\big)t+\beta^3 \frac{(\Phi^T\mathbf{t}_N)^TS\phi(x)\phi(x)^TS(\Phi^T\mathbf{t}_N)}{1+\beta\phi(x)^TS\phi(x)}\Big\})\end{aligned}\]

我們考察每一個係數,首先是\(t^2\)的係數

\[\begin{aligned} \beta-\beta^2\phi(x)^TS\phi(x)+\beta^3 \frac{(\phi(x)^TS\phi(x))^2}{1+\beta\phi(x)^TS\phi(x)}&=\beta+\frac{-\beta^2\phi(x)^TS\phi(x)[1+\beta\phi(x)^TS\phi(x)]+\beta^3 (\phi(x)^TS\phi(x))^2}{1+\beta\phi(x)^TS\phi(x)}\\&=\beta-\frac{\beta^2\phi(x)^TS\phi(x)}{1+\beta \phi(x)^TS\phi(x)}\\&=\frac{\beta(1+\beta \phi(x)^TS\phi(x))-\beta^2\phi(x)^TS\phi(x)}{1+\beta\phi(x)^TS\phi(x)}\\&=\frac{\beta}{1+\beta\phi(x)^TS\phi(x)}=\frac{1}{\beta^{-1}+\phi(x)^TS\phi(x)}\end{aligned} \]

接著是\(t\)的係數

\[\begin{aligned}\beta^3 \frac{\phi(x)^T\phi(x)\phi(x)^TS(\Phi^T\mathbf{t}_N)}{1+\beta\phi(x)^TS\phi(x)}-\beta^2\phi(x)^TS(\Phi^T\mathbf{t}_N)&=\frac{\beta^3\phi(x)^T\phi(x)\phi(x)^TS(\Phi^T\mathbf{t}_N)-\beta^2\phi(x)^TS(\Phi^T\mathbf{t}_N)(1+\beta\phi(x)^TS\phi(x))}{1+\beta\phi(x)^TS\phi(x)}\\&=\frac{\beta^3\phi(x)^T\phi(x)\phi(x)^TS(\Phi^T\mathbf{t}_N)-\beta^2\phi(x)^TS(\Phi^T\mathbf{t}_N)-\beta^3\phi(x)^TS(\Phi^T\mathbf{t}_N)\phi(x)^TS\phi(x)}{1+\beta\phi(x)^TS\phi(x)}\\&=\frac{-\beta\phi(x)^TS(\Phi^T\mathbf{t}_N)}{\beta^{-1}+\phi(x)^TS\phi(x)}\end{aligned} \]

最後我們考察常數項

\[\beta^3 \frac{(\Phi^T\mathbf{t}_N)^TS\phi(x)\phi(x)^TS(\Phi^T\mathbf{t}_N)}{1+\beta\phi(x)^TS\phi(x)}=\frac{\beta^2(\phi(x)^TS(\Phi^T\mathbf{t}_N))^2}{\beta^{-1}+\phi(x)^TS\phi(x)} \]

綜合以上,我們有

\[\begin{aligned}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu-m^T\Lambda^{-1}m))&=\exp(-\frac{1}{2(\beta^{-1}+\phi(x)^TS\phi(x))}\Big\{t^2-2\beta\phi(x)^TS(\Phi^T\mathbf{t}_N)t+\beta^2(\phi(x)^TS(\Phi^T\mathbf{t}_N))^2\Big\})\\&=\exp(-\frac{1}{2(\beta^{-1}+\phi(x)^TS\phi(x))}(t-\beta\phi(x)^TS(\Phi^T\mathbf{t}_N))^2)\end{aligned} \]

綜合以上,我們可以得到

\[p(t|x,\mathbf{x},\mathbf{t})=\frac{1}{\sqrt{2\pi\cdot(\beta^{-1}+\phi(x)^TS\phi(x))}}\exp(-\frac{1}{2(\beta^{-1}+\phi(x)^TS\phi(x))}(t-\beta\phi(x)^TS(\Phi^T\mathbf{t}_N))^2) \]

\[ m(x)=\beta\phi(x)^TS(\Phi^T\mathbf{t}_N)=\mu^T\phi(x)\\ s^2(x)=\beta^{-1}+\phi(x)^TS\phi(x)\]

以上兩式對應PRML中的式1.70~1.71。式1.71中,第一項表示資料中的噪音(方差越小,資料越集中,不確定性越小);第二項表示關於引數\(\mathbf{w}\)的不確定性,當\(N\to \infty\)時,第二項趨於0,這是由於當資料量趨於無限大時,關於引數的不確定性逐漸消失,先驗的影響逐漸減弱。理論上的證明如下,首先我們考察\(S_{N+1}\):

\[S_{N+1}=(\alpha I+\beta \sum_{i=1}^N \phi(x_i)\phi(x_i)^T+\beta \phi(x_{N+1})\phi(x_{N+1})^T)=(S_N^{-1}+\beta \phi(x_{N+1})\phi(x_{N+1})^T)\\=S_N-\beta\frac{S_N\phi(x_{N+1})\phi(x_{N+1})^TS_N}{1+\beta \phi(x_{N+1})^TS_N\phi(x_{N+1})}\\=S_N-\frac{\beta}{1+\beta \phi(x_{N+1})^TS_N\phi(x_{N+1})} (S_N\phi(x_{N+1}))(S_N\phi(x_{N+1}))^T \]

於是

\[\sigma_{N+1}^2(x)=\beta^{-1}+\phi(x)^TS_{N+1}\phi(x)=\sigma_N^2(x)-\frac{\beta}{1+\beta \phi(x_{N+1})^TS_N\phi(x_{N+1})}[ \phi(x)^T(S_N\phi(x_{N+1}))]^2\leq \sigma_N^2(x) \]

因此序列\(\sigma_N^2(x)\)是單調遞減序列,又由於有下界(0),因此當\(N\to\infty\)時,\(\sigma_N^2(x)\to 0\)

於是我們知道

\[p(t|x,\mathbf{x},\mathbf{t})=\mathcal{N}(t|m(x),s^2(x)) \]

也就是說後驗預測分佈也是一個高斯,\(t\)的且均值、方差取決於\(x\)
需要注意的是當\(x\)滿足\(\beta=-(\phi(x)^TS\phi(x))^{-1}\)時,方差

\[s^2(x)=\beta^{-1}+\phi(x)^TS\phi(x)=-\phi(x)^TS\phi(x)+\phi(x)^TS\phi(x)=0 \]

因此在這一點分佈未定義