1. 程式人生 > 實用技巧 >《The Matrix Calculus You Need For Deep Learning》讀書筆記

《The Matrix Calculus You Need For Deep Learning》讀書筆記

用於深度學習的矩陣微積分讀書筆記

書籍簡介

《The Matrix Calculus You Need For Deep Learning》是舊金山大學的Terence Parr教授(ANTLR之父,fast.ai創始人)和Jeremy Howard共同推出的一篇免費教程,可以幫助你快速入門深度學習中的矩陣微積分相關知識。該教程簡潔明瞭,通俗易懂,只需要一點微積分和神經網路的基礎知識就可以直接開始學習啦!

本教程涵蓋的內容

教程先快速回顧了標量求導法則、向量微積分和偏導數的概念,然後從Jacobian矩陣的推廣開始介紹如何計算矩陣的導數,最後推導單個神經元輸出的梯度以及神經網路損失函式的梯度。
目錄

內容總結

1.引言

導數是機器學習中的一個重要組成部分,特別是深度學習,它通過優化損失函式來對神經網路進行訓練。不過它們需要的不是之前所學的標量微積分,而是所謂的矩陣微積分——線性代數和多變數微積分的“聯姻”。
標量求導我們已經很熟悉的,常用的有指數法則、乘積法則和鏈式法則。需要注意的是這裡我們已經可以引入運算元的概念,即可認為 d d x \frac{d}{dx} dxd是將一個函式對映到它的導數的微分運算元,這就意味著 d d x f ( x ) \frac{d}{dx}f(x)

dxdf(x) d f ( x ) d x \frac{df(x)}{dx} dxdf(x)表示相同的概念。
進一步考慮多變數的情況。多變數函式對單個變數求導得到的是偏導數(用 ∂ ∂ x \frac{\partial}{\partial x} x表示)。將所有的偏導數放在一個行向量內,這個向量即稱為函式 f ( x , y ) f(x,y) f(x,y)的梯度,即
∇ f ( x , y ) = [ ∂ f ( x , y ) ∂ x , ∂ f ( x , y ) ∂ y ] \nabla f(x,y)=[\frac{\partial f(x,y)}{\partial x},\frac{\partial f(x,y)}{\partial y}]
f(x,y)=[xf(x,y),yf(x,y)]
.
再進一步,考慮多函式多變數的情況。除了 f ( x , y ) f(x,y) f(x,y)之外,再加上一個函式 g ( x , y ) g(x,y) g(x,y)。對於這兩個函式,我們可以將它們的梯度組合成一個矩陣,稱為Jacobian矩陣,矩陣的每一行對應一個函式的梯度:
J = [ ∇ f ( x , y ) ∇ g ( x , y ) ] = [ ∂ f ( x , y ) ∂ x ∂ f ( x , y ) ∂ y ∂ g ( x , y ) ∂ x ∂ g ( x , y ) ∂ y ] J=\begin{bmatrix}\nabla f(x,y)\\ \nabla g(x,y) \end{bmatrix}=\begin{bmatrix}\frac{\partial f(x,y)}{\partial x} & \frac{\partial f(x,y)}{\partial y} \\ \frac{\partial g(x,y)}{\partial x}&\frac{\partial g(x,y)}{\partial y} \end{bmatrix} J=[f(x,y)g(x,y)]=[xf(x,y)xg(x,y)yf(x,y)yg(x,y)].
這樣我們就得到了本教程的核心內容:矩陣微積分!

2.Jacobian的推廣

將引數用向量表示,即 x = [ x 1 x 2 . . . x n ] T \bold{x}=[ x_1 x_2 ... x_n]^T x=[x1x2...xn]T.
將函式同樣用向量表示,即 y = f ( x ) = [ f 1 ( x ) f 2 ( x ) . . . f m ( x ) ] T \bold{y}=\bold{f(x)}=[f_1(\bold x) f_2(\bold x) ... f_m(\bold x)]^T y=f(x)=[f1(x)f2(x)...fm(x)]T,表示由m個標量函式構成的函式向量。
通常Jacobian矩陣即是 m ∗ n m*n mn個偏導數的集合,也就是相對於 x \bold x x的m個梯度的堆積:
∂ y ∂ x = [ ∇ f 1 ( x ) ∇ f 2 ( x ) . . . ∇ f m ( x ) ] = [ ∂ x f 1 ( x ) ∂ x f 2 ( x ) . . . ∂ x f m ( x ) ] = [ ∂ x 1 f 1 ( x ) ∂ x 2 f 1 ( x ) . . . ∂ x n f 1 ( x ) ∂ x 1 f 2 ( x ) ∂ x 2 f 2 ( x ) . . . ∂ x n f 2 ( x ) . . . ∂ x 1 f m ( x ) ∂ x 2 f m ( x ) . . . ∂ x n f m ( x ) ] \frac{\partial {\bold y}}{\partial {\bold x}}=\begin{bmatrix} \nabla f_1(\bold {x}) \\ \nabla f_2(\bold {x}) \\ ... \\ \nabla f_m(\bold {x}) \end{bmatrix}=\begin{bmatrix} \frac{\partial}{\bold {x}}f_1(\bold x) \\ \frac{\partial}{\bold {x}}f_2(\bold x) \\ ... \\ \frac{\partial}{\bold {x}}f_m(\bold x)\end{bmatrix}\\=\begin{bmatrix} \frac{\partial}{x_1}f_1(\bold x) & \frac{\partial}{x_2}f_1(\bold x) & ... & \frac{\partial}{x_n}f_1(\bold x)\\ \frac{\partial}{x_1}f_2(\bold x) & \frac{\partial}{x_2}f_2(\bold x) & ... & \frac{\partial}{x_n}f_2(\bold x)\\ &...& \\ \frac{\partial}{x_1}f_m(\bold x) & \frac{\partial}{x_2}f_m(\bold x) & ... & \frac{\partial}{x_n}f_m(\bold x)\end{bmatrix} xy=f1(x)f2(x)...fm(x)=xf1(x)xf2(x)...xfm(x)=x1f1(x)x1f2(x)x1fm(x)x2f1(x)x2f2(x)...x2fm(x).........xnf1(x)xnf2(x)xnfm(x).
作為一個特例,假定 f ( x ) = x \bold f(\bold x)=\bold x f(x)=x f i ( x ) = x i f_i(\bold x)=x_i fi(x)=xi。這裡有n個函式,每個函式都有n個引數,因此Jacobian矩陣是一個方陣,而且很容易得到它是一個單位陣 I I I.

3.向量的逐元素二元運算求導

向量的逐元素二元運算(element-wise binary operations)有很多,如向量的加法、減法、點乘,標量乘以向量等運算。一般情況下,該種運算可以寫成 y = f ( w ) ∘ g ( x ) \bold y = \bold f(\bold w) \circ \bold g(\bold x) y=f(w)g(x)的形式,其中 ∘ \circ 即表示任意一種逐元素運算操作。
對於加減乘除等簡單的逐元素運算操作,函式 y \bold y y w \bold w w x \bold x x的偏導數如下:
二元向量運算的偏導數

4.涉及標量運算的向量求導

簡單來說,即向量加上標量或者向量乘以標量,然後再求導。這裡涉及的運算仍然是逐元素的,因此可以將標量展開成包含相同值的向量。求導結果比較簡單:
∂ y ∂ x = ∂ ( x + z ) ∂ x = I \frac{\partial {\bold y}}{\partial {\bold x}}=\frac{\partial {(\bold x+z)}}{\partial {\bold x}}=I xy=x(x+z)=I
∂ y ∂ z = ∂ ( x + z ) ∂ z = 1 \frac{\partial {\bold y}}{\partial {z}}=\frac{\partial {(\bold x+z)}}{\partial {z}}=\bold 1 zy=z(x+z)=1
∂ y ∂ x = ∂ ( x z ) ∂ x = I z \frac{\partial {\bold y}}{\partial {\bold x}}=\frac{\partial {(\bold xz)}}{\partial {\bold x}}=Iz xy=x(xz)=Iz
∂ y ∂ z = ∂ ( x z ) ∂ z = x \frac{\partial {\bold y}}{\partial {z}}=\frac{\partial {(\bold xz)}}{\partial {z}}=\bold x zy=z(xz)=x

5.向量和求導

向量元素求和是深度學習中的一個重要運算,如計算網路的損失函式。同樣的,也可用於向量點積和其它將向量轉化成標量的運算的求導操作。
y = s u m ( f ( x ) ) = ∑ i = 1 n f i ( x ) y=sum(\bold f(\bold x))=\sum_{i=1}^{n}f_i(\bold x) y=sum(f(x))=i=1nfi(x)
∂ y ∂ x = [ ∂ y ∂ x 1 ∂ y ∂ x 2 . . . ∂ y ∂ x n ] \frac{\partial{y}}{\partial \bold x}=\begin{bmatrix} \frac{\partial{y}}{\partial x_1} \frac{\partial{y}}{\partial x_2} ... \frac{\partial{y}}{\partial x_n} \end{bmatrix} xy=[x1yx2y...xny]
一些簡明的求導結果如下:
y = s u m ( x ) , ∇ y = 1 y=sum(\bold x),\nabla y=\bold 1 y=sum(x),y=1
y = s u m ( x z ) , ∂ y ∂ x = 1 z , ∂ y ∂ z = s u m ( x ) y=sum(\bold xz), \frac {\partial y}{\partial \bold x}=\bold 1z,\frac {\partial y}{\partial \bold z}=sum(\bold x) y=sum(xz),xy=1z,zy=sum(x)

6.鏈式法則

使用基本的矩陣微分法則無法計算複雜函式的偏導數,比如巢狀函式,此時必須將基本的向量求導法則和向量鏈式法則結合起來使用。不幸的是,有很多求導法則都屬於鏈式法則,所以我們要小心使用哪個鏈式法則。

6.1 單變數鏈式法則

所謂單變數鏈式法則,即之前所學的鏈式法則,此時只有一個變數。

6.2 單變數全導數鏈式法則

單變數鏈式法則的適用範圍是有限的,即所有的中間變數必須是單個變數的函式。所謂全導數,即是要計算 d y d x \frac {dy}{dx} dxdy,必須把x變化量對y變化量的所有可能貢獻加起來。相對 x x x的全導數假設所有的變數都是 x x x的函式,並且可能隨著 x x x變化而變化。如 f ( x ) = u 2 ( x , u 1 ) f(x)=u_2(x,u_1) f(x)=u2(x,u1)通過中間變數 u 1 ( x ) u_1(x) u1(x)直接和間接地依賴於x,則全導數:
d y d x = ∂ f ( x ) ∂ x = ∂ u 2 ( x , u 1 ) ∂ x = ∂ u 2 ∂ x ∂ x ∂ x + ∂ u 2 ∂ u 1 ∂ u 1 ∂ x = ∂ u 2 ∂ x + ∂ u 2 ∂ u 1 ∂ u 1 ∂ x \frac{dy}{dx}=\frac{\partial f(x)}{\partial x}=\frac{\partial u_2(x,u_1)}{\partial x}=\frac{\partial u_2}{\partial x}\frac{\partial x}{\partial x}+\frac{\partial u_2}{\partial u_1}\frac{\partial u_1}{\partial x}=\frac{\partial u_2}{\partial x}+\frac{\partial u_2}{\partial u_1}\frac{\partial u_1}{\partial x} dxdy=xf(x)=xu2(x,u1)=xu2xx+u1u2xu1=xu2+u1u2xu1
其一般形式為:
∂ f ( u 1 , . . . , u n + 1 ) ∂ x = ∑ i = 1 n + 1 ∂ f ∂ u i ∂ u i ∂ x \frac {\partial f(u_1,...,u_n+1)}{\partial x}=\sum_{i=1}^{n+1}\frac{\partial f}{\partial u_i}\frac{\partial u_i}{\partial x} xf(u1,...,un+1)=i=1n+1uifxui

6.3 向量鏈式法則

我們先從計算一個向量函式 y = f ( g ( x ) ) \bold y=\bold f(\bold g(x)) y=f(g(x))對標量的導數開始,看看我們能否抽象出一個一般的公式。顯然,有:
∂ y ∂ x = [ ∂ f 1 ( g ) ∂ x ∂ f 2 ( g ) ∂ x ] = [ ∂ f 1 ∂ g 1 ∂ g 1 ∂ x + ∂ f 1 ∂ g 2 ∂ g 2 ∂ x ∂ f 2 ∂ g 1 ∂ g 1 ∂ x + ∂ f 2 ∂ g 2 ∂ g 2 ∂ x ] = [ ∂ f 1 ∂ g 1 ∂ f 1 ∂ g 2 ∂ f 2 ∂ g 1 ∂ f 2 ∂ g 2 ] [ ∂ g 1 ∂ x ∂ g 2 ∂ x ] = ∂ f ∂ g ∂ g ∂ x \frac {\partial \bold y}{\partial x}=\begin{bmatrix} \frac{\partial f_1(\bold g)}{\partial x} \\ \frac{\partial f_2(\bold g)}{\partial x}\end{bmatrix}=\begin{bmatrix} \frac {\partial f_1}{\partial g_1}\frac{\partial g_1}{\partial x}+ \frac {\partial f_1}{\partial g_2}\frac{\partial g_2}{\partial x } \\ \frac {\partial f_2}{\partial g_1}\frac{\partial g_1}{\partial x}+ \frac {\partial f_2}{\partial g_2}\frac{\partial g_2}{\partial x } \end{bmatrix} \\=\begin{bmatrix} \frac {\partial f_1}{\partial g_1}& \frac {\partial f_1}{\partial g_2} \\ \frac {\partial f_2}{\partial g_1}& \frac {\partial f_2}{\partial g_2} \end{bmatrix}\begin{bmatrix} \frac{\partial g_1}{\partial x}\\\frac{\partial g_2}{\partial x } \end{bmatrix}=\frac{\partial \bold f}{\partial \bold g}\frac{\partial \bold g}{\partial x} xy=[xf1(g)xf2(g)]=[g1f1xg1+g2f1xg2g1f2xg1+g2f2xg2]=[g1f1g1f2g2f1g2f2][xg1xg2]=gfxg
當將變數 x x x擴充套件成向量 x \bold x x,再次求Jacobian矩陣,此時完整的向量鏈式法則為:
∂ ∂ x f ( g ( x ) ) = ∂ f ∂ g ∂ g ∂ x \frac{\partial}{\partial \bold x} \bold f(\bold g(\bold x))=\frac{\partial \bold f}{\partial \bold g} \frac {\partial \bold g}{\partial \bold x} xf(g(x))=gfxg

在這裡插入圖片描述
這個等式可以進一步簡化,因為在許多情況下,Jacobian矩陣是方陣( m = n m=n m=n),而且非對角元素均為0.這是神經網路的天然性質,它涉及的是向量的函式,而非函式的向量。如神經元仿射函式是 s u m ( w ⨂ x ) sum(\bold w \bigotimes \bold x) sum(wx),啟用函式是 m a x ( 0 , x ) max(0,\bold x) max(0,x).
正如之前所講,對向量 w \bold w w x \bold x x進行逐元素運算,其偏導數矩陣為 ∂ w i ∂ x i \frac {\partial w_i}{\partial x_i} xiwi構成的對角陣,因為 w i w_i wi x i x_i xi的函式,而非 x j x_j xj的函式( j ≠ i j\ne i j=i).
∂ f ∂ g = d i a g ( ∂ f i ∂ g i ) \frac{\partial \bold f}{\partial \bold g}=diag(\frac{\partial \bold f_i}{\partial \bold g_i} ) gf=diag(gifi)
∂ g ∂ x = d i a g ( ∂ g i ∂ x i ) \frac{\partial \bold g}{\partial \bold x}=diag(\frac{\partial \bold g_i}{\partial \bold x_i} ) xg=diag(xigi)
此時,鏈式法則可以簡化為:
∂ ∂ x f ( g ( x ) ) = d i a g ( ∂ f i ∂ g i ) d i a g ( ∂ g i ∂ x i ) = d i a g ( ∂ f i ∂ g i ∂ g i ∂ x i ) \frac{\partial}{\partial \bold x} f(\bold g( x))=diag(\frac{\partial f_i}{\partial g_i} )diag(\frac{\partial g_i}{\partial x_i} )=diag(\frac{\partial f_i}{ \partial g_i}\frac{ \partial g_i}{\partial x_i}) xf(g(x))=diag(gifi)diag(xigi)=diag(gifixigi)

6.4 總結

綜上所述,下表總結了鏈式法則的相應乘積部分以計算Jacobian矩陣。
在這裡插入圖片描述

7.啟用函式的梯度

啟用函式:
a c t i v a t i o n ( x ) = m a x ( 0 , w ⋅ x + b ) activation(\bold x)=max(0,\bold w \cdot \bold x+b) activation(x)=max(0,wx+b)
中間變數:
u = w ⋅ x + b \bold u = \bold w \cdot \bold x+b u=wx+b
y = s u m ( u ) y=sum(\bold u) y=sum(u)
偏導數:
∂ y ∂ w = ∂ ∂ w w ⋅ x + ∂ ∂ w b = x T + 0 T = x T \frac{\partial y}{\partial \bold w}=\frac{\partial}{\partial \bold w}\bold w\cdot \bold x+\frac{\partial}{\partial \bold w}b=\bold x^T+\bold 0^T=\bold x^T wy=wwx+wb=xT+0T=xT
∂ y ∂ b = ∂ ∂ b w ⋅ x + ∂ ∂ b b = 0 + 1 = 1 \frac{\partial y}{\partial b}=\frac{\partial}{\partial b}\bold w\cdot \bold x+\frac{\partial}{\partial \bold b}b=0+1=1 by=bwx+bb=0+1=1
啟用函式的偏導數:
∂ a c t i v a t i o n ∂ w = { 0 T , w ⋅ x + b ≤ 0 x T , w ⋅ x + b > 0 \frac{\partial activation}{\partial \bold w}=\begin{cases}\bold 0^T, &\bold w\cdot \bold x+b \le0 \\ \bold x^T, & \bold w\cdot \bold x+b>0 \end{cases} wactivation={0T,xT,wx+b0wx+b>0
∂ a c t i v a t i o n ∂ w = { 0 , w ⋅ x + b ≤ 0 1 , w ⋅ x + b > 0 \frac{\partial activation}{\partial \bold w}=\begin{cases}0, &\bold w\cdot \bold x+b \le0 \\ 1, & \bold w\cdot \bold x+b>0 \end{cases} wactivation={0,1,wx+b0wx+b>0

8.損失函式的梯度

訓練一個神經網路需要計算損失函式(或代價函式)對模型引數 w \bold w w b b b的導數。由於我們是用多個輸入向量(也即多個圖片)和標量目標值(如每個圖片一個類別)進行訓練。令
X = [ x 1 , x 2 , . . . , x N ] T X=[\bold x_1,\bold x_2,...,\bold x_N]^T X=[x1,x2,...,xN]T
y = [ t a r g e t ( x 1 ) , t a r g e t ( x 2 ) , . . . , t a r g e t ( x N ) ] T \bold y=[target(\bold x_1),target(\bold x_2),...,target(\bold x_N)]^T y=[target(x1),target(x2),...,target(xN)]T.
代價函式則為:
C ( w , b , X , y ) = 1 N ∑ i = 1 N ( y i − a c t i v a t i o n ( x i ) ) 2 = 1 N ∑ i = 1 N ( y i − m a x ( 0 , w ⋅ x i + b ) ) 2 C(\bold w,b,X,\bold y)=\frac {1}{N}\sum_{i=1}^{N}(y_i-activation(\bold x_i))^2\\=\frac{1}{N}\sum_{i=1}^{N}(y_i-max(0,\bold w\cdot \bold x_i+b))^2 C(w,b,X,y)=N1i=1N(yiactivation(xi))2=N1i=1N(yimax(0,wxi+b))2
此時,引入中間變數:
u ( w , b , x ) = m a x ( 0 , w ⋅ x + b ) u(\bold w,b,\bold x)=max(0,\bold w\cdot x+b) u(w,b,x)=max(0,wx+b)
v ( y , u ) = y − u v(y,u)=y-u v(y,u)=yu
C ( v ) = 1 N ∑ i = 1 N v 2 C(v)=\frac{1}{N}\sum_{i=1}^{N}v^2 C(v)=N1i=1Nv2
因此,對權重的偏導數為:
∂ C ( v ) w = ∂ ∂ w 1 N ∑ i = 1 N v 2 = 1 N ∑ i = 1 N 2 v ∂ v ∂ w = 1 N ∑ i = 1 N { 0 , w ⋅ x i + b ≤ 0 − 2 v x T , w ⋅ x i + b > 0 = 1 N ∑ i = 1 N { 0 , w ⋅ x i + b ≤ 0 − 2 ( y i − u ) x i T , w ⋅ x i + b > 0 = { 0 , w ⋅ x i + b ≤ 0 2 N ∑ i = 1 N ( w ⋅ x i + b − y i ) x i T , w ⋅ x i + b > 0 \frac{\partial C(v)}{\bold w}=\frac{\partial}{\partial \bold w} \frac{1}{N}\sum_{i=1}^{N}v^2=\frac{1}{N}\sum_{i=1}^{N}2v\frac{\partial v}{\partial \bold w} \\ =\frac{1}{N}\sum_{i=1}^{N}\begin{cases}\bold 0, &\bold w\cdot \bold x_i+b \le0 \\ -2v\bold x^T, & \bold w\cdot \bold x_i+b>0 \end{cases}\\ =\frac{1}{N}\sum_{i=1}^{N}\begin{cases}\bold 0, &\bold w\cdot \bold x_i+b \le0 \\ -2(y_i-u)\bold x_i^T, & \bold w\cdot \bold x_i+b>0 \end{cases}\\ =\begin{cases}\bold 0, &\bold w\cdot \bold x_i+b \le0 \\ \frac{2}{N}\sum_{i=1}^{N}(\bold w \cdot\bold x_i+b-y_i)\bold x_i^T, & \bold w\cdot \bold x_i+b>0 \end{cases} wC(v)=wN1i=1Nv2=N1i=1N2vwv=N1i=1N{0,2vxT,wxi+b0wxi+b>0=N1i=1N{0,2(yiu)xiT,wxi+b0wxi+b>0={0,N2i=1N(wxi+byi)xiT,wxi+b0wxi+b>0
定義誤差項 e i = w ⋅ x i + b − y i e_i=\bold w \cdot\bold x_i+b-y_i ei=wxi+byi,則偏導數
∂ C ∂ w = 2 N ∑ i = 1 N e i x i T \frac{\partial C}{\partial \bold w}=\frac{2}{N}\sum_{i=1}^{N}e_i\bold x_i^T wC=N2i=1NeixiT,非零情況
可見,這個結果是 X X X中所有 x i \bold x_i xi的加權平均值,權重是誤差項.
對偏差的偏導數為:
∂ C ( v ) b = ∂ ∂ b 1 N ∑ i = 1 N v 2 = 1 N ∑ i = 1 N 2 v ∂ v ∂ b = 1 N ∑ i = 1 N { 0 , w ⋅ x i + b ≤ 0 − 2 v , w ⋅ x i + b > 0 = { 0 , w ⋅ x i + b ≤ 0 2 N ∑ i = 1 N ( w ⋅ x i + b − y i ) , w ⋅ x i + b > 0 = 2 N ∑ i = 1 N e i \frac{\partial C(v)}{b}=\frac{\partial}{\partial b} \frac{1}{N}\sum_{i=1}^{N}v^2=\frac{1}{N} \sum_{i=1}^{N}2v\frac{\partial v}{\partial b}\\ =\frac{1}{N} \sum_{i=1}^{N}\begin{cases} 0, &\bold w\cdot \bold x_i+b \le0 \\ -2v, & \bold w\cdot \bold x_i+b>0 \end{cases}\\ =\begin{cases}0, &\bold w\cdot \bold x_i+b \le0 \\ \frac{2}{N}\sum_{i=1}^{N}(\bold w \cdot\bold x_i+b-y_i), & \bold w\cdot \bold x_i+b>0 \end{cases}\\ =\frac{2}{N}\sum_{i=1}^{N}e_i bC(v)=bN1i=1Nv2=N1i=1N2vbv=N1i=1N{0,2v,wxi+b0wxi+b>0={0,N2i=1N(wxi+byi),wxi+b0wxi+b>0=N2i=1Nei
在實際應用中,將向量 w \bold w w b b b合成單個向量更為方便: w ^ = [ w T , b ] T \hat \bold w=[\bold w^T,b]^T w^=[wT,b]T.輸入向量 x \bold x x擴充套件成 x ^ = [ x T , 1 ] T \hat \bold x=[\bold x^T,1]^T x^=[xT,1]T,則有 w ⋅ x + b = w ^ ⋅ x ^ \bold w \cdot\bold x+b=\hat \bold w\cdot\hat\bold x wx+b=w^x^.