《The Matrix Calculus You Need For Deep Learning》讀書筆記
用於深度學習的矩陣微積分讀書筆記
書籍簡介
《The Matrix Calculus You Need For Deep Learning》是舊金山大學的Terence Parr教授(ANTLR之父,fast.ai創始人)和Jeremy Howard共同推出的一篇免費教程,可以幫助你快速入門深度學習中的矩陣微積分相關知識。該教程簡潔明瞭,通俗易懂,只需要一點微積分和神經網路的基礎知識就可以直接開始學習啦!
本教程涵蓋的內容
教程先快速回顧了標量求導法則、向量微積分和偏導數的概念,然後從Jacobian矩陣的推廣開始介紹如何計算矩陣的導數,最後推導單個神經元輸出的梯度以及神經網路損失函式的梯度。
內容總結
1.引言
導數是機器學習中的一個重要組成部分,特別是深度學習,它通過優化損失函式來對神經網路進行訓練。不過它們需要的不是之前所學的標量微積分,而是所謂的矩陣微積分——線性代數和多變數微積分的“聯姻”。
標量求導我們已經很熟悉的,常用的有指數法則、乘積法則和鏈式法則。需要注意的是這裡我們已經可以引入運算元的概念,即可認為
d
d
x
\frac{d}{dx}
dxd是將一個函式對映到它的導數的微分運算元,這就意味著
d
d
x
f
(
x
)
\frac{d}{dx}f(x)
進一步考慮多變數的情況。多變數函式對單個變數求導得到的是偏導數(用
∂
∂
x
\frac{\partial}{\partial x}
∂x∂表示)。將所有的偏導數放在一個行向量內,這個向量即稱為函式
f
(
x
,
y
)
f(x,y)
f(x,y)的梯度,即
∇
f
(
x
,
y
)
=
[
∂
f
(
x
,
y
)
∂
x
,
∂
f
(
x
,
y
)
∂
y
]
\nabla f(x,y)=[\frac{\partial f(x,y)}{\partial x},\frac{\partial f(x,y)}{\partial y}]
再進一步,考慮多函式多變數的情況。除了
f
(
x
,
y
)
f(x,y)
f(x,y)之外,再加上一個函式
g
(
x
,
y
)
g(x,y)
g(x,y)。對於這兩個函式,我們可以將它們的梯度組合成一個矩陣,稱為Jacobian矩陣,矩陣的每一行對應一個函式的梯度:
J
=
[
∇
f
(
x
,
y
)
∇
g
(
x
,
y
)
]
=
[
∂
f
(
x
,
y
)
∂
x
∂
f
(
x
,
y
)
∂
y
∂
g
(
x
,
y
)
∂
x
∂
g
(
x
,
y
)
∂
y
]
J=\begin{bmatrix}\nabla f(x,y)\\ \nabla g(x,y) \end{bmatrix}=\begin{bmatrix}\frac{\partial f(x,y)}{\partial x} & \frac{\partial f(x,y)}{\partial y} \\ \frac{\partial g(x,y)}{\partial x}&\frac{\partial g(x,y)}{\partial y} \end{bmatrix}
J=[∇f(x,y)∇g(x,y)]=[∂x∂f(x,y)∂x∂g(x,y)∂y∂f(x,y)∂y∂g(x,y)].
這樣我們就得到了本教程的核心內容:矩陣微積分!
2.Jacobian的推廣
將引數用向量表示,即
x
=
[
x
1
x
2
.
.
.
x
n
]
T
\bold{x}=[ x_1 x_2 ... x_n]^T
x=[x1x2...xn]T.
將函式同樣用向量表示,即
y
=
f
(
x
)
=
[
f
1
(
x
)
f
2
(
x
)
.
.
.
f
m
(
x
)
]
T
\bold{y}=\bold{f(x)}=[f_1(\bold x) f_2(\bold x) ... f_m(\bold x)]^T
y=f(x)=[f1(x)f2(x)...fm(x)]T,表示由m個標量函式構成的函式向量。
通常Jacobian矩陣即是
m
∗
n
m*n
m∗n個偏導數的集合,也就是相對於
x
\bold x
x的m個梯度的堆積:
∂
y
∂
x
=
[
∇
f
1
(
x
)
∇
f
2
(
x
)
.
.
.
∇
f
m
(
x
)
]
=
[
∂
x
f
1
(
x
)
∂
x
f
2
(
x
)
.
.
.
∂
x
f
m
(
x
)
]
=
[
∂
x
1
f
1
(
x
)
∂
x
2
f
1
(
x
)
.
.
.
∂
x
n
f
1
(
x
)
∂
x
1
f
2
(
x
)
∂
x
2
f
2
(
x
)
.
.
.
∂
x
n
f
2
(
x
)
.
.
.
∂
x
1
f
m
(
x
)
∂
x
2
f
m
(
x
)
.
.
.
∂
x
n
f
m
(
x
)
]
\frac{\partial {\bold y}}{\partial {\bold x}}=\begin{bmatrix} \nabla f_1(\bold {x}) \\ \nabla f_2(\bold {x}) \\ ... \\ \nabla f_m(\bold {x}) \end{bmatrix}=\begin{bmatrix} \frac{\partial}{\bold {x}}f_1(\bold x) \\ \frac{\partial}{\bold {x}}f_2(\bold x) \\ ... \\ \frac{\partial}{\bold {x}}f_m(\bold x)\end{bmatrix}\\=\begin{bmatrix} \frac{\partial}{x_1}f_1(\bold x) & \frac{\partial}{x_2}f_1(\bold x) & ... & \frac{\partial}{x_n}f_1(\bold x)\\ \frac{\partial}{x_1}f_2(\bold x) & \frac{\partial}{x_2}f_2(\bold x) & ... & \frac{\partial}{x_n}f_2(\bold x)\\ &...& \\ \frac{\partial}{x_1}f_m(\bold x) & \frac{\partial}{x_2}f_m(\bold x) & ... & \frac{\partial}{x_n}f_m(\bold x)\end{bmatrix}
∂x∂y=⎣⎢⎢⎡∇f1(x)∇f2(x)...∇fm(x)⎦⎥⎥⎤=⎣⎢⎢⎡x∂f1(x)x∂f2(x)...x∂fm(x)⎦⎥⎥⎤=⎣⎢⎢⎡x1∂f1(x)x1∂f2(x)x1∂fm(x)x2∂f1(x)x2∂f2(x)...x2∂fm(x).........xn∂f1(x)xn∂f2(x)xn∂fm(x)⎦⎥⎥⎤.
作為一個特例,假定
f
(
x
)
=
x
\bold f(\bold x)=\bold x
f(x)=x,
f
i
(
x
)
=
x
i
f_i(\bold x)=x_i
fi(x)=xi。這裡有n個函式,每個函式都有n個引數,因此Jacobian矩陣是一個方陣,而且很容易得到它是一個單位陣
I
I
I.
3.向量的逐元素二元運算求導
向量的逐元素二元運算(element-wise binary operations)有很多,如向量的加法、減法、點乘,標量乘以向量等運算。一般情況下,該種運算可以寫成
y
=
f
(
w
)
∘
g
(
x
)
\bold y = \bold f(\bold w) \circ \bold g(\bold x)
y=f(w)∘g(x)的形式,其中
∘
\circ
∘即表示任意一種逐元素運算操作。
對於加減乘除等簡單的逐元素運算操作,函式
y
\bold y
y對
w
\bold w
w和
x
\bold x
x的偏導數如下:
4.涉及標量運算的向量求導
簡單來說,即向量加上標量或者向量乘以標量,然後再求導。這裡涉及的運算仍然是逐元素的,因此可以將標量展開成包含相同值的向量。求導結果比較簡單:
∂
y
∂
x
=
∂
(
x
+
z
)
∂
x
=
I
\frac{\partial {\bold y}}{\partial {\bold x}}=\frac{\partial {(\bold x+z)}}{\partial {\bold x}}=I
∂x∂y=∂x∂(x+z)=I
∂
y
∂
z
=
∂
(
x
+
z
)
∂
z
=
1
\frac{\partial {\bold y}}{\partial {z}}=\frac{\partial {(\bold x+z)}}{\partial {z}}=\bold 1
∂z∂y=∂z∂(x+z)=1
∂
y
∂
x
=
∂
(
x
z
)
∂
x
=
I
z
\frac{\partial {\bold y}}{\partial {\bold x}}=\frac{\partial {(\bold xz)}}{\partial {\bold x}}=Iz
∂x∂y=∂x∂(xz)=Iz
∂
y
∂
z
=
∂
(
x
z
)
∂
z
=
x
\frac{\partial {\bold y}}{\partial {z}}=\frac{\partial {(\bold xz)}}{\partial {z}}=\bold x
∂z∂y=∂z∂(xz)=x
5.向量和求導
向量元素求和是深度學習中的一個重要運算,如計算網路的損失函式。同樣的,也可用於向量點積和其它將向量轉化成標量的運算的求導操作。
y
=
s
u
m
(
f
(
x
)
)
=
∑
i
=
1
n
f
i
(
x
)
y=sum(\bold f(\bold x))=\sum_{i=1}^{n}f_i(\bold x)
y=sum(f(x))=∑i=1nfi(x)
∂
y
∂
x
=
[
∂
y
∂
x
1
∂
y
∂
x
2
.
.
.
∂
y
∂
x
n
]
\frac{\partial{y}}{\partial \bold x}=\begin{bmatrix} \frac{\partial{y}}{\partial x_1} \frac{\partial{y}}{\partial x_2} ... \frac{\partial{y}}{\partial x_n} \end{bmatrix}
∂x∂y=[∂x1∂y∂x2∂y...∂xn∂y]
一些簡明的求導結果如下:
y
=
s
u
m
(
x
)
,
∇
y
=
1
y=sum(\bold x),\nabla y=\bold 1
y=sum(x),∇y=1
y
=
s
u
m
(
x
z
)
,
∂
y
∂
x
=
1
z
,
∂
y
∂
z
=
s
u
m
(
x
)
y=sum(\bold xz), \frac {\partial y}{\partial \bold x}=\bold 1z,\frac {\partial y}{\partial \bold z}=sum(\bold x)
y=sum(xz),∂x∂y=1z,∂z∂y=sum(x)
6.鏈式法則
使用基本的矩陣微分法則無法計算複雜函式的偏導數,比如巢狀函式,此時必須將基本的向量求導法則和向量鏈式法則結合起來使用。不幸的是,有很多求導法則都屬於鏈式法則,所以我們要小心使用哪個鏈式法則。
6.1 單變數鏈式法則
所謂單變數鏈式法則,即之前所學的鏈式法則,此時只有一個變數。
6.2 單變數全導數鏈式法則
單變數鏈式法則的適用範圍是有限的,即所有的中間變數必須是單個變數的函式。所謂全導數,即是要計算
d
y
d
x
\frac {dy}{dx}
dxdy,必須把x變化量對y變化量的所有可能貢獻加起來。相對
x
x
x的全導數假設所有的變數都是
x
x
x的函式,並且可能隨著
x
x
x變化而變化。如
f
(
x
)
=
u
2
(
x
,
u
1
)
f(x)=u_2(x,u_1)
f(x)=u2(x,u1)通過中間變數
u
1
(
x
)
u_1(x)
u1(x)直接和間接地依賴於x,則全導數:
d
y
d
x
=
∂
f
(
x
)
∂
x
=
∂
u
2
(
x
,
u
1
)
∂
x
=
∂
u
2
∂
x
∂
x
∂
x
+
∂
u
2
∂
u
1
∂
u
1
∂
x
=
∂
u
2
∂
x
+
∂
u
2
∂
u
1
∂
u
1
∂
x
\frac{dy}{dx}=\frac{\partial f(x)}{\partial x}=\frac{\partial u_2(x,u_1)}{\partial x}=\frac{\partial u_2}{\partial x}\frac{\partial x}{\partial x}+\frac{\partial u_2}{\partial u_1}\frac{\partial u_1}{\partial x}=\frac{\partial u_2}{\partial x}+\frac{\partial u_2}{\partial u_1}\frac{\partial u_1}{\partial x}
dxdy=∂x∂f(x)=∂x∂u2(x,u1)=∂x∂u2∂x∂x+∂u1∂u2∂x∂u1=∂x∂u2+∂u1∂u2∂x∂u1
其一般形式為:
∂
f
(
u
1
,
.
.
.
,
u
n
+
1
)
∂
x
=
∑
i
=
1
n
+
1
∂
f
∂
u
i
∂
u
i
∂
x
\frac {\partial f(u_1,...,u_n+1)}{\partial x}=\sum_{i=1}^{n+1}\frac{\partial f}{\partial u_i}\frac{\partial u_i}{\partial x}
∂x∂f(u1,...,un+1)=∑i=1n+1∂ui∂f∂x∂ui
6.3 向量鏈式法則
我們先從計算一個向量函式
y
=
f
(
g
(
x
)
)
\bold y=\bold f(\bold g(x))
y=f(g(x))對標量的導數開始,看看我們能否抽象出一個一般的公式。顯然,有:
∂
y
∂
x
=
[
∂
f
1
(
g
)
∂
x
∂
f
2
(
g
)
∂
x
]
=
[
∂
f
1
∂
g
1
∂
g
1
∂
x
+
∂
f
1
∂
g
2
∂
g
2
∂
x
∂
f
2
∂
g
1
∂
g
1
∂
x
+
∂
f
2
∂
g
2
∂
g
2
∂
x
]
=
[
∂
f
1
∂
g
1
∂
f
1
∂
g
2
∂
f
2
∂
g
1
∂
f
2
∂
g
2
]
[
∂
g
1
∂
x
∂
g
2
∂
x
]
=
∂
f
∂
g
∂
g
∂
x
\frac {\partial \bold y}{\partial x}=\begin{bmatrix} \frac{\partial f_1(\bold g)}{\partial x} \\ \frac{\partial f_2(\bold g)}{\partial x}\end{bmatrix}=\begin{bmatrix} \frac {\partial f_1}{\partial g_1}\frac{\partial g_1}{\partial x}+ \frac {\partial f_1}{\partial g_2}\frac{\partial g_2}{\partial x } \\ \frac {\partial f_2}{\partial g_1}\frac{\partial g_1}{\partial x}+ \frac {\partial f_2}{\partial g_2}\frac{\partial g_2}{\partial x } \end{bmatrix} \\=\begin{bmatrix} \frac {\partial f_1}{\partial g_1}& \frac {\partial f_1}{\partial g_2} \\ \frac {\partial f_2}{\partial g_1}& \frac {\partial f_2}{\partial g_2} \end{bmatrix}\begin{bmatrix} \frac{\partial g_1}{\partial x}\\\frac{\partial g_2}{\partial x } \end{bmatrix}=\frac{\partial \bold f}{\partial \bold g}\frac{\partial \bold g}{\partial x}
∂x∂y=[∂x∂f1(g)∂x∂f2(g)]=[∂g1∂f1∂x∂g1+∂g2∂f1∂x∂g2∂g1∂f2∂x∂g1+∂g2∂f2∂x∂g2]=[∂g1∂f1∂g1∂f2∂g2∂f1∂g2∂f2][∂x∂g1∂x∂g2]=∂g∂f∂x∂g
當將變數
x
x
x擴充套件成向量
x
\bold x
x,再次求Jacobian矩陣,此時完整的向量鏈式法則為:
∂
∂
x
f
(
g
(
x
)
)
=
∂
f
∂
g
∂
g
∂
x
\frac{\partial}{\partial \bold x} \bold f(\bold g(\bold x))=\frac{\partial \bold f}{\partial \bold g} \frac {\partial \bold g}{\partial \bold x}
∂x∂f(g(x))=∂g∂f∂x∂g
這個等式可以進一步簡化,因為在許多情況下,Jacobian矩陣是方陣(
m
=
n
m=n
m=n),而且非對角元素均為0.這是神經網路的天然性質,它涉及的是向量的函式,而非函式的向量。如神經元仿射函式是
s
u
m
(
w
⨂
x
)
sum(\bold w \bigotimes \bold x)
sum(w⨂x),啟用函式是
m
a
x
(
0
,
x
)
max(0,\bold x)
max(0,x).
正如之前所講,對向量
w
\bold w
w和
x
\bold x
x進行逐元素運算,其偏導數矩陣為
∂
w
i
∂
x
i
\frac {\partial w_i}{\partial x_i}
∂xi∂wi構成的對角陣,因為
w
i
w_i
wi是
x
i
x_i
xi的函式,而非
x
j
x_j
xj的函式(
j
≠
i
j\ne i
j=i).
∂
f
∂
g
=
d
i
a
g
(
∂
f
i
∂
g
i
)
\frac{\partial \bold f}{\partial \bold g}=diag(\frac{\partial \bold f_i}{\partial \bold g_i} )
∂g∂f=diag(∂gi∂fi)
∂
g
∂
x
=
d
i
a
g
(
∂
g
i
∂
x
i
)
\frac{\partial \bold g}{\partial \bold x}=diag(\frac{\partial \bold g_i}{\partial \bold x_i} )
∂x∂g=diag(∂xi∂gi)
此時,鏈式法則可以簡化為:
∂
∂
x
f
(
g
(
x
)
)
=
d
i
a
g
(
∂
f
i
∂
g
i
)
d
i
a
g
(
∂
g
i
∂
x
i
)
=
d
i
a
g
(
∂
f
i
∂
g
i
∂
g
i
∂
x
i
)
\frac{\partial}{\partial \bold x} f(\bold g( x))=diag(\frac{\partial f_i}{\partial g_i} )diag(\frac{\partial g_i}{\partial x_i} )=diag(\frac{\partial f_i}{ \partial g_i}\frac{ \partial g_i}{\partial x_i})
∂x∂f(g(x))=diag(∂gi∂fi)diag(∂xi∂gi)=diag(∂gi∂fi∂xi∂gi)
6.4 總結
綜上所述,下表總結了鏈式法則的相應乘積部分以計算Jacobian矩陣。
7.啟用函式的梯度
啟用函式:
a
c
t
i
v
a
t
i
o
n
(
x
)
=
m
a
x
(
0
,
w
⋅
x
+
b
)
activation(\bold x)=max(0,\bold w \cdot \bold x+b)
activation(x)=max(0,w⋅x+b)
中間變數:
u
=
w
⋅
x
+
b
\bold u = \bold w \cdot \bold x+b
u=w⋅x+b
y
=
s
u
m
(
u
)
y=sum(\bold u)
y=sum(u)
偏導數:
∂
y
∂
w
=
∂
∂
w
w
⋅
x
+
∂
∂
w
b
=
x
T
+
0
T
=
x
T
\frac{\partial y}{\partial \bold w}=\frac{\partial}{\partial \bold w}\bold w\cdot \bold x+\frac{\partial}{\partial \bold w}b=\bold x^T+\bold 0^T=\bold x^T
∂w∂y=∂w∂w⋅x+∂w∂b=xT+0T=xT
∂
y
∂
b
=
∂
∂
b
w
⋅
x
+
∂
∂
b
b
=
0
+
1
=
1
\frac{\partial y}{\partial b}=\frac{\partial}{\partial b}\bold w\cdot \bold x+\frac{\partial}{\partial \bold b}b=0+1=1
∂b∂y=∂b∂w⋅x+∂b∂b=0+1=1
啟用函式的偏導數:
∂
a
c
t
i
v
a
t
i
o
n
∂
w
=
{
0
T
,
w
⋅
x
+
b
≤
0
x
T
,
w
⋅
x
+
b
>
0
\frac{\partial activation}{\partial \bold w}=\begin{cases}\bold 0^T, &\bold w\cdot \bold x+b \le0 \\ \bold x^T, & \bold w\cdot \bold x+b>0 \end{cases}
∂w∂activation={0T,xT,w⋅x+b≤0w⋅x+b>0
∂
a
c
t
i
v
a
t
i
o
n
∂
w
=
{
0
,
w
⋅
x
+
b
≤
0
1
,
w
⋅
x
+
b
>
0
\frac{\partial activation}{\partial \bold w}=\begin{cases}0, &\bold w\cdot \bold x+b \le0 \\ 1, & \bold w\cdot \bold x+b>0 \end{cases}
∂w∂activation={0,1,w⋅x+b≤0w⋅x+b>0
8.損失函式的梯度
訓練一個神經網路需要計算損失函式(或代價函式)對模型引數
w
\bold w
w和
b
b
b的導數。由於我們是用多個輸入向量(也即多個圖片)和標量目標值(如每個圖片一個類別)進行訓練。令
X
=
[
x
1
,
x
2
,
.
.
.
,
x
N
]
T
X=[\bold x_1,\bold x_2,...,\bold x_N]^T
X=[x1,x2,...,xN]T
y
=
[
t
a
r
g
e
t
(
x
1
)
,
t
a
r
g
e
t
(
x
2
)
,
.
.
.
,
t
a
r
g
e
t
(
x
N
)
]
T
\bold y=[target(\bold x_1),target(\bold x_2),...,target(\bold x_N)]^T
y=[target(x1),target(x2),...,target(xN)]T.
代價函式則為:
C
(
w
,
b
,
X
,
y
)
=
1
N
∑
i
=
1
N
(
y
i
−
a
c
t
i
v
a
t
i
o
n
(
x
i
)
)
2
=
1
N
∑
i
=
1
N
(
y
i
−
m
a
x
(
0
,
w
⋅
x
i
+
b
)
)
2
C(\bold w,b,X,\bold y)=\frac {1}{N}\sum_{i=1}^{N}(y_i-activation(\bold x_i))^2\\=\frac{1}{N}\sum_{i=1}^{N}(y_i-max(0,\bold w\cdot \bold x_i+b))^2
C(w,b,X,y)=N1∑i=1N(yi−activation(xi))2=N1∑i=1N(yi−max(0,w⋅xi+b))2
此時,引入中間變數:
u
(
w
,
b
,
x
)
=
m
a
x
(
0
,
w
⋅
x
+
b
)
u(\bold w,b,\bold x)=max(0,\bold w\cdot x+b)
u(w,b,x)=max(0,w⋅x+b)
v
(
y
,
u
)
=
y
−
u
v(y,u)=y-u
v(y,u)=y−u
C
(
v
)
=
1
N
∑
i
=
1
N
v
2
C(v)=\frac{1}{N}\sum_{i=1}^{N}v^2
C(v)=N1∑i=1Nv2
因此,對權重的偏導數為:
∂
C
(
v
)
w
=
∂
∂
w
1
N
∑
i
=
1
N
v
2
=
1
N
∑
i
=
1
N
2
v
∂
v
∂
w
=
1
N
∑
i
=
1
N
{
0
,
w
⋅
x
i
+
b
≤
0
−
2
v
x
T
,
w
⋅
x
i
+
b
>
0
=
1
N
∑
i
=
1
N
{
0
,
w
⋅
x
i
+
b
≤
0
−
2
(
y
i
−
u
)
x
i
T
,
w
⋅
x
i
+
b
>
0
=
{
0
,
w
⋅
x
i
+
b
≤
0
2
N
∑
i
=
1
N
(
w
⋅
x
i
+
b
−
y
i
)
x
i
T
,
w
⋅
x
i
+
b
>
0
\frac{\partial C(v)}{\bold w}=\frac{\partial}{\partial \bold w} \frac{1}{N}\sum_{i=1}^{N}v^2=\frac{1}{N}\sum_{i=1}^{N}2v\frac{\partial v}{\partial \bold w} \\ =\frac{1}{N}\sum_{i=1}^{N}\begin{cases}\bold 0, &\bold w\cdot \bold x_i+b \le0 \\ -2v\bold x^T, & \bold w\cdot \bold x_i+b>0 \end{cases}\\ =\frac{1}{N}\sum_{i=1}^{N}\begin{cases}\bold 0, &\bold w\cdot \bold x_i+b \le0 \\ -2(y_i-u)\bold x_i^T, & \bold w\cdot \bold x_i+b>0 \end{cases}\\ =\begin{cases}\bold 0, &\bold w\cdot \bold x_i+b \le0 \\ \frac{2}{N}\sum_{i=1}^{N}(\bold w \cdot\bold x_i+b-y_i)\bold x_i^T, & \bold w\cdot \bold x_i+b>0 \end{cases}
w∂C(v)=∂w∂N1∑i=1Nv2=N1∑i=1N2v∂w∂v=N1∑i=1N{0,−2vxT,w⋅xi+b≤0w⋅xi+b>0=N1∑i=1N{0,−2(yi−u)xiT,w⋅xi+b≤0w⋅xi+b>0={0,N2∑i=1N(w⋅xi+b−yi)xiT,w⋅xi+b≤0w⋅xi+b>0
定義誤差項
e
i
=
w
⋅
x
i
+
b
−
y
i
e_i=\bold w \cdot\bold x_i+b-y_i
ei=w⋅xi+b−yi,則偏導數
∂
C
∂
w
=
2
N
∑
i
=
1
N
e
i
x
i
T
\frac{\partial C}{\partial \bold w}=\frac{2}{N}\sum_{i=1}^{N}e_i\bold x_i^T
∂w∂C=N2∑i=1NeixiT,非零情況
可見,這個結果是
X
X
X中所有
x
i
\bold x_i
xi的加權平均值,權重是誤差項.
對偏差的偏導數為:
∂
C
(
v
)
b
=
∂
∂
b
1
N
∑
i
=
1
N
v
2
=
1
N
∑
i
=
1
N
2
v
∂
v
∂
b
=
1
N
∑
i
=
1
N
{
0
,
w
⋅
x
i
+
b
≤
0
−
2
v
,
w
⋅
x
i
+
b
>
0
=
{
0
,
w
⋅
x
i
+
b
≤
0
2
N
∑
i
=
1
N
(
w
⋅
x
i
+
b
−
y
i
)
,
w
⋅
x
i
+
b
>
0
=
2
N
∑
i
=
1
N
e
i
\frac{\partial C(v)}{b}=\frac{\partial}{\partial b} \frac{1}{N}\sum_{i=1}^{N}v^2=\frac{1}{N} \sum_{i=1}^{N}2v\frac{\partial v}{\partial b}\\ =\frac{1}{N} \sum_{i=1}^{N}\begin{cases} 0, &\bold w\cdot \bold x_i+b \le0 \\ -2v, & \bold w\cdot \bold x_i+b>0 \end{cases}\\ =\begin{cases}0, &\bold w\cdot \bold x_i+b \le0 \\ \frac{2}{N}\sum_{i=1}^{N}(\bold w \cdot\bold x_i+b-y_i), & \bold w\cdot \bold x_i+b>0 \end{cases}\\ =\frac{2}{N}\sum_{i=1}^{N}e_i
b∂C(v)=∂b∂N1∑i=1Nv2=N1∑i=1N2v∂b∂v=N1∑i=1N{0,−2v,w⋅xi+b≤0w⋅xi+b>0={0,N2∑i=1N(w⋅xi+b−yi),w⋅xi+b≤0w⋅xi+b>0=N2∑i=1Nei
在實際應用中,將向量
w
\bold w
w和
b
b
b合成單個向量更為方便:
w
^
=
[
w
T
,
b
]
T
\hat \bold w=[\bold w^T,b]^T
w^=[wT,b]T.輸入向量
x
\bold x
x擴充套件成
x
^
=
[
x
T
,
1
]
T
\hat \bold x=[\bold x^T,1]^T
x^=[xT,1]T,則有
w
⋅
x
+
b
=
w
^
⋅
x
^
\bold w \cdot\bold x+b=\hat \bold w\cdot\hat\bold x
w⋅x+b=w^⋅x^.