tensorflow中實現自動、手動梯度下降:GradientDescent、Momentum、Adagrad
tensorflow中提供了自動訓練機制(見nsorflow optimizer minimize 自動訓練和var_list訓練限制),本文主要展現不同的自動梯度下降並附加手動實現。
learning rate、step、計算公式如下:
在預測中,x是關於y的變數,但是在train中,w是L的變數,x是不可能變化的。所以,知道為什麼weights叫Variable了吧(強行瞎解釋一發)
下面用tensorflow手動實現梯度下降:
為了方便寫公式,下邊的程式碼改了變數的命名,採用loss、prediction、gradient、weight、y、x等首字母表示,η表示學習率,w0、w1、w2等表示第幾次迭代時w的值,不是多個變數。
loss=(y-p)^2=(y-w*x)^2=(y^2-2*y*w*x+w^2*x^2)
dl/dw = 2*w*x^2-2*y*x
代入梯度下降公式w1=w0-η*dL/dw|w=w0
w1 = w0-η*dL/dw|w=w0
w2 = w1 - η*dL/dw|w=w1
w3 = w2 - η*dL/dw|w=w2
初始:y=3,x=1,w=2,l=1,dl/dw=-2,η=1
更新:w=4
更新:w=2
更新:w=4
所以,本例x=1,y=3,dl/dw巧合的等於2w-2y,也就是二倍的prediction和label的差距。learning rate=1會導致w圍繞正確的值來回徘徊,完全不收斂,這樣寫主要是方便演示計算。改小learning rate 並增加迴圈次數就能收斂了。
手動實現梯度下降Gradient Descent:
#demo4:manual gradient descent in tensorflow #y label y = tf.constant(3,dtype = tf.float32) x = tf.placeholder(dtype = tf.float32) w = tf.Variable(2,dtype=tf.float32) #prediction p = w*x #define losses l = tf.square(p - y) g = tf.gradients(l, w) learning_rate = tf.constant(1,dtype=tf.float32) #learning_rate = tf.constant(0.11,dtype=tf.float32) init = tf.global_variables_initializer() #update update = tf.assign(w, w - learning_rate * g[0]) with tf.Session() as sess: sess.run(init) print(sess.run([g,p,w], {x: 1})) for _ in range(5): w_,g_,l_ = sess.run([w,g,l],feed_dict={x:1}) print('variable is w:',w_, ' g is ',g_,' and the loss is ',l_) _ = sess.run(update,feed_dict={x:1})
結果:
learning rate=1
[[-2.0], 2.0, 2.0]
variable is w: 2.0 g is [-2.0] and the loss is 1.0
variable is w: 4.0 g is [2.0] and the loss is 1.0
variable is w: 2.0 g is [-2.0] and the loss is 1.0
variable is w: 4.0 g is [2.0] and the loss is 1.0
variable is w: 2.0 g is [-2.0] and the loss is 1.0
縮小learning rate
variable is w: 2.9964619 g is [-0.007575512] and the loss is 1.4347095e-05
variable is w: 2.996695 g is [-0.0070762634] and the loss is 1.2518376e-05
variable is w: 2.996913 g is [-0.0066099167] and the loss is 1.0922749e-05
variable is w: 2.9971166 g is [-0.0061740875] and the loss is 9.529839e-06
variable is w: 2.9973066 g is [-0.0057668686] and the loss is 8.314193e-06
variable is w: 2.9974842 g is [-0.0053868294] and the loss is 7.2544826e-06
variable is w: 2.9976501 g is [-0.0050315857] and the loss is 6.3292136e-06
variable is w: 2.997805 g is [-0.004699707] and the loss is 5.5218115e-06
variable is w: 2.9979498 g is [-0.004389763] and the loss is 4.8175043e-06
variable is w: 2.998085 g is [-0.0041003227] and the loss is 4.2031616e-06
variable is w: 2.9982114 g is [-0.003829956] and the loss is 3.6671408e-06
variable is w: 2.9983294 g is [-0.0035772324] and the loss is 3.1991478e-06
SGD:
注意,tensorflow中沒有SGD(Stochastic Gradient Descent)這種梯度下降演算法介面,SGD更像是一個喂資料的策略,而不是具體訓練方法,按吳恩達教程,嚴格的說,SGD甚至一次只能訓練一個樣本,實際常見的更多是多個樣本的mini-batch,只要喂資料的時候隨機化就算是SGD(mini-batch)了。
Momentum梯度下降:
#demo5.2 tensorflow momentum
y = tf.constant(3,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
w = tf.Variable(2,dtype=tf.float32)
#prediction
p = w*x
#define losses
l = tf.square(p - y)
g = tf.gradients(l, w)
Mu = 0.8
LR = tf.constant(0.01,dtype=tf.float32)
init = tf.group(tf.global_variables_initializer(),tf.local_variables_initializer())
#update w
update = tf.train.MomentumOptimizer(LR, Mu).minimize(l)
with tf.Session() as sess:
sess.run(init)
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
print(sess.run([g,p,w], {x: 1}))
for _ in range(10):
w_,g_,l_ = sess.run([w,g,l],feed_dict={x:1})
print('variable is w:',w_, ' g is ',g_, ' and the loss is ',l_)
sess.run([update],feed_dict={x:1})
這是前幾次迭代的資料,注意看,和下邊的手動實現做對比
variable is w: 2.0 g is [-2.0] and the loss is 1.0
variable is w: 2.02 g is [-1.96] and the loss is 0.96040004
variable is w: 2.0556 g is [-1.8888001] and the loss is 0.8918915
variable is w: 2.102968 g is [-1.794064] and the loss is 0.80466646
variable is w: 2.158803 g is [-1.682394] and the loss is 0.7076124
variable is w: 2.220295 g is [-1.5594101] and the loss is 0.60793996
variable is w: 2.2850826 g is [-1.4298348] and the loss is 0.5111069
variable is w: 2.351211 g is [-1.2975779] and the loss is 0.42092708
variable is w: 2.4170897 g is [-1.1658206] and the loss is 0.3397844
variable is w: 2.4814508 g is [-1.0370984] and the loss is 0.26889327
#demo5.2:manual momentum in tensorflow
y = tf.constant(3,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
w = tf.Variable(2,dtype=tf.float32)
#prediction
p = w*x
#define losses
l = tf.square(p - y)
g = tf.gradients(l, w)
Mu = 0.8
LR = tf.constant(0.01,dtype=tf.float32)
#v = tf.Variable(0,tf.float32)#error?secend param is not dtype?
v = tf.Variable(0,dtype = tf.float32)
init = tf.global_variables_initializer()
#update w
update1 = tf.assign(v, Mu * v + g[0] * LR )
update2 = tf.assign(w, w - v)
#update = tf.group(update1,update2)#wrong sequence!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
with tf.Session() as sess:
sess.run(init)
print(sess.run([g,p,w], {x: 1}))
for _ in range(10):
w_,g_,l_,v_ = sess.run([w,g,l,v],feed_dict={x:1})
print('variable is w:',w_, ' g is ',g_, ' v is ',v_,' and the loss is ',l_)
_ = sess.run([update1],feed_dict={x:1})
_ = sess.run([update2],feed_dict={x:1})
注意看前邊這組資料,和tf自動實現的是一樣的。
variable is w: 2.0 g is [-2.0] v is 0.0 and the loss is 1.0
variable is w: 2.0 g is [-2.0] v is -0.02 and the loss is 1.0
variable is w: 2.02 g is [-1.96] v is -0.0356 and the loss is 0.96040004
variable is w: 2.0556 g is [-1.8888001] v is -0.047367997 and the loss is 0.8918915
variable is w: 2.102968 g is [-1.794064] v is -0.05583504 and the loss is 0.80466646
variable is w: 2.158803 g is [-1.682394] v is -0.06149197 and the loss is 0.7076124
variable is w: 2.220295 g is [-1.5594101] v is -0.06478768 and the loss is 0.60793996
variable is w: 2.2850826 g is [-1.4298348] v is -0.06612849 and the loss is 0.5111069
variable is w: 2.351211 g is [-1.2975779] v is -0.06587857 and the loss is 0.42092708
variable is w: 2.4170897 g is [-1.1658206] v is -0.06436106 and the loss is 0.3397844
variable is w: 2.9999995 g is [-9.536743e-07] v is -4.7683734e-08 and the loss is 2.2737368e-13
variable is w: 2.9999995 g is [-9.536743e-07] v is -4.7683734e-08 and the loss is 2.2737368e-13
variable is w: 2.9999995 g is [-9.536743e-07] v is -4.7683734e-08 and the loss is 2.2737368e-13
variable is w: 2.9999995 g is [-9.536743e-07] v is -4.7683734e-08 and the loss is 2.2737368e-13
variable is w: 2.9999995 g is [-9.536743e-07] v is -4.7683734e-08 and the loss is 2.2737368e-13
接下來是adagrad的例子:
adagrad有點使用Hessian矩陣的意思,不過用的是近似二次導數,因為真求出二次導數,在深度學習中代價還是很大的。
#demo6:adagrad optimizer in tensorflow
y = tf.constant(3,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
w = tf.Variable(2,dtype=tf.float32)
#prediction
p = w*x
#define losses
l = tf.square(p - y)
g = tf.gradients(l, w)
LR = tf.constant(0.6,dtype=tf.float32)
optimizer = tf.train.AdagradOptimizer(LR)
update = optimizer.minimize(l)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
#print(sess.run([g,p,w], {x: 1}))
for _ in range(20):
w_,l_,g_ = sess.run([w,l,g],feed_dict={x:1})
print('variable is w:',w_, 'g:',g_ ,' and the loss is ',l_)
_ = sess.run(update,feed_dict={x:1})
#demo6.2:manual adagrad
#with tf.name_scope('initial'):
y = tf.constant(3,dtype = tf.float32)
x = tf.placeholder(dtype=tf.float32)
w = tf.Variable(2,dtype=tf.float32,expected_shape=[1])
second_derivative = tf.Variable(0,dtype=tf.float32)
LR = tf.constant(0.6,dtype=tf.float32)
Regular = 1e-8
#prediction
p = w*x
#loss
l = tf.square(p - y)
#gradients
g = tf.gradients(l, w)
#print(g)
#print(tf.square(g))
#update
update1 = tf.assign_add(second_derivative,tf.square(g[0]))
g_final = LR * g[0] / (tf.sqrt(second_derivative) + Regular)
update2 = tf.assign(w, w - g_final)
#update = tf.assign(w, w - LR * g[0])
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
print(sess.run([g,p,w], {x: 1}))
for _ in range(20):
_ = sess.run(update1,feed_dict={x:1.0})
w_,g_,l_,g_sec_ = sess.run([w,g,l,second_derivative],feed_dict={x:1.0})
print('variable is w:',w_, ' g is ',g_,' g_sec_ is ',g_sec_,' and the loss is ',l_)
#sess.run(g_final)
_ = sess.run(update2,feed_dict={x:1.0})
結果接近,可惜不完全一樣,我也不知道optimizer中的引數都是多少,有沒有正則化,太不透明瞭。
[[-2.0], 2.0, 2.0]
variable is w: 2.0 g is [-2.0] g_sec_ is 0.0 and the loss is 1.0
variable is w: 2.6 g is [-0.8000002] g_sec_ is 4.0 and the loss is 0.16000007
variable is w: 2.8228343 g is [-0.3543315] g_sec_ is 4.6400003 and the loss is 0.0313877
variable is w: 2.920222 g is [-0.15955591] g_sec_ is 4.765551 and the loss is 0.006364522
variable is w: 2.9639592 g is [-0.072081566] g_sec_ is 4.791009 and the loss is 0.0012989381
variable is w: 2.9837074 g is [-0.032585144] g_sec_ is 4.7962046 and the loss is 0.0002654479
variable is w: 2.9926338 g is [-0.014732361] g_sec_ is 4.7972665 and the loss is 5.4260614e-05
variable is w: 2.9966695 g is [-0.0066609383] g_sec_ is 4.7974834 and the loss is 1.1092025e-05
variable is w: 2.9984941 g is [-0.0030117035] g_sec_ is 4.797528 and the loss is 2.2675895e-06
variable is w: 2.999319 g is [-0.0013618469] g_sec_ is 4.797537 and the loss is 4.6365676e-07
variable is w: 2.9996922 g is [-0.0006155968] g_sec_ is 4.7975388 and the loss is 9.4739846e-08
variable is w: 2.9998608 g is [-0.0002784729] g_sec_ is 4.797539 and the loss is 1.9386789e-08
variable is w: 2.999937 g is [-0.00012588501] g_sec_ is 4.797539 and the loss is 3.961759e-09
variable is w: 2.9999716 g is [-5.6743622e-05] g_sec_ is 4.797539 and the loss is 8.0495965e-10
variable is w: 2.9999871 g is [-2.5749207e-05] g_sec_ is 4.797539 and the loss is 1.6575541e-10
variable is w: 2.9999943 g is [-1.1444092e-05] g_sec_ is 4.797539 and the loss is 3.274181e-11
variable is w: 2.9999974 g is [-5.2452087e-06] g_sec_ is 4.797539 and the loss is 6.8780537e-12
variable is w: 2.9999988 g is [-2.3841858e-06] g_sec_ is 4.797539 and the loss is 1.4210855e-12
variable is w: 2.9999995 g is [-9.536743e-07] g_sec_ is 4.797539 and the loss is 2.2737368e-13
variable is w: 2.9999998 g is [-4.7683716e-07] g_sec_ is 4.797539 and the loss is 5.684342e-14
variable is w: 2.0 g: [-2.0] and the loss is 1.0
variable is w: 2.5926378 g: [-0.81472445] and the loss is 0.16594398
variable is w: 2.816606 g: [-0.3667879] and the loss is 0.033633344
variable is w: 2.9160419 g: [-0.1679163] and the loss is 0.0070489706
variable is w: 2.9614334 g: [-0.07713318] and the loss is 0.0014873818
variable is w: 2.9822717 g: [-0.035456657] and the loss is 0.00031429363
variable is w: 2.9918494 g: [-0.016301155] and the loss is 6.6431916e-05
variable is w: 2.9962525 g: [-0.0074949265] and the loss is 1.404348e-05
variable is w: 2.998277 g: [-0.0034461021] and the loss is 2.968905e-06
variable is w: 2.9992077 g: [-0.0015845299] and the loss is 6.2768373e-07
variable is w: 2.9996357 g: [-0.0007286072] and the loss is 1.327171e-07
variable is w: 2.9998324 g: [-0.00033521652] and the loss is 2.809253e-08
variable is w: 2.999923 g: [-0.0001540184] and the loss is 5.930417e-09
variable is w: 2.9999645 g: [-7.104874e-05] and the loss is 1.2619807e-09
variable is w: 2.9999835 g: [-3.2901764e-05] and the loss is 2.7063152e-10
variable is w: 2.9999924 g: [-1.5258789e-05] and the loss is 5.820766e-11
variable is w: 2.9999964 g: [-7.1525574e-06] and the loss is 1.2789769e-11
variable is w: 2.9999983 g: [-3.33786e-06] and the loss is 2.7853275e-12
variable is w: 2.9999993 g: [-1.4305115e-06] and the loss is 5.1159077e-13
variable is w: 2.9999998 g: [-4.7683716e-07] and the loss is 5.684342e-14
這個例子只供演示,真正體現Adagrad優勢的,還得是多引數情形,單引數用Adagrad不能顯現很大優勢,Adagrad的一大優點,是能協調不同引數的學習速率,每個引數都被自己的“二次微分”約束,最後就公平了。