1. 程式人生 > >tensorflow中實現自動、手動梯度下降:GradientDescent、Momentum、Adagrad

tensorflow中實現自動、手動梯度下降:GradientDescent、Momentum、Adagrad

tensorflow中提供了自動訓練機制(見nsorflow optimizer minimize 自動訓練和var_list訓練限制),本文主要展現不同的自動梯度下降並附加手動實現。

learning rate、step、計算公式如下:

在預測中,x是關於y的變數,但是在train中,w是L的變數,x是不可能變化的。所以,知道為什麼weights叫Variable了吧(強行瞎解釋一發)

下面用tensorflow手動實現梯度下降:

為了方便寫公式,下邊的程式碼改了變數的命名,採用loss、prediction、gradient、weight、y、x等首字母表示,η表示學習率,w0、w1、w2等表示第幾次迭代時w的值,不是多個變數。

loss=(y-p)^2=(y-w*x)^2=(y^2-2*y*w*x+w^2*x^2)

dl/dw = 2*w*x^2-2*y*x

代入梯度下降公式w1=w0-η*dL/dw|w=w0

w1 = w0-η*dL/dw|w=w0

w2 = w1 - η*dL/dw|w=w1

w3 = w2 - η*dL/dw|w=w2

初始:y=3,x=1,w=2,l=1,dl/dw=-2,η=1

更新:w=4

更新:w=2

更新:w=4

所以,本例x=1,y=3,dl/dw巧合的等於2w-2y,也就是二倍的prediction和label的差距。learning rate=1會導致w圍繞正確的值來回徘徊,完全不收斂,這樣寫主要是方便演示計算。改小learning rate 並增加迴圈次數就能收斂了。

手動實現梯度下降Gradient Descent:

#demo4:manual gradient descent in tensorflow
#y label
y = tf.constant(3,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
w = tf.Variable(2,dtype=tf.float32)
#prediction
p = w*x

#define losses
l = tf.square(p - y)
g = tf.gradients(l, w)
learning_rate = tf.constant(1,dtype=tf.float32)
#learning_rate = tf.constant(0.11,dtype=tf.float32)
init = tf.global_variables_initializer()

#update
update = tf.assign(w, w - learning_rate * g[0])

with tf.Session() as sess:
    sess.run(init)
    print(sess.run([g,p,w], {x: 1}))
    for _ in range(5):
        w_,g_,l_ = sess.run([w,g,l],feed_dict={x:1})
        print('variable is w:',w_, ' g is ',g_,'  and the loss is ',l_)

        _ = sess.run(update,feed_dict={x:1})

結果:

learning rate=1

[[-2.0], 2.0, 2.0]
variable is w: 2.0  g is  [-2.0]   and the loss is  1.0
variable is w: 4.0  g is  [2.0]   and the loss is  1.0
variable is w: 2.0  g is  [-2.0]   and the loss is  1.0
variable is w: 4.0  g is  [2.0]   and the loss is  1.0
variable is w: 2.0  g is  [-2.0]   and the loss is  1.0

縮小learning rate

variable is w: 2.9964619  g is  [-0.007575512]   and the loss is  1.4347095e-05
variable is w: 2.996695  g is  [-0.0070762634]   and the loss is  1.2518376e-05
variable is w: 2.996913  g is  [-0.0066099167]   and the loss is  1.0922749e-05
variable is w: 2.9971166  g is  [-0.0061740875]   and the loss is  9.529839e-06
variable is w: 2.9973066  g is  [-0.0057668686]   and the loss is  8.314193e-06
variable is w: 2.9974842  g is  [-0.0053868294]   and the loss is  7.2544826e-06
variable is w: 2.9976501  g is  [-0.0050315857]   and the loss is  6.3292136e-06
variable is w: 2.997805  g is  [-0.004699707]   and the loss is  5.5218115e-06
variable is w: 2.9979498  g is  [-0.004389763]   and the loss is  4.8175043e-06
variable is w: 2.998085  g is  [-0.0041003227]   and the loss is  4.2031616e-06
variable is w: 2.9982114  g is  [-0.003829956]   and the loss is  3.6671408e-06
variable is w: 2.9983294  g is  [-0.0035772324]   and the loss is  3.1991478e-06

SGD:

注意,tensorflow中沒有SGD(Stochastic Gradient Descent)這種梯度下降演算法介面,SGD更像是一個喂資料的策略,而不是具體訓練方法,按吳恩達教程,嚴格的說,SGD甚至一次只能訓練一個樣本,實際常見的更多是多個樣本的mini-batch,只要喂資料的時候隨機化就算是SGD(mini-batch)了。

Momentum梯度下降:

#demo5.2 tensorflow momentum


y = tf.constant(3,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
w = tf.Variable(2,dtype=tf.float32)
#prediction
p = w*x

#define losses
l = tf.square(p - y)
g = tf.gradients(l, w)
Mu = 0.8
LR = tf.constant(0.01,dtype=tf.float32)

init = tf.group(tf.global_variables_initializer(),tf.local_variables_initializer())

#update w
update = tf.train.MomentumOptimizer(LR, Mu).minimize(l)

with tf.Session() as sess:
    sess.run(init)
    sess.run(tf.global_variables_initializer())
    sess.run(tf.local_variables_initializer())
    print(sess.run([g,p,w], {x: 1}))
    for _ in range(10):
        w_,g_,l_ = sess.run([w,g,l],feed_dict={x:1})
        print('variable is w:',w_, ' g is ',g_, '  and the loss is ',l_)

        sess.run([update],feed_dict={x:1})

這是前幾次迭代的資料,注意看,和下邊的手動實現做對比

variable is w: 2.0  g is  [-2.0]   and the loss is  1.0
variable is w: 2.02  g is  [-1.96]   and the loss is  0.96040004
variable is w: 2.0556  g is  [-1.8888001]   and the loss is  0.8918915
variable is w: 2.102968  g is  [-1.794064]   and the loss is  0.80466646
variable is w: 2.158803  g is  [-1.682394]   and the loss is  0.7076124
variable is w: 2.220295  g is  [-1.5594101]   and the loss is  0.60793996
variable is w: 2.2850826  g is  [-1.4298348]   and the loss is  0.5111069
variable is w: 2.351211  g is  [-1.2975779]   and the loss is  0.42092708
variable is w: 2.4170897  g is  [-1.1658206]   and the loss is  0.3397844
variable is w: 2.4814508  g is  [-1.0370984]   and the loss is  0.26889327
#demo5.2:manual momentum in tensorflow

y = tf.constant(3,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
w = tf.Variable(2,dtype=tf.float32)
#prediction
p = w*x

#define losses
l = tf.square(p - y)
g = tf.gradients(l, w)
Mu = 0.8
LR = tf.constant(0.01,dtype=tf.float32)
#v = tf.Variable(0,tf.float32)#error?secend param is not dtype?
v = tf.Variable(0,dtype = tf.float32)
init = tf.global_variables_initializer()

#update w
update1 = tf.assign(v, Mu * v + g[0] * LR )
update2 = tf.assign(w, w - v)
#update = tf.group(update1,update2)#wrong sequence!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

with tf.Session() as sess:
    sess.run(init)
    print(sess.run([g,p,w], {x: 1}))
    for _ in range(10):
        w_,g_,l_,v_ = sess.run([w,g,l,v],feed_dict={x:1})
        print('variable is w:',w_, ' g is ',g_, ' v is ',v_,'  and the loss is ',l_)

        _ = sess.run([update1],feed_dict={x:1})
        _ = sess.run([update2],feed_dict={x:1})

 注意看前邊這組資料,和tf自動實現的是一樣的。

variable is w: 2.0  g is  [-2.0]  v is  0.0   and the loss is  1.0
variable is w: 2.0  g is  [-2.0]  v is  -0.02   and the loss is  1.0
variable is w: 2.02  g is  [-1.96]  v is  -0.0356   and the loss is  0.96040004
variable is w: 2.0556  g is  [-1.8888001]  v is  -0.047367997   and the loss is  0.8918915
variable is w: 2.102968  g is  [-1.794064]  v is  -0.05583504   and the loss is  0.80466646
variable is w: 2.158803  g is  [-1.682394]  v is  -0.06149197   and the loss is  0.7076124
variable is w: 2.220295  g is  [-1.5594101]  v is  -0.06478768   and the loss is  0.60793996
variable is w: 2.2850826  g is  [-1.4298348]  v is  -0.06612849   and the loss is  0.5111069
variable is w: 2.351211  g is  [-1.2975779]  v is  -0.06587857   and the loss is  0.42092708
variable is w: 2.4170897  g is  [-1.1658206]  v is  -0.06436106   and the loss is  0.3397844
variable is w: 2.9999995  g is  [-9.536743e-07]  v is  -4.7683734e-08   and the loss is  2.2737368e-13
variable is w: 2.9999995  g is  [-9.536743e-07]  v is  -4.7683734e-08   and the loss is  2.2737368e-13
variable is w: 2.9999995  g is  [-9.536743e-07]  v is  -4.7683734e-08   and the loss is  2.2737368e-13
variable is w: 2.9999995  g is  [-9.536743e-07]  v is  -4.7683734e-08   and the loss is  2.2737368e-13
variable is w: 2.9999995  g is  [-9.536743e-07]  v is  -4.7683734e-08   and the loss is  2.2737368e-13

接下來是adagrad的例子: 

adagrad有點使用Hessian矩陣的意思,不過用的是近似二次導數,因為真求出二次導數,在深度學習中代價還是很大的。

#demo6:adagrad optimizer in tensorflow

y = tf.constant(3,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
w = tf.Variable(2,dtype=tf.float32)
#prediction
p = w*x

#define losses
l = tf.square(p - y)
g = tf.gradients(l, w)
LR = tf.constant(0.6,dtype=tf.float32)
optimizer = tf.train.AdagradOptimizer(LR)
update = optimizer.minimize(l)
init = tf.global_variables_initializer()


with tf.Session() as sess:
    sess.run(init)
    #print(sess.run([g,p,w], {x: 1}))
    for _ in range(20):
        w_,l_,g_ = sess.run([w,l,g],feed_dict={x:1})
        print('variable is w:',w_, 'g:',g_ ,'  and the loss is ',l_)

        _ = sess.run(update,feed_dict={x:1})
#demo6.2:manual adagrad

#with tf.name_scope('initial'):

y = tf.constant(3,dtype = tf.float32)
x = tf.placeholder(dtype=tf.float32)
w = tf.Variable(2,dtype=tf.float32,expected_shape=[1])
second_derivative = tf.Variable(0,dtype=tf.float32)
LR = tf.constant(0.6,dtype=tf.float32)
Regular = 1e-8

#prediction
p = w*x
#loss
l = tf.square(p - y)
#gradients
g = tf.gradients(l, w)
#print(g)
#print(tf.square(g))

#update
update1 = tf.assign_add(second_derivative,tf.square(g[0]))
g_final = LR * g[0] / (tf.sqrt(second_derivative) + Regular)
update2 = tf.assign(w, w - g_final)

#update = tf.assign(w, w - LR * g[0])

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    print(sess.run([g,p,w], {x: 1}))
    for _ in range(20):
        _ = sess.run(update1,feed_dict={x:1.0})
        w_,g_,l_,g_sec_ = sess.run([w,g,l,second_derivative],feed_dict={x:1.0})
        print('variable is w:',w_, ' g is ',g_,' g_sec_ is ',g_sec_,'  and the loss is ',l_)
        #sess.run(g_final)

        _ = sess.run(update2,feed_dict={x:1.0})

結果接近,可惜不完全一樣,我也不知道optimizer中的引數都是多少,有沒有正則化,太不透明瞭。

[[-2.0], 2.0, 2.0]
variable is w: 2.0  g is  [-2.0]  g_sec_ is  0.0   and the loss is  1.0
variable is w: 2.6  g is  [-0.8000002]  g_sec_ is  4.0   and the loss is  0.16000007
variable is w: 2.8228343  g is  [-0.3543315]  g_sec_ is  4.6400003   and the loss is  0.0313877
variable is w: 2.920222  g is  [-0.15955591]  g_sec_ is  4.765551   and the loss is  0.006364522
variable is w: 2.9639592  g is  [-0.072081566]  g_sec_ is  4.791009   and the loss is  0.0012989381
variable is w: 2.9837074  g is  [-0.032585144]  g_sec_ is  4.7962046   and the loss is  0.0002654479
variable is w: 2.9926338  g is  [-0.014732361]  g_sec_ is  4.7972665   and the loss is  5.4260614e-05
variable is w: 2.9966695  g is  [-0.0066609383]  g_sec_ is  4.7974834   and the loss is  1.1092025e-05
variable is w: 2.9984941  g is  [-0.0030117035]  g_sec_ is  4.797528   and the loss is  2.2675895e-06
variable is w: 2.999319  g is  [-0.0013618469]  g_sec_ is  4.797537   and the loss is  4.6365676e-07
variable is w: 2.9996922  g is  [-0.0006155968]  g_sec_ is  4.7975388   and the loss is  9.4739846e-08
variable is w: 2.9998608  g is  [-0.0002784729]  g_sec_ is  4.797539   and the loss is  1.9386789e-08
variable is w: 2.999937  g is  [-0.00012588501]  g_sec_ is  4.797539   and the loss is  3.961759e-09
variable is w: 2.9999716  g is  [-5.6743622e-05]  g_sec_ is  4.797539   and the loss is  8.0495965e-10
variable is w: 2.9999871  g is  [-2.5749207e-05]  g_sec_ is  4.797539   and the loss is  1.6575541e-10
variable is w: 2.9999943  g is  [-1.1444092e-05]  g_sec_ is  4.797539   and the loss is  3.274181e-11
variable is w: 2.9999974  g is  [-5.2452087e-06]  g_sec_ is  4.797539   and the loss is  6.8780537e-12
variable is w: 2.9999988  g is  [-2.3841858e-06]  g_sec_ is  4.797539   and the loss is  1.4210855e-12
variable is w: 2.9999995  g is  [-9.536743e-07]  g_sec_ is  4.797539   and the loss is  2.2737368e-13
variable is w: 2.9999998  g is  [-4.7683716e-07]  g_sec_ is  4.797539   and the loss is  5.684342e-14

variable is w: 2.0 g: [-2.0]   and the loss is  1.0
variable is w: 2.5926378 g: [-0.81472445]   and the loss is  0.16594398
variable is w: 2.816606 g: [-0.3667879]   and the loss is  0.033633344
variable is w: 2.9160419 g: [-0.1679163]   and the loss is  0.0070489706
variable is w: 2.9614334 g: [-0.07713318]   and the loss is  0.0014873818
variable is w: 2.9822717 g: [-0.035456657]   and the loss is  0.00031429363
variable is w: 2.9918494 g: [-0.016301155]   and the loss is  6.6431916e-05
variable is w: 2.9962525 g: [-0.0074949265]   and the loss is  1.404348e-05
variable is w: 2.998277 g: [-0.0034461021]   and the loss is  2.968905e-06
variable is w: 2.9992077 g: [-0.0015845299]   and the loss is  6.2768373e-07
variable is w: 2.9996357 g: [-0.0007286072]   and the loss is  1.327171e-07
variable is w: 2.9998324 g: [-0.00033521652]   and the loss is  2.809253e-08
variable is w: 2.999923 g: [-0.0001540184]   and the loss is  5.930417e-09
variable is w: 2.9999645 g: [-7.104874e-05]   and the loss is  1.2619807e-09
variable is w: 2.9999835 g: [-3.2901764e-05]   and the loss is  2.7063152e-10
variable is w: 2.9999924 g: [-1.5258789e-05]   and the loss is  5.820766e-11
variable is w: 2.9999964 g: [-7.1525574e-06]   and the loss is  1.2789769e-11
variable is w: 2.9999983 g: [-3.33786e-06]   and the loss is  2.7853275e-12
variable is w: 2.9999993 g: [-1.4305115e-06]   and the loss is  5.1159077e-13
variable is w: 2.9999998 g: [-4.7683716e-07]   and the loss is  5.684342e-14

這個例子只供演示,真正體現Adagrad優勢的,還得是多引數情形,單引數用Adagrad不能顯現很大優勢,Adagrad的一大優點,是能協調不同引數的學習速率,每個引數都被自己的“二次微分”約束,最後就公平了。

原始碼