CNN 中的BN(batch normalization)“批歸一化”原理
在看 ladder network(https://arxiv.org/pdf/1507.02672v2.pdf) 時初次遇到batch normalization(BN). 文中說BN能加速收斂等好處,但是並不理解,然後就在網上搜了些關於BN的資料。
看了知乎上關於深度學習中 Batch Normalization為什麼效果好? 和CSDN上一個關於Batch Normalization 的學習筆記,總算對BN有一定的瞭解了。這裡只是總結一下BN的具體操作流程,對於BN更深層次的理解,為什麼要BN,BN是否真的有效也還在持續學習和實驗中。
BN就是在神經網路的訓練過程中對每層的輸入資料加一個標準化處理。
傳統的神經網路,只是在將樣本xx進行標準化,還對每個隱藏層的輸入進行標準化。
標準化後的xx了)
將s1s1
需要注意的是,上述的計算方法用於在訓練過程中。在測試時,所使用的μμ的值通常是在訓練的同時用移動平均法來計算的.
在看具體程式碼之前,先來看兩個求平均值函式的用法:
mean, variance = tf.nn.moments(x, axes, name=None, keep_dims=False)
這個函式的輸入引數x表示樣本,形如[batchsize, height, width, kernels]
axes表示在哪個維度上求解,是個list
函式輸出均值和方差
'''
batch = np.array(np.random.randint(1, 100, [10, 5]))開始這裡沒有定義資料型別,batch的dtype=int64,導致後面sess.run([mm,vv])時老報InvalidArgumentError錯誤,原因是tf.nn.moments中的計算要求引數是float的
'''
batch = np.array(np.random.randint(1, 100, [10, 5]),dtype=np.float64)
mm, vv=tf.nn.moments(batch,axes=[0])#按維度0求均值和方差
#mm, vv=tf.nn.moments(batch,axes=[0,1])求所有資料的平均值和方差
sess = tf.Session()
print batch
print sess.run([mm, vv])#一定要注意引數型別
sess.close()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
輸出結果:
[[ 53. 9. 67. 30. 69.]
[ 79. 25. 7. 80. 16.]
[ 77. 67. 60. 30. 85.]
[ 45. 14. 92. 12. 67.]
[ 32. 98. 70. 98. 48.]
[ 45. 89. 73. 73. 80.]
[ 35. 67. 21. 77. 63.]
[ 24. 33. 56. 85. 17.]
[ 88. 43. 58. 82. 59.]
[ 53. 23. 34. 4. 33.]]
[array([ 53.1, 46.8, 53.8, 57.1, 53.7]), array([ 421.09, 896.96, 598.36, 1056.69, 542.61])]
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
ema = tf.train.ExponentialMovingAverage(decay) 求滑動平均值需要提供一個衰減率。該衰減率用於控制模型更新的速度,ExponentialMovingAverage 對每一個(待更新訓練學習的)變數(variable)都會維護一個影子變數(shadow variable)。影子變數的初始值就是這個變數的初始值,
shadow_variable=decay×shadow_variable+(1−decay)×variable
由上述公式可知, decay 控制著模型更新的速度,越大越趨於穩定。實際運用中,decay 一般會設定為十分接近 1 的常數(0.99或0.999)。為了使得模型在訓練的初始階段更新得更快,ExponentialMovingAverage 還提供了 num_updates 引數來動態設定 decay 的大小:
decay=min{decay,1+num_updates10+num_updates}decay=min{decay,1+num_updates10+num_updates}
對於滑動平均值我是這樣理解的(也不知道對不對,如果有覺得錯了的地方希望能幫忙指正)
假設有一串時間序列 {a1,a2,a3,⋯,at,at+1,⋯,}{a1,a2,a3,⋯,at,at+1,⋯,}
import tensorflow as tf
graph=tf.Graph()
with graph.as_default():
w = tf.Variable(dtype=tf.float32,initial_value=1.0)
ema = tf.train.ExponentialMovingAverage(0.9)
update = tf.assign_add(w, 1.0)
with tf.control_dependencies([update]):
ema_op = ema.apply([w])#返回一個op,這個op用來更新moving_average #這句和下面那句不能調換順序
ema_val = ema.average(w)#此op用來返回當前的moving_average,這個引數不能是list
with tf.Session(graph=graph) as sess:
sess.run(tf.initialize_all_variables())
for i in range(3):
print i
print 'w_old=',sess.run(w)
print sess.run(ema_op)
print 'w_new=', sess.run(w)
print sess.run(ema_val)
print '**************'
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
輸出:
0
w_old= 1.0
None
w_new= 2.0#在執行ema_op時先執行了對w的更新
1.1 #0.9*1.0+0.1*2.0=1.1
**************
1
w_old= 2.0
None
w_new= 3.0
1.29 #0.9*1.1+0.1*3.0=1.29
**************
2
w_old= 3.0
None
w_new= 4.0
1.561 #0.9*1.29+0.1*4.0=1.561
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
關於加入了batch Normal的對mnist手寫數字分類的nn網路完整程式碼:
import tensorflow as tf
#import input_data
from tqdm import tqdm
import numpy as np
import math
from six.moves import cPickle as pickle
#資料預處理
pickle_file = '/home/sxl/tensor學習/My Udacity/notM/notMNISTs.pickle'
#為了加速計算,這個是經過處理的小樣本mnist手寫數字,這個資料可在[這裡](http://download.csdn.net/detail/whitesilence/9908115)下載
with open(pickle_file, 'rb') as f:
save = pickle.load(f)
train_dataset = save['train_dataset']
train_labels = save['train_labels']
valid_dataset = save['valid_dataset']
valid_labels = save['valid_labels']
test_dataset = save['test_dataset']
test_labels = save['test_labels']
del save # hint to help gc free up memory
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)
image_size = 28
num_labels = 10
def reformat(dataset, labels):
dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
# Map 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...]
labels = (np.arange(num_labels) == labels[:, None]).astype(np.float32)
return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)
#建立一個7層網路
layer_sizes = [784, 1000, 500, 250, 250,250,10]
L = len(layer_sizes) - 1 # number of layers
num_examples = train_dataset.shape[0]
num_epochs = 100
starter_learning_rate = 0.02
decay_after = 15 # epoch after which to begin learning rate decay
batch_size = 120
num_iter = (num_examples/batch_size) * num_epochs # number of loop iterations
x = tf.placeholder(tf.float32, shape=(None, layer_sizes[0]))
outputs = tf.placeholder(tf.float32)
testing=tf.placeholder(tf.bool)
learning_rate = tf.Variable(starter_learning_rate, trainable=False)
def bi(inits, size, name):
return tf.Variable(inits * tf.ones([size]), name=name)
def wi(shape, name):
return tf.Variable(tf.random_normal(shape, name=name)) / math.sqrt(shape[0])
shapes = zip(layer_sizes[:-1], layer_sizes[1:]) # shapes of linear layers
weights = {'W': [wi(s, "W") for s in shapes], # feedforward weights
# batch normalization parameter to shift the normalized value
'beta': [bi(0.0, layer_sizes[l+1], "beta") for l in range(L)],
# batch normalization parameter to scale the normalized value
'gamma': [bi(1.0, layer_sizes[l+1], "beta") for l in range(L)]}
ewma = tf.train.ExponentialMovingAverage(decay=0.99) # to calculate the moving averages of mean and variance
bn_assigns = [] # this list stores the updates to be made to average mean and variance
def batch_normalization(batch, mean=None, var=None):
if mean is None or var is None:
mean, var = tf.nn.moments(batch, axes=[0])
return (batch - mean) / tf.sqrt(var + tf.constant(1e-10))
# average mean and variance of all layers
running_mean = [tf.Variable(tf.constant(0.0, shape=[l]), trainable=False) for l in layer_sizes[1:]]
running_var = [tf.Variable(tf.constant(1.0, shape=[l]), trainable=False) for l in layer_sizes[1:]]
def update_batch_normalization(batch, l):
"batch normalize + update average mean and variance of layer l"
mean, var = tf.nn.moments(batch, axes=[0])
assign_mean = running_mean[l-1].assign(mean)
assign_var = running_var[l-1].assign(var)
bn_assigns.append(ewma.apply([running_mean[l-1], running_var[l-1]]))
with tf.control_dependencies([assign_mean, assign_var]):
return (batch - mean) / tf.sqrt(var + 1e-10)
def eval_batch_norm(batch,l):
mean = ewma.average(running_mean[l - 1])
var = ewma.average(running_var[l - 1])
s = batch_normalization(batch, mean, var)
return s
def net(x,weights,testing=False):
d={'m': {}, 'v': {}, 'h': {}}
h=x
for l in range(1, L+1):
print "Layer ", l, ": ", layer_sizes[l-1], " -> ", layer_sizes[l]
d['h'][l-1]=h
s= tf.matmul(d['h'][l-1], weights['W'][l-1])
m, v = tf.nn.moments(s, axes=[0])
if testing:
s=eval_batch_norm(s,l)
else:
s=update_batch_normalization(s, l)
s=weights['gamma'][l-1] * s + weights["beta"][l-1]
if l == L:
# use softmax activation in output layer
h = tf.nn.softmax(s)
else:
h= tf.nn.relu(s)
d['m'][l]=m
d['v'][l]=v
d['h'][l]=h
return h,d
y,_=net(x,weights)
cost = -tf.reduce_mean(tf.reduce_sum(outputs*tf.log(y), 1))
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(outputs, 1)) # no of correct predictions
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) * tf.constant(100.0)
train_step = tf.train.AdamOptimizer(learning_rate).minimize(cost)
# add the updates of batch normalization statistics to train_step
bn_updates = tf.group(*bn_assigns)
with tf.control_dependencies([train_step]):
train_step = tf.group(bn_updates)
print "=== Starting Session ==="
sess = tf.Session()
init = tf.initialize_all_variables()
sess.run(init)
i_iter = 0
print "=== Training ==="
#print "Initial Accuracy: ", sess.run(accuracy, feed_dict={x: test_dataset, outputs: test_labels, testing: True}), "%"
for i in tqdm(range(i_iter, num_iter)):
#images, labels = mnist.train.next_batch(batch_size)
start = (i * batch_size) % num_examples
images=train_dataset[start:start+batch_size,:]
labels=train_labels[start:start+batch_size,:]
sess.run(train_step, feed_dict={x: images, outputs: labels})
if (i > 1) and ((i+1) % (num_iter/num_epochs) == 0):#i>1且完成了一個epochs,即所有資料訓練完一遍
epoch_n = i/(num_examples/batch_size)#第幾個epochs
perm = np.arange(num_examples)
np.random.shuffle(perm)
train_dataset = train_dataset[perm]#所有訓練資料迭代完一次後,對訓練資料進行重排,避免下一次迭代時取的是同樣的資料
train_labels = train_labels[perm]
if (epoch_n+1) >= decay_after:
# decay learning rate
# learning_rate = starter_learning_rate * ((num_epochs - epoch_n) / (num_epochs - decay_after))
ratio = 1.0 * (num_epochs - (epoch_n+1)) # epoch_n + 1 because learning rate is set for next epoch
ratio = max(0, ratio / (num_epochs - decay_after))
sess.run(learning_rate.assign(starter_learning_rate * ratio))
print "Train Accuracy: ",sess.run(accuracy,feed_dict={x: images, outputs: labels})
print "Final Accuracy: ", sess.run(accuracy, feed_dict={x: test_dataset, outputs: test_labels, testing: True}), "%"
sess.close()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
關於batch normal 的另一參考資料http://blog.csdn.net/intelligence1994/article/details/53888270
tensorflow常用函式介紹http://blog.csdn.net/wuqingshan2010/article/details/71056292
在看 ladder network(https://arxiv.org/pdf/1507.02672v2.pdf) 時初次遇到batch normalization(BN). 文中說BN能加速收斂等好處,但是並不理解,然後就在網上搜了些關於BN的資料。