cs231n-assignment2的筆記
重度拖延症患者準備繼續完成作業2了......
Q1: Fully-connected Neural Network (25 points)
在這一問中需要完成layers.py檔案中的一些函式,首先是affine_forward,也就是前向計算,較為簡單:
N = x.shape[0] x_reshape = x.reshape([N, -1]) x_plus_w = x_reshape.dot(w) # [N, M] out = x_plus_w + b然後是反向梯度傳遞計算,這裡的一個小技巧就是先確定導數的表示式,然後具體計算(比如是否需要轉置、求和等)則可以通過導數表示式中各項的shape確定,這個技巧在lecture4的backprop notes中Gradients for vectorized operations一節中有講到。
N = x.shape[0] x_reshape = x.reshape([N, -1]) # [N, D] dx = dout.dot(w.T).reshape(*x.shape) dw = (x_reshape.T).dot(dout) db = np.sum(dout, axis=0)
例如對於dx,通過上面前向計算表示式可知其導數為dout * w,由於dout.shape == (N, M),w.shape == (D, M),而dx.shape==(N, d1, ..., d_k),很容易寫出表示式。
對於relu_forward,直接根據其表示式即可:
out = np.maximum(0, x)
而relu_backward中,則與點選開啟連結 中的max gate是一致的,程式碼如下:
dx = dout * (x >= 0)
接下來看看它在這次作業中預先實現好的svm loss和softmax loss。實現Two-layer network, 這裡較為簡單,根據初始化要求,對w和b進行初始化:
self.params['W1'] = weight_scale * np.random.randn(input_dim, hidden_dim) self.params['b1'] = np.zeros(hidden_dim) self.params['W2'] = weight_scale * np.random.randn(hidden_dim,num_classes) self.params['b2'] = np.zeros(num_classes)
然後在loss函式中,搭建2層神經網路:
layer1_out, layer1_cache = affine_relu_forward(X, self.params['W1'], self.params['b1']) layer2_out, layer2_cache = affine_forward(layer1_out, self.params['W2'], self.params['b2']) scores = layer2_out
計算梯度:
loss, dscores = softmax_loss(scores, y) loss = loss + 0.5 * self.reg * np.sum(self.params['W1'] * self.params['W1']) + \ 0.5 * self.reg * np.sum(self.params['W2'] * self.params['W2']) d1_out, dw2, db2 = affine_backward(dscores, layer2_cache) grads['W2'] = dw2 + self.reg * self.params['W2'] grads['b2'] = db2 dx1, dw1, db1 = affine_relu_backward(d1_out, layer1_cache) grads['W1'] = dw1 + self.reg * self.params['W1'] grads['b1'] = db1
接下來將所有相關的內容全部利用一個solver物件來進行組合,其中solver.py中已經說明了該物件如何使用,所以在這裡直接使用即可:
solver = Solver(model, data, update_rule='sgd', optim_config={ 'learning_rate': 1e-3,}, lr_decay=0.80, num_epochs=10, batch_size=100, print_every=100) solver.train() scores = solver.model.loss(data['X_test']) y_pred = np.argmax(scores, axis=1) acc = np.mean(y_pred == data['y_test']) print("test acc: ",acc)
最終在測試集上達到的準確率為52.3%,然後做出圖:
這裡注意訓練集和驗證集的準確率,如果差別過大,而且驗證集上準確率提升緩慢,則考慮是否過擬合。
然後是實現FullyConnectedNet,都是套路,網路層結構如下所示:
{affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmax
在初始化中,對W、b、gamma、beta進行初始化,按照提示寫即可:
shape1 = input_dim for i, shape2 in enumerate(hidden_dims): self.params['W'+str(i+1)] = weight_scale * np.random.randn(shape1, shape2) self.params['b'+str(i+1)] = np.zeros(shape2) shape1 = shape2 if self.use_batchnorm: self.params['gamma'+str(i+1)] = np.ones(shape2) self.params['beta'+str(i+1)] = np.zeros(shape2) self.params['W' + str(self.num_layers)] = weight_scale * np.random.randn(shape1, num_classes) self.params['b' + str(self.num_layers)] = np.zeros(num_classes)
loss的計算中注意加上正則化項,由於要考慮是否使用dropout層以及use_batchnorm,為方便起見,在layer_utils.py中加入一些函式
def affine_bn_relu_forward(x , w , b, gamma, beta, bn_param): a, fc_cache = affine_forward(x, w, b) bn, bn_cache = batchnorm_forward(a, gamma, beta, bn_param) out, relu_cache = relu_forward(bn) cache = (fc_cache, bn_cache, relu_cache) return out, cache def affine_bn_relu_backward(dout, cache): fc_cache, bn_cache, relu_cache = cache dbn = relu_backward(dout, relu_cache) da, dgamma, dbeta = batchnorm_backward_alt(dbn, bn_cache) dx, dw, db = affine_backward(da, fc_cache) return dx, dw, db, dgamma, dbeta
是否使用dropout層會影響梯度計算,所以中間變數利用兩個dict儲存,前向傳遞過程:
ar_cache = {} dp_cache = {} layer_input = X for i in range(1, self.num_layers): if self.use_batchnorm: layer_input, ar_cache[i-1] = affine_bn_relu_forward( layer_input, self.params['W'+str(i)], self.params['b'+str(i)], self.params['gamma'+str(i)], self.params['beta'+str(i)], self.bn_params[i-1]) else: layer_input, ar_cache[i-1] = affine_relu_forward( layer_input, self.params['W'+str(i)], self.params['b'+str(i)]) if self.use_dropout: layer_input, dp_cache[i-1] = dropout_forward(layer_input, self.dropout_param) layer_out, ar_cache[self.num_layers] = affine_forward( layer_input,self.params['W'+str(self.num_layers)], self.params['b'+str(self.num_layers)]) scores = layer_out
後向傳遞過程,基本上是將forward pass反向計算 :
loss, dscores = softmax_loss(scores, y) loss += 0.5 * self.reg * np.sum(self.params['W' + str(self.num_layers)] * self.params['W' + str(self.num_layers)]) dout, dw, db = affine_backward(dscores, ar_cache[self.num_layers]) grads['W'+str(self.num_layers)] = dw + self.reg * self.params['W' + str(self.num_layers)] grads['b'+str(self.num_layers)] = db for i in range(self.num_layers-1): layer = self.num_layers - i - 1 loss += 0.5 * self.reg * np.sum(self.params['W' + str(layer)] * self.params['W' + str(layer)]) if self.use_dropout: dout = dropout_backward(dout, dp_cache[layer-1]) if self.use_batchnorm: dout, dw, db, dgamma, dbeta = affine_bn_relu_backward(dout, ar_cache[layer-1]) grads['gamma' + str(layer)] = dgamma grads['beta' + str(layer)] = dbeta else: dout, dw, db = affine_relu_backward(dout, ar_cache[layer-1]) grads['W' + str(layer)] = dw + self.reg * self.params['W' + str(layer)] grads['b' + str(layer)] = db
然後是在一個小樣本集上進行訓練,需要在20 epochs內達到100%的訓練集準確率,初始的learning_rate是達不到要求的,所以需要調節該引數,這裡的調參技巧是利用下圖,引自cs231n的neural-networks-3 :
初始引數(1e-4)的結果如下:
可推測其學習率過低,適當增大學習率(1e-3、1e-2,先大範圍搜尋再小範圍搜素)
在learning_rate = 9e-3時結果如下:
然後是5層網路,也是需要調節lr和weight_scale,這裡先調節lr,確定其大概範圍後再對初始化值進行調節,但weight_scale對loss的影響是巨大的,這比前面的三層網路要複雜的多,最優引數也更加難以找到,最終learning_rate = 2e-2 , weight_scale = 4e-2:
但其實早已過擬合(看val_acc)......
下面是實現引數更新的幾種方式,之前我們使用的方式是隨機梯度下降(SGD,也即是利用minibatch更新一次可訓練引數w、b等),首先第一種是SGD+Momentum(從這裡開始屬於lecture7的內容了):
這個可以直接根據notes上的程式碼寫出:
v = config['momentum'] * v - config['learning_rate'] * dw next_w = w + v
rmsprop和adam的類似.
rmsprop:
config['cache'] = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * (dx**2) next_x = x - config['learning_rate'] * dx / (np.sqrt(config['cache']) + config['epsilon'])
adam:
config['t'] += 1 config['m'] = config['beta1'] * config['m'] + (1 - config['beta1']) * dx config['v'] = config['beta2'] * config['v'] + (1 - config['beta2']) * (dx**2) mb = config['m'] / (1 - config['beta1']**config['t']) vb = config['v'] / (1 - config['beta2']**config['t']) next_x = x - config['learning_rate'] * mb / (np.sqrt(vb) + config['epsilon'])
Q2: Batch Normalization (25 points)
前向過程及其後向傳播過程,這個可看論文中有詳細推導:
主要公式如下:
程式碼如下:
sample_mean = np.mean(x , axis=0) # sample_mean's shape [D,] !!! sample_var = np.var(x, axis=0) # shape is the same as sample_mean x_norm = (x - sample_mean) / np.sqrt(sample_var + eps) out = gamma * x_norm + beta cache = (x, sample_mean, sample_var, x_norm, gamma, beta, eps) running_mean = momentum * running_mean + (1 - momentum) * sample_mean running_var = momentum * running_var + (1 - momentum) * sample_var
batchnorm_backward過程,較簡單:
N, D = dout.shape x, sample_mean, sample_var, x_norm, gamma, beta, eps = cache dx_norm = dout * gamma dsample_var = np.sum(dx_norm * (-0.5 * x_norm / (sample_var + eps)), axis=0) dsample_mean = np.sum(-dx_norm / np.sqrt((sample_var + eps)), axis=0) + \ dsample_var * np.sum(-2.0/N * (x - sample_mean), axis=0) dx1 = dx_norm / np.sqrt(sample_var + eps) dx2 = dsample_var * (2.0/N) * (x - sample_mean) # from sample_var dx3 = dsample_mean * (1.0 / N * np.ones_like(dout)) # from sample_mean dx = dx1 + dx2 + dx3 dgamma = np.sum(dout * x_norm, axis=0) dbeta = np.sum(dout, axis=0)
然後是batchnorm_backward_alt,這個暫時沒有找到比上面更快的方法,包括減少重複計算等,但效果不明顯.......
後面的只需要跑相應的程式碼即可,fc_net.py中在Q1中已經實現了batchnorm......
Q3: Dropout (10 points)
dropout的程式碼在教程中都有實現:
mode == 'train'
mask = (np.random.rand(*x.shape) >= p ) / (1 - p)
out = x * mask
mode == 'test'
out = x
backward
dx = dout * mask然後實驗發現dropout層的引入使得訓練集準確率上升較慢,測試集準確率則更高,這說明其有效的防止了過擬合
Q4: Convolutional Networks (30 points)
首先是Convolution: Naive forward pass,與前面的是類似的,推薦畫出相應的圖來更易於寫程式碼:
N, C, H, W = x.shape F, _, HH, WW = w.shape stride = conv_param['stride'] pad = conv_param['pad'] x_pad = np.pad(x, ((0,), (0,), (pad,), (pad,)), 'constant') out_h = 1 + (H + 2 * pad - HH) // stride out_w = 1 + (W + 2 * pad - WW) // stride out = np.zeros([N, F, out_h, out_w]) for j in range(out_h): for k in range(out_w): h_coord = min(j * stride, H + 2 * pad - HH) w_coord = min(k * stride, W + 2 * pad - WW) for i in range(F): out[:, i, j, k] = np.sum(x_pad[:, :, h_coord:h_coord+HH, w_coord:w_coord+WW] * w[i, :, :, :], axis=(1, 2, 3)) out = out + b[None, :, None, None]
然後是Convolution: Naive backward pass,這個其實就是逆過程,注意一點——對原forward pass的sum過程的處理:
db = np.sum(dout, axis=(0, 2, 3)) x, w, b, conv_param = cache N, C, H, W = x.shape F, _, HH, WW = w.shape stride = conv_param['stride'] pad = conv_param['pad'] x_pad = np.pad(x, ((0,), (0,), (pad,), (pad,)), 'constant') out_h = 1 + (H + 2 * pad - HH) // stride out_w = 1 + (W + 2 * pad - WW) // stride dx = np.zeros_like(x) dw = np.zeros_like(w) dx_pad = np.zeros_like(x_pad) for j in range(out_h): for k in range(out_w): h_coord = min(j * stride, H + 2 * pad - HH) w_coord = min(k * stride, W + 2 * pad - WW) for i in range(N): dx_pad[i, :, h_coord:h_coord+HH, w_coord:w_coord+WW] += \ np.sum((dout[i, :, j, k])[:, None, None, None] * w, axis=0) for i in range(F): dw[i, :, :, :] += np.sum(x_pad[:, :, h_coord:h_coord+HH, w_coord:w_coord+WW] * (dout[:, i, j, k])[:, None, None, None], axis=0) dx = dx_pad[:, :, pad:-pad, pad:-pad]
然後是max_pool_forward_naive:
N, C, H, W = x.shape ph, pw, stride = pool_param['pool_height'], pool_param['pool_width'], pool_param['stride'] out_h = 1 + (H - ph) // stride out_w = 1 + (W - pw) // stride out = np.zeros([N, C, out_h, out_w]) for i in range(out_h): for j in range(out_w): h_coord = min(i * stride, H - ph) w_coord = min(j * stride, W - pw) out[:, :, i, j] = np.max(x[:, :, h_coord:h_coord+ph, w_coord:w_coord+pw], axis=(2, 3))
而其backward過程的核心則是如何求解max的梯度,通過之前的學習可知max相當於一個遮蔽梯度的過程,它會將除了max值之外的所有資料的梯度變為0,所以由此可得:
x, pool_param = cache N, C, H, W = x.shape ph, pw, stride = pool_param['pool_height'], pool_param['pool_width'], pool_param['stride'] out_h = 1 + (H - ph) // stride out_w = 1 + (W - pw) // stride dx = np.zeros_like(x) for i in range(out_h): for j in range(out_w): h_coord = min(i * stride, H - ph) w_coord = min(j * stride, W - pw) max_num = np.max(x[:, :, h_coord:h_coord+ph, w_coord:w_coord+pw], axis=(2,3)) mask = (x[:, :, h_coord:h_coord+ph, w_coord:w_coord+pw] == (max_num)[:,:,None,None]) dx[:, :, h_coord:h_coord+ph, w_coord:w_coord+pw] += (dout[:, :, i, j])[:,:,None,None] * mask
後面的Three layer ConvNet過程則更加簡單,直接堆疊函式即可:
__init__:這裡需要注意的是W2的第一個引數的計算
C, H, W = input_dim self.params['W1'] = weight_scale * np.random.randn(num_filters, C, filter_size, filter_size) self.params['b1'] = np.zeros(num_filters) self.params['W2'] = weight_scale * np.random.randn((H//2)*(W//2)*num_filters, hidden_dim) self.params['b2'] = np.zeros(hidden_dim) self.params['W3'] = weight_scale * np.random.randn(hidden_dim, num_classes) self.params['b3'] = np.zeros(num_classes)
loss-forward pass:
layer1, cache1 = conv_relu_pool_forward(X, W1, b1, conv_param, pool_param) layer2, cache2 = affine_relu_forward(layer1, W2, b2) scores, cache3 = affine_forward(layer2, W3, b3)
loss-backward pass:
loss, dout = softmax_loss(scores, y) loss += 0.5*self.reg*(np.sum(W1**2)+np.sum(W2**2)+np.sum(W3**2)) dx3, grads['W3'], grads['b3'] = affine_backward(dout, cache3) dx2, grads['W2'], grads['b2'] = affine_relu_backward(dx3, cache2) dx, grads['W1'], grads['b1'] = conv_relu_pool_backward(dx2, cache1) grads['W3'] = grads['W3'] + self.reg * self.params['W3'] grads['W2'] = grads['W2'] + self.reg * self.params['W2'] grads['W1'] = grads['W1'] + self.reg * self.params['W1']
接下來是spatial_batchnorm的相關過程:
forward pass根據提示來做,求均值和方差均是對一張圖的某一個畫素點求,例如對於10張32*32*3的圖片來說,分別對圖片的每一位置求均值和方差(也就是10張圖得到一張均值或方差圖,其大小為32*32*3)
N, C, H, W = x.shape x2 = x.transpose(0, 2, 3, 1).reshape((N*H*W, C)) out, cache = batchnorm_forward(x2, gamma, beta, bn_param) out = out.reshape(N, H, W, C).transpose(0, 3, 1, 2)
backward pass:
N, C, H, W = dout.shape dout2 = dout.transpose(0, 2, 3, 1).reshape(N*H*W, C) dx, dgamma, dbeta = batchnorm_backward(dout2, cache) dx = dx.reshape(N, H, W, C).transpose(0, 3, 1, 2)