計算機視覺學習記錄 - Implementing a Neural Network from Scratch - An Introduction
0 - 學習目標
我們將實現一個簡單的3層神經網絡,我們不會仔細推到所需要的數學公式,但我們會給出我們這樣做的直觀解釋。註意,此次代碼並不能達到非常好的效果,可以自己進一步調整或者完成課後練習來進行改進。
1 - 實驗步驟
1.1 - Import Packages
# Package imports import matplotlib.pyplot as plt import numpy as np import sklearn import sklearn.datasets import sklearn.linear_model import matplotlib # Display plots inline and changedefault figure size %matplotlib inline matplotlib.rcParams[‘figure.figsize‘] = (10.0, 8.0) # 指定matplotlib畫布規模
1.2 - Generating a dataset
註意到,scikit-learn包包含了數據生成的代碼,因此我們無需自己實現,直接采用其make_moons方法即可。下圖中有兩種類別的點,藍點表示男患者,紅點表示女患者,而xy坐標表示醫學測量指標。我們的目的是去訓練一個模型可以根據醫學測量指標結果來劃分男女患者,註意到圖中的劃分界限不是簡單的線性的,因此采用簡單的邏輯回歸的效果合理不會很好。
# Generate a dataset and plot it np.random.seed(0) X, y = sklearn.datasets.make_moons(200, noise=0.20) plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Spectral)
1.3 - Logistic Regression
為了證明上述觀點我們來訓練一個邏輯回歸模型看看效果。輸入是xy坐標,而輸出是(0,1)二分類。我們直接使用scikit-learn包中的邏輯回歸算法做預測。
# Train the logistic regression classifier clf= sklearn.linear_model.LogisticRegressionCV() clf.fit(X, y)
Out[3]: LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False, fit_intercept=True, intercept_scaling=1.0, max_iter=100, multi_class=‘ovr‘, n_jobs=1, penalty=‘l2‘, random_state=None, refit=True, scoring=None, solver=‘lbfgs‘, tol=0.0001, verbose=0)
# Helper function to plot a decision boundary. # If you don‘t fully understand this function don‘t worry, it just generates the contour plot below. def plot_decision_boundary(pred_func): # Set min and max values and give it some padding x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5 y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5 h = 0.01 # Generate a grid of points with distance h between them xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) # Predict the function value for the whole gid Z = pred_func(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) # Plot the contour and training examples plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral) plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)
# Plot the decision boundary plot_decision_boundary(lambda x: clf.predict(x)) plt.title("Logistic Regression")
可以看到,邏輯回歸使用一條直線盡可能好的分割這個二分類問題,但是由於原先數據本就不是線性可分的,因此效果並不好。
1.4 - Training a Neural Network
現在來構建一個有一個輸入層一個隱藏層和一個輸出層的簡單三層神經網絡來做預測。
1.4.1 - How our network makes predictions
神經網絡通過下述公式進行預測。
$$
\begin{aligned}
z_1 & = xW_1 + b_1 \\
a_1 & = \tanh(z_1) \\
z_2 & = a_1W_2 + b_2 \\
a_2 & = \hat{y} = \mathrm{softmax}(z_2)
\end{aligned}
$$
1.4.2 - Learning the Parameters
學習參數是讓我們的網絡找到一組參數 ($W_1, b_1, W_2, b_2$)使得訓練集上的損失最小化。現在我們來定義損失函數,這裏我們使用常用的交叉熵損失函數,如下:
$$
\begin{aligned}
L(y,\hat{y}) = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \log\hat{y}_{n,i}
\end{aligned}
$$
而後我們使用梯度下降來最小化損失函數。我們將實現最簡單的梯度下降算法,其實就是有著固定學習率的批量梯度下降。在實踐中,梯度下降的一些變種如SGD(隨機梯度下降)或者最小批次梯度下降往往有更好的表現。因此後續我們可以通過這些點來改進效果。
梯度下降需要計算出損失函數相對於我們要更新參數的梯度 $\frac{\partial{L}}{\partial{W_1}}$, $\frac{\partial{L}}{\partial{b_1}}$, $\frac{\partial{L}}{\partial{W_2}}$, $\frac{\partial{L}}{\partial{b_2}}$。為了計算這些梯度我們使用著名的反向傳播算法,這種方法能夠從輸出開始有效地計算梯度。此處不細講反向傳播是如何工作的,只給出方向傳播需要的公式,如下:
$$
\begin{aligned}
& \delta_3 = \hat{y} - y \\
& \delta_2 = (1 - \tanh^2z_1) \circ \delta_3W_2^T \\
& \frac{\partial{L}}{\partial{W_2}} = a_1^T \delta_3 \\
& \frac{\partial{L}}{\partial{b_2}} = \delta_3\\
& \frac{\partial{L}}{\partial{W_1}} = x^T \delta_2\\
& \frac{\partial{L}}{\partial{b_1}} = \delta_2 \\
\end{aligned}
$$
1.4.3 - Implementation
開始實現!
變量定義。
num_examples = len(X) # 訓練集大小 nn_input_dim = 2 # 輸入層維度 nn_output_dim = 2 # 輸出層維度 # Gradient descent parameters (I picked these by hand) epsilon = 0.01 # 梯度下降學習率 reg_lambda = 0.01 # 正規化權重
損失函數定義。
# Helper function to evaluate the total loss on the dataset def calculate_loss(model): W1, b1, W2, b2 = model[‘W1‘], model[‘b1‘], model[‘W2‘], model[‘b2‘] # 前向傳播,計算出預測值 z1 = X.dot(W1) + b1 a1 = np.tanh(z1) z2 = a1.dot(W2) + b2 exp_scores = np.exp(z2) probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # 計算損失 corect_logprobs = -np.log(probs[range(num_examples), y]) data_loss = np.sum(corect_logprobs) # 損失值加入正規化 data_loss += reg_lambda/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2))) return 1./num_examples * data_loss
我們也實現了一個有用的用來計算網絡輸出的方法,其做了前向傳播計算並且返回最高概率類別。
# Helper function to predict an output (0 or 1) def predict(model, x): W1, b1, W2, b2 = model[‘W1‘], model[‘b1‘], model[‘W2‘], model[‘b2‘] # Forward propagation z1 = x.dot(W1) + b1 a1 = np.tanh(z1) z2 = a1.dot(W2) + b2 exp_scores = np.exp(z2) probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) return np.argmax(probs, axis=1)
最後,使用批量梯度下降算法來訓練我們的神經網絡。
# This function learns parameters for the neural network and returns the model. # - nn_hdim: Number of nodes in the hidden layer # - num_passes: Number of passes through the training data for gradient descent # - print_loss: If True, print the loss every 1000 iterations def build_model(nn_hdim, num_passes=20000, print_loss=False): # 隨機初始化權重 np.random.seed(0) W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim) b1 = np.zeros((1, nn_hdim)) W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim) b2 = np.zeros((1, nn_output_dim)) # 返回字典初始化 model = {} # 對於每一個批次進行梯度下降 for i in range(0, num_passes): # 前向傳播 z1 = X.dot(W1) + b1 a1 = np.tanh(z1) z2 = a1.dot(W2) + b2 exp_scores = np.exp(z2) probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # 反向傳播 delta3 = probs delta3[range(num_examples), y] -= 1 dW2 = (a1.T).dot(delta3) db2 = np.sum(delta3, axis=0, keepdims=True) delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2)) dW1 = np.dot(X.T, delta2) db1 = np.sum(delta2, axis=0) # 加入正則化 dW2 += reg_lambda * W2 dW1 += reg_lambda * W1 # 梯度下降參數更新 W1 += -epsilon * dW1 b1 += -epsilon * db1 W2 += -epsilon * dW2 b2 += -epsilon * db2 # 分配新權重 model = { ‘W1‘: W1, ‘b1‘: b1, ‘W2‘: W2, ‘b2‘: b2} # Optionally print the loss. # This is expensive because it uses the whole dataset, so we don‘t want to do it too often. if print_loss and i % 1000 == 0: print("Loss after iteration %i: %f" %(i, calculate_loss(model))) return model
1.4.4 - A network with a hidden layer of size 3
# Build a model with a 3-dimensional hidden layer model = build_model(3, print_loss=True) # Plot the decision boundary plot_decision_boundary(lambda x: predict(model, x)) plt.title("Decision Boundary for hidden layer size 3")
Loss after iteration 0: 0.432387 Loss after iteration 1000: 0.068947 Loss after iteration 2000: 0.068926 Loss after iteration 3000: 0.071218 Loss after iteration 4000: 0.071253 Loss after iteration 5000: 0.071278 Loss after iteration 6000: 0.071293 Loss after iteration 7000: 0.071303 Loss after iteration 8000: 0.071308 Loss after iteration 9000: 0.071312 Loss after iteration 10000: 0.071314 Loss after iteration 11000: 0.071315 Loss after iteration 12000: 0.071315 Loss after iteration 13000: 0.071316 Loss after iteration 14000: 0.071316 Loss after iteration 15000: 0.071316 Loss after iteration 16000: 0.071316 Loss after iteration 17000: 0.071316 Loss after iteration 18000: 0.071316 Loss after iteration 19000: 0.071316
這看起來比邏輯回歸的效果好多了!
1.5 - Varying the hidden layer size
plt.figure(figsize=(16, 32)) hidden_layer_dimensions = [1, 2, 3, 4, 5, 20, 50] for i, nn_hdim in enumerate(hidden_layer_dimensions): plt.subplot(5, 2, i+1) plt.title(‘Hidden Layer size %d‘ % nn_hdim) model = build_model(nn_hdim) plot_decision_boundary(lambda x: predict(model, x)) plt.show()
2 - Exercises
我們給出了一些練習。
- Instead of batch gradient descent, use minibatch gradient descent (more info) to train the network. Minibatch gradient descent typically performs better in practice.
- We used a fixed learning rate $\epsilon$ for gradient descent. Implement an annealing schedule for the gradient descent learning rate (more info).
- We used a $\tanh$ activation function for our hidden layer. Experiment with other activation functions (some are mentioned above). Note that changing the activation function also means changing the backpropagation derivative.
- Extend the network from two to three classes. You will need to generate an appropriate dataset for this.
- Extend the network to four layers. Experiment with the layer size. Adding another hidden layer means you will need to adjust both the forward propagation as well as the backpropagation code.
3 - Exercises (1)
4 - Exercises (2)
使用模擬退火算法更新學習率,公式為$epsilon=\frac{epsilon_0}{1+d \times t}$。
# This function learns parameters for the neural network and returns the model. # - nn_hdim: Number of nodes in the hidden layer # - num_passes: Number of passes through the training data for gradient descent # - print_loss: If True, print the loss every 1000 iterations # - d: the decay number of annealing schedule def build_model(nn_hdim, num_passes=20000, print_loss=False, d=10e-3): # Initialize the parameters to random values. We need to learn these. np.random.seed(0) W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim) b1 = np.zeros((1, nn_hdim)) W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim) b2 = np.zeros((1, nn_output_dim)) # This is what we return at the end model = {} # Gradient descent. For each batch... for i in range(0, num_passes): # Forward propagation z1 = X.dot(W1) + b1 a1 = np.tanh(z1) z2 = a1.dot(W2) + b2 exp_scores = np.exp(z2) probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # Backpropagation delta3 = probs delta3[range(num_examples), y] -= 1 dW2 = (a1.T).dot(delta3) db2 = np.sum(delta3, axis=0, keepdims=True) delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2)) dW1 = np.dot(X.T, delta2) db1 = np.sum(delta2, axis=0) # Add regularization terms (b1 and b2 don‘t have regularization terms) dW2 += reg_lambda * W2 dW1 += reg_lambda * W1 epsilon_ = epsilon / (1+d*i) # Gradient descent parameter update W1 += -epsilon_ * dW1 b1 += -epsilon_ * db1 W2 += -epsilon_ * dW2 b2 += -epsilon_ * db2 # Assign new parameters to the model model = { ‘W1‘: W1, ‘b1‘: b1, ‘W2‘: W2, ‘b2‘: b2} # Optionally print the loss. # This is expensive because it uses the whole dataset, so we don‘t want to do it too often. if print_loss and i % 1000 == 0: print("Loss after iteration %i: %f" %(i, calculate_loss(model))) return model
Loss after iteration 0: 0.432387 Loss after iteration 1000: 0.081007 Loss after iteration 2000: 0.075384 Loss after iteration 3000: 0.073729 Loss after iteration 4000: 0.072895 Loss after iteration 5000: 0.072376 Loss after iteration 6000: 0.072013 Loss after iteration 7000: 0.071742 Loss after iteration 8000: 0.071530 Loss after iteration 9000: 0.071357 Loss after iteration 10000: 0.071214 Loss after iteration 11000: 0.071092 Loss after iteration 12000: 0.070986 Loss after iteration 13000: 0.070894 Loss after iteration 14000: 0.070812 Loss after iteration 15000: 0.070739 Loss after iteration 16000: 0.070673 Loss after iteration 17000: 0.070613 Loss after iteration 18000: 0.070559 Loss after iteration 19000: 0.070509
4 - Exercises (3)
5 - Exercises (4)
6 - Exercises (5)
7 - 參考資料
http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/
https://github.com/dennybritz/nn-from-scratch
計算機視覺學習記錄 - Implementing a Neural Network from Scratch - An Introduction