吳恩達深度學習2-Week1課後作業2-正則化
一、deeplearning-assignment
這一節作業的重點是理解各個正則化方法的原理,以及它們的優缺點,而不是去注重演算法實現的具體末節。
問題陳述:希望你通過一個數據集訓練一個合適的模型,從而幫助推薦法國守門員應該踢球的位置,這樣法國隊的球員可以用頭打。法國過去10場比賽中的二維資料集如下:
每個點對應於法國守門員在足球場左側擊球之後,其他運動員用頭將球擊中的足球場上的位置。
- 如果這個點是藍色的,這意味著這個法國球員設法用他/她的頭擊球
- 如果這個點是紅色的,這意味著另一個隊的球員用頭撞球
你的目標:使用深度學習模式來找到守門員踢球的場地。
分析資料集:這個資料集有點雜亂,但貌似可以用一條對角線能區分開左上角(藍色)與右下角(紅色)的資料,效果還不錯。
在本次作業中將會首先嚐試一個非正則化的模型。然後學習如何正規化,並決定選擇哪種模式來解決法國足球公司的問題。
二、相關演算法程式碼
1.非正規化模型
def model(X, Y, learning_rate=0.3, num_iterations=30000, print_cost=True, lambd=0, keep_prob=1): """ :param X:input data, of shape (input size, number of examples) :param Y:true "label" vector (1 for blue dot / 0 for red dot), of shape (output size, number of examples) :param learning_rate:learning rate of the optimization :param num_iterations:number of iterations of the optimization loop :param print_cost:If True, print the cost every 10000 iterations :param lambd:regularization hyperparameter, scalar :param keep_prob:probability of keeping a neuron active during drop-out, scalar. :return:parameters -- parameters learned by the model. They can then be used to predict. """ grads = {} costs = [] # to keep track of the cost m = X.shape[1] # number of examples is 211 layers_dims = [X.shape[0], 20, 3, 1] parameters = initialize_parameters(layers_dims) for i in range(0, num_iterations): if keep_prob == 1: a3, cache = forward_propagation(X, parameters) elif keep_prob < 1: a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob) if lambd == 0: cost = compute_cost(a3, Y) else: cost = compute_cost_with_regularization(a3, Y, parameters, lambd) assert (lambd == 0 or keep_prob == 1) if lambd == 0 and keep_prob == 1: grads = backward_propagation(X, Y, cache) elif lambd != 0: grads = backward_propagation_with_regularization(X, Y, cache, lambd) elif keep_prob < 1: grads = backward_propagation_with_dropout(X, Y, cache, keep_prob) parameters = update_parameters(parameters, grads, learning_rate) if print_cost and i % 10000 == 0: print("Cost after iteration {}: {}".format(i, cost)) if print_cost and i % 1000 == 0: costs.append(cost) plt.plot(costs) plt.ylabel('cost') plt.xlabel('iterations (x1,000)') plt.title("Learning rate =" + str(learning_rate)) plt.show() return parameters
從上述結果可以看出非正則化模型過度擬合了訓練集,它擬合了部分雜亂資料,現在讓我們看看可以減少過度擬合的兩種技術。
2.L2正則化
為了避免過度擬合數據集,將cost通過改為第二個方程進行計算,從而減小高方差帶來的影響。
def compute_cost_with_regularization(A3, Y, parameters, lambd): """ :param A3:post-activation, output of forward propagation, of shape (output size, number of examples) :param Y:"true" labels vector, of shape (output size, number of examples) :param parameters:python dictionary containing parameters of the model :param lambd:regularization hyperparameter, scalar :return:cost - value of the regularized loss function """ m = Y.shape[1] W1 = parameters["W1"] W2 = parameters["W2"] W3 = parameters["W3"] cross_entropy_cost = compute_cost(A3, Y) L2_regularization_cost = lambd / 2 * 1 / m * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3))) cost = cross_entropy_cost + L2_regularization_cost return cost def backward_propagation_with_regularization(X, Y, cache, lambd): """ :param X:input dataset, of shape (input size, number of examples) :param Y:"true" labels vector, of shape (output size, number of examples) :param cache:cache output from forward_propagation() :param lambd:regularization hyperparameter, scalar :return:gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables """ m = X.shape[1] (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache dZ3 = A3 - Y dW3 = 1. / m * np.dot(dZ3, A2.T) db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True) dA2 = np.dot(W3.T, dZ3) dZ2 = np.multiply(dA2, np.int64(A2 > 0)) dW2 = 1. / m * np.dot(dZ2, A1.T) + lambd / m * W2 db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True) dA1 = np.dot(W2.T, dZ2) dZ1 = np.multiply(dA1, np.int64(A1 > 0)) dW1 = 1. / m * np.dot(dZ1, X.T) + lambd / m * W1 db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True) gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3, "dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1} return gradients
從上圖可以看出,L2正則化模型讓分類的邊界更為平滑。對於cost函式來說,它增加了一個正則化項,lambd越大,w的值會相應變小,從而導致減小過度擬合帶來的影響。
3.dropout正則化
dropout對於每一層的隱藏層節點數,會隨機地減小一些節點,從而使神經網路變得更簡單,直觀上表現為原來的網路變為某些隱藏層的節點很少但層數很深的神經網路。
def forward_propagation_with_dropout(X, parameters, keep_prob=0.5):
"""
:param X:input dataset, of shape (2, number of examples)
:param parameters:python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
W1 -- weight matrix of shape (20, 2)
b1 -- bias vector of shape (20, 1)
W2 -- weight matrix of shape (3, 20)
b2 -- bias vector of shape (3, 1)
W3 -- weight matrix of shape (1, 3)
b3 -- bias vector of shape (1, 1)
:param keep_prob: probability of keeping a neuron active during drop-out, scalar
:return:A3 -- last activation value, output of the forward propagation, of shape (1,1)
cache -- tuple, information stored for computing the backward propagation
"""
np.random.seed(1)
# retrieve parameters
W1 = parameters["W1"]
b1 = parameters["b1"]
W2 = parameters["W2"]
b2 = parameters["b2"]
W3 = parameters["W3"]
b3 = parameters["b3"]
# LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
Z1 = np.dot(W1, X) + b1
A1 = relu(Z1)
D1 = np.random.rand(A1.shape[0], A1.shape[1]) # Step 1: initialize matrix D1 = np.random.rand(..., ...)
D1 = D1 < keep_prob # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
A1 = np.multiply(D1, A1) # Step 3: shut down some neurons of A1
A1 = A1 / keep_prob # Step 4: scale the value of neurons that haven't been shut down
Z2 = np.dot(W2, A1) + b2
A2 = relu(Z2)
D2 = np.random.rand(A2.shape[0], A2.shape[1]) # Step 1: initialize matrix D2 = np.random.rand(..., ...)
D2 = D2 < keep_prob # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)
A2 = np.multiply(D2, A2) # Step 3: shut down some neurons of A2
A2 = A2 / keep_prob # Step 4: scale the value of neurons that haven't been shut down
Z3 = np.dot(W3, A2) + b3
A3 = sigmoid(Z3)
cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
return A3, cache
def backward_propagation_with_dropout(X, Y, cache, keep_prob):
"""
:param X:input dataset, of shape (2, number of examples)
:param Y:"true" labels vector, of shape (output size, number of examples)
:param cache:cache output from forward_propagation_with_dropout()
:param keep_prob:probability of keeping a neuron active during drop-out, scalar
:return:gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
"""
m = X.shape[1]
(Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
dZ3 = A3 - Y
dW3 = 1. / m * np.dot(dZ3, A2.T)
db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)
dA2 = np.dot(W3.T, dZ3)
dA2 = np.multiply(dA2, D2) # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
dA2 = dA2 / keep_prob # Step 2: Scale the value of neurons that haven't been shut down
dZ2 = np.multiply(dA2, np.int64(A2 > 0))
dW2 = 1. / m * np.dot(dZ2, A1.T)
db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)
dA1 = np.dot(W2.T, dZ2)
dA1 = np.multiply(dA1, D1) # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
dA1 = dA1 / keep_prob # Step 2: Scale the value of neurons that haven't been shut down
dZ1 = np.multiply(dA1, np.int64(A1 > 0))
dW1 = 1. / m * np.dot(dZ1, X.T)
db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)
gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3, "dA2": dA2,
"dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
"dZ1": dZ1, "dW1": dW1, "db1": db1}
return gradients
需要注意的一點是,dropout只能應用於訓練集,在測試的時候不應該用該模型。
從上面的程式碼可以看出,在訓練期間,每個dropout層會除以keep_prob來保持相同的期望值。例如,如果keep_prob是0.5,意味著我們將平均關閉一半的節點,那麼輸出將除以0.5,除以0.5等於乘以2。因此,現在具有相同的預期值的輸出。
三、總結
以下的表是上述三種模型的結果:
我們可以看到,正則化過程影響著訓練集的準確度,同時也影響著測試集的精度。這是因為正則化能延緩模型對訓練集的過度擬合程度,同時也能提高模型的泛化能力,提高測試集的準確度。
因此,通過這周作業的練習,我們知道了:
- 正則化過程能夠幫助減小對資料的過擬合現象。
- 正則化能夠讓w變得更小。
- L2正則化和Dropout是兩種比較好的正則化方法。