Coursera | Andrew Ng (02-week-1-1.10)—梯度消失與梯度爆炸

該系列僅在原課程基礎上部分知識點添加個人學習筆記，或相關推導補充等。如有錯誤，還請批評指教。在學習了 Andrew Ng 課程的基礎上，為了更方便的查閱複習，將其整理成文字。因本人一直在學習英語，所以該系列以英文為主，同時也建議讀者以英文為主，中文輔助，以便後期進階時，為學習相關領域的學術論文做鋪墊。- ZJ

轉載請註明作者和出處：ZJ 微信公眾號-「SelfImprovementLab」

1.10 vanishing/exploding gradients (梯度消失與梯度爆炸)

(字幕來源：網易雲課堂)

One of the problems of training neural network, especially very deep neural networks, is data vanishing/exploding gradients. What that means is that when you’re training a very deep network, your derivatives or your slopes can sometimes get either very very big or very very small, maybe even exponentially small, and this makes training difficult. In this video, you see what this problem of exploding or vanishing gradients really means, as well as how you can use careful choices of the random weight initialization to significantly reduce this problem.

訓練神經網路尤其是深度神經網路所面臨的一個問題是，梯度消失或梯度爆炸，也就是說當你訓練深度網路時，導數或坡度有時會變得非常大，或非常小甚至以指數方式變小， 這加大了訓練的難度，這節課你將會瞭解梯度消失或爆炸問題的真正含義，以及如何更明智地選擇隨機初始化權重，從而避免這個問題。

Let’s say you’re training a very deep neural network like this, to save space on the slide, I’ve drawn it as if you have only two hidden units per layer, but it could be more as well. But this neural network will have parameters w

[1], w[2], w[3] and so on, up to w[L]. For the sake of simplicity, let’s say we’re using an activation function g(z)=z,so linear activation function. And let’s ignore b, let’s say b[L]=0. So in that case you can show that the output y will be w[L] times w[L−1] times w[L−2], dot, dot, dot down to the w

[3], w[2], w[1] times x. But if you want to just check my math, w[1] times x is going to be z[1], right, because b is equal to zero. So z[1] is equal to, I guess, w[1] times x and then plus b which is zero. But then a[1] is equal to g of z[1]. But because we use a linear activation function, this is just equal to z[1]. So this first term w[1]x is equal to a[1]. And then by the reasoning you can figure out that w[2] times w[1] times x is equal to a[2], because that’s going to be g of z[2], is going to be g of w[2] times a[1]which you can plug that in here. So this thing is going to be equal to a[2], and then this thing is going to be a[3], and so on until the protocol of all these matrices gives you y^, not y.

這裡寫圖片描述

假設你正在訓練這樣一個極深的神經網路，為了節約幻燈片上的空間，我畫的神經網路每層只有兩個隱藏單元，但它可能含有更多，但這個神經網路會有引數w[1]，w[2] w[3]等等直到w[L]，為了簡單起見，假設我們使用啟用函式 g(z)=z，也就是線性啟用函式，我們忽略 b 假設b[L]=0，如果那樣的話，輸出y=w[L]∗w[L−1]∗w[L−2].....w[3]∗w[2]∗w[1]∗x，如果你想考驗我的數學水平，w[1]∗x=z[1]，因為 b 等於 0，所以我想，z[1]=w[1]∗x 因為b=0， a[1]=g(

Coursera | Andrew Ng (02-week-1-1.10)—梯度消失與梯度爆炸

Coursera | Andrew Ng (02-week-1-1.10)—梯度消失與梯度爆炸

Coursera | Andrew Ng (02-week-1-1.5)—為什麼正則化可以減少過擬合？

Coursera | Andrew Ng (02-week-2-2.3)—指數加權平均

Coursera | Andrew Ng (01-week-1-1.3)—用神經網路進行監督學習

Coursera | Andrew Ng (02-week2)—改善深層神經網路：優化演算法

【原】Coursera—Andrew Ng機器學習—Week 10 習題—大規模機器學習

Stanford coursera Andrew Ng 機器學習課程程式設計作業（Exercise 1）Python3.x

【原】Coursera—Andrew Ng機器學習—課程筆記 Lecture 10—Advice for applying machine learning

【原】Coursera—Andrew Ng機器學習—Week 8 習題—聚類和降維

機器學習之Coursera Andrew Ng 《Machine Learning》 week 6 test 2

【原】Coursera—Andrew Ng機器學習—課程筆記 Lecture 11—Machine Learning System Design

【原】Coursera—Andrew Ng機器學習—彙總（課程筆記、測驗習題答案、程式設計作業原始碼）

【原】Coursera—Andrew Ng機器學習—課程筆記 Lecture 12—Support Vector Machines 支援向量機

【原】Coursera—Andrew Ng機器學習—課程筆記 Lecture 14—Dimensionality Reduction 降維

【原】Coursera—Andrew Ng機器學習—課程筆記 Lecture 15—Anomaly Detection異常檢測

【原】Coursera—Andrew Ng機器學習—課程筆記 Lecture 16—Recommender Systems 推薦系統

【原】Coursera—Andrew Ng機器學習—課程筆記 Lecture 17—Large Scale Machine Learning 大規模機器學習

【原】Coursera—Andrew Ng機器學習—課程筆記 Lecture 18—Photo OCR 應用例項:圖片文字識別

Andrew Ng 機器學習筆記 15 ：大資料集梯度下降

Week One - 1. Andrew Ng - 什麼是機器學習？

Coursera | Andrew Ng (02-week-1-1.10)—梯度消失與梯度爆炸

相關推薦