gradient descent梯度下降演算法的優化

阿新 • • 發佈：2018-11-30

cost function優化

最原始更新由此

相應的難點程式碼：

self.weights = [w-(eta/len(mini_batch))*nw
                for w, nw in zip(self.weights, nabla_w)]
self.biases = [b-(eta/len(mini_batch))*nb
               for b, nb in zip(self.biases, nabla_b)]

nabla_b[-1] = delta
nabla_w[-1] = np.dot(delta, activations[-2].transpose())

nabla_b[-l] = delta
nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())

1. cost entropy

影象：

相應的程式碼：

return np.sum(np.nan_to_num(-y*np.log(a)-(1-y)*np.log(1-a)))

2. softmax

關鍵程式碼：

3. overfitting：在訓練集上表現號，在測試集上表現差。

4. Regulization L2

self.weights = [(1-eta*(lmbda/n))*w-(eta/len(mini_batch))*nw
                for w, nw in zip(self.weights, nabla_w)]

5. Regulization L1

附加：

完整程式碼：

"""network2.py
~~~~~~~~~~~~~~

An improved version of network.py, implementing the stochastic
gradient descent learning algorithm for a feedforward neural network.
Improvements include the addition of the cross-entropy cost function,
regularization, and better initialization of network weights.  Note
that I have focused on making the code simple, easily readable, and
easily modifiable.  It is not optimized, and omits many desirable
features.

"""

#### Libraries
# Standard library
import json
import random
import sys

# Third-party libraries
import numpy as np


#### Define the quadratic and cross-entropy cost functions

class QuadraticCost(object):

    @staticmethod
    def fn(a, y):
        """Return the cost associated with an output ``a`` and desired output
        ``y``.

        """
        return 0.5*np.linalg.norm(a-y)**2

    @staticmethod
    def delta(z, a, y):
        """Return the error delta from the output layer."""
        return (a-y) * sigmoid_prime(z)


class CrossEntropyCost(object):

    @staticmethod
    def fn(a, y):
        """Return the cost associated with an output ``a`` and desired output
        ``y``.  Note that np.nan_to_num is used to ensure numerical
        stability.  In particular, if both ``a`` and ``y`` have a 1.0
        in the same slot, then the expression (1-y)*np.log(1-a)
        returns nan.  The np.nan_to_num ensures that that is converted
        to the correct value (0.0).

        """
        return np.sum(np.nan_to_num(-y*np.log(a)-(1-y)*np.log(1-a)))

    @staticmethod
    def delta(z, a, y):
        """Return the error delta from the output layer.  Note that the
        parameter ``z`` is not used by the method.  It is included in
        the method's parameters in order to make the interface
        consistent with the delta method for other cost classes.

        """
        return (a-y)


#### Main Network class
class Network(object):

    def __init__(self, sizes, cost=CrossEntropyCost):
        """The list ``sizes`` contains the number of neurons in the respective
        layers of the network.  For example, if the list was [2, 3, 1]
        then it would be a three-layer network, with the first layer
        containing 2 neurons, the second layer 3 neurons, and the
        third layer 1 neuron.  The biases and weights for the network
        are initialized randomly, using
        ``self.default_weight_initializer`` (see docstring for that
        method).

        """
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.default_weight_initializer()
        self.cost=cost

    def default_weight_initializer(self):
        """Initialize each weight using a Gaussian distribution with mean 0
        and standard deviation 1 over the square root of the number of
        weights connecting to the same neuron.  Initialize the biases
        using a Gaussian distribution with mean 0 and standard
        deviation 1.

        Note that the first layer is assumed to be an input layer, and
        by convention we won't set any biases for those neurons, since
        biases are only ever used in computing the outputs from later
        layers.

        """
        self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]
        self.weights = [np.random.randn(y, x)/np.sqrt(x)
                        for x, y in zip(self.sizes[:-1], self.sizes[1:])]

    def large_weight_initializer(self):
        """Initialize the weights using a Gaussian distribution with mean 0
        and standard deviation 1.  Initialize the biases using a
        Gaussian distribution with mean 0 and standard deviation 1.

        Note that the first layer is assumed to be an input layer, and
        by convention we won't set any biases for those neurons, since
        biases are only ever used in computing the outputs from later
        layers.

        This weight and bias initializer uses the same approach as in
        Chapter 1, and is included for purposes of comparison.  It
        will usually be better to use the default weight initializer
        instead.

        """
        self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(self.sizes[:-1], self.sizes[1:])]

    def feedforward(self, a):
        """Return the output of the network if ``a`` is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            lmbda = 0.0,
            evaluation_data=None,
            monitor_evaluation_cost=False,
            monitor_evaluation_accuracy=False,
            monitor_training_cost=False,
            monitor_training_accuracy=False):
        """Train the neural network using mini-batch stochastic gradient
        descent.  The ``training_data`` is a list of tuples ``(x, y)``
        representing the training inputs and the desired outputs.  The
        other non-optional parameters are self-explanatory, as is the
        regularization parameter ``lmbda``.  The method also accepts
        ``evaluation_data``, usually either the validation or test
        data.  We can monitor the cost and accuracy on either the
        evaluation data or the training data, by setting the
        appropriate flags.  The method returns a tuple containing four
        lists: the (per-epoch) costs on the evaluation data, the
        accuracies on the evaluation data, the costs on the training
        data, and the accuracies on the training data.  All values are
        evaluated at the end of each training epoch.  So, for example,
        if we train for 30 epochs, then the first element of the tuple
        will be a 30-element list containing the cost on the
        evaluation data at the end of each epoch. Note that the lists
        are empty if the corresponding flag is not set.

        """
        if evaluation_data: n_data = len(evaluation_data)
        n = len(training_data)
        evaluation_cost, evaluation_accuracy = [], []
        training_cost, training_accuracy = [], []
        for j in xrange(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in xrange(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(
                    mini_batch, eta, lmbda, len(training_data))
            print "Epoch %s training complete" % j
            if monitor_training_cost:
                cost = self.total_cost(training_data, lmbda)
                training_cost.append(cost)
                print "Cost on training data: {}".format(cost)
            if monitor_training_accuracy:
                accuracy = self.accuracy(training_data, convert=True)
                training_accuracy.append(accuracy)
                print "Accuracy on training data: {} / {}".format(
                    accuracy, n)
            if monitor_evaluation_cost:
                cost = self.total_cost(evaluation_data, lmbda, convert=True)
                evaluation_cost.append(cost)
                print "Cost on evaluation data: {}".format(cost)
            if monitor_evaluation_accuracy:
                accuracy = self.accuracy(evaluation_data)
                evaluation_accuracy.append(accuracy)
                print "Accuracy on evaluation data: {} / {}".format(
                    self.accuracy(evaluation_data), n_data)
            print
        return evaluation_cost, evaluation_accuracy, \
            training_cost, training_accuracy

    def update_mini_batch(self, mini_batch, eta, lmbda, n):
        """Update the network's weights and biases by applying gradient
        descent using backpropagation to a single mini batch.  The
        ``mini_batch`` is a list of tuples ``(x, y)``, ``eta`` is the
        learning rate, ``lmbda`` is the regularization parameter, and
        ``n`` is the total size of the training data set.

        """
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [(1-eta*(lmbda/n))*w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        """Return a tuple ``(nabla_b, nabla_w)`` representing the
        gradient for the cost function C_x.  ``nabla_b`` and
        ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
        to ``self.biases`` and ``self.weights``."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = (self.cost).delta(zs[-1], activations[-1], y)
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

    def accuracy(self, data, convert=False):
        """Return the number of inputs in ``data`` for which the neural
        network outputs the correct result. The neural network's
        output is assumed to be the index of whichever neuron in the
        final layer has the highest activation.

        The flag ``convert`` should be set to False if the data set is
        validation or test data (the usual case), and to True if the
        data set is the training data. The need for this flag arises
        due to differences in the way the results ``y`` are
        represented in the different data sets.  In particular, it
        flags whether we need to convert between the different
        representations.  It may seem strange to use different
        representations for the different data sets.  Why not use the
        same representation for all three data sets?  It's done for
        efficiency reasons -- the program usually evaluates the cost
        on the training data and the accuracy on other data sets.
        These are different types of computations, and using different
        representations speeds things up.  More details on the
        representations can be found in
        mnist_loader.load_data_wrapper.

        """
        if convert:
            results = [(np.argmax(self.feedforward(x)), np.argmax(y))
                       for (x, y) in data]
        else:
            results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in data]
        return sum(int(x == y) for (x, y) in results)

    def total_cost(self, data, lmbda, convert=False):
        """Return the total cost for the data set ``data``.  The flag
        ``convert`` should be set to False if the data set is the
        training data (the usual case), and to True if the data set is
        the validation or test data.  See comments on the similar (but
        reversed) convention for the ``accuracy`` method, above.
        """
        cost = 0.0
        for x, y in data:
            a = self.feedforward(x)
            if convert: y = vectorized_result(y)
            cost += self.cost.fn(a, y)/len(data)
        cost += 0.5*(lmbda/len(data))*sum(
            np.linalg.norm(w)**2 for w in self.weights)
        return cost

    def save(self, filename):
        """Save the neural network to the file ``filename``."""
        data = {"sizes": self.sizes,
                "weights": [w.tolist() for w in self.weights],
                "biases": [b.tolist() for b in self.biases],
                "cost": str(self.cost.__name__)}
        f = open(filename, "w")
        json.dump(data, f)
        f.close()

#### Loading a Network
def load(filename):
    """Load a neural network from the file ``filename``.  Returns an
    instance of Network.

    """
    f = open(filename, "r")
    data = json.load(f)
    f.close()
    cost = getattr(sys.modules[__name__], data["cost"])
    net = Network(data["sizes"], cost=cost)
    net.weights = [np.array(w) for w in data["weights"]]
    net.biases = [np.array(b) for b in data["biases"]]
    return net

#### Miscellaneous functions
def vectorized_result(j):
    """Return a 10-dimensional unit vector with a 1.0 in the j'th position
    and zeroes elsewhere.  This is used to convert a digit (0...9)
    into a corresponding desired output from the neural network.

    """
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

gradient descent梯度下降演算法的優化

cost function優化最原始更新由此相應的難點程式碼： self.weights = [w-(eta/len(mini_batch))*nw for w, nw in zip(self.weights, nabla_w)] self.bi

梯度下降演算法（Gradient descent）

梯度下降演算法是一種求區域性最優解的方法，在wikipedia上對它做了詳細的說明，這裡我只是把自己感興趣的一些地方總結一下：對於F(x)，在a點的梯度是F(x)增長最快的方向，那麼它的相反方向則是該點下降最快的方向，我們有如下結論：其中，v是一個大於0的數，於是我們

機器學習之梯度下降演算法Gradient Descent

梯度下降演算法: 機器學習實現關鍵在於對引數的磨合，其中最關鍵的兩個數：代價函式J(θ)，代價函式對θ的求導∂J/∂θj。如果知道這兩個數，就能對引數進行磨合了：其中 α 為每步調整的幅度。其中代價函式公式J(θ)：代價函式對θ的求導∂J/∂θj：

[deeplearning-001] stotisticks gradient descent隨機梯度下降演算法的最簡單例子解釋

1.gradient descent梯度下降優化 1.1假設要優化一個函式f(x)=(x−1)2求它的最小值。這個函式在x=1 時有最小值，這是解析解。如果用梯度下降法，是這樣的： f′(x)=2(x−1) 每一步的迭代公式是： xi+1=xi−ηf′(

梯度下降各種優化演算法的比較

深度學習最全優化方法總結比較（SGD，Adagrad，Adadelta，Adam，Adamax，Nadam）直接上別人的部落格吧，哈哈，太懶了。。。。 https://zhuanlan.zhihu.com/p/32626442 https://zhuanlan.zhihu.com/p/22252270

優化梯度下降演算法 Momentum、RMSProp(Root mean square propagation)和Adam( Adaptive Moment Estimation)

https://blog.csdn.net/To_be_to_thought/article/details/81780397闡釋Batch Gradient Descent、Stochastic Gradient Descent、MiniBatch Gradient Descent具體原理。

人工智障學習筆記——梯度下降(2)優化演算法

四、優化 4-1 Momentum 如果我們把梯度下降法當作小球從山坡到山谷的一個過程，那麼在小球滾動時是帶有一定的初速度，在下落過程，小球積累的動能越來越大，小球的速度也會越滾越大，更快的奔向谷底，受此啟發就有了動量法 Momentum。動量的引入是為了加速SG

【機器學習】梯度下降演算法及梯度優化演算法

用於評估機器學習模型的就是損失函式，我們訓練的目的基本上都是最小化損失，這個最小化的方式就要用優化演算法了，機器學習中最常用的就是梯度下降演算法。導數、方向導數和梯度要了解梯度下降演算法是什麼首要知道梯度是什麼，導數和方向導數又是瞭解梯度的前提。

斯坦福CS229機器學習課程筆記一：線性迴歸與梯度下降演算法

機器學習三要素機器學習的三要素為：模型、策略、演算法。模型：就是所要學習的條件概率分佈或決策函式。線性迴歸模型策略：按照什麼樣的準則學習或選擇最優的模型。最小化均方誤差，即所謂的 least-squares(在spss裡線性迴歸對應的模組就叫OLS即Ordinary Least Squares)：

梯度下降演算法過程詳細解讀

　　看了很多博文，一談到梯度下降，大多都在畫圖，類比“下山”。對於一開始想要了解“梯度下降”是個什麼玩意兒時，這種類比法是非常有助於理解的。但是，當我大概知道了梯度下降是什麼東西之後，我就好奇了，梯度下降究竟是怎樣尋找到模型的最優引數的？不能一想到梯度下降，腦海中就只有“下山”的畫面，“下山”不是目的，目的在

吳恩達機器學習課程筆記02——處理房價預測問題（梯度下降演算法詳解）

建議記住的實用符號符號含義 m 樣本數目 x 輸入變數 y 輸出變數/目標變數

機器學習之--梯度下降演算法

貌似機器學習最繞不過去的演算法，是梯度下降演算法。這裡專門捋一下。 1. 什麼是梯度有知乎大神已經解釋的很不錯，這裡轉載並稍作修改，加上自己的看法。先給出連結，畢竟轉載要說明出處嘛。為什麼梯度反方向是函式值區域性下降最快的方向？因為高等數學都忘光了，先從導數/偏倒數/方向

梯度下降、線性迴歸演算法中的梯度下降、為什麼要用梯度下降演算法。

梯度梯度是一個向量。函式上某點的梯度的方向：導數最大的方向。梯度的大小（梯度的模）：該點的導數的大小。梯度下降對於一般二次函式而言：由於梯度的方向是導數最大的方向，順著梯度方向走，函式值就變大的最快，順著梯度的反方向，那麼函式值減小最快的方向，導數也慢慢減小。當導數減為

線性迴歸梯度下降演算法 overshot the minimun現象

在梯度下降演算法中，理論上有一個步長steep需要我們設定。steep的設定非常重要，如果設定不當，我們的梯度下降演算法可能就得不到我們想要的結果。一：步長不當導致的問題如果步長太短，很顯然我們訓練集訓練的時間就會加長。如果訓練集比較多，這也是致命的。如果步長太長，可能出現Oversho

監督學習-梯度下降演算法

公式不太好上傳，所以就截圖了，效果不太好，大家想看原件，請下載：https://download.csdn.net/download/qq_24369689/10811686 監督學習-梯度下降演算法如果你還沒有接觸過梯度下降演算法，你在看下面內容之前可以先看一下，吳恩達的梯度下降的視訊：

訓練過程--梯度下降演算法（SGD、adam等）

SGD系列 1）Batch gradient descent(批量梯度下降) 在整個資料集上每更新一次權重，要遍歷所有的樣本，由於樣本集過大，無法儲存在記憶體中，無法線上更新模型。對於損失函式的凸曲面，可以收斂到全域性最小值，對於非凸曲面，收斂到區域性最小值。隨機梯度

梯度下降演算法(1) - Python實現

演算法介紹：梯度下降演算法是一種利用一次導數資訊求取目標函式極值的方法，也是目前應用最為廣泛的區域性優化演算法之一。其具有實現簡單、容易遷移、收斂速度較快的特徵。在求解過程中，從預設的種子點開始，根據梯度資訊逐步迭代更新，使得種子點逐漸向目標函式的極小值點移動，最終到達目標函式的極小值點。注意，沿梯度正

一系列梯度下降演算法

梯度下降演算法是優化神經網路最常見的方式，這裡我們會接觸梯度下降法的不同變種以及多種梯度下降優化演算法。梯度下降變種 batch gradient descent 缺點：一次引數更新需要使用整個資料集，因此十分慢，並且記憶體不夠的話很難應付。優點：保證收斂到全域性最小值或者區

梯度下降隨機梯度下降演算法

一、一維梯度下降演算法思想：我們要找到一個函式的谷底，可以通過不斷求導，不斷逼近，找到一個函式求導後為0，我們就引入了一個概念學習率（也可以叫作步長），因為是不斷逼近某個x，所以學習率過大會導致超過最優解，而學習率過小，會導致收斂速度過慢。二、多維梯度下降

梯度下降演算法原理介紹

　　梯度下降法　　1、梯度：　　在微積分裡面，對多元函式引數求偏導數，把求的各引數的偏導數以向量的形式寫出來，就是梯度。　　梯度向量從幾何意義上講，就是函式變化增加最快的地方，沿著梯度向量的方向更容易找到函式的最大值，沿著向量相反的方向，梯度減小最快，更容易找

gradient descent梯度下降演算法的優化

相關推薦