1. 程式人生 > >深度學習概述:從感知機到深度網路 (英文版)

深度學習概述:從感知機到深度網路 (英文版)

A Deep Learning Tutorial: From Perceptrons to Deep Networks


In recent years, there’s been a resurgence in the field of Artificial Intelligence. It’s spread beyond the academic world with major players like Google, Microsoft, and Facebook creating their own research teams and making some impressive 

acquisitions.

Some this can be attributed to the abundance of raw data generated by social network users, much of which needs to be analyzed, as well as to the cheap computational power available via GPGPUs.

But beyond these phenomena, this resurgence has been powered in no small part by a new trend in AI, specifically in 

machine learning, known as “Deep Learning”. In this tutorial, I’ll introduce you to the key concepts and algorithms behind deep learning, beginning with the simplest unit of composition and building to the concepts of machine learning in Java.

(For full disclosure: I’m also the author of a Java deep learning library, available 

here, and the examples in this article are implemented using the above library. If you like it, you can support it by giving it a star on GitHub, for which I would be grateful. Usage instructions are available on the homepage.)

A Thirty Second Tutorial on Machine Learning

The general procedure is as follows:

  1. We have some algorithm that’s given a handful of labeled examples, say 10 images of dogs with the label 1 (“Dog”) and 10 images of other things with the label 0 (“Not dog”)—note that we’re mainly sticking tosupervisedbinary classification for this post.
  2. The algorithm “learns” to identify images of dogs and, when fed a new image, hopes to produce the correct label (1 if it’s an image of a dog, and 0 otherwise).

This setting is incredibly general: your data could be symptoms and your labels illnesses; or your data could be images of handwritten characters and your labels the actual characters they represent.

Perceptrons: Early Deep Learning Algorithms

One of the earliest supervised training algorithms is that of the perceptron, a basic neural network building block.

Say we have n points in the plane, labeled ‘0’ and ‘1’. We’re given a new point and we want to guess its label (this is akin to the “Dog” and “Not dog” scenario above). How do we do it?

One approach might be to look at the closest neighbor and return that point’s label. But a slightly more intelligent way of going about it would be to pick a line that best separates the labeled data and use that as your classifier.

A depiction of input data in relation to a linear classifier is a basic approach to deep learning.

In this case, each piece of input data would be represented as a vector x = (x_1, x_2) and our function would be something like “‘0’ if below the line, ‘1’ if above”.

To represent this mathematically, let our separator be defined by a vector of weights w and a vertical offset (or bias) b. Then, our function would combine the inputs and weights with a weighted sum transfer function:

weighted sum transfer function

The result of this transfer function would then be fed into an activation function to produce a labeling. In the example above, our activation function was a threshold cutoff (e.g., 1 if greater than some value):

result of this transfer function

Training the Perceptron

The training of the perceptron consists of feeding it multiple training samples and calculating the output for each of them. After each sample, the weights w are adjusted in such a way so as to minimize the output error, defined as the difference between the desired (target) and the actual outputs. There are other error functions, like the mean square error, but the basic principle of training remains the same.

Single Perceptron Drawbacks

The single perceptron approach to deep learning has one major drawback: it can only learn linearly separable functions. How major is this drawback? Take XOR, a relatively simple function, and notice that it can’t be classified by a linear separator (notice the failed attempt, below):

The drawback to this deep learning approach is that some functions cannot be classified by a linear separator.

To address this problem, we’ll need to use a multilayer perceptron, also known as feedforward neural network: in effect, we’ll compose a bunch of these perceptrons together to create a more powerful mechanism for learning.

Feedforward Neural Networks for Deep Learning

A neural network is really just a composition of perceptrons, connected in different ways and operating on different activation functions.

Feedforward neutral network deep learning is a more complex approach than single perceptrons.

For starters, we’ll look at the feedforward neural network, which has the following properties:

  • An input, output, and one or more hidden layers. The figure above shows a network with a 3-unit input layer, 4-unit hidden layer and an output layer with 2 units (the terms units and neurons are interchangeable).
  • Each unit is a single perceptron like the one described above.
  • The units of the input layer serve as inputs for the units of the hidden layer, while the hidden layer units are inputs to the output layer.
  • Each connection between two neurons has a weight w (similar to the perceptron weights).
  • Each unit of layer t is typically connected to every unit of the previous layer t - 1 (although you could disconnect them by setting their weight to 0).
  • To process input data, you “clamp” the input vector to the input layer, setting the values of the vector as “outputs” for each of the input units. In this particular case, the network can process a 3-dimensional input vector (because of the 3 input units). For example, if your input vector is [7, 1, 2], then you’d set the output of the top input unit to 7, the middle unit to 1, and so on. These values are then propagated forward to the hidden units using the weighted sum transfer function for each hidden unit (hence the term forward propagation), which in turn calculate their outputs (activation function).
  • The output layer calculates it’s outputs in the same way as the hidden layer. The result of the output layer is the output of the network.

Beyond Linearity

What if each of our perceptrons is only allowed to use a linear activation function? Then, the final output of our network will still be some linear function of the inputs, just adjusted with a ton of different weights that it’s collected throughout the network. In other words, a linear composition of a bunch of linear functions is still just a linear function. If we’re restricted to linear activation functions, then the feedforward neural network is no more powerful than the perceptron, no matter how many layers it has.

A linear composition of a bunch of linear functions is still just a linear function, so most neural networks use non-linear activation functions.

Because of this, most neural networks use non-linear activation functions like the logistictanhbinary orrectifier. Without them the network can only learn functions which are linear combinations of its inputs.

Training Perceptrons

The most common deep learning algorithm for supervised training of the multilayer perceptrons is known as backpropagation. The basic procedure:

  1. A training sample is presented and propagated forward through the network.
  2. The output error is calculated, typically the mean squared error:

    mean squared error

    Where t is the target value and y is the actual network output. Other error calculations are also acceptable, but the MSE is a good choice.

  3. Network error is minimized using a method called stochastic gradient descent.Gradient descent

    Gradient descent is universal, but in the case of neural networks, this would be a graph of the training error as a function of the input parameters. The optimal value for each weight is that at which the error achieves a global minimum. During the training phase, the weights are updated in small steps (after each training sample or a mini-batch of several samples) in such a way that they are always trying to reach the global minimum—but this is no easy task, as you often end up in local minima, like the one on the right. For example, if the weight has a value of 0.6, it needs to be changed towards 0.4.

    This figure represents the simplest case, that in which error depends on a single parameter. However, network error depends on every network weight and the error function is much, much more complex.

    Thankfully, backpropagation provides a method for updating each weight between two neurons with respect to the output error. The derivation itself is quite complicated, but the weight update for a given node has the following (simple) form:

    example form

    Where E is the output error, and w_i is the weight of input i to the neuron.

    Essentially, the goal is to move in the direction of the gradient with respect to weight i. The key term is, of course, the derivative of the error, which isn’t always easy to calculate: how would you find this derivative for a random weight of a random hidden node in the middle of a large network?

    The answer: through backpropagation. The errors are first calculated at the output units where the formula is quite simple (based on the difference between the target and predicted values), and then propagated back through the network in a clever fashion, allowing us to efficiently update our weights during training and (hopefully) reach a minimum.

Hidden Layer

The hidden layer is of particular interest. By the universal approximation theorem, a single hidden layer network with a finite number of neurons can be trained to approximate an arbitrarily random function. In other words, a single hidden layer is powerful enough to learn any function. That said, we often learn better in practice with multiple hidden layers (i.e., deeper nets).

The hidden layer is where the network stores it's internal abstract representation of the training data.

The hidden layer is where the network stores it’s internal abstract representation of the training data, similar to the way that a human brain (greatly simplified analogy) has an internal representation of the real world. Going forward in the tutorial, we’ll look at different ways to play around with the hidden layer.

An Example Network

You can see a simple (4-2-3 layer) feedforward neural network that classifies the IRIS dataset implemented in Java here through the testMLPSigmoidBP method. The dataset contains three classes of iris plants with features like sepal length, petal length, etc. The network is provided 50 samples per class. The features are clamped to the input units, while each output unit corresponds to a single class of the dataset: “1/0/0” indicates that the plant is of class Setosa, “0/1/0” indicates Versicolour, and “0/0/1” indicates Virginica). The classification error is 2/150 (i.e., it misclassifies 2 samples out of 150).

The Problem with Large Networks

A neural network can have more than one hidden layer: in that case, the higher layers are “building” new abstractions on top of previous layers. And as we mentioned before, you can often learn better in-practice with larger networks.

However, increasing the number of hidden layers leads to two known issues:

  1. Vanishing gradients: as we add more and more hidden layers, backpropagation becomes less and less useful in passing information to the lower layers. In effect, as information is passed back, the gradients begin to vanish and become small relative to the weights of the networks.
  2. Overfitting: perhaps the central problem in machine learning. Briefly, overfitting describes the phenomenon of fitting the training data too closely, maybe with hypotheses that are too complex. In such a case, your learner ends up fitting the training data really well, but will perform much, much more poorly on real examples.

Let’s look at some deep learning algorithms to address these issues.

Autoencoders

Most introductory machine learning classes tend to stop with feedforward neural networks. But the space of possible nets is far richer—so let’s continue.

An autoencoder is typically a feedforward neural network which aims to learn a compressed, distributed representation (encoding) of a dataset.

An autoencoder is a neural deep learning network that aims to learn a certain representation of a dataset.

Conceptually, the network is trained to “recreate” the input, i.e., the input and the target data are the same. In other words: you’re trying to output the same thing you were input, but compressed in some way. This is a confusing approach, so let’s look at an example.

Compressing the Input: Grayscale Images

Say that the training data consists of 28x28 grayscale images and the value of each pixel is clamped to one input layer neuron (i.e., the input layer will have 784 neurons). Then, the output layer would have the same number of units (784) as the input layer and the target value for each output unit would be the grayscale value of one pixel of the image.

The intuition behind this architecture is that the network will not learn a “mapping” between the training data and its labels, but will instead learn the internal structure and features of the data itself. (Because of this, the hidden layer is also called feature detector.) Usually, the number of hidden units is smaller than the input/output layers, which forces the network to learn only the most important features and achieves a dimensionality reduction.

We want a few small nodes in the middle to learn the data at a conceptual level, producing a compact representation.

In effect, we want a few small nodes in the middle to really learn the data at a conceptual level, producing a compact representation that in some way captures the core features of our input.

Flu Illness

To further demonstrate autoencoders, let’s look at one more application.

In this case, we’ll use a simple dataset consisting of flu symptoms (credit to this blog post for the idea). If you’re interested, the code for this example can be found in the testAEBackpropagation method.

Here’s how the data set breaks down:

  • There are six binary input features.
  • The first three are symptoms of the illness. For example, 1 0 0 0 0 0 indicates that this patient has a high temperature, while 0 1 0 0 0 0 indicates coughing, 1 1 0 0 0 0 indicates coughing and high temperature, etc.
  • The final three features are “counter” symptoms; when a patient has one of these, it’s less likely that he or she is sick. For example, 0 0 0 1 0 0 indicates that this patient has a flu vaccine. It’s possible to have combinations of the two sets of features: 0 1 0 1 0 0 indicates a vaccines patient with a cough, and so forth.

We’ll consider a patient to be sick when he or she has at least two of the first three features and healthy if he or she has at least two of the second three (with ties breaking in favor of the healthy patients), e.g.:

  • 111000, 101000, 110000, 011000, 011100 = sick
  • 000111, 001110, 000101, 000011, 000110 = healthy

We’ll train an autoencoder (using backpropagation) with six input and six output units, but only two hidden units.

After several hundred iterations, we observe that when each of the “sick” samples is presented to the machine learning network, one of the two the hidden units (the same unit for each “sick” sample) always exhibits a higher activation value than the other. On the contrary, when a “healthy” sample is presented, the other hidden unit has a higher activation.

Going Back to Machine Learning

Essentially, our two hidden units have learned a compact representation of the flu symptom data set. To see how this relates to learning, we return to the problem of overfitting. By training our net to learn a compact representation of the data, we’re favoring a simpler representation rather than a highly complex hypothesis that overfits the training data.

In a way, by favoring these simpler representations, we’re attempting to learn the data in a truer sense.

Like what you're reading? Get the latest updates first. No spam. Just great engineering and design posts.

Restricted Boltzmann Machines

The next logical step is to look at a Restricted Boltzmann machines (RBM), a generative stochastic neural network that can learn a probability distribution over its set of inputs.

In machine learning, Restricted Botlzmann Machines are composed of visible and hidden units.

RBMs are composed of a hidden, visible, and bias layer. Unlike the feedforward networks, the connections between the visible and hidden layers are undirected (the values can be propagated in both the visible-to-hidden and hidden-to-visible directions) and fully connected (each unit from a given layer is connected to each unit in the next—if we allowed any unit in any layer to connect to any other layer, then we’d have a Boltzmann (rather than a restricted Boltzmann) machine).

The standard RBM has binary hidden and visible units: that is, the unit activation is 0 or 1 under a Bernoulli distribution, but there are variants with other non-linearities.

While researchers have known about RBMs for some time now, the recent introduction of the contrastive divergence unsupervised training algorithm has renewed interest.

Contrastive Divergence

The single-step contrastive divergence algorithm (CD-1) works like this:

  1. Positive phase:
    • An input sample v is clamped to the input layer.
    • v is propagated to the hidden layer in a similar manner to the feedforward networks. The result of the hidden layer activations is h.
  2. Negative phase:
    • Propagate h back to the visible layer with result v’ (the connections between the visible and hidden layers are undirected and thus allow movement in both directions).
    • Propagate the new v’ back to the hidden layer with activations result h’.
  3. Weight update:

    weight update

    Where a is the learning rate and vv’hh’, and w are vectors.

The intuition behind the algorithm is that the positive phase (h given v) reflects the network’s internal representation of the real world data. Meanwhile, the negative phase represents an attempt to recreate the data based on this internal representation (v’ given h). The main goal is for the generated data to be as close as possible to the real world and this is reflected in the weight update formula.

In other words, the net has some perception of how the input data can be represented, so it tries to reproduce the data based on this perception. If its reproduction isn’t close enough to reality, it makes an adjustment and tries again.

Returning to the Flu

To demonstrate contrastive divergence, we’ll use the same symptoms data set as before. The test network is an RBM with six visible and two hidden units. We’ll train the network using contrastive divergence with the symptoms v clamped to the visible layer. During testing, the symptoms are again presented to the visible layer; then, the data is propagated to the hidden layer. The hidden units represent the sick/healthy state, a very similar architecture to the autoencoder (propagating data from the visible to the hidden layer).

After several hundred iterations, we can observe the same result as with autoencoders: one of the hidden units has a higher activation value when any of the “sick” samples is presented, while the other is always more active for the “healthy” samples.

Deep Networks

We’ve now demonstrated that the hidden layers of autoencoders and RBMs act as effective feature detectors; but it’s rare that we can use these features directly. In fact, the data set above is more an exception than a rule. Instead, we need to find some way to use these detected features indirectly.

Luckily, it was discovered that these structures can be stacked to form deep networks. These networks can be trained greedily, one layer at a time, to help to overcome the vanishing gradient and overfitting problems associated with classic backpropagation.

The resulting structures are often quite powerful, producing impressive results. Take, for example, Google’s famous “cat” paper in which they use special kind of deep autoencoders to “learn” human and cat face detection based on unlabeled data.

Let’s take a closer look.

Stacked Autoencoders

As the name suggests, this network consists of multiple stacked autoencoders.

Stacked Autoencoders have a series of inputs, outputs, and hidden layers that contribute to machine learning outcomes.

相關推薦

深度學習概述感知深度網路

  (注:本文譯自一篇部落格,作者行文較隨意,我儘量按原意翻譯,但作者所介紹的知識還是非常好的,包括例子的選擇、理論的介紹都很到位,由淺入深,源文地址)   近些年來,人工智慧領域又活躍起來,除了傳統了學術圈外,Google、Microsoft、facebook等工業界優秀企業也紛紛成立相關研究團隊,並取得

深度學習概述感知深度網路 英文版

A Deep Learning Tutorial: From Perceptrons to Deep Networks In recent years, there’s been a resurgence in the field of Artificial Inte

深度學習【8】基於迴圈神經網路RNN的端到端end-to-end對話系統

注:本篇部落格主要內容來自:A Neural Conversational Model,這篇論文。 http://blog.csdn.net/linmingan/article/details/51077837         與傳統的基於資料庫匹配的對話\翻譯系統不一樣

深度學習理論基礎5-感知的侷限性

--------------異或門------------- 異或門的文字描述:僅當輸入中的一方為1時,才會輸出1。 現在請閉上眼睛想2兩個星期,怎麼用上節提到的感知機實現異或門呢? 。。。 嗯,相信你在思考的這段時間裡已經嘗試了很多組合。但依然沒有靠譜的引數可用,實際上上文的

深度學習迴圈神經網路入手學習LSTM及GRU

# 迴圈神經網路 ## 簡介 **迴圈神經網路(Recurrent Neural Networks, RNN)** 是一類用於處理序列資料的神經網路。之前的說的卷積神經網路是專門用於處理網格化資料(例如一個影象)的神經網路,而迴圈神經網路專門用於處理序列資料(例如$x^{(1)},x^{(2)},···,

統計學習1.感知

使用 itl cas erp 如果 所有 存儲 解釋 line 全文引用自《統計學習方法》(李航) 感知機(perceptron) 最早由Rosenblatt於1957年提出,是一種較為簡單的二分類的線性分類模型,其輸入為實例的特征向量,輸出為實例的類別,類別取值為+1

10 大深度學習架構計算機視覺優秀從業者必備附程式碼實現

近日,Faizan Shaikh 在 Analytics Vidhya 發表了一篇題為《10 Advanced Deep Learning Architectures Data Scientists Should Know!》的文章,總結了計算機視覺領域已經成效卓著的 10

深度學習實踐操作—從小白到大白Ubuntu下Nvidia英偉達驅動安裝

深度學習實踐操作—從小白到大白 目錄 四. ubuntu下Nvidia(英偉達)驅動安裝 1. 前期準備 檢視是否有NVIDIA顯示卡 nvidia-smi 將n

【機器學習基礎】感知模型說起

感知機(perceptron) 感知器(perceptron)1957年由Rosenblatt提出,是神經網路與支援向量機的基礎。感知器是二類分類的線性分類模型,其輸入為例項的特徵向量,輸出為例項的類別,取+1和-1二值。感知機對應於輸入空間(特徵空間)中將例項劃分為正

深度學習實踐操作—從小白到大白cuda & cudnn安裝

深度學習實踐操作—從小白到大白 目錄 五. cuda & cudnn安裝 1. CUDA8.0安裝及配置 sudo dpkg -i cuda-repo-ubuntu1604-8-0-rc_8.0.27-1_am

深度學習實踐操作—從小白到大白Linux遠端控制

深度學習實踐操作—從小白到大白 目錄 三. Linux遠端控制 在具體使用時,特別是實驗室或者工作時多人使用同一臺伺服器,一般通過本機遠端控制。 本機是Linux系統,使用ssl指令即可:ssh –l 使用者名稱 IP地址 ssh

第四講感知+SVM+LR

主要內容 2. 支援向量機(下)     2.3 軟間隔最大化         2.3.1 線性支援向量機       &nbs

第三講感知+SVM+LR

主要內容 1. 補充   1.1 最小二乘法的概率解釋 2. 支援向量機   2.1 模型與策略   2.2 硬間隔最大化     2.2.1 函式間隔與幾何間隔   &n

深度學習專案 | 一招讓你的mao片十倍增長

我們知道深度學習模型訓練時通常都需要大量的訓練集,我們在做影象相關的應用時同樣需要進行影象資料增加,下面我將給大家總結10種影象資料增強常用的方式,並給出keras的實現方法。 一. 水平翻轉 隨機的對圖片進行水平翻轉,這個引數適用於水平翻轉不影響圖片語義的時候 二. 豎直翻轉 隨機的對圖片進

深度學習訓練資料python程式碼——資料增廣

python影象資料增強庫 Augmentor使用比較簡單,只有一些簡單的操作。 imgaug實現的功能更多,可以對keypoint, bounding box同步處理,比如你現在由一些標記好的資料,只有同時對原始圖片和標記資訊同步處理,才能有更多的標記資料進行訓練。我

Deep Learning模型之CNN卷積神經網路深度解析CNN

http://m.blog.csdn.net/blog/wu010555688/24487301 本文整理了網上幾位大牛的部落格,詳細地講解了CNN的基礎結構與核心思想,歡迎交流。 1. 概述    卷積神經網路是一種特殊的深層的神經網路模型,它的特殊性體現在兩個方面,一方面它的神經元

深度學習在自然語言處理中的應用

資料整理篇 經典教材 Speech and Language Processing (第1,2版內容略陳舊,第三版很入時, 但有些重要章節沒放在網上) https://web.stanford.edu

深度學習傑出人物專訪系列(Andrew Ng)分享

兩天前,Yotube使用者Preserve Knowledge,在Youtube上分享了一套Andrew NG採訪深度學習領域的傑出人物以一套視訊,包括深度學習之父Geffery Hinton,卷積神經網路創始人Yoshua Bengio,生成對抗GAN創始人Ian

深度學習筆記——理論與推導之Reinforcement Learning十三

Reinforcement Learning(強化學習) Reinforcement Learning 機器學習的分支: 有監督學習是機器學習任務的一種,它從有標記的訓練資料中推匯出預測函式。有標記的訓練資料是指每個訓練例項都包括輸入和期望的輸出。即

基於深度學習的語音識別研究-CTC理論推導

  有時候學習真的得循序漸進,並沒有速成的方法,本小白在經歷了大約一個月終於把CTC的從頭到尾大致看完了,下面講一下我的理解,歡迎各位朋友批評指正。  首先,我們得知道為什麼要引入CTC,前面部落格講到,之前在做語音的聲學模型的時候,我們的資料形式是幀與標籤的分別對齊,以Th