1. 程式人生 > >Naftali Tishby——Information Theory of Deep Learning演講翻譯(二)

Naftali Tishby——Information Theory of Deep Learning演講翻譯(二)

要想聽懂這一段,先準備一點基礎知識:

Tishby另一個視訊,介紹的更詳細一點。

1.PAC學習:Probably Approximately Correct,PAC框架主要確定資料是否可分,確定訓練樣本個數,判斷時間空間複雜度等。

2. 假設空間:Hypothesis set。 機器學習實際上就是學一個對映f,使得f(X)--->Y。 這個f會有很多種可能,就和上面視訊裡說的,如果你學習一個“圓”,那麼你的這個f實際上實所有圓方程構成的集合),但是最終我們通過得到的樣本進行分析,最後確定一個最好的f作為最佳的對映方式。 這些可能的f 構成了一個集合,就是假設空間。而所說的cardinality of hypothesis set 就是這些可能的f的個數。 如周志華機器學習裡的瓜的例子,我們可以認為‘色澤青綠 、根蒂蜷縮 、敲聲濁響的瓜 是好瓜’,也可以認為“色澤烏黑 、根蒂蜷縮 、敲聲濁響的瓜” 是好瓜,一共有65種選擇,這個65中選擇構成的叫做hypothesis set,65叫做cardinality of hypothesis set。可以參考之前的部落格文章。

https://blog.csdn.net/qq_20936739/article/details/77982056

3. 這個 \epsilon 實際上是一個是經驗誤差和泛化誤差(測試樣本集上的誤差和模型真實的誤差)的差值。也就是說 \epsilon 越小,經驗誤差就和泛化誤差越接近,我們測試集上的誤差越能夠反映將來實際應用過程中,未知樣本集上的真實誤差。\epsilon無法準確求解,但是它上限可以求解。使這個上限足夠小,就能保證經驗誤差能夠反映泛化誤差。

4.\delta 是置信度,有大於等於1-\delta的可能性使這個不等式成立。可以看出\delta越接近於1,也就是說,我們對於這個不等式成立的把握越小,但是此時\epsilon會更小。也就是說,我們不能保證這個式子100%成立,但是一旦成立,我們的經驗誤差就會和真實情況越相似。log1/\delta

通常忽略不計。

5. 對於H\epsilon的理解:這個全部假設的集合實際上很大,難以確定,通常使用一個“球體”來近似,這個球的大小和\epsilon和VC維度有關。我自己的理解如下圖。

6. m是樣本個數,這個不需要解釋了吧,用的樣本越多越好。

好了,我是正式開始的分割線~~~~~~~~~~

So I hope that some of you know something about PAC learning and learning theory. And then usually it should be familiar with what I call the old type of generalization bound. Essentially that the generation error probability of error outside my trainning data is bounded or some power or its square is bounded by the log of hypothesis class cardinality. Essentially for finite cardinality simply what's called the cardinality bound but for any general cardinality, a general class and I put in this class, we usually use what we call epsilon cover of the class. Actually, sample it on a grid such that all the hypotheses there are absolutely close to each other and then we can settle with ?cut  in a cover of the hypotheses class divided by the number of samples, of some constant----I don't care about---the constant plus a small number which actually it was typical large problems is negligible completely which has to do with the confidence. Okay, this is everybody who left took the first course in  learning theory knows about this bound. And then usually what we have is some sort like the VC dimension or other dimensionalities of class that tells us that the cardinality of an epsilon cover of the class scales like one over epsilon to some dimension. This is the event I plug this here. I get the d/m as the main factor which is really telling me as long as the number of example is smaller than the dimension of the class. You are not generalizeing once it above it. You start to generalize it like 1/sqrt(n) or like 1 over some other power n.

That's classical. The problem is , as I'm sure you all know but maybe don't appreciate is .. maybe you don't know but the deep learning deep neral networks don't use this bound. It's useless for deep learning. It doesn't work. Why they're useless. Actually they get even worse when you show me that the network can express very complicated functions that are complicated and very very sophisticated expressivity bounds, essentially push the dimension higher. So this dimension is now orders of the millions of tens of millions and I have hundreds of thousands of samples and actually generalized very well.So obviously, this doesn't explain anything and  they just move by many people in the wrong direction. 

希望你們對PAC學習和其他學習理論有所瞭解。 那麼你們對我接下來所說的“舊”泛化界(generation bound)應該十分熟悉。訓練集之外的其他樣本的錯誤概率,可以表示為使用平方形式或者根號形式,以log|H|為界,其中H 為假設空間。對於有限的假設空間來講,這個界限就叫做“cardinality bound”。但是對於普通的類別,我們通常用一個episilon-cover的假設空間。實際就是對假設空間進行取樣,每個取樣都非常接近。然後用這個假設空間的分割除上樣本個數,後面是一個常數,一個通常被忽略不計的很小的數,和置信度有關的常數。如果瞭解相關知識的話,你們就聽說過這個bound。通常的話,我們可以通過VC維度或者其他維度瞭解到,\epsilon-cover的假設空間的元素個數是1除上\epsilon或者別的某個維度。在這裡如果用下面的這個近似,可以得到維度和樣本個數的比值作為影響樣本個數的主要因素。只要樣本個數比類別的維度小(誤差超過1了),就沒有什麼泛化的意義了。否則的話,泛化能力可以表示為1除以n的二次或幾次方根。這些是傳統的觀念。問題在於,你們也許知道也許不知道,深度學習不適合用這個泛化界。這行不通,為什麼它沒有作用呢?神經網路可以描述非常複雜的函式,因此界限通常也非常複雜,難以描述,而且深度網路維度極其高,使用傳統的就泛化界效能會更差,這個維度可能上億,並且在上千的樣本上可以表現出很好的效能。因此它並沒有反映泛化能力的實際情況,甚至帶來誤導。

So I suggest a different bound. I call it  the input compression bound.It's quite new and it's actually surprisingly that it's related to many things  that people knew for a long time like nearest neithbors and like many other partitions. So instead of focusing on the hypothesis class which actually think it is completely an irrelevant notion for deep learning. I think about what happens to the input. Now I already told you that the layers induce partition of the input. So I'm going to quantify this partition by how homogeneous they are with respect to the label. So if I'm actually managed somehow to compress my input into cells to cover my input space with cells which are more or less homogeneous with respect to the generalization or the straight direct to the label. And then I'm doing a much finer job in terms of (?).So imaginge that I actually cover the space of input X. In the case of images that say it's all the possible images that care about. It is a very big space and I cover it with spheres which have essentially groups of images here, clusters of images. If you want it can be a soft of hard partition. Essentially what I'm saying is that then I can replace the cardinality of the hypothesis class. Let's say the boolean function just for simplicity. It moves from 2 to the cardinality of X which is all boolean functions from x to 2 to the cardinality of the partition. Because essentially now I need if  I manage to Epsilon cover my input then essentially and the iput moves from 2 to T_epsilon in terms of number of labels that I  really need and essentially one label per partition. So that looks okay so this is exponential decrease but it's ok.

所以,我推薦使用另一個bound。我把它稱為壓縮bound。這個概念是新提出的,而且它和許多人們熟知已久的方法相關,如近鄰法,或者其他聚類方法。我們不需要在意假設空間的問題,它和深度學習完全無關。我們現在需要思考的是輸入層發生了什麼。現在我已經向大家說明神經網路層實際上是對輸入層進行了分割。因此,我希望可以量化這種分割,量化的標準是他們和標籤的統一程度。我從這個角度做了一些有趣的工作。假設我已經覆蓋了整個輸入空間。在所有可能的圖片空間的情況下,這個輸入空間是非常大的,我用小的球體,實際上是將圖片分組,或者說聚類。因此,假設空間裡的元素個數改變了,為了方便起見,我們假設使用布林函式。也就是說,元素個數從2^N,變成了2^k(N是原來的X假設空間裡的元素個數,k是對X進行分割成的小塊的個數)。由於我是使用epsilon-cover的方式,輸入元素個數變成了2^T_epsilon,從標籤的角度來講,每個標籤需要單獨的一塊分割區域。這看上去不錯,因為假設空間的可能性呈現出指數級別的的減少。

Now there are two questions. The first one is how do I make sure that the partition that I get is indeed homogeneous with respect to the labels. So this requires some sort of a distortion function. I mean I compress it just I can read a distortion theory. I want my grade or my code book or my partition to be to have close small distortion. The only thing I'm saying is that  distortion we need to use is actually...emmm...if I use this information bottleneck distortion it's actually bound the L1 distortion which means that if this is going to be an epsilon partition respect to this is also going to be an epsilon partition respect to this the same topology. Therefore, since the information bottleneck distortion average is precisely related to the mutual information in the representation, this is equivalent minimizing this is equivalent to maximizing the mutual information on the output. Okay that's very nice, though not surprising to any one.

但是,這裡有兩個問題。第一個是我得保證我的分割方式確實和標籤一致,這需要某個失真函式作為度量。我的意思是,可以通過失真理論來壓縮。我希望對等級、碼字或者說就是分割方式,能有較小的失真率。我們要用到的失真函式是……嗯……如果用資訊瓶頸的失真理論,需要使用L1失真函式,也就是說,如果進行epsilon-分割的話,