Distilling the Knowledge in a Neural Network

阿新 • • 發佈：2020-10-26

概
主要內容
程式碼

Hinton G., Vinyals O. & Dean J. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv 1503.02531

概

\[q_1 = \frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)}. \]

主要內容

這篇文章或許重點是在遷移學習上, 一個重點就是其認為soft labels (即概率向量)比hard target (one-hot向量)含有更多的資訊. 比如, 數字模型判別數字\(2\)為\(3\)和\(7\)的概率分別是0.1, 0.01, 這說明這個數字\(2\)

很有可能和\(3\)長的比較像, 這是one-hot無法帶來的資訊.

於是乎, 現在的情況是:

以及有一個訓練好的且往往效果比較好但是計量大的模型\(t\);
我們打算用一個小的模型\(s\)去近似這個已有的模型;
策略是每個樣本\(x\), 先根據\(t(x)\)獲得soft logits \(z \in \mathbb{R}^K\), 其中\(K\)是類別數, 且\(z\)未經softmax.
最後我們希望根據下面的損失函式來訓練\(s\):

\[\mathcal{L(x, y)} = T^2 \cdot \mathcal{L}_{soft}(x, y) + \lambda \cdot\mathcal{L}_{hard}(x, y) \]

其中

\[\mathcal{L}_{soft}(x, y) = -\sum_{i=1}^K p_i(x) \log q_i (x) = -\sum_{i=1}^K \frac{\exp(v_i(x)/T)}{\sum_j \exp(v_j(x)/T)} \log \frac{\exp(z_i(x)/T)}{\sum_j \exp(z_j(x)/T)} \]

\[\mathcal{L}_{hard}(x, y) = -\log \frac{\exp(z_y(x))}{\sum_j \exp(z_j(x))} \]

至於\(T^2\)是怎麼來的, 這是為了配平梯度的magnitude.

\[\begin{array}{ll} \frac{\partial \mathcal{L}_{soft}}{\partial z_k} &= -\sum_{i=1}^K \frac{p_i}{q_i} \frac{\partial q_i}{\partial z_k} = -\frac{1}{T}p_k -\sum_{i=1}^K \frac{p_i}{q_i} \cdot (-\frac{1}{T}q_i q_k) \\ &= -\frac{1}{T} (p_k -\sum_{i=1}^K p_iq_k) = \frac{1}{T}(q_k-p_k) \\ &= \frac{1}{T} (\frac{e^{z_i/T}}{\sum_j e^{z_j/T}} - \frac{e^{v_i/T}}{\sum_j e^{v_j/T}}) . \end{array} \]

當\(T\)足夠大的時候, 並假設\(\sum_j z_j=0 = \sum_j v_j =0\), 有

\[\frac{\partial \mathcal{L}_{soft}}{\partial z_k} \approx \frac{1}{KT^2} (z_k - v_k). \]

故需要加個\(T^2\)取抵消這部分的影響.

程式碼

其實一直很好奇的一點是這部分程式碼在pytorch裡是怎麼實現的, 畢竟pytorch裡的交叉熵是

\[-\log p_y(x) \]

另外很噁心的一點是, 我看大家都用的是 KLDivLOSS, 但是其實現居然是:

\[\mathcal{L}(x, y) = y \cdot \log y - y \cdot x, \]

注: 這裡的\(\cdot\)是逐項的.

def kl_div(x, y):
    return y * (torch.log(y) - x)


x = torch.randn(2, 3)
y = torch.randn(2, 3).abs() + 1

loss1 = F.kl_div(x, y, reduction="none")
loss2 = kl_div(x, y)

這時, 出來的結果長這樣

tensor([[-1.5965,  2.2040, -0.8753],
        [ 3.9795,  0.0910,  1.0761]])
tensor([[-1.5965,  2.2040, -0.8753],
        [ 3.9795,  0.0910,  1.0761]])

又或者:

def kl_div(x, y):
    return (y * (torch.log(y) - x)).sum(dim=1).mean()


torch.manual_seed(10086)

x = torch.randn(2, 3)
y = torch.randn(2, 3).abs() + 1

loss1 = F.kl_div(x, y, reduction="batchmean")
loss2 = kl_div(x, y)

print(loss1)
print(loss2)

tensor(2.4394)
tensor(2.4394)

所以如果真要弄, 應該要

def soft_loss(z, v, T=10.):
    # z: logits
    # v: targets
    z = F.log_softmax(z / T, dim=1)
    v = F.softmax(v / T, dim=1)
    return F.kl_div(z, v, reduction="batchmean")

Distilling the Knowledge in a Neural Network

目錄概主要內容程式碼 Hinton G., Vinyals O. & Dean J. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv 1503.02531

知識蒸餾--Distilling the Knowledge in a Neural Network

知識蒸餾--Distilling the Knowledge in a Neural Network 動機在普遍的訓練當中，經過 softmax 後都是最大化正標籤的概率，最小化負標籤的概率。但是這樣訓練的效果導致了正標籤的概率輸出越來越接近 1，負標籤

【論文考古】知識蒸餾 Distilling the Knowledge in a Neural Network

知識蒸餾是模型融合的經典方法論文內容 G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network.” 2015.

Logistic Regression with a Neural Network mindset

文章內容為吳恩達深度學習第二週的程式設計作業 **ipynbg格式程式碼及資料集-->陳能豆**

[CVPR 2020] 3DRegNet: A Deep Neural Network for 3D Point Registration

零、概要論文: 3DRegNet: A Deep Neural Network for 3D Point Registrationtag: CVPR 2020; Registration程式碼: https://github.com/3DVisionISR/3DRegNet作者: G. Dias Pais, Srikumar Ramalingam, Ven

《A Lexicon-Based Graph Neural Network for Chinese NER》思維導圖筆記

A Lexicon-Based Graph Neural Network for Chinese NER 基於詞典的圖神經網路解決中文命名實體識別作者: Tao Gui , Yicheng Zou等單位:復旦大學發表會議及時間: EMNLP2019

[論文解讀]A Quantitative Analysis Framework for Recurrent Neural Network

A Quantitative Analysis Framework for Recurrent Neural Network 文章目錄 A Quantitative Analysis Framework for Recurrent Neural Network簡介摘要動機THE DeepStellar FRAMEWORK抽象模型構建應用

Android studio 編譯 The number of method references in a .dex file cannot exceed 64K.

出現這種情況的：工程在編譯的時候方法超過dex最多儲存範圍65536，會丟擲異常MultiDex。

remove the merge commit and squash the branch into a single commit in the mainline

Starting with the repo in the original state To remove the merge commit and squash the branch into a single commit in the mainline

[LeetCode] 1192. Critical Connections in a Network 查詢叢集內的關鍵連線

There arenservers numbered from0ton - 1connected by undirected server-to-serverconnectionsforming a network whereconnections[i] = [ai, bi]represents a connection between serversaiandbi. Any server c

E-GraphSAGE: A Graph Neural Network based Intrusion Detection System 筆記

E-GraphSAGE: A Graph Neural Network based Intrusion Detection System 目錄E-GraphSAGE: A Graph Neural Network based Intrusion Detection System介紹翻譯訓練階段GNNGraphSAGEForward Propagation - Node Emb

Error: A <Route> is only ever to be used as the child of <Routes> element, never rendereddirectly.Please wrap your <Route> in a <Routes>.

原因： React路由版本問題，你可以檢視自己的package.json檔案，檢視react-router-dom的版本，應該是 6 版本。

Distilling the Knowledge in a Neural Network

概

主要內容

程式碼

Distilling the Knowledge in a Neural Network

知識蒸餾--Distilling the Knowledge in a Neural Network

【論文考古】知識蒸餾 Distilling the Knowledge in a Neural Network

Logistic Regression with a Neural Network mindset

[CVPR 2020] 3DRegNet: A Deep Neural Network for 3D Point Registration

《A Lexicon-Based Graph Neural Network for Chinese NER》思維導圖筆記

[論文解讀]A Quantitative Analysis Framework for Recurrent Neural Network

Android studio 編譯 The number of method references in a .dex file cannot exceed 64K.

remove the merge commit and squash the branch into a single commit in the mainline

[LeetCode] 1192. Critical Connections in a Network 查詢叢集內的關鍵連線

E-GraphSAGE: A Graph Neural Network based Intrusion Detection System 筆記

Error: A <Route> is only ever to be used as the child of <Routes> element, never rendereddirectly.Please wrap your <Route> in a <Routes>.

論文翻譯：2020_Nonlinear Residual Echo Suppression using a Recurrent Neural Network

論文翻譯：2018_Source localization using deep neural networks in a shallow water environment

Refining Traceability Links Between Vulnerability and Software Component in a Vulnerability Knowledge Graph

Towards the Memorization Effect of Neural Networks in Adversarial Training

Flutter升級到2.10後無法編譯的錯誤解決方案：The minCompileSdk (31) specified in a compileSdkVersion (android-30).

DeepFM: A Factorization-Machine based Neural Network for CTR Prediction

The dependencies of some of the beans in the application context form a cycle

visual studio (window10) dark主題下修改游標粗細（visual studio change the thickness of the cursor in dark theme for window10）

Distilling the Knowledge in a Neural Network

概

主要內容

程式碼

相關推薦