1804.03235-Large scale distributed neural network training through online distillation.md

阿新 • • 發佈：2018-07-06

sin parameter rec before space sgd memory 同步 change

現有分布式模型訓練的模式

分布式SGD
- 並行SGD：大規模訓練中，一次的最長時間取決於最慢的機器
- 異步SGD：不同步的數據，有可能導致權重更新向著未知方向
並行多模型：多個集群訓練不同的模型，再組合最終模型，但是會消耗inference運行時
蒸餾：流程復雜
- student訓練數據集的選擇
  - unlabeled的數據
  - 原始數據
  - 留出來的數據

協同蒸餾

using the same architecture for all the models;
using the same dataset to train all the models; and
using the distillation loss during training before any model has fully converged.

特點
- 就算thacher和student是完全相同的模型設置，只要其內容足夠不同，也是能夠獲得有效的提升的
- 即是模型未收斂，收益也是有的
- 丟掉teacher和student的區分，互相訓練，也是有好處的
- 不是同步的模型也是可以的。

技術分享圖片

算法簡單易懂，而且步驟看上去不是很復雜。

使用out of state模型權重的解釋：

every change in weights leads to a change in gradients, but as training progresses towards convergence, weight updates should substantially change only the predictions on a small subset of the training data;

weights (and gradients) are not statistically identifiable as different copies of the weights might have arbitrary scaling differences, permuted hidden units, or otherwise rotated or transformed hidden layer feature space so that averaging gradients does not make sense unless models are extremely similar;

sufficiently out-of-sync copies of the weights will have completely arbitrary differences that change the meaning of individual directions in feature space that are not distinguishable by measuring the loss on the training set;
in contrast, output units have a clear and consistent meaning enforced by the loss function and the training data.

所以這裏似乎是說，隨機性的好處？

一種指導性的實用框架設計：

Each worker trains an independent version of the model on a locally available subset of the training data.
Occasionally, workers checkpoint their parameters.
Once this happens, other workers can load the freshest available checkpoints into memory and perform codistillation.
再加上，可以在小一些的集群上使用分布式SGD。

另外論文中提到，這種方式，比起每次直接發送梯度和權重，只需要偶爾載入checkpoint，而且各個模型集群在運算上是完全相互獨立的。這個倒是確實能減少一些問題。
但是，如果某個模型垮掉了，完全沒收斂呢？

另外，沒看出來這種框架哪裏簡單了，管理模型和checkpoint不是一個簡單的事情。

實驗結論

20TB的數據，有錢任性

論文中提到，並不是機器越多，最終模型效果越好，似乎32-128是比較合適的，更多了，模型收斂速度和性能不會更好，有時反而會有下降。
技術分享圖片

論文中的實驗結果2a，最好的還是雙模型並行，其次是協同蒸餾，最差的是unigram的smooth0.9，label smooth 0.99跟直接訓練表現差不多，畢竟只是一個隨機噪聲。
另外，通過對比相同數據的協同蒸餾2b，和隨機數據的協同整理，實驗發現，隨機數據實際上讓模型有更好的表現
3在imagenet上的實驗，出現了跟2a差不多的結果。
4中雖然不用非得用最新的模型，但是，協同蒸餾，使用太久遠的checkpoint還是會顯著降低訓練效率的。

欠擬合的模型是有用的，但是過擬合的模型在蒸餾中可能不太有價值。
協同蒸餾比雙步蒸餾能更快的收斂，而且更有效率。

3.5中介紹的，也是很多時候面臨的問題，因為初始化，訓練過程的參數不一樣等問題，可能導致兩次訓練出來的模型的輸出有很大區別。例如分類模型，可能上次訓練的在某些分類上準確，而這次訓練的，在這些分類上就不準確了。模型平均或者蒸餾法能有效避免這個問題。

總結

balabalabala
實驗只嘗試了兩個模型，多個模型的多種拓撲結構也值得嘗試。

很值得一讀的一個論文。

1804.03235-Large scale distributed neural network training through online distillation.md

sin parameter rec before space sgd memory 同步 change 現有分布式模型訓練的模式分布式SGD 並行SGD：大規模訓練中，一次的最長時間取決於最慢的機器異步SGD：不同步的數據，有可能導致權重更新向著未知方向並行

2018/12/14 Deep Neural Network Training(1)

Loss Function and Optimization 損失函式如何優化線性分類器損失函式是量化的評估線性分類器的標準。損失函式是優化的目標。損失函式的定義：當初始化W很小的時候，S–>0,此時L–>c-1（其中c代表類的個數）

MLHPC 2018 | Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems

這篇文章主要介紹了一個名為Aluminum通訊庫，在這個庫中主要針對Allreduce做了一些關於計算通訊重疊以及針對延遲的優化，以加速分散式深度學習訓練過程。 ### 分散式訓練的通訊需求 #### 通訊何時發生一般來說，神經網路的訓練過程分為三步：前向傳播、反向傳播以及引數優化。在使用資料並行進行分散

1804.03235-Large scale distributed neural network training through online distillation.md

現有分布式模型訓練的模式

協同蒸餾

實驗結論

總結

1804.03235-Large scale distributed neural network training through online distillation.md

2018/12/14 Deep Neural Network Training(1)

MLHPC 2018 | Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems

MSCNN論文解讀-A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection

Debugging & Visualising training of Neural Network with TensorBoard

Large scale GAN training for high fidelity natural image synthesis解讀

論文翻譯------Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

DL-1: Tips for Training Deep Neural Network

Ng第十七課：大規模機器學習(Large Scale Machine Learning)

Machine Learning：Neural Network---Representation

Batch normalization:accelerating deep network training by reducing internal covariate shift的筆記

codefroces 852B - Neural Network country

VGGnet論文總結（VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION）

Node.js: Extend and Maintain Applications + large scale

論文閱讀：A Primer on Neural Network Models for Natural Language Processing（1）

斯坦福大學公開課機器學習：Neural network-model representation（神經網絡模型及神經單元的理解）

Building your Deep Neural Network: Step by Step¶

Deep Neural Network for Image Classification: Application

論文筆記-DeepFM: A Factorization-Machine based Neural Network for CTR Prediction

1804.03235-Large scale distributed neural network training through online distillation.md

現有分布式模型訓練的模式

協同蒸餾

實驗結論

總結

相關推薦