【DMCP】2020-CVPR-DMCP Differentiable Markov Channel Pruning for Neural Networks-論文閱讀
DMCP
2020-CVPR-DMCP Differentiable Markov Channel Pruning for Neural Networks
- Shaopeng Guo(sensetime 商湯)
- GitHub: 64 stars
- https://github.com/zx55/dmcp
Introduction
propose a novel differentiable channel pruning method named Differentiable Markov Channel Pruning (DMCP) to perform efficient optimal sub-structure searching.
本文提出DMCP(可微分的通道剪枝)來高效地搜尋子空間。
At the same FLOPs, our method outperforms all the other pruning methods both on MobileNetV2 and ResNet, as shown in Figure 1.
With our method, MobileNetV2 has 0.1% accuracy drop with 30% FLOPs reduction and the FLOPs of ResNet-50 is reduced by 44% with only 0.4% drop.
Motivation
Recent works imply that the channel pruning can be regarded as searching optimal sub-structure from unpruned networks.
通道的修剪可以視為從未修剪的網路中搜索最佳的子結構(網路剪枝得到的子結構比繼承的權重更重要)
However, existing works based on this observation require training and evaluating a large number of structures, which limits their application.
之前的工作需要訓練和評估很多子結構,開銷大
Conventional channel pruning methods mainly rely on the human-designed paradigm.
卷積網路的剪枝主要依靠手工設計的正規化(重要性指標)
the structure of the pruned model is the key of determining the performance of a pruned model, rather than the inherited “important” weights.
剪枝後網路的結構對效能的影響更大,而不是所繼承的”重要“權重
the optimization of these pruning process need to train and evaluate a large number of structures sampled from the unpruned network, thus the scalability of these methods is limited.
之前的(搜尋子結構)的剪枝方法需要訓練和評估大量的網路,因此可擴充套件性(修剪不同大小的網路)受到限制
A similar problem in neural architecture search (NAS) has been tackled by differentiable method DARTS
在NAS中也有類似的問題,已經被可微分方法DARTS解決了
ps 與DATRS的區別
First, the definition of search space is different. The search space of DARTS is a category of pre-defined operations (convolution, max-pooing, etc), while in the channel pruning, the search space is the number of channels in each layer.
第一,搜尋空間的不同。DARTS的搜尋空間是一些預定義的操作,而我們的搜尋空間是不同層通道的數量
Second, the operations in DARTS are independent with each other. But in the channel pruning, if a layer has k + 1 channels, it must have at least k channels first, which has a logical implication relationship.
第二,DARTS中的操作時互相獨立的(比如兩個node的連線之間的不同操作,卷積,池化,互不影響),但通道剪枝中,如果一層有k+1個通道,那麼它首先要有k個通道。
Contribution
Our method makes the channel pruning differentiable by modeling it as a Markov process.
我們通過將模型剪枝建模為馬爾科夫過程,從而使之可以微分
Method
Our method is differentiable and can be directly optimized by gradient descent with respect to standard task loss and budget regularization (e.g. FLOPs constraint).
DMCP中,我們將通道剪枝視為markov(馬爾科夫)過程,剪枝中的markov狀態(state)代表是否保留相應的通道,狀態之間的轉移視為剪枝的過程
In the Markov process for each layer, the state \(S_k\) represents the \(k^{th}\) channel is retained, the transition from \(S_k\) to \(S_{k+1}\) represents the probability of retaining the (k+1)th channel given that the kth channel is retained.
每一層為一個馬爾科夫過程,狀態 \(S_k\) 表示保留第k個通道。狀態 \(S_k\) 到 \(S_{k+1}\) 的轉移代表保留第k+1個通道的概率
Note that the start state is always \(S_1\) in our method.
\(S_1\) 是起始狀態,即每層都至少有1個通道
Then the marginal probability for state \(S_k\), i.e. the probability of retaining \(k^{th}\) channel, can be computed by the product of transition probabilities and can also be viewed as a scaling coefficient.
因此,第k個狀態(保留第k個通道)的邊緣概率=之前所有轉移概率的乘積,可以視為第k個通道的放大係數
Each scaling coefficient is multiplied to its corresponding channel’s feature map during the network forwarding.
前向過程中,每個通道的feature map 乘以 該通道對應的 邊緣概率(放大係數)
So the transition probabilities parameterized by learnable parameters can be optimized in an end-to-end manner by gradient descent with respect to task loss together with budget regularization (e.g. FLOPs constraint).
因此可以通過對目標loss 和 代價loss(FLOPs loss)的梯度下降,來end to end地優化 不同層,不同通道的轉移概率
After the optimization, the model within desired budgets can be sampled by the Markov process with learned transition probabilities and will be trained from scratch to achieve high performance.
優化完成後(即網路中每一層的轉移概率/邊緣概率 可以抽樣出符合FLOPs限制的網路了),進行取樣子網路並從頭開始訓練
因此,DMCP選擇將剪枝的過程建模為一個馬爾科夫模型。圖二展示了一層通道數為5的卷積層的剪枝過程。其中S1表示保留第一個通道,S2表示保留第二個通道,以此類推。T表示剪枝完畢。概率p則為轉移概率,通過可學習的引數計算得到,後文中會詳細介紹。
(1)優化剪枝空間
在傳統的剪枝方法中,會為每個通道計算“重要性”來決定是否保留它。而當我們把模型剪枝看作模型結構搜尋問題後,不同模型的區別則在於每一層的通道數量。如果仍然每個通道單獨判斷,就會產生同樣的結構,造成優化困難。如圖三所示:情況1中,最後兩個通道被剪掉,情況2中,第2個和第4個通道被剪掉,而這兩種情況都會產生3個通道的卷積層,使剪枝空間遠大於實際網路個數。
因此,DMCP採用保留前k個通道的方式,大大縮小了剪枝空間。
(2)建模剪枝過程
其中pk為馬爾科夫模型中的轉移概率。這樣,通過在優化完畢後的馬爾可夫模型上取樣就可以得到相應的剪枝後的模型。
(3)學習轉移概率
(\(p_k\) 是轉移概率,\(p_{w1}\) 是邊緣概率)
(4)訓練流程
DMCP的訓練可以分為兩個階段:訓練原模型和更新馬爾科夫模型。這兩個階段是交替進行來優化的。
階段一,訓練原模型。
在每一輪迭代過程中,利用馬爾科夫過程取樣兩個隨機結構,同時也取樣了最大與最小的結構來保證原模型的所有引數可以充分訓練。所有采樣的結構都與原模型共享訓練引數,因此所有子模型在任務資料集上的精度損失函式得到的梯度都會更新至原模型的引數上。
階段二,更新馬爾科夫模型
在訓練原模型後,通過前文中所描述的方法將馬爾科夫模型中的轉移概率和原模型結合,從而可以利用梯度下降的方式更新馬爾科夫模型的引數,其損失函式如下:
Experiments
Conclusion
The proposed method is differentiable by modeling the channel pruning as the Markov process, thus can be optimized with respect to task loss by gradient descent.
Summary
Reference
【CVPR 2020 Oral丨DMCP: 可微分的深度模型剪枝演演算法解讀】https://zhuanlan.zhihu.com/p/146721840
【Soft Filter Pruning(SFP)演演算法筆記】https://blog.csdn.net/u014380165/article/details/81107032