Pytorch中RoI pooling layer的幾種實現

阿新 • • 發佈：2018-05-12

IT function 最好 sta 多維 docs CA ner ffi

Faster-RCNN論文中在RoI-Head網絡中，將128個RoI區域對應的feature map進行截取，而後利用RoI pooling層輸出7*7大小的feature map。在pytorch中可以利用：

torch.nn.functional.adaptive_max_pool2d(input, output_size, return_indices=False)
torch.nn.AdaptiveMaxPool2d(output_size, return_indices=False)

這個函數很方便調用，但是這個實現有個缺點，就是慢。

所以有許多其他不同的實現方式，借鑒其他人的實現方法，這裏借鑒github做一個更加豐富對比實驗。總共有4種方法：

方法1. 利用cffi進行C擴展實現,然後利用Pytorch調用：需要單獨的 C 和 CUDA 源文件，還需要事先進行編譯，不但過程比較繁瑣，代碼結構也稍顯淩亂。對於一些簡單的 CUDA 擴展（代碼量不大，沒有復雜的庫依賴），顯得不夠友好。

方法2.利用Cupy實現在線編譯，直接為 pytorch 提供 CUDA 擴展（當然，也可以是純 C 的擴展）。Cupy實現了在cuda上兼容numpy格式的多維數組。GPU加速的矩陣運算，而Numpy並沒有利用GPU。Cupy目前已脫離chainer成為一個獨立的庫。

方法3.利用chainer實現，相較其他深度學習框架來說，chainer知名度不夠高，但是是一款非常優秀的深度學習框架，純python實現，設計思想簡潔，語法簡單。chainer中的GPU加速也是通過Cupy實現的。此外，chainer還有其他附加包，例如ChainerCV，其中便有對Faster-RCNN、SSD等網絡的實現。

技術分享圖片

圖源：Chainer官網slides

方法4.利用Pytorch實現，也就是文章伊始給出的兩個函數。

從方法1至方法4，實現過程越來越簡單，所以速度越來越慢。

以下是一個簡單的對比試驗結果：實驗中以輸入batch大小、圖像尺寸（嚴格講是特征圖尺寸）大小、rois數目、是否反向傳播為變量來進行對比，註意輸出尺寸和Faster原論文一致都是7*7，都利用cuda，且設置scale=1，即特征圖和原圖同大小。

對比1：只正向傳播

use_cuda: True, has_backward: True
method1:  
0.001353292465209961, batch_size: 8, size: 8, num_rois: 10
method2: 0.04485161781311035, batch_size: 8, size: 8, num_rois: 10
method3: 0.06167919635772705, batch_size: 8, size: 8, num_rois: 10
method4: 0.009436330795288085, batch_size: 8, size: 8, num_rois: 10

method1: 0.0003777980804443359, batch_size: 8, size: 8, num_rois: 100
method2: 0.001593632698059082, batch_size: 8, size: 8, num_rois: 100
method3: 0.00210268497467041, batch_size: 8, size: 8, num_rois: 100
method4: 0.061138014793396, batch_size: 8, size: 8, num_rois: 100

method1: 0.001754002571105957, batch_size: 64, size: 64, num_rois: 100
method2: 0.0047376775741577145, batch_size: 64, size: 64, num_rois: 100
method3: 0.006129913330078125, batch_size: 64, size: 64, num_rois: 100
method4: 0.06233139038085937, batch_size: 64, size: 64, num_rois: 100

method1: 0.0018497371673583984, batch_size: 64, size: 64, num_rois: 1000
method2: 0.010891580581665039, batch_size: 64, size: 64, num_rois: 1000
method3: 0.023005642890930177, batch_size: 64, size: 64, num_rois: 1000
method4: 0.5292188739776611, batch_size: 64, size: 64, num_rois: 1000

method1: 0.09110891819000244, batch_size: 256, size: 256, num_rois: 100
method2: 0.4102628231048584, batch_size: 256, size: 256, num_rois: 100
method3: 0.3902537250518799, batch_size: 256, size: 256, num_rois: 100
method4: 0.6544218873977661, batch_size: 256, size: 256, num_rois: 100

method1: 0.09256606578826904, batch_size: 256, size: 256, num_rois: 1000
method2: 0.641594967842102, batch_size: 256, size: 256, num_rois: 1000
method3: 1.3756087446212768, batch_size: 256, size: 256, num_rois: 1000
method4: 4.076273036003113, batch_size: 256, size: 256, num_rois: 1000

對比2：含反向傳播

use_cuda: True, has_backward: False
method1: 0.000156359672546386, batch_size: 8, size: 8, num_rois: 10
method2: 0.009024391174316406, batch_size: 8, size: 8, num_rois: 10
method3: 0.009477467536926269, batch_size: 8, size: 8, num_rois: 10
method4: 0.002876405715942383, batch_size: 8, size: 8, num_rois: 10

method1: 0.00017533779144287, batch_size: 8, size: 8, num_rois: 100
method2: 0.00040388107299804, batch_size: 8, size: 8, num_rois: 100
method3: 0.00085462093353271, batch_size: 8, size: 8, num_rois: 100
method4: 0.02638674259185791, batch_size: 8, size: 8, num_rois: 100

method1: 0.00018683433532714, batch_size: 64, size: 64, num_rois: 100
method2: 0.00039398193359375, batch_size: 64, size: 64, num_rois: 100
method3: 0.00234550476074218, batch_size: 64, size: 64, num_rois: 100
method4: 0.02483976364135742, batch_size: 64, size: 64, num_rois: 100

method1: 0.0013917160034179, batch_size: 64, size: 64, num_rois: 1000
method2: 0.0010843658447265, batch_size: 64, size: 64, num_rois: 1000
method3: 0.0025740385055541, batch_size: 64, size: 64, num_rois: 1000
method4: 0.2577446269989014, batch_size: 64, size: 64, num_rois: 1000

method1: 0.0003826856613153, batch_size: 256, size: 256, num_rois: 100
method2: 0.0004550600051874, batch_size: 256, size: 256, num_rois: 100
method3: 0.2729876136779785, batch_size: 256, size: 256, num_rois: 100
method4: 0.0269237756729125, batch_size: 256, size: 256, num_rois: 100

method1: 0.0008277797698974, batch_size: 256, size: 256, num_rois: 1000
method2: 0.0021707582473754, batch_size: 256, size: 256, num_rois: 1000
method3: 0.2724076747894287, batch_size: 256, size: 256, num_rois: 1000
method4: 0.2687232542037964, batch_size: 256, size: 256, num_rois: 1000

可以觀察到最後一種方法總是最慢的，因為對於所有的num_roi依次循環叠代，效率極低。

對比3：固定1個batch（一張圖），size假設為50*50（特征圖大小，所以原圖為800*800），特征圖通道設為512，num_rois設為300，這是近似於 batch為1的Faster-RCNN的測試過程，看一下用時情況：此時輸入特征圖為（1,512,50,50），rois為（300,5）。rois的第一列為batch index，因為是1個batch，所以此項全為0，沒有實質作用。

use_cuda: True, has_backward: True
method0: 0.0344547653198242, batch_size: 1, size: 50, num_rois: 300
method1: 0.1322056961059570, batch_size: 1, size: 50, num_rois: 300
method2: 0.1307379817962646, batch_size: 1, size: 50, num_rois: 300
method3: 0.2016681671142578, batch_size: 1, size: 50, num_rois: 300

可以看到，方法2和方法3速度幾乎一致，所以可以使用更簡潔的chainer方法，然而當使用多batch訓練Faster時，最好利用方法1，速度極快。

代碼：

Pytorch中RoI pooling layer的幾種實現

IT function 最好 sta 多維 docs CA ner ffi Faster-RCNN論文中在RoI-Head網絡中，將128個RoI區域對應的feature map進行截取，而後利用RoI pooling層輸出7*7大小的feature map。在pytorc

Pytorch中RoI pooling layer的幾種實現

torch.nn.AdaptiveMaxPool2d(output_size, return_indices=False)

Pytorch中RoI pooling layer的幾種實現

iOS開發之UITableView中計時器的幾種實現方式(NSTimer、DispatchSource、CADisplayLink)

Android中圓形圖的幾種實現方式

Java中阻塞佇列的幾種實現方式

JS 中深拷貝的幾種實現方法

Java中轉換為十六進制的幾種實現

最近在研究多線程，淺談JAVA中多線程的幾種實現方式

在Spring中依賴注入的幾種方式實現鬆耦合

資料結構3----線性表中鏈式結構的其他幾種實現(霜之小刀)

C語言中求字串長度的函式my_strlen()的幾種實現方法

java中執行緒池的幾種實現方式

JAVA中關於map集合常見的幾種實現介面

Java多執行緒有哪幾種實現方式? Java中的類如何保證執行緒安全? 請說明ThreadLocal的用法和適用場景（面試題）

Unity中UGUI人物血條跟隨的幾種實現方式（一）

Box2d中剛體的紋理的幾種實現方式

Python中的單例模式的幾種實現方式的及優化

詳解單頁面路由的幾種實現原理

多線程有幾種實現方法?同步有幾種實現方法?(被問到)

Spring定時任務的幾種實現

轉 Spring定時任務的幾種實現 (記錄備用)

Pytorch中RoI pooling layer的幾種實現

torch.nn.AdaptiveMaxPool2d(output_size, return_indices=False)

相關推薦