Pytorch中RoI pooling layer的幾種實現
Faster-RCNN論文中在RoI-Head網絡中,將128個RoI區域對應的feature map進行截取,而後利用RoI pooling層輸出7*7大小的feature map。在pytorch中可以利用:
- torch.nn.functional.adaptive_max_pool2d(input, output_size, return_indices=False)
-
torch.nn.AdaptiveMaxPool2d(output_size, return_indices=False)
這個函數很方便調用,但是這個實現有個缺點,就是慢。
所以有許多其他不同的實現方式,借鑒其他人的實現方法,這裏借鑒github做一個更加豐富對比實驗。總共有4種方法:
方法1. 利用cffi進行C擴展實現,然後利用Pytorch調用:需要單獨的 C 和 CUDA 源文件,還需要事先進行編譯,不但過程比較繁瑣,代碼結構也稍顯淩亂。對於一些簡單的 CUDA 擴展(代碼量不大,沒有復雜的庫依賴),顯得不夠友好。
方法2.利用Cupy實現在線編譯,直接為 pytorch 提供 CUDA 擴展(當然,也可以是純 C 的擴展)。Cupy實現了在cuda上兼容numpy格式的多維數組。GPU加速的矩陣運算,而Numpy並沒有利用GPU。Cupy目前已脫離chainer成為一個獨立的庫。
方法3.利用chainer實現,相較其他深度學習框架來說,chainer知名度不夠高,但是是一款非常優秀的深度學習框架,純python實現,設計思想簡潔,語法簡單。chainer中的GPU加速也是通過Cupy實現的。此外,chainer還有其他附加包,例如ChainerCV,其中便有對Faster-RCNN、SSD等網絡的實現。
圖源:Chainer官網slides
方法4.利用Pytorch實現,也就是文章伊始給出的兩個函數。
從方法1至方法4,實現過程越來越簡單,所以速度越來越慢。
以下是一個簡單的對比試驗結果:實驗中以輸入batch大小、圖像尺寸(嚴格講是特征圖尺寸)大小、rois數目、是否反向傳播為變量來進行對比,註意輸出尺寸和Faster原論文一致都是7*7,都利用cuda,且設置scale=1,即特征圖和原圖同大小。
對比1: 只正向傳播
use_cuda: True, has_backward: True method1:0.001353292465209961, batch_size: 8, size: 8, num_rois: 10 method2: 0.04485161781311035, batch_size: 8, size: 8, num_rois: 10 method3: 0.06167919635772705, batch_size: 8, size: 8, num_rois: 10 method4: 0.009436330795288085, batch_size: 8, size: 8, num_rois: 10 method1: 0.0003777980804443359, batch_size: 8, size: 8, num_rois: 100 method2: 0.001593632698059082, batch_size: 8, size: 8, num_rois: 100 method3: 0.00210268497467041, batch_size: 8, size: 8, num_rois: 100 method4: 0.061138014793396, batch_size: 8, size: 8, num_rois: 100 method1: 0.001754002571105957, batch_size: 64, size: 64, num_rois: 100 method2: 0.0047376775741577145, batch_size: 64, size: 64, num_rois: 100 method3: 0.006129913330078125, batch_size: 64, size: 64, num_rois: 100 method4: 0.06233139038085937, batch_size: 64, size: 64, num_rois: 100 method1: 0.0018497371673583984, batch_size: 64, size: 64, num_rois: 1000 method2: 0.010891580581665039, batch_size: 64, size: 64, num_rois: 1000 method3: 0.023005642890930177, batch_size: 64, size: 64, num_rois: 1000 method4: 0.5292188739776611, batch_size: 64, size: 64, num_rois: 1000 method1: 0.09110891819000244, batch_size: 256, size: 256, num_rois: 100 method2: 0.4102628231048584, batch_size: 256, size: 256, num_rois: 100 method3: 0.3902537250518799, batch_size: 256, size: 256, num_rois: 100 method4: 0.6544218873977661, batch_size: 256, size: 256, num_rois: 100 method1: 0.09256606578826904, batch_size: 256, size: 256, num_rois: 1000 method2: 0.641594967842102, batch_size: 256, size: 256, num_rois: 1000 method3: 1.3756087446212768, batch_size: 256, size: 256, num_rois: 1000 method4: 4.076273036003113, batch_size: 256, size: 256, num_rois: 1000
對比2:含反向傳播
use_cuda: True, has_backward: False method1: 0.000156359672546386, batch_size: 8, size: 8, num_rois: 10 method2: 0.009024391174316406, batch_size: 8, size: 8, num_rois: 10 method3: 0.009477467536926269, batch_size: 8, size: 8, num_rois: 10 method4: 0.002876405715942383, batch_size: 8, size: 8, num_rois: 10 method1: 0.00017533779144287, batch_size: 8, size: 8, num_rois: 100 method2: 0.00040388107299804, batch_size: 8, size: 8, num_rois: 100 method3: 0.00085462093353271, batch_size: 8, size: 8, num_rois: 100 method4: 0.02638674259185791, batch_size: 8, size: 8, num_rois: 100 method1: 0.00018683433532714, batch_size: 64, size: 64, num_rois: 100 method2: 0.00039398193359375, batch_size: 64, size: 64, num_rois: 100 method3: 0.00234550476074218, batch_size: 64, size: 64, num_rois: 100 method4: 0.02483976364135742, batch_size: 64, size: 64, num_rois: 100 method1: 0.0013917160034179, batch_size: 64, size: 64, num_rois: 1000 method2: 0.0010843658447265, batch_size: 64, size: 64, num_rois: 1000 method3: 0.0025740385055541, batch_size: 64, size: 64, num_rois: 1000 method4: 0.2577446269989014, batch_size: 64, size: 64, num_rois: 1000 method1: 0.0003826856613153, batch_size: 256, size: 256, num_rois: 100 method2: 0.0004550600051874, batch_size: 256, size: 256, num_rois: 100 method3: 0.2729876136779785, batch_size: 256, size: 256, num_rois: 100 method4: 0.0269237756729125, batch_size: 256, size: 256, num_rois: 100 method1: 0.0008277797698974, batch_size: 256, size: 256, num_rois: 1000 method2: 0.0021707582473754, batch_size: 256, size: 256, num_rois: 1000 method3: 0.2724076747894287, batch_size: 256, size: 256, num_rois: 1000 method4: 0.2687232542037964, batch_size: 256, size: 256, num_rois: 1000
可以觀察到最後一種方法總是最慢的,因為對於所有的num_roi依次循環叠代,效率極低。
對比3:固定1個batch(一張圖),size假設為50*50(特征圖大小,所以原圖為800*800),特征圖通道設為512,num_rois設為300,這是近似於 batch為1的Faster-RCNN的測試過程,看一下用時情況:此時輸入特征圖為(1,512,50,50),rois為(300,5)。rois的第一列為batch index,因為是1個batch,所以此項全為0,沒有實質作用。
use_cuda: True, has_backward: True method0: 0.0344547653198242, batch_size: 1, size: 50, num_rois: 300 method1: 0.1322056961059570, batch_size: 1, size: 50, num_rois: 300 method2: 0.1307379817962646, batch_size: 1, size: 50, num_rois: 300 method3: 0.2016681671142578, batch_size: 1, size: 50, num_rois: 300
可以看到,方法2和方法3速度幾乎一致,所以可以使用更簡潔的chainer方法,然而當使用多batch訓練Faster時,最好利用方法1,速度極快。
代碼:
Pytorch中RoI pooling layer的幾種實現