1. 程式人生 > >mmdetection原始碼剖析(1)--NMS

mmdetection原始碼剖析(1)--NMS

# mmdetection原始碼剖析(1)--NMS 熟悉目標檢測的應該都清楚**NMS**是什麼演算法,但是如果我們要與C++和cuda結合直接寫成Pytorch的操作你們清楚怎麼寫嗎?最近在看**mmdetection**的原始碼,發現其實原來寫C++和cuda的擴充套件也不難,下面給大家講一下。 C ++的擴充套件是允許使用者來建立自定義PyTorch框架外的操作(operators )的,即從PyTorch後端分離。此方法*與*實現本地PyTorch操作的方式*不同*。C ++擴充套件旨在為您節省大量與將操作與PyTorch後端整合在一起相關的樣板,同時為基於PyTorch的專案提供高度的靈活性。 官方給出了一個[LLTM的例子](https://pytorch.org/tutorials/advanced/cpp_extension.html),大家也可以看一下。 ## NMS演算法 先複習一下NMS的演算法: - (1)將所有框的得分排序,選中最高分及其對應的框 - (2)遍歷其餘的框,如果和當前最高分框的重疊面積(IOU)大於一定閾值,我們就將框刪除。 - (3)從未處理的框中繼續選一個得分最高的,重複上述過程。 這裡我給出一份純numpy的實現: ```python def nms(bounding_boxes, Nt): if len(bounding_boxes) == 0: return [], [] bboxes = np.array(bounding_boxes) x1 = bboxes[:, 0] y1 = bboxes[:, 1] x2 = bboxes[:, 2] y2 = bboxes[:, 3] scores = bboxes[:, 4] areas = (x2 - x1 + 1) * (y2 - y1 + 1) order = np.argsort(scores) picked_boxes = [] while order.size > 0: index = order[-1] picked_boxes.append(bounding_boxes[index]) x11 = np.maximum(x1[index], x1[order[:-1]]) y11 = np.maximum(y1[index], y1[order[:-1]]) x22 = np.minimum(x2[index], x2[order[:-1]]) y22 = np.minimum(y2[index], y2[order[:-1]]) w = np.maximum(0.0, x22 - x11 + 1) h = np.maximum(0.0, y22 - y11 + 1) intersection = w * h ious = intersection / (areas[index] + areas[order[:-1]] - intersection) left = np.where(ious < Nt) order = order[left] return picked_boxes ``` ## 編寫Pytorch C++擴充套件的步驟 需要編寫下面5個檔案: - nms_kernel.cu - nms_cuda.cpp - nms_ext.cpp - setup.py - nms_wrapper.py (1) nms_kernel.cu 主要使用ATen和THC庫編寫nms_cuda_forward的函式,使用C++編寫,涉及一些lazyInitCUDA,THCudaFree,THCCeilDiv 的操作,演算法跟我們前面寫的numpy差不太多。 (2) nms_cuda.cpp 是呼叫了nms_kernel.cu檔案的nms_cuda_forward封裝了一下變成nms_cuda函式。 (3) nms_ext.cpp 進一步封裝nms_cuda函式為nms,並且通過PYBIND11_MODULE繫結成python可呼叫的函式。 ```python PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { m.def("nms", &nms, "non-maximum suppression"); } ``` 通過上面那樣就相當於告訴python函式名定義為nms了。 (4) setup.py 就是編譯一遍nms_ext,至此你就可以通過nms_ext.nms呼叫cpp extension作為pytorch的操作了 ```python make_cuda_ext( name='nms_ext', module='mmdet.ops.nms', sources=['src/nms_ext.cpp', 'src/cpu/nms_cpu.cpp'], sources_cuda=[ 'src/cuda/nms_cuda.cpp', 'src/cuda/nms_kernel.cu' ]), ``` (5) nms_wrapper.py 再次封裝 nms_ext.nms,方便使用,使用例項: ```python from . import nms_ext inds = nms_ext.nms(dets_th, iou_thr) ``` 稍微完整的程式碼如下,但是我也刪減了一些,只剩下nms相關的程式碼,想要看完整程式碼可以點選下面的檔名。 [**nms_kernel.cu**](https://github.com/open-mmlab/mmdetection/blob/e903b5c109dc9ee5bb06c35fbe3b9c1f88ba6367/mmdet/ops/nms/src/cuda/nms_kernel.cu) (這個估計有部分是Facebook寫的) ```python // Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved. #include #include #include #include
#include #include #include int const threadsPerBlock = sizeof(unsigned long long) * 8; __device__ inline float devIoU(float const * const a, float const * const b) { float left = max(a[0], b[0]), right = min(a[2], b[2]); float top = max(a[1], b[1]), bottom = min(a[3], b[3]); float width = max(right - left, 0.f), height = max(bottom - top, 0.f); float interS = width * height; float Sa = (a[2] - a[0]) * (a[3] - a[1]); float Sb = (b[2] - b[0]) * (b[3] - b[1]); return interS / (Sa + Sb - interS); } __global__ void nms_kernel(const int n_boxes, const float nms_overlap_thresh, const float *dev_boxes, unsigned long long *dev_mask) { const int row_start = blockIdx.y; const int col_start = blockIdx.x; // if (row_start > col_start) return; const int row_size = min(n_boxes - row_start * threadsPerBlock, threadsPerBlock); const int col_size = min(n_boxes - col_start * threadsPerBlock, threadsPerBlock); __shared__ float block_boxes[threadsPerBlock * 5]; if (threadIdx.x < col_size) { block_boxes[threadIdx.x * 5 + 0] = dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 0]; block_boxes[threadIdx.x * 5 + 1] = dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 1]; block_boxes[threadIdx.x * 5 + 2] = dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 2]; block_boxes[threadIdx.x * 5 + 3] = dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 3]; block_boxes[threadIdx.x * 5 + 4] = dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 4]; } __syncthreads(); if (threadIdx.x < row_size) { const int cur_box_idx = threadsPerBlock * row_start + threadIdx.x; const float *cur_box = dev_boxes + cur_box_idx * 5; int i = 0; unsigned long long t = 0; int start = 0; if (row_start == col_start) { start = threadIdx.x + 1; } for (i = start; i < col_size; i++) { if (devIoU(cur_box, block_boxes + i * 5) > nms_overlap_thresh) { t |= 1ULL << i; } } const int col_blocks = THCCeilDiv(n_boxes, threadsPerBlock); dev_mask[cur_box_idx * col_blocks + col_start] = t; } } // boxes is a N x 5 tensor at::Tensor nms_cuda_forward(const at::Tensor boxes, float nms_overlap_thresh) { // Ensure CUDA uses the input tensor device. at::DeviceGuard guard(boxes.device()); using scalar_t = float; AT_ASSERTM(boxes.device().is_cuda(), "boxes must be a CUDA tensor"); auto scores = boxes.select(1, 4); auto order_t = std::get<1>(scores.sort(0, /* descending=*/true)); auto boxes_sorted = boxes.index_select(0, order_t); int boxes_num = boxes.size(0); const int col_blocks = THCCeilDiv(boxes_num, threadsPerBlock); scalar_t* boxes_dev = boxes_sorted.data_ptr(); THCState *state = at::globalContext().lazyInitCUDA(); // TODO replace with getTHCState unsigned long long* mask_dev = NULL; //THCudaCheck(THCudaMalloc(state, (void**) &mask_dev, // boxes_num * col_blocks * sizeof(unsigned long long))); mask_dev = (unsigned long long*) THCudaMalloc(state, boxes_num * col_blocks * sizeof(unsigned long long)); dim3 blocks(THCCeilDiv(boxes_num, threadsPerBlock), THCCeilDiv(boxes_num, threadsPerBlock)); dim3 threads(threadsPerBlock); nms_kernel<<>>(boxes_num, nms_overlap_thresh, boxes_dev, mask_dev); std::vector mask_host(boxes_num * col_blocks); THCudaCheck(cudaMemcpyAsync( &mask_host[0], mask_dev, sizeof(unsigned long long) * boxes_num * col_blocks, cudaMemcpyDeviceToHost, at::cuda::getCurrentCUDAStream() )); std::vector remv(col_blocks); memset(&remv[0], 0, sizeof(unsigned long long) * col_blocks); at::Tensor keep = at::empty({boxes_num}, boxes.options().dtype(at::kLong).device(at::kCPU)); int64_t* keep_out = keep.data_ptr(); int num_to_keep = 0; for (int i = 0; i < boxes_num; i++) { int nblock = i / threadsPerBlock; int inblock = i % threadsPerBlock; if (!(remv[nblock] & (1ULL << inblock))) { keep_out[num_to_keep++] = i; unsigned long long *p = &mask_host[0] + i * col_blocks; for (int j = nblock; j < col_blocks; j++) { remv[j] |= p[j]; } } } THCudaFree(state, mask_dev); // TODO improve this part return order_t.index({ keep.narrow(/*dim=*/0, /*start=*/0, /*length=*/num_to_keep).to( order_t.device(), keep.scalar_type())}); } ``` [**nms_cuda.cpp**](https://github.com/open-mmlab/mmdetection/blob/e903b5c109dc9ee5bb06c35fbe3b9c1f88ba6367/mmdet/ops/nms/src/cuda/nms_cuda.cpp) ```python #include
#define CHECK_CUDA(x) TORCH_CHECK(x.device().is_cuda(), #x, " must be a CUDAtensor ") at::Tensor nms_cuda_forward(const at::Tensor boxes, float nms_overlap_thresh); at::Tensor nms_cuda(const at::Tensor& dets, const float threshold) { CHECK_CUDA(dets); if (dets.numel() == 0) return at::empty({0}, dets.options().dtype(at::kLong).device(at::kCPU)); return nms_cuda_forward(dets, threshold); } ``` [**nms_ext.cpp**](https://github.com/open-mmlab/mmdetection/blob/e903b5c109dc9ee5bb06c35fbe3b9c1f88ba6367/mmdet/ops/nms/src/nms_ext.cpp) ```python #include
#ifdef WITH_CUDA at::Tensor nms_cuda(const at::Tensor& dets, const float threshold); #endif at::Tensor nms(const at::Tensor& dets, const float threshold){ if (dets.device().is_cuda()) { #ifdef WITH_CUDA return nms_cuda(dets, threshold); #else AT_ERROR("nms is not compiled with GPU support"); #endif } # return nms_cpu(dets, threshold); } PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { m.def("nms", &nms, "non-maximum suppression"); } ``` [**setup.py**](https://github.com/open-mmlab/mmdetection/blob/32fcb5872af512c1822caa739ec9fa35d43c938b/setup.py) ```python def make_cuda_ext(name, module, sources, sources_cuda=[]): define_macros = [] extra_compile_args = {'cxx': []} if torch.cuda.is_available() or os.getenv('FORCE_CUDA', '0') == '1': define_macros += [('WITH_CUDA', None)] extension = CUDAExtension extra_compile_args['nvcc'] = [ '-D__CUDA_NO_HALF_OPERATORS__', '-D__CUDA_NO_HALF_CONVERSIONS__', '-D__CUDA_NO_HALF2_OPERATORS__', ] sources += sources_cuda else: print(f'Compiling {name} without CUDA') extension = CppExtension # raise EnvironmentError('CUDA is required to compile MMDetection!') return extension( name=f'{module}.{name}', sources=[os.path.join(*module.split('.'), p) for p in sources], define_macros=define_macros, extra_compile_args=extra_compile_args) if __name__ == '__main__': write_version_py() setup( name='mmdet', version=get_version(), description='Open MMLab Detection Toolbox and Benchmark', long_description=readme(), author='OpenMMLab', author_email='[email protected]', keywords='computer vision, object detection', url='https://github.com/open-mmlab/mmdetection', packages=find_packages(exclude=('configs', 'tools', 'demo')), package_data={'mmdet.ops': ['*/*.so']}, classifiers=[ 'Development Status :: 4 - Beta', 'License :: OSI Approved :: Apache Software License', 'Operating System :: OS Independent', 'Programming Language :: Python :: 3', 'Programming Language :: Python :: 3.5', 'Programming Language :: Python :: 3.6', 'Programming Language :: Python :: 3.7', ], license='Apache License 2.0', setup_requires=parse_requirements('requirements/build.txt'), tests_require=parse_requirements('requirements/tests.txt'), install_requires=parse_requirements('requirements/runtime.txt'), extras_require={ 'all': parse_requirements('requirements.txt'), 'tests': parse_requirements('requirements/tests.txt'), 'build': parse_requirements('requirements/build.txt'), 'optional': parse_requirements('requirements/optional.txt'), }, ext_modules=[ make_cuda_ext( name='compiling_info', module='mmdet.ops.utils', sources=['src/compiling_info.cpp']), make_cuda_ext( name='nms_ext', module='mmdet.ops.nms', sources=['src/nms_ext.cpp', 'src/cpu/nms_cpu.cpp'], sources_cuda=[ 'src/cuda/nms_cuda.cpp', 'src/cuda/nms_kernel.cu' ]), ], cmdclass={'build_ext': BuildExtension}, zip_safe=False) ``` [**nms_wrapper.py**](https://github.com/open-mmlab/mmdetection/blob/e903b5c109dc9ee5bb06c35fbe3b9c1f88ba6367/mmdet/ops/nms/nms_wrapper.py) ```python from . import nms_ext def nms(dets, iou_thr, device_id=None): # convert dets (tensor or numpy array) to tensor if isinstance(dets, torch.Tensor): is_numpy = False dets_th = dets elif isinstance(dets, np.ndarray): is_numpy = True device = 'cpu' if device_id is None else f'cuda:{device_id}' dets_th = torch.from_numpy(dets).to(device) else: raise TypeError('dets must be either a Tensor or numpy array, ' f'but got {type(dets)}') # execute cpu or cuda nms if dets_th.shape[0] == 0: inds = dets_th.new_zeros(0, dtype=torch.long) else: if dets_th.is_cuda: inds = nms_ext.nms(dets_th, iou_thr) else: inds = nms_ext.nms(dets_th, iou_thr) if is_numpy: inds = inds.cpu().numpy() return dets[inds, :], inds ``` ## 總結 想要編寫一個c++和cuda的擴充套件給Pytorch使用,其實主要就4步: - 使用ATEN和THC編寫前向程式碼cu檔案A - 封裝成一個cpp檔案B - 再把B封裝一遍並且使用PYBIND11_MODULE繫結函式名function - 通過make_cuda_ext(這個是mmdetection自定義的函式)把元件setup一遍 通過繫結的函式名function就可以在Pytorch中調