PyTorch中的C++擴充套件實現

阿新 • • 發佈：2020-04-03

今天要聊聊用 PyTorch 進行 C++ 擴充套件。

在正式開始前，我們需要了解 PyTorch 如何自定義module。這其中，最常見的就是在 python 中繼承torch.nn.Module，用 PyTorch 中已有的 operator 來組裝成自己的模組。這種方式實現簡單，但是，計算效率卻未必最佳，另外，如果我們想實現的功能過於複雜，可能 PyTorch 中那些已有的函式也沒法滿足我們的要求。這時，用 C、C++、CUDA 來擴充套件 PyTorch 的模組就是最佳的選擇了。

由於目前市面上大部分深度學習系統（TensorFlow、PyTorch 等）都是基於 C、C++ 構建的後端，因此這些系統基本都存在 C、C++ 的擴充套件介面。PyTorch 是基於 Torch 構建的，而 Torch 底層採用的是 C 語言，因此 PyTorch 天生就和 C 相容，因此用 C 來擴充套件 PyTorch 並非難事。而隨著 PyTorch1.0 的釋出，官方已經開始考慮將 PyTorch 的底層程式碼用 caffe2 替換，因此他們也在逐步重構 ATen，後者是目前 PyTorch 使用的 C++ 擴充套件庫。總的來說，C++ 是未來的趨勢。至於 CUDA，這是幾乎所有深度學習系統在構建之初就採用的工具，因此 CUDA 的擴充套件介面是標配。

本文用一個簡單的例子，梳理一下進行 C++ 擴充套件的步驟，至於一些具體的實現，不做深入探討。

PyTorch的C、C++、CUDA擴充套件

關於 PyTorch 的 C 擴充套件，可以參考官方教程或者這篇博文，其操作並不難，無非是藉助原先 Torch 提供的<TH/TH.h>和<THC/THC.h>等介面，再利用 PyTorch 中提供的torch.util.ffi模組進行擴充套件。需要注意的是，隨著 PyTorch 版本升級，這種做法在新版本的 PyTorch 中可能會失效。

本文主要介紹 C++（未來可能加上 CUDA）的擴充套件方法。

C++擴充套件

首先，介紹一下基本流程。在 PyTorch 中擴充套件 C++/CUDA 主要分為幾步：

安裝好 pybind11 模組（通過 pip 或者 conda 等安裝），這個模組會負責 python 和 C++ 之間的繫結；
用 C++ 寫好自定義層的功能，包括前向傳播forward和反向傳播backward；
寫好 setup.py，並用 python 提供的setuptools來編譯並載入 C++ 程式碼。
編譯安裝，在 python 中呼叫 C++ 擴充套件介面。

接下來，我們就用一個簡單的例子（z=2x+y）來演示這幾個步驟。

第一步

安裝 pybind11 比較簡單，直接略過。我們先寫好 C++ 相關的檔案：

標頭檔案 test.h

#include <torch/extension.h>
#include <vector>

// 前向傳播
torch::Tensor Test_forward_cpu(const torch::Tensor& inputA,const torch::Tensor& inputB);
// 反向傳播
std::vector<torch::Tensor> Test_backward_cpu(const torch::Tensor& gradOutput);

注意，這裡引用的<torch/extension.h>標頭檔案至關重要，它主要包括三個重要模組：

pybind11，用於 C++ 和 python 互動；
ATen，包含 Tensor 等重要的函式和類；
一些輔助的標頭檔案，用於實現 ATen 和 pybind11 之間的互動。

原始檔 test.cpp 如下：

#include "test.h"

// 前向傳播，兩個 Tensor 相加。這裡只關注 C++ 擴充套件的流程，具體實現不深入探討。
torch::Tensor Test_forward_cpu(const torch::Tensor& x,const torch::Tensor& y) {
  AT_ASSERTM(x.sizes() == y.sizes(),"x must be the same size as y");
  torch::Tensor z = torch::zeros(x.sizes());
  z = 2 * x + y;
  return z;
}

// 反向傳播
// 在這個例子中，z對x的導數是2，z對y的導數是1。
// 至於這個backward函式的介面（引數，返回值）為何要這樣設計，後面會講。
std::vector<torch::Tensor> Test_backward_cpu(const torch::Tensor& gradOutput) {
  torch::Tensor gradOutputX = 2 * gradOutput * torch::ones(gradOutput.sizes());
  torch::Tensor gradOutputY = gradOutput * torch::ones(gradOutput.sizes());
  return {gradOutputX,gradOutputY};
}

// pybind11 繫結
PYBIND11_MODULE(TORCH_EXTENSION_NAME,m) {
 m.def("forward",&Test_forward_cpu,"TEST forward");
 m.def("backward",&Test_backward_cpu,"TEST backward");
}

第二步

新建一個編譯安裝的配置檔案 setup.py，檔案目錄安排如下：

└── csrc
  ├── cpu
  │  ├── test.cpp
  │  └── test.h
  └── setup.py

以下是 setup.py 中的內容：

from setuptools import setup
import os
import glob
from torch.utils.cpp_extension import BuildExtension,CppExtension

# 標頭檔案目錄
include_dirs = os.path.dirname(os.path.abspath(__file__))
# 原始碼目錄
source_cpu = glob.glob(os.path.join(include_dirs,'cpu','*.cpp'))

setup(
  name='test_cpp',# 模組名稱，需要在python中呼叫
  version="0.1",ext_modules=[
    CppExtension('test_cpp',sources=source_cpu,include_dirs=[include_dirs]),],cmdclass={
    'build_ext': BuildExtension
  }
)

注意，這個 C++ 擴充套件被命名為test_cpp，意思是說，在 python 中可以通過test_cpp模組來呼叫 C++ 函式。

第三步

在 cpu 這個目錄下，執行下面的命令編譯安裝 C++ 程式碼：

python setup.py install

之後，可以看到一堆輸出，該 C++ 模組會被安裝在 python 的 site-packages 中。

完成上面幾步後，就可以在 python 中呼叫 C++ 程式碼了。在 PyTorch 中，按照慣例需要先把 C++ 中的前向傳播和反向傳播封裝成一個函式op（以下程式碼放在 test.py 檔案中）：

from torch.autograd import Function

import test_cpp

class TestFunction(Function):

  @staticmethod
  def forward(ctx,x,y):
    return test_cpp.forward(x,y)

  @staticmethod
  def backward(ctx,gradOutput):
    gradX,gradY = test_cpp.backward(gradOutput)
    return gradX,gradY

這樣一來，我們相當於把 C++ 擴充套件的函式嵌入到 PyTorch 自己的框架內。

我查看了這個Function類的程式碼，發現是個挺有意思的東西：

class Function(with_metaclass(FunctionMeta,_C._FunctionBase,_ContextMethodMixin,_HookMixin)):
 
  ...

  @staticmethod
  def forward(ctx,*args,**kwargs):
    r"""Performs the operation.

    This function is to be overridden by all subclasses.

    It must accept a context ctx as the first argument,followed by any
    number of arguments (tensors or other types).

    The context can be used to store tensors that can be then retrieved
    during the backward pass.
    """
    raise NotImplementedError

  @staticmethod
  def backward(ctx,*grad_outputs):
    r"""Defines a formula for differentiating the operation.

    This function is to be overridden by all subclasses.

    It must accept a context :attr:`ctx` as the first argument,followed by
    as many outputs did :func:`forward` return,and it should return as many
    tensors,as there were inputs to :func:`forward`. Each argument is the
    gradient w.r.t the given output,and each returned value should be the
    gradient w.r.t. the corresponding input.

    The context can be used to retrieve tensors saved during the forward
    pass. It also has an attribute :attr:`ctx.needs_input_grad` as a tuple
    of booleans representing whether each input needs gradient. E.g.,:func:`backward` will have ``ctx.needs_input_grad[0] = True`` if the
    first input to :func:`forward` needs gradient computated w.r.t. the
    output.
    """
    raise NotImplementedError

這裡需要注意一下backward的實現規則。該介面包含兩個引數：ctx是一個輔助的環境變數，grad_outputs則是來自前一層網路的梯度列表，而且這個梯度列表的數量與forward函式返回的引數數量相同，這也符合鏈式法則的原理，因為鏈式法則就需要把前一層中所有相關的梯度與當前層進行相乘或相加。同時，backward需要返回forward中每個輸入引數的梯度，如果forward中包括 n 個引數，就需要一一返回 n 個梯度。所以，在上面這個例子中，我們的backward函式接收一個引數作為輸入（forward只輸出一個變數），並返回兩個梯度（forward接收上一層兩個輸入變數）。

定義完Function後，就可以在Module中使用這個自定義op了：

import torch

class Test(torch.nn.Module):

  def __init__(self):
    super(Test,self).__init__()

  def forward(self,inputA,inputB):
    return TestFunction.apply(inputA,inputB)

現在，我們的檔案目錄變成：

├── csrc
│  ├── cpu
│  │  ├── test.cpp
│  │  └── test.h
│  └── setup.py
└── test.py

之後，我們就可以將 test.py 當作一般的 PyTorch 模組進行呼叫了。

測試

下面，我們測試一下前向傳播和反向傳播：

import torch
from torch.autograd import Variable

from test import Test

x = Variable(torch.Tensor([1,2,3]),requires_grad=True)
y = Variable(torch.Tensor([4,5,6]),requires_grad=True)
test = Test()
z = test(x,y)
z.sum().backward()
print('x: ',x)
print('y: ',y)
print('z: ',z)
print('x.grad: ',x.grad)
print('y.grad: ',y.grad)

輸出如下：

x: tensor([1.,2.,3.],requires_grad=True)
y: tensor([4.,5.,6.],requires_grad=True)
z: tensor([ 6.,9.,12.],grad_fn=<TestFunctionBackward>)
x.grad: tensor([2.,2.])
y.grad: tensor([1.,1.,1.])

可以看出，前向傳播滿足 z=2x+y，而反向傳播的結果也在意料之中。

CUDA擴充套件

雖然 C++ 寫的程式碼可以直接跑在 GPU 上，但它的效能還是比不上直接用 CUDA 編寫的程式碼，畢竟 ATen 沒法並不知道如何去優化演算法的效能。不過，由於我對 CUDA 仍一竅不通，因此這一步只能暫時略過，留待之後補充～囧～。

參考

CUSTOM C EXTENSIONS FOR PYTORCH
CUSTOM C++ AND CUDA EXTENSIONS
Pytorch拓展進階(一)：Pytorch結合C以及Cuda語言
Pytorch拓展進階(二)：Pytorch結合C++以及Cuda拓展

到此這篇關於PyTorch中的C++擴充套件實現的文章就介紹到這了,更多相關PyTorch C++擴充套件內容請搜尋我們以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援我們！

PyTorch中的C++擴充套件實現

PyTorch中的C++擴充套件實現

Pytorch中的VGG實現修改最後一層FC

pytorch中使用cuda擴充套件的實現示例

C語言實現掃雷小遊戲(擴充套件版可選擇遊戲難度)

Pytorch中實現只匯入部分模型引數的方式

Pytorch中index_select() 函式的實現理解

在pytorch中實現只讓指定變數向後傳播梯度

C#中List擴充套件方法

資料結構C語言實現----向連結串列中插入結點

淺談c++如何實現併發中的Barrier

Java中Base64.encodeBase64URLSafe在C#的實現

在PyTorch中使用深度自編碼器實現影象重建

簡單介紹C# 中的擴充套件方法

C語言實現計算句子中的單詞數量的計算

多執行緒併發安全計數器實現限流(二) 使用 J.U.C中的AtomicInteger實現計數器

關於C語言實現單鏈表中的指標問題

如何在Visual Studio 2019中啟動並配置一個使用pyTorch的C++專案（Windows系統，CMAKE專案）

C#中Math.Round()實現中國式四捨五入

PLC中D0與D1 C語言實現方法

在vue中使用inheritAttrs實現元件的擴充套件性介紹

PyTorch中的C++擴充套件實現

相關推薦