OpenCL和CUDA的使用比較

阿新 • • 發佈：2019-01-18

OpenCL和CUDA雖然不是同一個平級的東西,但是也可以橫向比較!

對OpenCL和CUDA的異同做比較:

指標遍歷

OpenCL不支援CUDA那樣的指標遍歷方式, 你只能用下標方式間接實現指標遍歷. 例子程式碼如下:
// CUDA

struct Node { Node* next; }
n = n->next;

// OpenCL

struct Node { unsigned int next; }

n = bufBase + n;

Kernel 程式異同

CUDA的程式碼最終編譯成顯示卡上的二進位制格式，最後由cudart.dll(個人猜測)裝載到GPU並且執行。OpenCL中執行時庫中包含編譯器，

使用虛擬碼，程式執行時即時編譯和裝載。這個類似JAVA, .net 程式，道理也一樣，為了支援跨平臺的相容。kernel程式的語法也

有略微不同，如下：

可以看出大部分都相同。只是細節有差異：

1）CUDA 的kernel函式使用“__global__”申明而OpenCL的kernel函式使用“__kernel”作為申明。

2）OpenCL的所有引數都有“__global”修飾符，代表這個引數所指地址是在全域性記憶體。

3）眾所周知，CUDA採用threadIdx.{x|y|z}, blockIdx.{x|y|z}來獲得當前執行緒的索引號，而OpenCL

通過一個特定的get_global_id()函式來獲得在kernel中的全域性索引號。OpenCL中如果要獲得在當前工作

組（對等於CUDA中的block）中的區域性索引號，可以使用get_local_id()

Host程式碼的異同

把上面的kernel程式碼編譯成“vectorAdd.cubin”，CUDA呼叫方法如下：

const unsigned int cnBlockSize = 512;
const unsigned int cnBlocks = 3;
const unsigned int cnDimension = cnBlocks * cnBlockSize;
CUdevice hDevice;
CUcontext hContext;
CUmodule hModule;
CUfunction hFunction;
// create CUDA device & context
cuInit(0);
cuDeviceGet(&hContext, 0); // pick first device
cuCtxCreate(&hContext, 0, hDevice));
cuModuleLoad(&hModule, “vectorAdd.cubin”);
cuModuleGetFunction(&hFunction, hModule, "vectorAdd");
// allocate host vectors
float * pA = new float[cnDimension];
float * pB = new float[cnDimension];
float * pC = new float[cnDimension];
// initialize host memory
randomInit(pA, cnDimension);
randomInit(pB, cnDimension);
// allocate memory on the device
CUdeviceptr pDeviceMemA, pDeviceMemB, pDeviceMemC;
cuMemAlloc(&pDeviceMemA, cnDimension * sizeof(float));
cuMemAlloc(&pDeviceMemB, cnDimension * sizeof(float));
cuMemAlloc(&pDeviceMemC, cnDimension * sizeof(float));
// copy host vectors to device
cuMemcpyHtoD(pDeviceMemA, pA, cnDimension * sizeof(float));
cuMemcpyHtoD(pDeviceMemB, pB, cnDimension * sizeof(float));
// setup parameter values
cuFuncSetBlockShape(cuFunction, cnBlockSize, 1, 1);
cuParamSeti(cuFunction, 0, pDeviceMemA);
cuParamSeti(cuFunction, 4, pDeviceMemB);
cuParamSeti(cuFunction, 8, pDeviceMemC);
cuParamSetSize(cuFunction, 12);
// execute kernel
cuLaunchGrid(cuFunction, cnBlocks, 1);
// copy the result from device back to host
cuMemcpyDtoH((void *) pC, pDeviceMemC, cnDimension * sizeof(float));
delete[] pA;
delete[] pB;
delete[] pC;
cuMemFree(pDeviceMemA);
cuMemFree(pDeviceMemB);
cuMemFree(pDeviceMemC);

OpenCL的程式碼以文字方式存放在“sProgramSource”。呼叫方式如下：

const unsigned int cnBlockSize = 512;
const unsigned int cnBlocks = 3;
const unsigned int cnDimension = cnBlocks * cnBlockSize;
// create OpenCL device & context
cl_context hContext;
hContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, 0, 0, 0);
// query all devices available to the context
size_t nContextDescriptorSize;
clGetContextInfo(hContext, CL_CONTEXT_DEVICES, 0, 0, &nContextDescriptorSize);
cl_device_id * aDevices = malloc(nContextDescriptorSize);
clGetContextInfo(hContext, CL_CONTEXT_DEVICES, nContextDescriptorSize, aDevices, 0);
// create a command queue for first device the context reported
cl_command_queue hCmdQueue;
hCmdQueue = clCreateCommandQueue(hContext, aDevices[0], 0, 0);
// create & compile program
cl_program hProgram;
hProgram = clCreateProgramWithSource(hContext, 1, sProgramSource, 0, 0);
clBuildProgram(hProgram, 0, 0, 0, 0, 0);// create kernel
cl_kernel hKernel;
hKernel = clCreateKernel(hProgram, “vectorAdd”, 0);
// allocate host vectors
float * pA = new float[cnDimension];
float * pB = new float[cnDimension];
float * pC = new float[cnDimension];
// initialize host memory
randomInit(pA, cnDimension);
randomInit(pB, cnDimension);
// allocate device memory
cl_mem hDeviceMemA, hDeviceMemB, hDeviceMemC;
hDeviceMemA = clCreateBuffer(hContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, cnDimension * sizeof(cl_float), pA, 0);
hDeviceMemB = clCreateBuffer(hContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, cnDimension * sizeof(cl_float), pA, 0);
hDeviceMemC = clCreateBuffer(hContext,
CL_MEM_WRITE_ONLY,
cnDimension * sizeof(cl_float), 0, 0);
// setup parameter values
clSetKernelArg(hKernel, 0, sizeof(cl_mem), (void *)&hDeviceMemA);
clSetKernelArg(hKernel, 1, sizeof(cl_mem), (void *)&hDeviceMemB);
clSetKernelArg(hKernel, 2, sizeof(cl_mem), (void *)&hDeviceMemC);
// execute kernel
clEnqueueNDRangeKernel(hCmdQueue, hKernel, 1, 0, &cnDimension, 0, 0, 0, 0);
// copy results from device back to host
clEnqueueReadBuffer(hContext, hDeviceMemC, CL_TRUE, 0, cnDimension * sizeof(cl_float),
pC, 0, 0, 0);
delete[] pA;
delete[] pB;
delete[] pC;
clReleaseMemObj(hDeviceMemA);
clReleaseMemObj(hDeviceMemB);
clReleaseMemObj(hDeviceMemC);

初始化部分的異同

CUDA 在使用任何API之前必須呼叫cuInit(0)，然後是獲得當前系統的可用裝置並獲得Context。
cuInit(0);
cuDeviceGet(&hContext, 0);
cuCtxCreate(&hContext, 0, hDevice));
OpenCL不用全域性的初始化，直接指定裝置獲得控制代碼就可以了
cl_context hContext;
hContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, 0, 0, 0);
裝置建立完畢後，可以通過下面的方法獲得裝置資訊和上下文：
size_t nContextDescriptorSize;
clGetContextInfo(hContext, CL_CONTEXT_DEVICES, 0, 0, &nContextDescriptorSize);
cl_device_id * aDevices = malloc(nContextDescriptorSize);

clGetContextInfo(hContext, CL_CONTEXT_DEVICES, nContextDescriptorSize, aDevices, 0);

OpenCL introduces an additional concept: Command Queues. Commands launching kernels and
reading or writing memory are always issued for a specific command queue. A command queue is
created on a specific device in a context. The following code creates a command queue for the
device and context created so far:
cl_command_queue hCmdQueue;
hCmdQueue = clCreateCommandQueue(hContext, aDevices[0], 0, 0);
With this the program has progressed to the point where data can be uploaded to the device’s
memory and processed by launching compute kernels on the device.

Kernel Creation

CUDA kernel 以二進位制格式存放與CUBIN檔案中間，其呼叫格式和DLL的用法比較類似，先裝載二進位制庫，然後通過函式名查詢

函式地址，最後用將函式裝載到GPU執行。示例程式碼如下：
CUmodule hModule;
cuModuleLoad(&hModule, “vectorAdd.cubin”);
cuModuleGetFunction(&hFunction, hModule, "vectorAdd");
OpenCL 為了支援多平臺，所以不使用編譯後的程式碼，採用類似JAVA的方式，裝載文字格式的程式碼檔案，然後即時編譯並執行。
需要注意的是，OpenCL也提供API訪問kernel的二進位制程式，前提是這個kernel已經被編譯並且放在某個特定的快取中了。

// 裝載程式碼，即時編譯
cl_program hProgram;
hProgram = clCreateProgramWithSource(hContext, 1, “vectorAdd.c", 0, 0);
clBuildProgram(hProgram, 0, 0, 0, 0, 0);
// 獲得kernel函式控制代碼
cl_kernel hKernel;
hKernel = clCreateKernel(hProgram, “vectorAdd”, 0);

裝置記憶體分配

記憶體分配沒有什麼大區別，OpenCL提供兩組特殊的標誌，CL_MEM_READ_ONLY 和 CL_MEM_WRITE_ONLY 用來控制記憶體

的讀寫許可權。另外一個標誌比較有用：CL_MEM_COPY_HOST_PTR 表示這個記憶體在主機分配，但是GPU可以使用，執行時會自動

將主機記憶體內容拷貝到GPU，主機記憶體分配，裝置記憶體分配，主機拷貝資料到裝置，3個步驟一氣呵成。
// CUDA

CUdeviceptr pDeviceMemA, pDeviceMemB, pDeviceMemC;
cuMemAlloc(&pDeviceMemA, cnDimension * sizeof(float));
cuMemAlloc(&pDeviceMemB, cnDimension * sizeof(float));
cuMemAlloc(&pDeviceMemC, cnDimension * sizeof(float));
cuMemcpyHtoD(pDeviceMemA, pA, cnDimension * sizeof(float));
cuMemcpyHtoD(pDeviceMemB, pB, cnDimension * sizeof(float));
// OpenCL
hDeviceMemA = clCreateBuffer(hContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, cnDimension * sizeof(cl_float), pA, 0);
hDeviceMemB = clCreateBuffer(hContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, cnDimension * sizeof(cl_float), pA, 0);
hDeviceMemC = clCreateBuffer(hContext, CL_MEM_WRITE_ONLY, cnDimension * sizeof(cl_float), 0, 0);

Kernel Parameter Specification

The next step in preparing the kernels for launch is to establish a mapping between the kernels’
parameters, essentially pointers to the three vectors A, B and C, to the three device memory regions,
which were allocated in the previous section.
Parameter setting in both APIs is a pretty low-level affair. It requires knowledge of the total number
, order, and types of a given kernel’s parameters. The order and types of the parameters are used to
determine a specific parameters offset inside the data block made up of all parameters. The offset in
bytes for the n-th parameter is essentially the sum of the sizes of all (n-1) preceding parameters.
Using the CUDA Driver API:
In CUDA device pointers are represented as unsigned int and the CUDA Driver API has a
dedicated method for setting that type. Here’s the code for setting the three parameters. Note how
the offset is incrementally computed as the sum of the previous parameters’ sizes.
cuParamSeti(cuFunction, 0, pDeviceMemA);
cuParamSeti(cuFunction, 4, pDeviceMemB);
cuParamSeti(cuFunction, 8, pDeviceMemC);
cuParamSetSize(cuFunction, 12);
Using OpenCL:
In OpenCL parameter setting is done via a single function that takes a pointer to the location of the
parameter to be set.
clSetKernelArg(hKernel, 0, sizeof(cl_mem), (void *)&hDeviceMemA);
clSetKernelArg(hKernel, 1, sizeof(cl_mem), (void *)&hDeviceMemB);
clSetKernelArg(hKernel, 2, sizeof(cl_mem), (void *)&hDeviceMemC);

OpenCL和CUDA的使用比較

OpenCL和CUDA的使用比較

OpenCL和CUDA簡單比較

OpenCV3 比較CPU, OpenCL，cuda效能

關系數據庫和NOSQL比較

Java中Integer和int比較大小出現的錯誤

C# 的 String.CompareTo Equals和==的比較

string中的equals和 == 的比較

【轉載】Java中Comparable和Comparator比較

Oracle字符和時間比較

Memcached和Redis比較

JAVA學習（二） String使用equals方法和==分別比較的是什麽？（轉）

TCP和UDP比較

JAXB和XStream比較

Java中Comparable和Comparator比較

memcached 和redis比較

ubuntu 16.04安裝nVidia顯卡驅動和cuda/cudnn踩坑過程

Java-IO 字節流的使用和效率比較

python2和python3比較好的共存方法

Java集合--Iterator和Enumeration比較

Java中 equals 和 == 的比較

OpenCL和CUDA的使用比較

相關推薦