  •         指標遍歷

OpenCL不支援CUDA那樣的指標遍歷方式, 你只能用下標方式間接實現指標遍歷. 例子程式碼如下:

struct Node { Node* next; }
n = n->next;

 // OpenCL

struct Node { unsigned int next; }

n = bufBase + n;

  • Kernel 程式異同


使用虛擬碼,程式執行時即時編譯和裝載。這個類似JAVA, .net 程式,道理也一樣,為了支援跨平臺的相容。kernel程式的語法也




1)CUDA 的kernel函式使用“__global__”申明而OpenCL的kernel函式使用“__kernel”作為申明。


3)眾所周知,CUDA採用threadIdx.{x|y|z}, blockIdx.{x|y|z}來獲得當前執行緒的索引號,而OpenCL



  • Host程式碼的異同


OpenCL的程式碼以文字方式存放在“sProgramSource”。 呼叫方式如下:

  • 初始化部分的異同 

CUDA 在使用任何API之前必須呼叫cuInit(0),然後是獲得當前系統的可用裝置並獲得Context。
cuDeviceGet(&hContext, 0);
cuCtxCreate(&hContext, 0, hDevice));

cl_context hContext;
hContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, 0, 0, 0);

size_t nContextDescriptorSize;
clGetContextInfo(hContext, CL_CONTEXT_DEVICES, 0, 0, &nContextDescriptorSize);
cl_device_id * aDevices = malloc(nContextDescriptorSize);

clGetContextInfo(hContext, CL_CONTEXT_DEVICES, nContextDescriptorSize, aDevices, 0);

OpenCL introduces an additional concept: Command Queues. Commands launching kernels and
reading or writing memory are always issued for a specific command queue. A command queue is
created on a specific device in a context. The following code creates a command queue for the
device and context created so far:
cl_command_queue hCmdQueue;
hCmdQueue = clCreateCommandQueue(hContext, aDevices[0], 0, 0);
With this the program has progressed to the point where data can be uploaded to the device’s
memory and processed by launching compute kernels on the device.

  • Kernel Creation

CUDA kernel 以二進位制格式存放與CUBIN檔案中間,其呼叫格式和DLL的用法比較類似,先裝載二進位制庫,然後通過函式名查詢

CUmodule hModule;
cuModuleLoad(&hModule, “vectorAdd.cubin”);
cuModuleGetFunction(&hFunction, hModule, "vectorAdd");

OpenCL 為了支援多平臺,所以不使用編譯後的程式碼,採用類似JAVA的方式,裝載文字格式的程式碼檔案,然後即時編譯並執行。

// 裝載程式碼,即時編譯
cl_program hProgram;
hProgram = clCreateProgramWithSource(hContext, 1, “vectorAdd.c", 0, 0);
clBuildProgram(hProgram, 0, 0, 0, 0, 0);
// 獲得kernel函式控制代碼
cl_kernel hKernel;
hKernel = clCreateKernel(hProgram, “vectorAdd”, 0);

  • 裝置記憶體分配

記憶體分配沒有什麼大區別,OpenCL提供兩組特殊的標誌,CL_MEM_READ_ONLY  和 CL_MEM_WRITE_ONLY 用來控制記憶體

的讀寫許可權。另外一個標誌比較有用:CL_MEM_COPY_HOST_PTR 表示這個記憶體在主機分配,但是GPU可以使用,執行時會自動


CUdeviceptr pDeviceMemA, pDeviceMemB, pDeviceMemC;
cuMemAlloc(&pDeviceMemA, cnDimension * sizeof(float));
cuMemAlloc(&pDeviceMemB, cnDimension * sizeof(float));
cuMemAlloc(&pDeviceMemC, cnDimension * sizeof(float));
cuMemcpyHtoD(pDeviceMemA, pA, cnDimension * sizeof(float));
cuMemcpyHtoD(pDeviceMemB, pB, cnDimension * sizeof(float));
// OpenCL
hDeviceMemA = clCreateBuffer(hContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, cnDimension * sizeof(cl_float), pA, 0);
hDeviceMemB = clCreateBuffer(hContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, cnDimension * sizeof(cl_float), pA, 0);
hDeviceMemC = clCreateBuffer(hContext, CL_MEM_WRITE_ONLY, cnDimension * sizeof(cl_float), 0, 0);

  • Kernel Parameter Specification

The next step in preparing the kernels for launch is to establish a mapping between the kernels’
parameters, essentially pointers to the three vectors A, B and C, to the three device memory regions,
which were allocated in the previous section.
Parameter setting in both APIs is a pretty low-level affair. It requires knowledge of the total number
, order, and types of a given kernel’s parameters. The order and types of the parameters are used to
determine a specific parameters offset inside the data block made up of all parameters. The offset in
bytes for the n-th parameter is essentially the sum of the sizes of all (n-1) preceding parameters.
Using the CUDA Driver API:
In CUDA device pointers are represented as unsigned int and the CUDA Driver API has a
dedicated method for setting that type. Here’s the code for setting the three parameters. Note how
the offset is incrementally computed as the sum of the previous parameters’ sizes.
cuParamSeti(cuFunction, 0, pDeviceMemA);
cuParamSeti(cuFunction, 4, pDeviceMemB);
cuParamSeti(cuFunction, 8, pDeviceMemC);
cuParamSetSize(cuFunction, 12);
Using OpenCL:
In OpenCL parameter setting is done via a single function that takes a pointer to the location of the
parameter to be set.
clSetKernelArg(hKernel, 0, sizeof(cl_mem), (void *)&hDeviceMemA);
clSetKernelArg(hKernel, 1, sizeof(cl_mem), (void *)&hDeviceMemB);
clSetKernelArg(hKernel, 2, sizeof(cl_mem), (void *)&hDeviceMemC);



OpenCL和CUDA是兩種異構計算的程式設計模型。 NVIDIA在2007年正式釋出CUDA之後,一直大力推廣這種程式設計模型,主要集中在科學計算這一塊,原因是這個領域的很多應用程式屬於資料並行型別,因此利用CUDA在NVIDIA自家的GPU上加速原來單執行緒的程式一般

