1. 程式人生 > >一個程式來比較cuda/c在GPU/CPU的執行效率

一個程式來比較cuda/c在GPU/CPU的執行效率

在網上看了一個比較GPU和CPU執行矩陣運算效率的帖子,親自跑了一下。

這是一個CUDA語言程式,請儲存為“檔名.cu”。我的主機配置如下:

CPU:G2030;記憶體8GB;顯示卡:GTX750ti。

程式碼如下所示:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <time.h>

#define N (1024*1024)
#define M (10000)
#define THREADS_PER_BLOCK 1024

void serial_add(double *a, double *b, double *c, int n, int m)
{
    for(int index=0;index<n;index++)
    {
        for(int j=0;j<m;j++)
        {
            c[index] = a[index]*a[index] + b[index]*b[index];
        }
    }
}

__global__ void vector_add(double *a, double *b, double *c)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
        for(int j=0;j<M;j++)
        {
            c[index] = a[index]*a[index] + b[index]*b[index];
        }
}

int main()
{
    clock_t start,end;

    double *a, *b, *c;
    int size = N * sizeof( double );

    a = (double *)malloc( size );
    b = (double *)malloc( size );
    c = (double *)malloc( size );

    for( int i = 0; i < N; i++ )
    {
        a[i] = b[i] = i;
        c[i] = 0;
    }

    start = clock();
    serial_add(a, b, c, N, M);

    printf( "c[%d] = %f\n",0,c[0] );
    printf( "c[%d] = %f\n",N-1, c[N-1] );

    end = clock();

    float time1 = ((float)(end-start))/CLOCKS_PER_SEC;
    printf("CPU: %f seconds\n",time1);

    start = clock();
    double *d_a, *d_b, *d_c;


    cudaMalloc( (void **) &d_a, size );
    cudaMalloc( (void **) &d_b, size );
    cudaMalloc( (void **) &d_c, size );


    cudaMemcpy( d_a, a, size, cudaMemcpyHostToDevice );
    cudaMemcpy( d_b, b, size, cudaMemcpyHostToDevice );

    vector_add<<< (N + (THREADS_PER_BLOCK-1)) / THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( d_a, d_b, d_c );

    cudaMemcpy( c, d_c, size, cudaMemcpyDeviceToHost );


    printf( "c[%d] = %f\n",0,c[0] );
    printf( "c[%d] = %f\n",N-1, c[N-1] );


    free(a);
    free(b);
    free(c);
    cudaFree( d_a );
    cudaFree( d_b );
    cudaFree( d_c );

    end = clock();
    float time2 = ((float)(end-start))/CLOCKS_PER_SEC;
    printf("CUDA: %f seconds, Speedup: %f\n",time2, time1/time2);

    return 0;
}
程式執行結果如下圖所示:


CPU的執行時間是GPU執行時間的10倍,這已經是一個數量級的差距了。而且,我相信隨著運算量的加大,差距會更加明顯。由此看來GPU做矩陣運算確實比CPU快太多。

原帖地址:https://my.oschina.net/zzw922cn/blog/631650