C++矩陣乘法計算 || GPU && CPU 實現

阿新 • • 發佈：2019-01-30

前言

矩陣乘法運算是機器學習的基礎。比如，卷積神經網路通過矩陣化輸入資料，然後通過矩陣乘法計算獲得結果。而效能對於演算法是至關重要的事情，所以本文主要介紹c++呼叫普通的矩陣乘法庫進行計算，以及通過cuda計算矩陣乘法。C++常用cblas庫加速cpu上的矩陣乘法運算。為了將速度提高更高，GPU版本矩陣乘法運算則通過cublas庫進行操作，在cublas庫中，使用cublasSgemv()和cublasSgemm()分別進行矩陣向量間的乘法運算與矩陣矩陣間的乘法運算。本文將具體的解釋上述兩個函式的引數以及具體的應用例子。參照官方解釋

矩陣乘法函式解釋

CPU版本矩陣矩陣乘法

cblas有兩個函式實現矩陣乘法，一個是cblas_sgemm(),另一個是cblas_dgemm()兩者的不同點在於傳入引數一個是float型，一個是double型。

void cblas_sgemm(
        OPENBLAS_CONST enum CBLAS_ORDER Order,         // 矩陣儲存形式，行優先或者列優先
        OPENBLAS_CONST enum CBLAS_TRANSPOSE TransA,    // 進行矩陣乘運算前，A是否轉置
        OPENBLAS_CONST enum CBLAS_TRANSPOSE TransB,    // 進行矩陣運算前，B是否轉置
        OPENBLAS_CONST blasint M,                      // A的行數
        OPENBLAS_CONST blasint N, 
                      // B的列數
        OPENBLAS_CONST blasint K,                      // A的列數 <==> B的行數 
		OPENBLAS_CONST double alpha,                   // 比例因子
		OPENBLAS_CONST double *A,                      // A的首地址
		OPENBLAS_CONST blasint lda,                    // A的列數，與是否轉置無關
		OPENBLAS_CONST double 
 *B,                      // B的首地址
		OPENBLAS_CONST blasint ldb,                    // B的列數，與轉置無關
		OPENBLAS_CONST double beta,                    // 比例因子
		double *C,                                     // C的首地址
		OPENBLAS_CONST blasint ldc);                   // C的列數

計算結果: C = alpha * TransA(A) x TransB(B) + beta * C
其中, A為M x K矩陣, B為 K x N矩陣，C 為 M x N 矩陣

CPU版本矩陣向量乘法

void cblas_sgemv(
        OPENBLAS_CONST enum CBLAS_ORDER order,       // A的資料儲存型式  
        OPENBLAS_CONST enum CBLAS_TRANSPOSE trans,   // A進行矩陣運算前，是否轉置
        OPENBLAS_CONST blasint M,                    // 轉置後, A的行維度
        OPENBLAS_CONST blasint N,                    // 轉置後, A的列維度
		OPENBLAS_CONST float alpha,                  // 比例因子
		OPENBLAS_CONST float  *A,                    // A的首地址
		OPENBLAS_CONST blasint lda,                  // A的列數，與轉置無關
		OPENBLAS_CONST float  *X,                    // X的首地址
		OPENBLAS_CONST blasint incx,                 // X的首地址大小
		OPENBLAS_CONST float beta,                   // 比例因子
		float  *Y,                                   // Y的首地址 
		OPENBLAS_CONST blasint incy);                // Y的列數，與轉置無關

運算式子：Y = alpha*trans(A) x X + beta * C
其中，A為M x N的矩陣，而X為N x 1向量，C為M x 1 的向量

可以看出，函式需要傳入的引數比較多，第一個引數是矩陣的儲存型式(可選行優先或者列優先)，

GPU版本矩陣乘法

僅僅使用單精度(float)進行解釋，若想嘗試雙精度(double),可以進入官網檢視相應函式.

cublasStatus_t cublasSgemv(cublasHandle_t handle,            //傳入的cublas控制代碼 
                           cublasOperation_t trans,          
                           int m, int n,
                           const float           *alpha,
                           const float           *A, int lda,
                           const float           *x, int incx,
                           const float           *beta,
                           float           *y, int incy)

這個函式實現的是矩陣向量乘法運算:
y = alpha*op(A) x + beta *y
A是一個mxn維的矩陣，以列優先的形式儲存在記憶體中，x,y分別是向量，alpha,beta分別規模因子，也就是倍數關係，而對於矩陣A來說 op(A) = A 如果第二個引數 transa == CUBLAS_OP_N，如果第二個引數為transa == CUBLAS_OP_T，那麼op(A)結果是A的轉置，而如果transa==cublas_op_H，那麼op(A)的結果是A的共軛轉置.
而lda是A矩陣主軸的維度，而incx與incy分別是x,y各個元素之間的間隔。

cublasStatus_t cublasSgemm(cublasHandle_t handle,       //cublas控制代碼
                           cublasOperation_t transa, 
                           cublasOperation_t transb,
                           int m, int n, int k,
                           const float           *alpha,
                           const float           *A, int lda,
                           const float           *B, int ldb,
                           const float           *beta,
                           float           *C, int ldc)

C = alpha * op(A)op(B) + beta*C，其中A,B,C都為矩陣，而A維度為mxk，B維度為kxn，而C的維度為mxk，lda,ldb,ldc分別是矩陣A,B,C主軸上的維度。

#例項

#include "./common/common.h"
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include "cublas_v2.h"
#include <iostream>
using namespace std;

/*
 * A simple example of performing matrix-vector multiplication using the cuBLAS
 * library and some randomly generated inputs.
 */

/*
 * M = # of rows
 * N = # of columns
 */
int M = 5;
int N = 2;

/*
 * Generate a vector of length N with random single-precision floating-point
 * values between 0 and 100.
 */
void generate_random_vector(int N, float **outX)
{
    int i;
    double rMax = (double)RAND_MAX;
    float *X = (float *)malloc(sizeof(float) * N);

    for (i = 0; i < N; i++)
    {
        int r = rand();
        double dr = (double)r;
        X[i] = (dr / rMax) * 100.0;
    }

    *outX = X;
}

/*
 * Generate a matrix with M rows and N columns in column-major order. The matrix
 * will be filled with random single-precision floating-point values between 0
 * and 100.
 */
void generate_random_dense_matrix(int M, int N, float **outA)
{
    int i, j;
    double rMax = (double)RAND_MAX;
    float *A = (float *)malloc(sizeof(float) * M * N);

    // For each column
    for (j = 0; j < N; j++)
    {
        // For each row
        for (i = 0; i < M; i++)
        {
            double dr = (double)rand();
            A[j * M + i] = (dr / rMax) * 100.0;
        }
    }

    *outA = A;
}

int main(int argc, char **argv)
{
    int i;
    float *A, *dA;
    float *X, *dX;
    float *Y, *dY;
    float beta;
    float alpha;
    cublasHandle_t handle = 0;

    alpha = 3.0f;
    beta = 4.0f;

    // Generate inputs
    srand(9384);
    generate_random_dense_matrix(M, N, &A);
    cout<<"A: "<<endl;
    for(int i = 0; i < M; i ++){
        for(int j = 0; j < N; j ++){
            cout << A[M * j + i] <<" ";
        }
        cout<< endl;
    }


    generate_random_vector(N, &X);

    cout<<"X: "<<endl;
    for(int i = 0; i < N; i ++){
        cout<<X[i]<<" ";
    }
    cout<< endl;


    generate_random_vector(M, &Y);

    cout<<"Y: "<<endl;
    for(int i = 0; i < M; i++){
        cout<<Y[i]<<" ";
    }
    cout<<endl;

    // Create the cuBLAS handle
    CHECK_CUBLAS(cublasCreate(&handle));

    // Allocate device memory
    CHECK(cudaMalloc((void **)&dA, sizeof(float) * M * N));
    CHECK(cudaMalloc((void **)&dX, sizeof(float) * N));
    CHECK(cudaMalloc((void **)&dY, sizeof(float) * M));


    // Transfer inputs to the device
    CHECK_CUBLAS(cublasSetVector(N, sizeof(float), X, 1, dX, 1));
    CHECK_CUBLAS(cublasSetVector(M, sizeof(float), Y, 1, dY, 1));
    CHECK_CUBLAS(cublasSetMatrix(M, N, sizeof(float), A, M, dA, M));

    // Execute the matrix-vector multiplication
    CHECK_CUBLAS(cublasSgemv(handle, CUBLAS_OP_N, M, N, &alpha, dA, M, dX, 1,
                             &beta, dY, 1));
    cout<< "aplha: "<<alpha<< " beta: "<<beta<<endl;
    // Retrieve the output vector from the device
    CHECK_CUBLAS(cublasGetVector(M, sizeof(float), dY, 1, Y, 1));

    cout<<"result: "<<endl;
    for (i = 0; i < M; i++)
    {
        printf("%2.2f\n", Y[i]);
    }

    printf("...\n");

    free(A);
    free(X);
    free(Y);

    CHECK(cudaFree(dA));
    CHECK(cudaFree(dY));
    CHECK_CUBLAS(cublasDestroy(handle));

    return 0;
}

最終的輸出結果為:

A: 
51.3409 68.9939 
27.9576 78.9213 
52.4111 97.6222 
36.8833 74.3454 
39.9056 4.77248 
B: 
82.3203 91.5174 84.3425 
30.7926 37.0584 81.2048 
C: 
34.009 36.5475 79.0337 
7.12583 76.2508 76.6652 
71.0715 39.7865 79.2842 
3.62919 88.8859 93.6551 
78.67 60.1165 20.3579 
X: 
30.6251 21.6128 
72.769 67.5084 61.5184 41.7629 46.4297 
Y: 
9481.49 7955.78 11391 8376.15 4161.5 
C: 
19188.7 21912.4 30114.7 
14223.5 16754.9 26607.1 
22245.9 25401.9 37360.8 
15991.1 18747.3 27818.7 
10610.7 11727.2 11341.3

C++矩陣乘法計算 || GPU && CPU 實現

前言

矩陣乘法函式解釋

CPU版本矩陣矩陣乘法

CPU版本矩陣向量乘法

GPU版本矩陣乘法

C++矩陣乘法計算 || GPU && CPU 實現

線性代數教程之一——矩陣乘法計算、理解及程式碼實現

C語言乘法計算

【華為機試070】矩陣乘法計算量估算

大矩陣乘法運算map reduce實現思路

MPI矩陣乘法的兩種實現方法

C#矩陣乘法

華為OJ——矩陣乘法計算量估算

C語言之兩矩陣乘法的實現

C語言矩陣乘法（指標實現）

遞推式計算與矩陣乘法 C++

一些演算法的MapReduce實現——矩陣分塊乘法計算(1)

模板C++ 02數論算法 4矩陣乘法

分治法實現矩陣乘法

(Matlab)GPU計算及CPU計算能力的比較

矩陣乘法來加速遞推式計算

C++ 方矩陣乘法 + Strassen矩陣

矩陣乘法模板C/C++

神經網路高效能運算卷積計算優化 openblas GEMM 矩陣乘法優化 ncnn mobileNet-ssd shueezeNet-ssd

java實現兩個矩陣乘法有個錯誤希望有大佬幫忙

C++矩陣乘法計算 || GPU && CPU 實現

前言

矩陣乘法函式解釋

CPU版本矩陣矩陣乘法

CPU版本矩陣向量乘法

GPU版本矩陣乘法

相關推薦