intel向量化指令在矩陣乘應用中的評估

阿新 • • 發佈：2019-01-31

隨著機器學習等人工智慧技術的飛速發展，矩陣乘法的應用越來越多，intel晶片先後提供了不同系列的向量指令，包括mmx、sse、avx等，支援simd操作。後來為了更好地支援矩陣乘法，又增加了fma（Fused Multiply-Add）指令。fma指令需要三個向量引數va,vb,vc，其效果等價於表示式(va∗vb)+vc，其中的乘法和加法都是面向向量中的元素的，也就是fma指令的結果是一個同樣長度的向量。fma指令的出現為矩陣乘法提供了方便，但是其效果同樣可以用avx指令系列中的乘法和加法的組合來實現，本文使用例子來分析不同向量指令在矩陣乘中的效能和精度。
例子主要計算了一個矩陣W和向量x

的乘積，W的列數等於x的長度，結果仍然是一個向量，長度等於W的行數。程式碼的實現如下。

#include <stdio.h>
#include <time.h>
#include <x86intrin.h>

int main() {
  const int col = 1024, row = 64, num_trails = 1000000;

  float w[row][col];
  float x[col];
  float y[row];
  float scratchpad[8];
  for (int i=0; i<row; i++) {
    for 
 (int j=0; j<col; j++) {
      w[i][j]=(float)(rand()%1000)/800.0f;
    }   
  }
  for (int j=0; j<col; j++) {
    x[j]=(float)(rand()%1000)/800.0f;
  }

  clock_t t1, t2; 
// The original matrix multiplication version
  t1 = clock();
  for (int r = 0; r < num_trails; r++)
    for(int j = 0; j < row; j++)
    {   
      float 
 sum = 0;
      float *wj = w[j];

      for(int i = 0; i < col; i++)
        sum += wj[i] * x[i];

      y[j] = sum;
    }   
  t2 = clock();
  float diff = ((float)t2 - (float)t1) / CLOCKS_PER_SEC;
  printf("\nTime taken: %.2f second.\n", diff);

  for (int i=0; i<row; i++) {
    printf("%.4f, ", y[i]);
  }
  printf("\n");
// The avx matrix multiplication version.
  const int col_reduced_8 = col - col % 8;

  __m256 op0, op1, tgt, tmp_vec;

  t1 = clock();
  for (int r = 0; r < num_trails; r++)
    for (int i=0; i<row; i++) {
      float res = 0;

      tgt = _mm256_setzero_ps();
      for (int j = 0; j < col_reduced_8; j += 8) {
        op0 = __builtin_ia32_loadups256(&x[j]);
        op1 = __builtin_ia32_loadups256(&w[i][j]);
        tmp_vec = __builtin_ia32_mulps256(op0, op1);
        tgt = __builtin_ia32_addps256(tmp_vec, tgt);
      }

      __builtin_ia32_storeups256(scratchpad, tgt);
      for (int k=0; k<8; k++)
        res += scratchpad[k];

      for (int l=col_reduced_8; l<col; l++) {
        res += w[i][l] * x[l];
      }
      y[i] = res;
    }
  t2 = clock();
  diff = ((float)t2 - (float)t1) / CLOCKS_PER_SEC;
  printf("\nTime taken: %.2f second.\n", diff);

  for (int i=0; i<row; i++) {
    printf("%.4f, ", y[i]);
  }
  printf("\n");
 // The fma matrix multiplication version.
  t1 = clock();
  for(int r = 0; r < num_trails; r++)
    for(int i = 0; i < row; i++)
    {
      float rlt = 0;

      tgt = _mm256_setzero_ps();
      for(int j = 0; j < col_reduced_8; j += 8)
      {
        op0 = __builtin_ia32_loadups256(&x[j]);
        op1 = __builtin_ia32_loadups256(&w[i][j]);
        tgt = _mm256_fmadd_ps(op0, op1, tgt);
      }
      __builtin_ia32_storeups256(scratchpad, tgt);
      for(int k = 0; k < 8; k++)
      {
        rlt += scratchpad[k];
      }
      for(int l = col_reduced_8; l < col; l++)
      {
        rlt += w[i][l] * x[l];
      }
      y[i] = rlt;
    }

  t2 = clock();
  diff = ((float)t2 - (float)t1) / CLOCKS_PER_SEC ;
  printf("\nTime taken: %.2f second.\n", diff);

  for(int i=0; i<row; i++)
  {
    printf("%.4f, ", y[i]);
  }
  printf("\n");

在ubuntu系統中，程式的編譯命令是:
gcc -O2 -mfma test.c -o test
需要注意的是，只有在支援fma的晶片結構下，程式才能夠執行。可以通過命令：
cat /proc/cpuinfo | grep fma
來判斷晶片是否支援fma。
其執行結果為：
Time taken: 93.56 second.
409.8341, 413.4546, 398.7332, 399.8303, 404.1195, 402.3861, 394.6979, 412.6429, 409.0014, 390.9019, 400.3911, 392.7900, 400.5019, 418.6781, 399.3336, 404.0719, 414.9839, 411.6887, 396.0086, 406.6972, 384.5781, 399.3724, 400.0473, 391.6383, 401.3511, 400.8543, 418.4066, 406.6425, 405.5102, 408.4534, 403.0285, 406.3510, 410.2005, 414.9617, 417.3602, 406.4511, 397.1705, 406.1265, 393.3314, 407.1777, 389.9053, 397.3145, 401.7866, 413.3134, 415.7482, 414.2341, 403.3439, 405.4922, 395.4076, 399.6389, 409.6675, 419.8184, 412.3336, 399.8252, 403.3434, 387.4861, 402.2747, 399.8241, 414.1568, 405.4861, 406.6151, 410.4040, 408.9755, 398.9610,

Time taken: 10.94 second.
409.8341, 413.4549, 398.7335, 399.8304, 404.1191, 402.3860, 394.6979, 412.6424, 409.0016, 390.9022, 400.3909, 392.7900, 400.5020, 418.6781, 399.3336, 404.0718, 414.9842, 411.6884, 396.0087, 406.6971, 384.5780, 399.3723, 400.0472, 391.6382, 401.3510, 400.8541, 418.4067, 406.6424, 405.5103, 408.4536, 403.0287, 406.3513, 410.2007, 414.9618, 417.3603, 406.4513, 397.1708, 406.1266, 393.3315, 407.1776, 389.9049, 397.3150, 401.7864, 413.3134, 415.7483, 414.2341, 403.3439, 405.4922, 395.4075, 399.6392, 409.6674, 419.8183, 412.3336, 399.8253, 403.3433, 387.4865, 402.2746, 399.8239, 414.1567, 405.4861, 406.6153, 410.4034, 408.9752, 398.9612,

Time taken: 12.08 second.
409.8341, 413.4549, 398.7335, 399.8304, 404.1191, 402.3860, 394.6979, 412.6424, 409.0016, 390.9022, 400.3909, 392.7900, 400.5021, 418.6781, 399.3336, 404.0718, 414.9842, 411.6884, 396.0087, 406.6971, 384.5780, 399.3722, 400.0472, 391.6382, 401.3510, 400.8541, 418.4067, 406.6424, 405.5102, 408.4536, 403.0287, 406.3513, 410.2007, 414.9618, 417.3603, 406.4513, 397.1708, 406.1266, 393.3315, 407.1776, 389.9050, 397.3150, 401.7864, 413.3134, 415.7483, 414.2341, 403.3439, 405.4922, 395.4075, 399.6392, 409.6674, 419.8183, 412.3336, 399.8253, 403.3433, 387.4865, 402.2746, 399.8239, 414.1568, 405.4861, 406.6153, 410.4034, 408.9752, 398.9612,

可見，avx對乘加的組合實現效能還略高於fma指令。而精度兩者相似，略低於原始的運算。

intel向量化指令在矩陣乘應用中的評估

intel向量化指令在矩陣乘應用中的評估

矩陣的向量化及內積

有向圖的矩陣對稱化（附加：如何把用空格分隔的文字貼上到Excel表格的各列中）

TiDB 在量化派風控系統中的應用

機器學習中向量化程式設計總結記錄

TensorFlow中矩陣乘操作tf.matmul(或tf.linalg.matmul)和矩陣元素乘tf.multiply(或tf.math.multiply)用法對比

python中的向量化和for

稀疏矩陣在opencv中的應用（大矩陣運算速度過慢的問題，藉助SparseMat？）

vue單頁應用中，使用setInterval()定時向伺服器獲取資料，後來跳轉頁面後，發現還在不停的獲取資料。

React Native學習筆記之--向原生應用中整合RN頁面

機器學習、神經網路計算過程的矩陣化與向量化

AVX指令集矩陣乘向量演算法

cublasGemmEx函式應用-探究8bit矩陣乘

吳恩達機器學習筆記59-向量化：低秩矩陣分解與均值歸一化（Vectorization: Low Rank Matrix Factorization & Mean Normalization）

關於FileFOutputStream應用中的FileNotFoundException問題

unity3d開發的android應用中增加AD系統的詳細步驟

Java-Servlet--《12-WEB應用中的普通Java程序如何讀取資源文件.mp4》有疑問

WebApi接口 - 如何在應用中調用webapi接口

AngularJS語法基礎及數據綁定——詳解各種數據綁定指令、屬性應用

如何統計應用中各個包下有多少個方法數

intel向量化指令在矩陣乘應用中的評估

相關推薦