ARM Compute Library編譯安裝
阿新 • • 發佈:2020-08-10
1.下載
https://github.com/ARM-software/ComputeLibrary
2.由於我交叉編譯器已經加入環境變數,修改SConstruct檔案下
3.編譯
opencl一起編譯進去embed_kernels=1
scons Werror=0 debug=0 asserts=1 neon=1 opencl=1 embed_kernels=1 os=linux arch=armv7a
如果只用到neon加速
scons Werror=0 debug=0 asserts=0 neon=1 opencl=1 os=linux arch=armv7a
debug和asserts用於除錯,會增加執行時間;
4.連結
編譯成功後會在根目錄下生成build資料夾,我只用到下面兩個so檔案。
資料夾
#include "arm_compute/core/Types.h" #include "tests/Utils.h" #include "arm_compute/runtime/NEON/NEScheduler.h" #include "arm_compute/runtime/NEON/functions/NEGEMM.h"
5.執行,矩陣乘法
neon加速
TensorShape AShape(K,M); TensorShape BShape(N,K); TensorShape OShape(N,M); Tensor ATensor, BTensor, OTensor , CTensor; ATensor.allocator()->init(TensorInfo(AShape, Format::F32)); BTensor.allocator()->init(TensorInfo(BShape, Format::F32)); OTensor.allocator()->init(TensorInfo(OShape, Format::F32)); NEGEMM armGemm; armGemm.configure(&ATensor, &BTensor,nullptr, &OTensor,ALPHA, 0.0); ATensor.allocator()->allocate(); BTensor.allocator()->allocate(); OTensor.allocator()->allocate(); Window A_window; A_window.use_tensor_dimensions(ATensor.info()->tensor_shape()); Iterator A_it(&ATensor, A_window); execute_window_loop(A_window, [&](const Coordinates & id) { *reinterpret_cast<float *>(A_it.ptr()) = A[id.z() * (M * K) + id.y() * K + id.x()]; }, A_it); Window B_window; B_window.use_tensor_dimensions(BTensor.info()->tensor_shape()); Iterator B_it(&BTensor, B_window); execute_window_loop(B_window, [&](const Coordinates & id) { *reinterpret_cast<float *>(B_it.ptr()) = B[id.z() * (K * N) + id.y() * N + id.x()]; }, B_it); //warmup run armGemm.run(); for(int h = 0; h < M; h++) { for(int w = 0; w < N; w++) { C[h*N + w] = *reinterpret_cast<float*>( OTensor.buffer() + OTensor.info()->offset_element_in_bytes(Coordinates(w,h,0))); } }
opencl加速需要opencl2.0以上或者支援-cl-arm-non-uniform-work-group-size