CAPI 初探及使用小結(3)

阿新 • • 發佈：2017-09-08

結構 gist desc tween image code sam 列操作操作

作者註：

限於能力和時間，文中定有不少錯誤，歡迎指出，郵箱[email protected], 期待討論。由於絕大部分是原創，即使拷貝也指明了出處(如有遺漏請指出），所以轉載請表明出處http://www.cnblogs.com/e-shannon/

http://www.cnblogs.com/e-shannon/p/7495618.html

3 CAPI 詳細結構和流程

CAPI的設計思路即是將加速設備作為CPU的完全對等體(full peer to CPU)，可直接與application通信，訪問內存機制與CPU一致（所有訪問實地址空間的權限，不用修改的有效地址EA，使用CPU的 page table等等）。相對於通過PCIE I/O口進行加速的方案，其減少了I/O driver的開銷，簡化了地址的轉換，並且AFU/PSL接口簡化了開發時間，不再需要關註PCIE驅動和設計[dream1] 。

如上節圖所示，CAPI采用PCIE為物理通道，在CPU一側需要CAPP支持，在FPGA一側需要IBM 的PSL IP支持。而CAIA則定義了OS和帶有PSL的CAPI設備之間的接口。

3.1 CAPI 硬件結構

整個CAPI中，用戶側（FPGA側）由PSL和AFU構成，CPU側則是CAPP模塊。

3.1.1 CAPP

CAPP的全稱是Coherent Accelerator Processor Proxy (CAPP)，在多核power CPU中，CAPP擴展了加速器的一致性，在CAPP上的目錄代表加速器提供了一致性回應。整個一致性協議泡在PCE 物理鏈路上，PCIE介於PSL和CAPP之間。The Coherent Attached Processor Proxy (CAPP) in the multi-core POWER8 processor extends coherency to the attached accelerator. A directory on the CAPP provides coherency responses on behalf of the accelerator.Coherency protocol is tunneled over standard PCI Express links between the CAPP unit on the processor and the POWER service layer (PSL) on the accelerator card

具有如下特征

作為FPGA加速的代理，嵌入到處理器中，表驅動型協議(?)，為加速器屏蔽cache目錄，約1MB的cache tags標簽（基於cache line[dream2] ）

Proxy for FPGA Accelerator on PowerBus

Integrated into Processor

Programmable (Table Driven) Protocol for CAPI

Shadow Cache Directory for Accelerator

Up to 1MB Cache Tags (Line based)

Larger block based Cache

3.1.2 PSL

PSL的含義是POWER Service Layer (PSL)，作為AFU通向CPU的橋梁，它為AFU提供了兼容POWER架構地址翻譯和系統memory cache（先期為256KB）。相對於I/O加速，它由許多優勢，包括共享內存，無需釘住數據用於內存DMA，對於cached 數據的低延遲，以及更容易自然的編程模式。不需要數據它實現於FPGA中，為加密的IP核，原來實現於ALTERA，後實現於Xilinx中[dream3] 。

如下圖PSL的大致功能如下，PSL含有圖中幾個功能，cache、MMU、ISL，PSL還含有內存保護表。

技術分享

TLB：地址轉換後備緩沖器(Translation Lookaside Buffer, TLB)用於緩存虛擬地址（或者有效地址）到物理地址（或者實際地址）的cache，有些地方用CAM實現。可以加速地址轉換

https://en.wikipedia.org/wiki/Translation_lookaside_buffer

內存訪問加速方法：http://blog.csdn.net/ctthuangcheng/article/details/8550450

a. 使用了MMU（Memory Management Unit，內存管理單元）。

b. 地址轉換中出現最頻繁的那些地址，保存到地址轉換後備緩沖器(Translation Lookaside Buffer, TLB)的高速緩存中。這些地址無需訪問頁表即可從高速緩存中直接獲得地址數據。

A CAIA-compliant processor includes a POWER service layer (PSL). The PSL is the bridge to the system for the AFU, and provides address translation and system memory cache.

This interface provides the basis for all communication between the accelerator and the POWER8 system. The PSL provides address translation that is compatible with the Power Architecture? for the accelerator and provides a cache for the data being used by the accelerator. This provides many advantages over a standard I/O model, including shared memory, no pinning[dream4] of data in memory for DMA, lower latency for cached data, and an easier, more natural programming .model Effective addresses from an AFU are translated to a physical address in system memory by the PSL. The PSL also provides miscellaneous management for the

Implemented in FPGA Technology

Provides Address Translation for Accelerator

Compatible with POWER Architecture

Provides Cache for Accelerator

First Implementation – 256KB

Facilities for downloading Accelerator Functions

3.1.3 AFU

AFU 全稱為Accelerator Function Unit，用戶定制的加速功能則在此實現，它通過PSL提供的接口與application通信。這個接口為PSL Accelerator Interfaces

AFU中含有AFU描述符，由一組寄存器組成，這些寄存器反映了AFU的能力，以及是software必須的知道的信息，它提供了一種機制，使得將問題狀體區域（problem state area）與AFU相關的系統進程關聯起來。（Yxr註：對於PCIE介質，AFU中可能還存有PCIE configuration 寄存器組。）

當應用程序需要使用使用AFU的時候

AFU Descriptor Overview，The AFU descriptor is a set of registers within the problem state area that contains information about the capabilities of the AFU that is required by system software.The AFU descriptor also contains a standard format for reporting errors to system software. All AFUs must implement an AFU descriptor.

When an application requests use of an AFU, a process element is added to the process-element linked list that describes the application’s process state. The process element also contains a work element descriptor (WED) provided by the application. The WED can contain the full description of the job to be performed or a pointer to other main memory structures in the application’s memory space. Several programming models are described providing for an AFU to be used by any application or for an AFU to be dedicated to a single application.See Section 3 Programming Models on page 25 for details.

3.2 PSL 加速接口

PSL加速接口介於PSL和AFU之間。通過這個接口，PSL提供AFU各種服務，這些服務基於cache-line size。在CAPI1.0中，PSL Accelerator Interface含有5個接口。

1）加速器命令接口Accelerator Command Interface，通過這個接口，AFU發出服務申請,其實就是讀寫和操控cache的命令，地址位為64位。這時AFU會產生一個8位tag標誌（類似PCIE訪問的tag標誌，用以給此次訪問打上標簽，一旦PSL回應申請，會帶上這個tag，告訴AFU針對的是哪個申請tag）

2）加速器緩存接口Accelerator Buffer Interface,這是一個數據接口，PSL響應AFU的申請，輸入和輸出數據給AFU，數據位均是512bit寬度，時鐘250M

3） PSL響應接口 PSL response interface。在PSL完成數據傳輸後，PSL給出狀態應答給AFU。

4）加速器MMIO接口 Accelerator MMIO 接口。MMIO，內存映射IO，通過這個接口，軟件的MMIO能夠訪問到AFU內部寄存器。24位地址，64位數據

5）加速器控制接口 Accelerator control interface。PSLjob 管理通過這個接口控制AFU的狀態，這個接口是個關鍵接口，用來啟動和終止AFU，AFU則反饋其狀態給PSL。並且通過這個接口，PSL將WED寫入給AFU，告知AFU進行什麽工作。AFU完成後，回應done給PSL.

在CAPI2.0中，接口命名將accelerator改為AFU，這樣更科學。同時添加了第六個接口AFU DMA Interface。AFU DMA接口允許AFU能否發送PCIE的原生態讀寫命令和接收PCIE讀完成命令，也就是說能直接操控PCIE，有更大的靈活性！所以CAPI2.0的接口如下

1)AFU Command Interface

2)AFU Buffer Interface

3)PSL Response Interface

4)AFU MMIO Interface

5)AFU Control Interface

6)AFU DMA Interface

Yxr註：在CAPI2.0和CAPI1.0相同接口是否有細節上的差別，自己沒有仔細甄別

3.3 CAPI工作機制

詳見《OpenPOWER_CAPI_Education_Intro_Latest.ppt》

3.3.1 CAPI的流程

軟件方面，先open AFU設備，初始化WED（work element descriptor），將WED寫入到AFU中，主要告知AFU進行何種操作。WED可以是一個指針，指向一系列操作的鏈表。AFU讀取WED，就開始進行。以上操作PSL 加速接口主要是 control interface(CTL 接口)和MMIO接口

AFU分解WED，通過PSL讀取內存數據進行加速處理，經過PSL加速接口的命令接口發出命令，PSL負責數據的傳輸，最後PSL給出AFU回應表示傳輸的結束狀態。

AFU獲取數據完成加速操作，期間軟件通過MMIO訪問AFU的寄存器。AFU會通過控制接口告知app其完成加速操作。

具體流程可以見下圖

技術分享

Because of CAPI’s peer-to-peer coherent relationship with the POWER8 processors, data-intensive programs are easily offloaded to the FPGA, freeing the POWER8 processor cores to run standard software programs. Any algorithm that you can code into an FPGA is now possible on a POWER8 system using this low overhead mechanism. CAPI’s overall value proposition is that it significantly reduces development time for new algorithm implementations and improves application performance by connecting the processor to hardware accelerators and allowing them to communicate in the same language (eliminating intermediaries such as I/O drivers).

The main application is executed on the host processor with computation-heavy functions executing on the accelerator. The accelerator is a full peer to the host processor, with direct communication with the application. The accelerator uses an unmodified effective address with full access to the real address space. It uses the processor’s page tables directly with page faults handled by system software

3.3.2 CAPI 應用程序流程

CAPI application flow

詳見ppt，描述地更好，自己拷貝了幾張圖

技術分享

Yxr註：如上兩圖，可能軟件的函數有些不同，但是整體流程差不多，可以看到此時app將WED的指針寫給了AFU，WED則是指向一個鏈表的頭和尾

技術分享

Yxr註：這個時候，app可以通過MMIO接口訪問AFU的寄存器，並且添加更多的命令。 CAPI solution flow

技術分享

完成後，AFU發出done信號，PSL復位加速器。Accelerator asserts “done” and PSL resets accelerator

3.4 CAPI仿真平臺搭建

由於自己主要涉及了CAPI1.0的仿真平臺搭建，所以主要介紹基於questa平臺的仿真。

Yxr註：不知道CAPI2.0的仿真平臺是否全部都變了，如果變了，則毫無意義了。

《DeveloperWorksCAPIDevelopmentKitdemo_document_20141119.doc》

3.4.1 仿真的原理和模型

這裏的仿真，指的是對於AFU的HDL代碼與APP以及驅動的聯合仿真。仿真的方法一直在改進，在原來的仿真方法中，分為HDK(hardware demonstration kit和SDK software demonstration kit兩種方法， HDK的意思是整個OS和power CPU均放入仿真環境中進行模擬，曾經在HDK上耽誤很長時間，實際上現有代碼是基於SDK的方法。SDK的仿真方法也有演進，現在最新的應該存在於github中，仿真方法更簡潔了。

在SDK仿真中，引入了PSL仿真引擎(PSLSE :Power Service Layer Simulation Engine ).，其采用C語言完成，以靜態庫（libcxl.a）形式存在。在編譯時和用戶的application代碼一起生成程序。AFU的代碼為HDL代碼，其通過PLI接口與afu_driver.sl協同：仿真，而afu_driver.sl通過socket 和PSLSE通信。結構框圖如下

從仿真框圖來看，其繞過了PCIE 物理鏈路， AFU代碼直接通過PSLSE和application通信。

可以理解PSLSE模擬了PSL,OS, libcxl（這些部分見CAPI 應用程序流程圖一節）。而仿真中並不需要用戶額外建verilog的仿真激勵，只要把自己的應用程序C代碼和AFU代碼進行協同仿真。當要上板調試時，將PSLSE替換成libcxl即可。

Preparing the PSL Simulation Engine

Recall that applications initialize communications with their AFU on the FPGA using the API provided by libcxl. The PSL Simulation Engine (PSLSE) is essentially an embodiment of libcxl that redirects the communication set up to the modeled AFU. PSLSE uses a socket to connect to the afu_driver shim(應該是afu_driver.sl，即PLI文件）that exists in the modeled AFU.

The demonstration kit provides a makefile in shim_client/sw to illustrate how to compile the code. The result is the PSLSE version of libcxl.

下圖來自CAPI user guide的拷貝，PSL shim可以理解為afu_driver.sl(PLI函數[dream5] )

技術分享

3.4.2 仿真步驟

1）編譯libcxl.a庫和afu_driver.sl PLI函數，編譯app程序包含libcxl.a

compile the application code and *.sl file:

a) generate the libcxl.a

run ‘make‘ in {work directory}/pslse-master/pslse directory to

b) generate the afu_driver.sl in

{work directory}/pslse-master/pslse/afu_driver/src/afu_driver.sl

run ‘make‘ in {work directory}/pslse-master/pslse/afu_driver/src

c) generate the application

run ‘make‘ in {work directory}/src/textswap

src/textswap/textswap -C build/test.dat build/test_out.dat

2）在HDL代碼側，IBM提供了頂層文件Afu_top.v，一方面與afu_driver.sl PLI函數相聯，另一方面內部有PSL加速標準接口（就是前面提到的五個接口），用戶只要對接這些接口，進行HDL 代碼編程。

3）啟動questa，編譯HDL代碼（vlib,vlog,vcom），

4）然後將PLI函數放在一起仿真

vsim -L work -t ns -novopt -c -pli ./pslse-master/pslse/afu_driver/src/afu_driver.sl +nowarnTSCALE work.top

5）run 仿真平臺，等待application的調用

6）運行 src/textswap/textswap -C build/test.dat build/test_out.dat 即可

3.5 CAPI 優勢

3.5.1 相比於PCIE IO 加速的優勢

CAPI與一般PCIE加速板的第一個優勢就是突出在C上，C就是coherent的意思，應指cache coherent。在這裏，CAPI加速板（PCIE卡）作為一個對等的CPU（或者輔助加速器），而PCIE僅僅提供一個高速物理鏈路而已（因為高速總線最終可能要統一到PCIE總線）。在多核的系統裏，每個CPU擁有自己的cache，通過cache coherence protocol保證cache的內容一致性。而CAPI也有自己的cache，約可能是256KB，這樣其訪問內存的速度就和沒有cache的PCIE卡優勢非常明顯。

為支持這個特性，在Power8的CPU側添加了CAPP, 用於窺探powerbus的命令，同時反饋PSL中的cache狀態，簡而言之就是一種CCA(Cache Coherency Agent )負責cache的狀態控制和更新，保證cache coherency

由於是和CPU是對稱關系，所以和CPU有相同的地址訪問空間和cache

所以說相比於一般的采用PCIE插卡的加速器，CAPI的第一個優勢就是有對等於CPU的架構以及相應的cache結構，使得其效率遠高於無cache的架構（就好比帶Cache的CPU與不帶cache的CPU的效率差距，至於對等性是否帶來線程調度的差異，這個本人暫時沒有深究）

由於采用PSL/AFU結構，帶來的另一個優勢就是是AFU的開發者不需要理解掌握PCIE的操作，專註於解決自己的問題。(A primary function of the PSL is the physical separation of the AFUs so that they appear to the systemas independent units.)，更加突出了PCIE僅僅是一個傳輸媒介。

如下是文檔關於優勢的描述Coherent Advantages Over I/O Model，並且其給出了數值比較

1)Virtual Addressing & Data Caching

Shared Memory

Lower latency for highly referenced data

2)Easier, More Natural Programming Model

Traditional thread level programming

3)Enables Applications Not Possible on I/O attached

Pointer chasing, etc…

技術分享

Yxr註：但是自己確實不是很理解為什麽I/O 開銷很大，尤其是copy or pin source Data,這個CAPI也不能省吧，必須把數據拷貝到FPGA本地，才能做操作啊，只不過是AFU編程人員不需要知道這種拷貝操作，因為其直觀認為是直接操作內存。
個人理解是做為IO設備，在操作系統其走的路徑，需要經過I/O driver，OS，最終到達application層。而CAPI設備則直接能通過OS到達application，減少了IO driver開銷，尤其是IO map一系列地址映射帶來的開銷。

引申

CAPI的核心是提供給CPU一個可編程的對可並行處理的協處理，協處理器由於是並行硬件加速，所以快於一般的軟件加速（其實並不清楚是否快於GPU加速，GPU也可以並行）

引申開來，其實PCIE只是在CAPI中僅僅是一個媒介作用，在CAPI體系中，可以不是PCIE總線互聯，效率高的可能是QPI和HT總線（類似FSB總線，但是點對對串行），只是無法開放給用戶。而OpenCAPI確實給了bluelink總線，PCIE總線也存在軟肋，就是round trip latency。而intel收購altera，使得這樣的加速架構成為可能，FPGA嵌入到CPU中（以往的FPGA都是CPU嵌入到FPGA中），高端用戶就可以完成自己的可訂制加速！

3.5.2 相比於CPU+GPU優勢

以下摘自http://www.openhw.org/module/forum/thread-597651-1-1.html，雖然列出了4大優勢，但是筆者理解來應該FPGA是硬件加速，比起GPU的並行軟件加速，速度更快，功耗更省！

那麽結合CAPI技術和FPGA技術的加速卡有什麽優勢呢？基本可以總結出四大優勢：

* FPGA芯片和CPU更通暢便捷的“對話” CAPI技術具備緩存一致性和對等的內存訪問能力，也就是說CPU可以分享他的內存空間直接給FPGA使用和訪問，通過CAPI給CPU和FPGA搭起一條快速通道，使得他們的“對話和交流”變得輕而易舉。（yxr註： Power9的nvlink2.0可以對接nvidia的GPU，這樣誰更便捷呢？）

* 逆天高性能，真正實現以一當百根據賽迪顧問發布的《中國OpenPower產業生態發展白皮書》介紹，使用CAPI+FPGA加速卡對大數據Hadoop算法中對性能要求最高的擦除碼進行處理，經測試，服務器性能提升20-100倍！真正實現逆天的以一當百啊！

* 靈活可編程，全方位實現指哪打哪 FPGA芯片是可以編程的，所以我們可以用它來實現多種不同算法的在線升級。我們可以在不同的場合對FPGA加速單元進行升級，靈活配置更換不同的算法，此法具備通用性，可應用於不同場景加速——可以說，FPGA芯片是“十八般武藝樣樣精通”呢！（GPU也能吧。。。）

* 少吃多幹，環保低功耗啥是吃的是草，擠出來的是奶？看看CAPI+FPGA加速卡就知道了！一塊FPGA加速卡功耗約20Watt~75Watt，安裝在服務器機箱內，不占用額外機房空間。光看這個，你沒什麽概念是吧！沒關系，對比一下就知道了！ 1個CPU單元功耗約為 145Watt~190Watt， 1個GPU卡單元功耗約為 235Watt~300W，這下明白了吧！（yxr註：是同等性能下嗎？）

3.5.3 劣勢

主要是和其他加速接口對比，比如必須要PSL加密IP。必須需要FPGA配合，不是公開的IP核。PCIE的round_trip delay大。FPGA編程難度高於GPU的軟件編程，適用面小於GPU。受資源限制，過於復雜的算法能否由FPGA實現。

[dream1]IBM_CAPI_Users_Guide_1-2.pdf

The Coherent Accelerator Process Interface (CAPI) is a general term for the infrastructure of attaching a coherent accelerator to an IBM POWER? system. The main application is executed on the host processor with computation-heavy functions executing on the accelerator. The accelerator is a full peer to the host processor, with direct communication with the application. The accelerator uses an unmodified effective address with full access to the real address space. It uses the processor’s page tables directly with page faults handled by system software. Figure 1-1 shows an overview of CAPI.

[dream2]基本沒有懂

[dream3]決定了CAPI 只能由FPGA實現，且僅僅適用於POWER CPU

[dream4]Pinning：釘住

http://www.ehcache.org/generated/2.9.0/html/ehc-all/index.html#page/Ehcache_Documentation_Set/co-life_pinning_data.html

如果不釘住數據，則cache或者memory中的數據會被清除

Pinning Data

Without pinning, expired cache entries can be flushed and eventually evicted, and even most non-expired elements can also be flushed and evicted as well, if resource limitations are reached. Pinning gives per-cache control over whether data can be evicted from Ehcache .

Data that should remain in memory can be pinned. You cannot pin individual entries, only an entire cache. As described in the following topics, there are two types of pinning, depending upon whether the pinning configuration should take precedence over resource constraints or the other way around.

[dream5]對嗎？

CAPI 初探及使用小結(3)

結構 gist desc tween image code sam 列操作操作作者註：限於能力和時間，文中定有不少錯誤，歡迎指出，郵箱[email protected], 期待討論。由於絕大部分是原創，即使拷貝也指明了出處(如有遺漏請指出），所以轉載請表

CAPI 初探及使用小結(3)

3 CAPI 詳細結構和流程