淺析 TensorFlow Runtime 技術

阿新 • • 發佈：2020-12-25

# 關於 TF Runtime 的疑問？ ## 什麼是TFRT ? **TensorFlow Runtime**，簡稱 TFRT，它提供了統一的、可擴充套件的基礎架構層，可以極致地發揮CPU多執行緒效能，支援全非同步程式設計（無鎖佇列+非同步化語義）。TFRT 可以減少開發、驗證和部署企業級模型所需的時間。 ![TFRT Overview](https://cdn.jsdelivr.net/gh/Aurelius84/FigureBed/img/TFRT_overview.svg) ## TFRT 的輸入是什麼？輸入為Tensorflow GraphDef，TFRT 會呼叫**基於MLIR的圖編譯器**，執行圖優化，並將其lower成 BEF —— 用於執行TFRT graph的二進位制可執行格式。 ![BEF Conversion](https://cdn.jsdelivr.net/gh/Aurelius84/FigureBed/img/BEF_conversion.svg) + 在TF原生框架中，執行的流程是：Python Layers → GradDef (DAG) → 執行OpNode (ThreadPool並行) + Runtime 的思路：Python Layers → GradDef (DAG) → Compile IR → Binary (BEF) → execute (`BEFExecutor`) 基礎概念： + `Host Program in MLIR`是graph的低階中間表示 + `BEF`是一個`BEFExecutor`的可執行檔案，讀取`BEF`檔案，然後非同步執行裡面的函式 + 兩者通過`tfrt_translate`來轉換，類似彙編器 Assembler ## 這裡的 IR 是什麼？其實可以理解為是一套表示**拓撲關係**的程式碼，甚至是一個graph。通過拓撲遞推，可以很容易轉為一段IR程式碼。這也是為什麼BEF支援IR與Graph的互轉的原因。比如： ``` cpp %1 = hex.constant.i32 1 %2 = hex.constant.i32 2 %3 = hex.add.i32 %1, %2 hex.print.i32 %3 # 實際可以表示為一個DAG圖 ``` ## 和 XLA 的區別？ XLA 本質上並沒有脫離**圖執行**的框架，它只是通過 graph cluster 把部分**子圖**通過 HLO 的轉換走 JIT 執行，將子圖包裹在一個`XlaRunOp`裡，再與圖的其他節點一起執行。所以只是把幾個節點換成了一個更快的大節點。（看起來有點類似fuse）官方文件裡稱`BEF`為 Kernel graph的實際載體，實際還是一個graph，即表示bef executor最終執行的實體依然是一個 graph（但不是TF原生意義的GraphDef）。 ## TFRT 基本執行單元是什麼？執行的流程？ TFRT裡的 kernel 概念，分為如下兩種： + 同步 Kernel + 完全在呼叫它的執行緒中執行，不會涉及到其他執行緒裡的計算。它產生的`AsyncValue`狀態都是available的 ```cpp int32_t TFRTAddI32(Argument arg0, Argument arg1) { // The thread that calls TFRTAddI32 performs this addition, and produces // an available AsyncValue. return *arg0 + *arg1; } ``` + 非同步 Kernel + 包含兩個部分的計算：①呼叫它所線上程的同步計算 ② 其他執行緒中的非同步計算。它產生的`AsyncValue`狀態是unavailable的（並不全是） ```cpp void TFRTAddI32Async(Argument arg0, Argument arg1, Result output, HostContext* host) { // Synchronously allocate an unavailable AsyncValue for ‘output’. auto result = output.Allocate(); // Asynchronously make ‘output’ available. host->EnqueueWork([arg0 = *arg0, arg1 = *arg1, result_ref = FormRef(result)] { // A ConcurrentWorkQueue thread performs this addition. result_ref->emplace(arg0 + arg1); }); // Synchronously returns unavailable ‘output’. } ``` **執行流程：** + 建立一個AsyncKernelFrame，包含輸入引數和輸入result + 將Frame傳遞給kernel執行 + 所有的AsyncValue通過registers來跟蹤也提供了eager API （op-by-op）：CoreRuntime 和 CoreRuntimeOp + CoreRuntime： + 執行OpHandler，藉助內部類Impl來實現 + 它可以呼叫`MakeOp(op_name, op_handler)`來建立一個`CoreRuntimeOp`直接執行 + CoreRuntimeOp + 持有一個`llvm::unique_function>`型別的函式指標`fn_` + 仿函式用於執行函式`fn_` ## 如何整合硬體裝置的？藉助 DeviceRuntime，讓BEF只支援最底層的driver API的Op，從而儘量避免讓每一種後端都單獨實現一遍tf的各個Op。如下圖中使用的op直接對應到了cuda api： ![img](https://cdn.jsdelivr.net/gh/Aurelius84/FigureBed/img/v2-8783f0350fa0797b365421766a02992d_720w.jpg) # Host Runtime的設計思路 ## Host Runtime 的位置? ![TFRT Architecture](https://cdn.jsdelivr.net/gh/Aurelius84/FigureBed/img/tfrt-arch.svg) host 指執行計算的機器裝置，可能有，也可能沒有硬體加速的資源。host 可以只是一個具有多GPU的伺服器，或帶有DSP和IPU的移動裝置。在TF原生的框架中，TF Core是按照 data-flow 進行op-by-op的執行，設計上有很多順序同步執行的影子在裡面。**而 Host Runtime 通過重新編排計算邏輯，然後驅動 Device Runtime（如GPU、TPU）去加速計算**，使得kernel的執行可以單獨放在一個執行緒中，去非同步執行，充分利用的多執行緒並行的優勢。 ![Host runtime](https://cdn.jsdelivr.net/gh/Aurelius84/FigureBed/img/host-runtime.svg) ## 為什麼要做這件事？ + 期望能高效的eagerly執行op + TF對graph執行已經優化的很好了，畢竟都在C++端執行。但在earge模式下，python和runtime端之間的不必要的開銷還是在存的。 + 統一圖和op兩個不同層次下多執行緒並行機制 + **runtime 中非同步是一等公民** + a non-strict kernel/function may execute before all its inputs are ready. + 更輕便地進行cross-kernel優化 + TF 的op Kernel實現中封裝了 Tensor 的記憶體申請之類的邏輯，這限制了cross-kernel中reuse buffe的優化。在 TFRT的kernel中，解耦了 shape計算和 tensor 記憶體申請的邏輯 + 實現模組化、可插拔式的新硬體支援機制 + 期望解決之前為了接入新硬體而不得不hack整個程式碼庫的痛點；能夠建立一種模組化機制，直接提供完善的接入文件給硬體團隊即可，變被動為主動。 ## 如何去設計來實現上述目標麼？先回顧下背景： Core Runtime, Graph Lowering 和 Eager Execution 1. Core Runtime 用來 eagerly 執行單個 op 或者整個graph function——包含GradDef 和 HLO。一個op graph通常是裝置獨立的。 2. Graph Lowering Compiler passes 將一個op graph 轉化為一個Kernel Graph，它是一個數據流計算的更低階表示，為**更快執行**而設計，因此不適合做編譯分析，但可以通過低階方言（如MLIR）來表示。Kernel graph是面向指定裝置的（與平臺繫結） 3. Eager Execution Host Runtime支援eagerly 執行。但並不一定會涉及Graph/BEF的構造和BEFExecutor的使用。TF設計了兩個方案： + Generic path：把 op 當做graph function來處理，可以很好處理組合 op 的情況，也可以複用graph function的那一整套程式碼。 + Fast path：使用手寫的C++或者預編譯的 graph snippets 去完成op kernel的選取和呼叫（定製化優化？成本不高麼？） ## Kernel Graph 中的 Kernel 指什麼？ TFRT裡面也有 kernel 的概念，輸入輸出均為：[`AsyncValue`](https://github.com/tensorflow/runtime/blob/master/documents/tfrt_host_runtime_design.md#asyncvalue)——**非同步是一等公民**的踐行者。類似C++標準庫中的 [future](https://en.cppreference.com/w/cpp/thread/future) 和 [promis](https://en.cppreference.com/w/cpp/thread/promise)的組合。 graph中的所有data全部都會替換為`AsyncValue`。執行流程： + 建立一個AsyncKernelFrame，包含輸入引數和輸入result + 將Frame傳遞給kernel執行 + 所有的AsyncValue通過registers來跟蹤 ```cpp // Kernel that adds two integers. // AsyncKernelFrame holds the kernel’s arguments and results. static void TFRTAdd(AsyncKernelFrame* frame) { // Fetch the kernel’s 0th argument. AsyncValue* arg1 = frame->GetArgAt(0); // Fetch the kernel’s 1st argument. AsyncValue* arg2 = frame->GetArgAt(1); int v1 = arg1->get(); int v2 = arg2->get(); // Set the kernel’s 0th result. frame->EmplaceResultAt(0, v1 + v2); } ``` > TODO: Kernel中的[記憶體申請接入機制](https://github.com/tensorflow/runtime/blob/master/documents/tfrt_host_runtime_design.md#memory-allocation) Kernel 型別分為如下兩種： + 同步 Kernel + 完全在呼叫它的執行緒中執行，不會涉及任何其他執行緒的計算。它產生的`AsyncValue`狀態都是available的 ```cpp int32_t TFRTAddI32(Argument arg0, Argument arg1) { // The thread that calls TFRTAddI32 performs this addition, and produces // an available AsyncValue. return *arg0 + *arg1; } ``` + 非同步 Kernel + 包含兩個部分：①呼叫它所線上程的同步操作 ② 其他執行緒中的非同步操作。它產生的``AsyncValue`狀態是unavailable的（並不全是） ```cpp void TFRTAddI32Async(Argument arg0, Argument arg1, Result output, HostContext* host) { // Synchronously allocate an unavailable AsyncValue for ‘output’. auto result = output.Allocate(); // Asynchronously make ‘output’ available. host->EnqueueWork([arg0 = *arg0, arg1 = *arg1, result_ref = FormRef(result)] { // A ConcurrentWorkQueue thread performs this addition. result_ref->emplace(arg0 + arg1); }); // Synchronously returns unavailable ‘output’. } ``` Kernel 的兩種執行模式： + Strict mode: + 此類Kernel被呼叫時，所有的`AsyncValue`均已是available。 + [non Strict mode](https://github.com/tensorflow/runtime/blob/master/documents/tfrt_host_runtime_design.md#non-strict-control-flow): + 只要有一個輸入引數是available，就執行。比如三元操作，它其實只負責轉發 ```cpp result = ternary(condition, true_result, false_result) //只要condition可用即可 ``` + 這類kernel實現難度較高 ## `AsyncValue`有什麼用途？前面提到：Kernel 的輸入輸出均為：[`AsyncValue`](https://github.com/tensorflow/runtime/blob/master/documents/tfrt_host_runtime_design.md#asyncvalue)，graph中的所有data也全部替換為了`AsyncValue`。 ```cpp // A subset of interface functions in AsyncValue. class AsyncValue { public: // Is the data available? bool IsAvailable() const; // Get the payload data as type T. // Assumes the data is already available, so get() never blocks. template const T& get() const; // Store the payload data in-place. template void emplace(Args&&... args); // Add a waiter callback that will run when the value becomes available. void AndThen(std::function&& waiter); // ... }; ``` **AyncValuea有三個派生類：** + `ConcreteAsyncValue`：用於表示和存放具體data + `ErrorAysncValue`：用於處理異常傳播和取消執行。BEFExecutor會監控每個Kernel執行返回的值，若果某個result值為此型別，則跳過所有依賴此值的下游op + `IndirectAsyncValue`：有些情況下，某個result的dataType還不知道呢，但為了實現非阻塞機制，先建立一個IndirectSyncValue，保證non-strick Kernel的執行。它其實並不持有資料，而是持有了一個指向另一個`AsyncValue`的指標。 **生命週期**：通過引用計數實現： + kernel會首先對results建立AyncValue（當dataType確定時） + 一個AsyncValue的所有權會從kernel移交給BEFExecutor + BEFExecutor將AsyncValue傳遞給所有使用它的下游 Op，並遞增引用計數 + 每個下游Op Kernel完成計算後，遞減此AsyncValue的引用計數 ## 管理`AyncValue`的`Register`具體做哪些工作？ `Register`其實是一個指向`AyncValue`的指標，它也只操作指標，因此不涉及資料的移動和copy。 **舉個栗子**： ```cpp available_value = upstream() downstream(available_value, unavailable_value) ``` downstream需要等到兩個引數都ready才會執行。當`unavailable_value`也available時，執行器從`register`載入資料，然後傳遞給downstream去執行 **`register`有三種狀態：** + **Empty**：初始狀態，不指向任何`AsyncValue` + **Unavailable**: 只用於非同步kernel。同步kernel不會產生此狀態。 + **Available**: 最終狀態，且狀態不可逆。 ![Register states](https://cdn.jsdelivr.net/gh/Aurelius84/FigureBed/img/reg-states.svg) ## RunTime 如何實現非同步加速的？在 TFRT 中，執行Kernel的執行緒，與排程其他已ready的kernel的執行緒，可能屬於同一個。TFRT 把後臺排程kernel任務放到了一個`ConcurrentWorkQueue`中來非同步執行。 **但反向需要梯度才能執行，如何處理反向op以及IO阻塞問題呢？** TF採用了兩個獨立的執行緒池： ①專用執行緒池：存放長時非阻塞任務 + 固定執行緒數，每個硬體一個執行緒，避免執行緒資源搶佔帶來的開銷。 ②單獨執行緒池：存放阻塞任務（如IO） + 申請多一些執行緒數來處理IO任務 + 為了避免死鎖，阻塞任務只能放在阻塞執行緒池裡執行 + 要求Kernel的實現不能直接包含阻塞操作（例如？），更不能將部分阻塞操作放到非阻塞佇列裡。 ## 圖執行——Graph Executation 圖執行時，host program 會把 graph 轉換為MLIR表示的 Kernel graph。此處會應用一些compiler passes 將裝置無關的 graph 轉化為面向特定硬體平臺的 kernel graph。 ```cpp func @sample_function() -> i32 { %one = tfrt.constant.i32 1 // Make AsyncValue with value 1 %two = tfrt.constant.i32 2 // Make AsyncValue with value 2 %three = tfrt.add.i32 %one, %two // Make AsyncValue with value 3 (1+2) tfrt.print.i32 %three // Print AsyncValue %three tfrt.return %three : i32 // Return AsyncValue %three } ``` **runtime 並不直接執行IR**，而是通過`mlir_to_bef`將其轉換為 `BEF`後再執行。通過 registers 跟蹤和記錄所有 `AsyncValue` 的狀態。 ### 如何解決control dependency問題？在原生的TF中是通過`tf.control_dependencies`來對兩個有順序要求的Kernel新增依賴。在TFRT中，是通過[`Chain`](https://github.com/tensorflow/runtime/blob/master/documents/explicit_dependency.md)來實現。一個`chain`也是一個`AsyncValue`——可以是kernel的引數，也可以是result，這樣的話，Chain要求consumer必須在producer之後，以此實現有序性。 ```cpp func @control_dep1() { %a = dht.create_uninit_tensor.i32.2 [2 : i32, 2 : i32] %chain1 = dht.fill_tensor.i32 %a, 41 %chain2 = dht.print_tensor.i32 %a, %chain1 } ``` ### 如何處理控制流的情況，如if ? TFRT支援在Kernel中呼叫`BEFExecutor`（這一點跟Paddle目前的控制流處理思路有點類似） ```cpp void TFRTIf(AsyncKernelFrame* frame) { const auto* true_fn = &frame->GetConstantAt(0); const auto* false_fn = &frame->GetConstantAt(1); // First arg is the condition. ArrayRef args = frame->GetArguments(); AsyncValue* condition = args[0]; // Execute true_fn or false_fn depending on ‘condition’. auto* fn = condition->get() ? true_fn : false_fn; fn->Execute(args.drop_front(), frame->GetResults(), frame->GetHostContext()); } ``` ### 與底層的session的區別和聯絡？貌似沒啥關係。（待深入瞭解） ## BEF檔案裡都包含了什麼資訊？ [BEF](https://github.com/tensorflow/runtime/blob/master/documents/binary_executable_format.md) 是runtime和compiler的橋樑，同時將compiler從runtime中解耦，從而可以獨立應用編譯優化策略。它支援儲存到磁碟，重新載入執行（mmap bytes）。感覺和二進位制檔案很類似，因為它也包括很多section的概念。 BEF 包含了一些與硬體裝置相關的資訊：每個Kernel在哪種裝置（CPU/GPU/TPU）上執行，以及哪些特殊的Kernel會被呼叫。 **MLIR和BEF之間可以互相轉換：** ![MLIR <-> BEF](https://cdn.jsdelivr.net/gh/Aurelius84/FigureBed/img/mlir-bef.svg) ## BEFExecutor的作用是什麼？有特殊效能收益嗎？ **它是一個執行器，而非一個直譯器**，因為它沒有**program counter**d的概念。效能收益來源： + 它是 lock-free 的 + 非阻塞執行： + 無論一個Value是否available，它都會執行下去。對於unvailable的value，執行器會將其推遲到`AsyncValue::AndThen` + 由於`AyncValue`都會由`Register`來跟蹤，它一旦ready，會通知和喚起所有相關kernel # 遺留問題 ## TFRT中公佈的文件中很少涉及訓練和反向op的內容，是否支援？在官網給出的 [mnist_training.md](https://github.com/tensorflow/runtime/blob/master/documents/mnist_training.md)介紹中，提到了TFRT對訓練的支援，但只是原型展示，並非最終版本。 + 單獨重寫了MNIST模型中所有的op，如matmul、relu、elem_add、argmax、reduce_mean + 這裡只重寫relu_grad的kernel，其他op的反向kernel預設使用的是Tensorflow框架的？ # 參考資料 1. [【官方文件】—TFRT Host Runtime Design](https://github.com/tensorflow/runtime/blob/master/documents/tfrt_host_runtime_des

淺析 TensorFlow Runtime 技術

淺析 TensorFlow Runtime 技術

淺析寬頻接入技術

淺析OCR識別技術的應用場所

區塊鏈開發公司淺析區塊鏈技術與人工智慧的關係

淺析區塊鏈技術未來的發展方向你覺得如何？

DLL load failed: 找不到指定模組\Failed to load the native TensorFlow runtime解決方法

【轉載】史上最全：TensorFlow 好玩的技術、應用和你不知道的黑科技

NSX技術的淺析

騰訊技術工程 |騰訊報告TensorFlow首個安全風險谷歌確認並致謝

JAVA技術分享：jdbc淺析

淺析AnyCast網絡技術

淺析神鷹TDM和LIMS的技術優勢

比特幣深層技術原理淺析

《TensorFlow技術解析與實戰》高清中文PDF+源代碼

分享《TensorFlow技術解析與實戰》高清中文PDF+原始碼

分享《TensorFlow技術解析與實戰》高清中文PDF+源代碼

淺析軟體工程中的UML建模技術

分享《TensorFlow技術解析與實戰》+PDF+源碼+李嘉璇

分散式快取技術原理淺析 - 20181120

TensorFlow 技術解析與實戰筆記 01

淺析 TensorFlow Runtime 技術

相關推薦