TensorFlow中的並行執行引擎——StreamExecutor框架

阿新 • • 發佈：2020-10-15

TensorFlow中的並行執行引擎——StreamExecutor框架

背景

[作者：DeepLearningStack，阿里巴巴演算法工程師，開源TensorFlow Contributor]

歡迎大家關注我的公眾號，“網際網路西門二少”，我將繼續輸出我的技術乾貨~

在前一篇文章中，我們梳理了TensorFlow中各種異構Device的新增和註冊機制，通過使用預先定義好的巨集，各種自定義好的Device能夠將自己註冊到全域性表中。TensorFlow期望通過這種模式，能夠讓Device的新增和註冊於系統本身更好的解耦，從而體現了較好的模組化特性。在這篇文章中，我們選擇直接去窺探TensorFlow底層架構較為複雜的一個部分——StreamExecutor框架。我們已經知道TensorFlow是一個異構的並行執行框架，對於異構Device的管理是一件非常複雜的事，不僅包括Device的新增、註冊、刪除、屬性的管理，還必須要對Device的並行執行過程做進一步抽象形成統一的框架，才能實現更好的解耦。通過閱讀這部分原始碼不但可以對執行引擎的管理有很深的理解，還可以體驗學習到各種設計模式。如果想要對TensorFlow底層甚至是XLA做一些效能上的深度優化，那麼這一部分則是必須要了解的內容。

Stream

Stream存在於計算機相關的各種技術中，比如在作業系統、流式計算、計算機網路傳輸或是CUDA程式設計中都有涉及。Stream從抽象角度來看其本質是定義了一個操作序列，處於同一個Stream的操作必須按順序執行，不同Stream之間的並無順序關係。在TensorFlow中存在一些高效能的並行程式設計裝置，所以需要有一套抽象框架對這些裝置的執行過程管理起來，這就是StreamExecutor的用武之地了。

StreamExecutor簡介

其實StreamExecutor本身就是一個在Google內部為並行程式設計模型開發的單獨的庫，感興趣的可以直接參考GitHub。在TensorFlow中的StreamExecutor是一個開源StreamExecutor的簡版，並且並不是以第三方庫的形式出現，而是在原始碼中單獨放了一個stream_executor的資料夾，裡面的程式碼非常的精簡

，目錄結構部分截圖如下圖所示。

StreamExecutor為TensorFlow的執行層面提供了較為統一的抽象，而在底層各種Device的執行管理細節卻完全不同。我們可以看到stream_executor下面有cuda和host兩個子目錄，他們分別是GPU執行引擎和CPU執行引擎所使用的子模組。下面我們先從統一的抽象層面來梳理該框架的結構。

StreamExecutor對外提供的控制代碼——Stream物件

為了隱藏StreamExecutor框架管理的複雜性，它對外暴露的handler必須足夠簡單。事實也確實如此，StreamExecutor通過暴露Stream物件作為操作底層的handler。一般而言，在TensorFlow的框架中都是使用Stream物件來呼叫底層計算庫，進行裝置間資料拷貝操作等過程

。比如呼叫Stream物件的ThenMemcpy即可完成非同步的資料傳輸拷貝過程，呼叫ThenConvolveXXX等函式即可完成DNN庫中的卷積呼叫。事實上，TensorFlow中很多Op的C++實現中，其Compute函式內就是通過使用Stream物件來完成某些實際計算或資料拷貝的過程，下圖展示了Stream物件、StreamExecutor框架以及其他模組的關係。

Stream物件是通過持有StreamInterface的具體實現物件來獲得實際平臺的Stream，進而通過Stream這個統一的handler完成與底層的互動，下面試這一子模組的類圖結構。

StreamExecutor框架內的層次結構

熟悉GPU程式設計的同學都知道，CUDA程式的編寫是相對複雜的，不但要針對某種任務設計特定的並行程式設計思路，還要管理Event，Stream等較為底層的物件。為了能夠減輕StreamExecutor使用者的使用負擔，也為了能夠給上層呼叫者即TensorFlow引擎提供更加統一的介面，一些抽象分層的工作是非常有必要的。總體上StreamExecutor框架由三個層次組成，從上到下依次為Platform層（平臺描述）、StreamExecutor Core層（執行引擎）和LibrarySupport層（基礎庫）。如果需要為TensorFlow新增新的計算裝置種類，不但要向TensorFlow中註冊Device的定義，還需要在StreamExecutor框架中提供負責管理該Device計算的程式碼。

Platform層

在StreamExecutor中Platform指的是計算所使用裝置平臺的抽象，每種Device對應一種Platform。比如GPU對應的是CudaPlatform，而CPU對應的是HostPlatform等。一旦獲得了某種Device的Platform，就可以獲取和該Platform對應的StreamExecutor Core以及相應的LibrarySupport。在TensorFlow的程式碼實現中，所有Platform類都是通過巨集定義和MultiPlatformManager管理類的靜態方法主動註冊到系統中的，下面是這一層次的類圖表示。

CudaPlatform和HostPlatform繼承自公共父類Platform，如果有新的Platform出現，依然可以沿用這樣的設計直接繼承並給出實現。所有的Platform都通過MultiPlaftormManager呼叫RegsiterPlatform函式主動註冊到系統中並做初始化，下面程式碼段是CudaPlaftorm的註冊過程，註冊使用了Initializer模組及相應的巨集定義，這些程式碼比較簡單，這裡就不再詳細展開了。

 1 static void InitializeCudaPlatform() {
 2   // Disabling leak checking, MultiPlatformManager does not destroy its
 3   // registered platforms.
 4 
 5   std::unique_ptr<cuda::CudaPlatform> platform(new cuda::CudaPlatform);
 6   SE_CHECK_OK(MultiPlatformManager::RegisterPlatform(std::move(platform)));
 7 }
 8 
 9 }  // namespace stream_executor
10 
11 REGISTER_MODULE_INITIALIZER(cuda_platform,
12                             stream_executor::InitializeCudaPlatform());
13 
14 // Note that module initialization sequencing is not supported in the
15 // open-source project, so this will be a no-op there.
16 REGISTER_MODULE_INITIALIZER_SEQUENCE(cuda_platform, multi_platform_manager);
17 REGISTER_MODULE_INITIALIZER_SEQUENCE(multi_platform_manager_listener,
18                                      cuda_platform);

MultiPlatformManager提供了兩種獲取具體Platform的方式，一種是通過name，另一種是通過Id，如下程式碼段所示。

 1   // Retrieves the platform registered with the given platform name (e.g.
 2   // "CUDA", "OpenCL", ...) or id (an opaque, comparable value provided by the
 3   // Platform's Id() method).
 4   //
 5   // If the platform has not already been initialized, it will be initialized
 6   // with a default set of parameters.
 7   //
 8   // If the requested platform is not registered, an error status is returned.
 9   // Ownership of the platform is NOT transferred to the caller --
10   // the MultiPlatformManager owns the platforms in a singleton-like fashion.
11   static port::StatusOr<Platform*> PlatformWithName(absl::string_view target);
12   static port::StatusOr<Platform*> PlatformWithId(const Platform::Id& id);

StreamExecutor Core層

從原始碼上看這一層非常複雜，因為它涉及到的類最多，但是當我們把Platform層和Library層分開看待後，這一層就變得非常簡單了。對於外部使用者來說，獲取Platform就是為了獲取對應的執行引擎。對於TensorFlow這種存在多種Platform和執行引擎的異構框架來說，必須為每一種執行引擎提供完整的實現，這具有一定的複雜度。為了讓程式碼結構更有層次感，也為了向Platform層隱藏底層的設計複雜度，該層選擇只向上層暴露StreamExecutor類，而涉及到具體實現的StreamExecutorInterface以及各種具體的實現將由StreamExecutor類統一控制，這種代理的方式讓這一層的架構更加乾淨，下面是涉及到這一層的類圖。

CudaExecutor和HostExecutor繼承自StreamExecutorInterface後，由StreamExecutor持有，並暴露給上一層Platform使用。同各種Platform類似，每個具體的StreamExecutor也需要註冊到系統中，但他們卻沒有依賴於任何控制類，直接通過巨集定義將自己註冊到全域性工廠中，註冊過程也是藉助Initializer模組實現的。下面的程式碼段展示了CudaExecutor的註冊過程。

 1 void initialize_cuda_gpu_executor() {
 2   *internal::MakeCUDAExecutorImplementation() = [](const PluginConfig &config) {
 3     return new cuda::CUDAExecutor{config};
 4   };
 5 }
 6 
 7 }  // namespace stream_executor
 8 
 9 REGISTER_MODULE_INITIALIZER(cuda_gpu_executor, {
10   stream_executor::initialize_cuda_gpu_executor();
11 });

initialize_cuda_gpu_executor函式中定義了一個建立CUDAExecutor的匿名函式，而MakeCUDAExecutorImplementation函式實際上建立了一個全域性的table，中間的等號賦值操作實際上就是把該匿名函式放到了全域性instance中，這實際上就是一種簡單的工廠模式，在StreamExecutor中存在多種類似的工廠，下面程式碼段展示了這些工廠的本質。

1 using StreamExecutorFactory =
2     std::function<StreamExecutorInterface *(const PluginConfig &)>;
3 using EventFactory = std::function<EventInterface *(StreamExecutor *)>;
4 using StreamFactory = std::function<StreamInterface *(StreamExecutor *)>;
5 using TimerFactory = std::function<TimerInterface *(StreamExecutor *)>;
6 using KernelFactory = std::function<KernelInterface*()>;
7 
8 StreamExecutorFactory* MakeCUDAExecutorImplementation();

StreamExecutor框架使用Cache機制避免為同一種StreamExecutor Core被重複建立，這個Cache就是ExecutorCache，下面程式碼展示了Platform從Cache獲取StreamExecutor Core的內容，當Cache中不存在所需要的StreamExecutor時，會建立新的物件並放入cache中，並以config作為key。

說白了就是個Map嘛。。。

1 port::StatusOr<StreamExecutor*> CudaPlatform::GetExecutor(
2     const StreamExecutorConfig& config) {
3   return executor_cache_.GetOrCreate(
4       config, [&]() { return GetUncachedExecutor(config); });
5 }

Library層

這一層提供的是各種底層加速庫的接入，當前該層主要負責接入Dnn，Blas，Rng和Fft模組，每個模組和對應的類說明如下表所示。

子模組名稱	功能說明
DNNSupport	DNN計算模組，主要包含DNN計算的基本操作。在GPU實現中，它將作為CuDNN的封裝
RngSupport	隨機數生成模組
BlasSupport	基礎線性代數庫模組，主要包含矩陣系列的計算，在CPU實現中它可以是Eigen，mkl等；在GPU實現中，它將作為CuBLAS的封裝
FFTSupport	FFT系列運算模組

因為這些基礎庫同StreamExecutor類似，都具有平臺屬性，例如在CUDAHostPlatform中使用的Blas庫應為CuBLAS，而HostPlatform中對應的可能是OpenBlas，MKL等。雖然StreamExecutorInterface創建出來的各種Library指標均由StreamExecutor持有，但是他們卻由StreamExecutorInterface的實現類負責建立，所以從邏輯上看他們處於StreamExecutor Core的下一層，下圖展示了Library層的類圖。

Library層將這些基礎庫統一作為外掛（Plugin）來管理，用以應對未來出現的各種各樣的基礎庫。他們通過PluginRegister模組註冊。和StreamExecutor Core中的管理方式相同，依然要先建立外掛的Factory，Factory的建立也通過巨集實現。以CudnnSupport為例，通過向通用初始化模組Intializer傳入initialize_cudnn函式並呼叫，將建立CudnnSupport的函式作為DnnFactory放到PluginRegister模組中，至此完成了DnnFactory的建立。使用時，只需要拿到PluginRegister的key（即要求拿到何種外掛）即可取出對應的LibrarySupport。下面展示了CudnnSupport的工廠註冊程式碼。

 1 void initialize_cudnn() {
 2   port::Status status =
 3       PluginRegistry::Instance()->RegisterFactory<PluginRegistry::DnnFactory>(
 4           cuda::kCudaPlatformId, cuda::kCuDnnPlugin, "cuDNN",
 5           [](internal::StreamExecutorInterface* parent) -> dnn::DnnSupport* {
 6             cuda::CUDAExecutor* cuda_executor =
 7                 dynamic_cast<cuda::CUDAExecutor*>(parent);
 8             if (cuda_executor == nullptr) {
 9               LOG(ERROR) << "Attempting to initialize an instance of the cuDNN "
10                          << "support library with a non-CUDA StreamExecutor";
11               return nullptr;
12             }
13 
14             cuda::CudnnSupport* dnn = new cuda::CudnnSupport(cuda_executor);
15             if (!dnn->Init().ok()) {
16               // Note: Init() will log a more specific error.
17               delete dnn;
18               return nullptr;
19             }
20             return dnn;
21           });
22 
23   if (!status.ok()) {
24     LOG(ERROR) << "Unable to register cuDNN factory: "
25                << status.error_message();
26   }
27 
28   PluginRegistry::Instance()->SetDefaultFactory(
29       cuda::kCudaPlatformId, PluginKind::kDnn, cuda::kCuDnnPlugin);
30 }
31 
32 }  // namespace stream_executor
33 
34 REGISTER_MODULE_INITIALIZER(register_cudnn,
35                             { stream_executor::initialize_cudnn(); });

再看總體類圖

在StreamExecutor框架中還存在其他模組，比如XLA的支援，比如Event的管理，在逐個梳理StreamExecutor框架的三個層次後再看其餘部分就非常清晰明瞭了，下面的兩張圖展示了整體類圖和一些繼承結構。

StreamExecutor的呼叫棧

在完整的理解了StreamExecutor框架的內部結構和外部控制代碼後，我們就可以非常清晰地trace其呼叫棧了。最後，我們以呼叫Cudnn中的FusedConvolveWIthAlgorithm為例，畫出完整的呼叫時序圖。FusedConvolveWIthAlgorithm是將Convolution計算，Bias計算以及Activation計算fuse在一起的優化版本CUDA kernel，它的效率相對於分開呼叫相比更高。

總結

StreamExecutor是一個相對獨立的專案，在TensorFlow中所使用的StreamExecutor是精簡之後的版本。正是因為異構框架管理每種Device的並行執行過程非常繁雜，所以需要StreamExecutor向上層呼叫者隱藏底層的複雜性。在架構設計上，StreamExecutor選擇向上層暴露簡單的Stream物件handler實現了這一封裝。事實上，TensorFlow中所有需要呼叫與Device相關的第三方高效能運算庫的Op都使用Stream這一handler輕鬆完成Op的編寫。從StreamExecutor框架內部看，可以分為Platform層、StreamExecutor Core層和LibrarySupport層，每層的核心元件都通過Initializer模組和巨集定義主動註冊到系統Factory中，從上層Op對Stream的呼叫棧中也可以清晰地感受到這層次分明的架構設計。掌握並理解StreamExecutor的呼叫棧是非常重要的，因為無論是為TensorFlow底層做XLA優化還是為某些Op提供Int8計算支援，都需要改寫這一部分。將來我們在梳理XLA整體框架時還會回過頭來窺探StreamExecutor框架中的其他部分。

TensorFlow中的並行執行引擎——StreamExecutor框架

TensorFlow中的並行執行引擎——StreamExecutor框架

背景

Stream

StreamExecutor簡介

StreamExecutor對外提供的控制代碼——Stream物件

StreamExecutor框架內的層次結構

Platform層

StreamExecutor Core層

Library層

再看總體類圖

StreamExecutor的呼叫棧

總結

TensorFlow中的並行執行引擎——StreamExecutor框架

詳解SQL優化必備：並行執行框架和執行計劃

總結：iOS中多執行緒的經典崩潰

關於SpringBoot中Redis執行緒池的有關探討

簡單瞭解C語言中主執行緒退出對子執行緒的影響

對tensorflow中的strides引數使用詳解

python中的執行緒threading.Thread()使用詳解

python爬蟲中多執行緒的使用詳解

python程式中的執行緒操作 concurrent模組使用詳解

淺談tensorflow中張量的提取值和賦值

tensorflow中tf.slice和tf.gather切片函式的使用

在tensorflow中實現去除不足一個batch的資料

淺談tensorflow中Dataset圖片的批量讀取及維度的操作詳解

在tensorflow中設定儲存checkpoint的最大數量例項

在Tensorflow中實現梯度下降法更新引數值

在Tensorflow中檢視權重的實現

PyQt5中多執行緒模組QThread使用方法的實現

關於Tensorflow分散式並行策略

基於tensorflow指定GPU執行及GPU資源分配的幾種方式小結

在tensorflow中實現遮蔽輸出的log資訊

TensorFlow中的並行執行引擎——StreamExecutor框架

TensorFlow中的並行執行引擎——StreamExecutor框架

背景

Stream

StreamExecutor簡介

StreamExecutor對外提供的控制代碼——Stream物件

StreamExecutor框架內的層次結構

Platform層

StreamExecutor Core層

Library層

再看總體類圖

StreamExecutor的呼叫棧

總結

相關推薦