Tensorflow C++ 編譯和呼叫圖模型

阿新 • • 發佈：2018-12-12

簡介

最近在研究如何打通tensorflow線下 python 的指令碼訓練建模, 利用freeze_graph工具輸出.pb圖檔案，之後再線上生產環境用C++程式碼直接呼叫預先訓練好的模型完成預測的工作，而不需要用自己寫的Inference的函式。因為目前tensorflow提供的C++的API比較少，所以參考了幾篇已有的日誌，踩了不少坑一併記錄下來。寫了一個簡單的ANN模型對Iris資料集分類的Demo。

梳理過後的流程如下：

1. python指令碼中定義自己的模型，訓練完成後將tensorflow graph定義匯出為protobuf的二進位制檔案或文字檔案（一個僅有tensor定義但不包含權重引數的檔案）；

2. python指令碼訓練過程儲存模型引數檔案 *.ckpt。
3. 呼叫tensorflow自帶的freeze_graph.py 小工具，輸入為格式*.pb或*.pbtxt的protobuf檔案和*.ckpt的引數檔案，輸出為一個新的同時包含圖定義和引數的*.pb檔案；這個步驟的作用是把checkpoint .ckpt檔案中的引數轉化為常量const operator後和之前的tensor定義繫結在一起。
4. 在C++中新建Session，只需要讀取一個繫結後的模型檔案.pb, 進行預測，利用Session->Run()獲得輸出的tensor的值就okay；
5. 編譯和執行，這時有兩個選擇：

a) 一種是在tensorflow原始碼的子目錄下新建自己專案的目錄和程式碼，然後用bazel來編譯成一個很大的100多MB的二進位制檔案，這個方法的缺點在於無法把預測模組整合在自己的程式碼系統和編譯環境如cmake, bcloud中，遷移性和實用性不強；參考: (https://medium.com/jim-fleming/loading-a-tensorflow-graph-with-the-c-api-4caaff88463f) 如果打不開貌似有中文翻譯版的部落格
b) 另一種是自己把tensorflow原始碼編譯成一個.so檔案，然後在自己的C++程式碼環境中依賴這個檔案完成編譯。C的API依賴libtensorflow.so，C++的API依賴libtensorflow_cc.so

執行成功後

下面通過具體的例子寫了一個簡單的ANN預測的demo，應該別的模型也可以參考或者拓展C++程式碼中的基類。測試環境：MacOS, 需要依賴安裝：tensorflow, bazel, protobuf , eigen(一種矩陣運算的庫)；

配置環境

系統安裝 HomeBrew, Bazel, Eigen

# Mac下安裝包管理工具homebrew
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

安裝Bazel, Google 的一個編譯工具

brew install bazel

git clone https://github.com/google/protobuf.git brew install automake libtool ./autogen.sh ./configure make check make && make install

安裝 Eigen, 用於矩陣運算

brew install eigen

下載編譯tensorflow原始碼

# 從github下載tensorflow原始碼
git clone --recursive https://github.com/tensorflow/tensorflow

進入根目錄後編譯

編譯生成.so檔案, 編譯C++ API的庫 (建議)

bazel build //tensorflow:libtensorflow_cc.so

也可以選擇,編譯C API的庫

bazel build //tensorflow:libtensorflow.so在等待30多分鐘後, 如果編譯成功，在tensorflow根目錄下出現 bazel-bin, bazel-genfiles 等資料夾, 按順序執行以下命令將對應的libtensorflow_cc.so檔案和其他檔案拷貝進入 /usr/local/lib/ 目錄

mkdir /usr/local/include/tf
cp -r bazel-genfiles/ /usr/local/include/tf/
cp -r tensorflow /usr/local/include/tf/
cp -r third_party /usr/local/include/tf/
cp -r bazel-bin/tensorflow/libtensorflow_cc.so /usr/local/lib/

這一步完成後，我們就準備好了libtensorflow_cc.so檔案等，後面在自己的C++編譯環境和程式碼目錄下編譯時連結這些庫即可。

1. Python線下定義模型和訓練

我們寫了一個簡單的指令碼，來訓練一個包含1個隱含層的ANN模型來對Iris資料集分類，模型每層節點數：[5, 64, 3]，具體指令碼參考專案：

1.1 定義Graph中輸入和輸出tensor名稱

為了方便我們在呼叫C++ API時，能夠準確根據Tensor的名稱取出對應的結果，在python指令碼訓練時就要先定義好每個tensor的tensor_name。如果tensor包含名稱空間namespace的如"namespace_A/tensor_A" 需要用完整的名稱。(Tips: 對於不清楚tensorname具體是什麼的，可以在輸出的 .pbtxt檔案中找對應的定義)；這個例子中，我們定義以下3個tensor的tensorname

class TensorNameConfig(object):
    input_tensor = "inputs"
    target_tensor = "target"
    output_tensor = "output_node"
    # To Do

1.2 輸出graph的定義檔案*.pb和引數檔案 *.ckpt

我們要在訓練的指令碼nn_model.py中加入兩處程式碼：第一處是將tensorflow的graph_def儲存成./models/目錄下一個檔案nn_model.pbtxt, 裡面包含有圖中各個tensor的定義名稱等資訊。第二處是在訓練程式碼中加入儲存引數檔案的程式碼，將訓練好的ANN模型的權重Weight和Bias同時儲存到./ckpt目錄下的*.ckpt, *.meta等檔案。最後執行 python nn_model.py 就可以完成訓練過程

# 儲存圖模型
tf.train.write_graph(session.graph_def, FLAGS.model_dir, "nn_model.pbtxt", as_text=True)

儲存 Checkpoint

checkpoint_path = os.path.join(FLAGS.train_dir, “nn_model.ckpt”) model.saver.save(session, checkpoint_path)

執行命令完成訓練過程

python nn_model.py

1.3 使用freeze_graph.py小工具整合模型freeze_graph

最後利用tensorflow自帶的 freeze_graph.py小工具把.ckpt檔案中的引數固定在graph內，輸出nn_model_frozen.pb

# 執行freeze_graph.py 小工具
# freeze the graph and the weights
python freeze_graph.py --input_graph=../model/nn_model.pbtxt --input_checkpoint=../ckpt/nn_model.ckpt --output_graph=../model/nn_model_frozen.pb --output_node_names=output_node

或者執行

成功標誌:

Converted 2 variables to const ops.

9 ops in the final graph.

指令碼中的引數解釋：

--input_graph: 模型的圖的定義檔案nn_model.pb （不包含權重）；
--input_checkpoint: 模型的引數檔案nn_model.ckpt；
--output_graph: 繫結後包含引數的圖模型檔案 nn_model_frozen.pb；
-- output_node_names: 輸出待計算的tensor名字【重要】；

發現tensorflow不同版本下執行freeze_graph.py 指令碼時可能遇到的Bug挺多的，列舉一下：

# Bug1: google.protobuf.text_format.ParseError: 2:1 : Message type "tensorflow.GraphDef" has no field named "J".
# 原因: tf.train.write_graph(,,as_text=False) 之前寫出的模型檔案是Binary時, 
# 讀入檔案格式應該對應之前設定引數 python freeze_graph.py [***] --input_binary=true，
# 如果as_text=True則可以忽略，因為預設值 --input_binary=false。
# 參考: https://github.com/tensorflow/tensorflow/issues/5780

Bug2: Input checkpoint ‘…’ doesn’t exist!

原因：可能是命令列用了 --input_checkpoint=data.ckpt ,

執行 freeze_graph.py 指令碼，要在路徑引數前加上 “./” 貌似才能正確識別路徑。

如檔案的路徑 --input_checkpoint=data.ckpt 變為 --input_checkpoint=./data.ckpt

Bug3: google.protobuf.text_format.ParseError: 2:1 : Expected identifier or number.

原因: --input_checkpoint 需要找到 .ckpt.data-000*** 和 .ckpt.meta等多個檔案，

因為在 --input_checkpoint 引數只需要新增 ckpt的字首, 如: nn_model.ckpt，而不是完整的路徑nn_model.ckpt.data-000***

.meta .index .data checkpoint 4個檔案

Bug4: # you need to use a different restore operator?

tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file ./pos.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

Saver 儲存的檔案用格式V2，解決方法更新tensorflow…

歡迎補充

最後如果輸出如下: Converted variables to const ops. * ops in the final graph 就代表繫結成功了！發現綁定了引數的的.pb檔案大小有10多MB。

2. C++API呼叫模型和編譯

在C++預測階段，我們在工程目錄下引用兩個tensorflow的標頭檔案:

2.1 C++載入模型 ?

#include "tensorflow/core/public/session.h"
#include "tensorflow/core/platform/env.h"

在這個例子中我們把C++的API方法都封裝在基類裡面了。 FeatureAdapterBase 用來處理輸入的特徵，以及ModelLoaderBase提供統一的模型介面load()和predict()方法。然後可以根據自己的模型可以繼承基類實現這兩個方法，如本demo中的ann_model_loader.cpp。可以參考下，就不具體介紹了。

a) 新建Session, 從model_path 載入*.pb模型檔案，並在Session中建立圖。預測的核心程式碼如下：

// @brief: 從model_path 載入模型，在Session中建立圖
// ReadBinaryProto() 函式將model_path的protobuf檔案讀入一個tensorflow::GraphDef的物件
// session->Create(graphdef) 函式在一個Session下建立了對應的圖;

int ANNModelLoader::load(tensorflow::Session* session, const std::string model_path) { //Read the pb file into the grapgdef member tensorflow::Status status_load = ReadBinaryProto(Env::Default(), model_path, &graphdef); if (!status_load.ok()) { std::cout << “ERROR: Loading model failed…” << model_path << std::endl; std::cout << status_load.ToString() << “\n”; return -1; }

// Add the graph to the session
tensorflow::Status status_create = session-&gt;Create(graphdef);
if (!status_create.ok()) {
    std::cout &lt;&lt; "ERROR: Creating graph in session failed..." &lt;&lt; status_create.ToString() &lt;&lt; std::endl;
    return -1;
}
return 0;

}

b) 預測階段的函式呼叫 session->Run(input_feature.input, {output_node}, {}, &outputs);

引數 const FeatureAdapterBase& input_feature, 內部的成員input_feature.input是一個Map型, std::vector<std::pair >, 類似於python裡的feed_dict={"x":x, "y": y}，這裡的C++程式碼中的輸入tensor_name也一定要和python訓練指令碼中的一致, 如上文中設定的"inputs", "targets" 等。呼叫基類 FeatureAdapterBase中的方法assign(std::string, std::string tname, std::vector* vec) 函式來定義。

引數 const std::string output_node, 對應的就是在python指令碼中定義的輸出節點的名稱，如"name_scope/output_node"

int ANNModelLoader::predict(tensorflow::Session* session, const FeatureAdapterBase& input_feature,
        const std::string output_node, double* prediction) {
    // The session will initialize the outputs
    std::vector<tensorflow::Tensor> outputs;         //shape  [batch_size]
// @input: vector&lt;pair&lt;string, tensor&gt; &gt;, feed_dict
// @output_node: std::string, name of the output node op, defined in the protobuf file
tensorflow::Status status = session-&gt;Run(input_feature.input, {output_node}, {}, &amp;outputs);
if (!status.ok()) {
    std::cout &lt;&lt; "ERROR: prediction failed..." &lt;&lt; status.ToString() &lt;&lt; std::endl;
    return -1;
}

// ...

}

2.1 C++編譯的方法

記得我們之前預先編譯好的libtensorflow_cc.so檔案，要成功編譯需要連結那個庫。執行下列命令：

# 使用g++
g++ -std=c++11 -o tfcpp_demo \
-I/usr/local/include/tf \
-I/usr/local/include/eigen3 \
-g -Wall -D_DEBUG -Wshadow -Wno-sign-compare -w  \
`pkg-config --cflags --libs protobuf` \
-L/usr/local/lib/libtensorflow_cc \
-ltensorflow_cc main.cpp ann_model_loader.cpp

引數含義:

a) -I/usr/local/include/tf # 依賴的include檔案
b) -L/usr/local/lib/libtensorflow_cc # 編譯好的libtensorflow_cc.so檔案所在的目錄
c) -ltensorflow_cc # .so檔案的檔名

為了方便呼叫，嘗試著寫了一個Makefile檔案，將裡面的路徑換成自己的，每次直接用make命令執行就好

make

此外，在直接用g++來編譯的過程中可能會遇到一些Bug, 現在記錄下來

# Bug1: main.cpp:9:10: fatal error: 'tensorflow/core/public/session.h' file not found
# include "tensorflow/core/public/session.h"
# 原因: 這個應該就是編譯階段沒有找到之前編譯好的tensorflow_cc.so 檔案，檢查-I和-L的路徑引數

Bug2: fatal error: ‘google/protobuf/stubs/common.h’ file not found

原因：沒有成功安裝 protobuf檔案

Bug3: /usr/local/include/tf/third_party/eigen3/unsupported/Eigen/CXX11/Tensor:1:10: fatal error: ‘unsupported/Eigen/CXX11/Tensor’ file not found

原因：沒有安裝或找到Eigen的路徑

參考之前安裝Eigen的步驟

3. 執行

最後試著執行一下之前編譯好的可執行檔案 tfcpp_demo

# 執行可執行檔案，輸入引數 model_path指向之前的包含引數的模型檔案 nn_model_frozen.pb
folder_dir=`pwd`
model_path=${folder_dir}/model/nn_model_frozen.pb
./tfcpp_demo ${model_path}

或者直接執行指令碼:

我們試著預測一個樣本[1,1,1,1,1]，輸出該樣本對應的分類和概率。進行到這一步，我們終於成功完成了在python中定義模型和訓練，然後在C++生產程式碼中進行編譯和呼叫的流程。