protobuf序列化/反序列化效能及問題
為了tensorflow專案要求測試protobuf序列化/反序列化的效能,測試過程及測試結果如下:
一. 測試環境
python 2.7 + proto3
二. 測試方法
1. 自定義一個proto訊息(使用protobuf example裡的例子,進行修改)
2. 編譯proto檔案message Person { string name = 1; int32 id = 2; // Unique ID number for this person. string email = 3; enum PhoneType { MOBILE = 0; HOME = 1; WORK = 2; } message PhoneNumber { string number = 1; PhoneType type = 2; } repeated PhoneNumber phones = 4; } // Our address book file is just one of these. message AddressBook { repeated Person people = 1; }
protoc --python_out=. address.proto
得到 addressbook_pb2.py
3. 在測試檔案中,通過修改迴圈的大小,修改序列化內容的大小。並
for i in range(1024 * 1024):
PromptForAddress(address_book.people.add())
4. 序列化
begin = datetime.datetime.now() serialized = address_book.SerializeToString() end = datetime.datetime.now() print end-begin print len(serialized) f.write(serialized)
5. 反序列化
book = f.read()
parsebegin = datetime.datetime.now()
address_book.ParseFromString(book)
parseend = datetime.datetime.now()
print parseend-parsebegin
print len(book)
完整的py檔案如下:
#! /usr/bin/env python # See README.txt for information and build instructions. import addressbook_pb2 import sys import datetime # This function fills in a Person message based on user input. def PromptForAddress(person): person.id = 160824 person.name = "xxxxx xxxxx" person.email = "[email protected]" phone_number = person.phones.add() phone_number.number = "12345678" phone_number.type = addressbook_pb2.Person.MOBILE phone_number = person.phones.add() phone_number.number = "23456789" phone_number.type = addressbook_pb2.Person.HOME phone_number = person.phones.add() phone_number.number = "34567890" phone_number.type = addressbook_pb2.Person.WORK # Main procedure: Reads the entire address book from a file, # adds one person based on user input, then writes it back out to the same # file. if len(sys.argv) != 2: print "Usage:", sys.argv[0], "ADDRESS_BOOK_FILE" sys.exit(-1) address_book = addressbook_pb2.AddressBook() # Read the existing address book. try: with open(sys.argv[1], "rb") as f: book = f.read() parsebegin = datetime.datetime.now() address_book.ParseFromString(book) parseend = datetime.datetime.now() print parseend-parsebegin print len(book) # address_book.ParseFromString(f.read()) except IOError: print sys.argv[1] + ": File not found. Creating a new file." # Add an address. for i in range(1024 * 1024): PromptForAddress(address_book.people.add()) # Write the new address book back to disk. with open(sys.argv[1], "wb") as f: begin = datetime.datetime.now() serialized = address_book.SerializeToString() end = datetime.datetime.now() print end-begin print len(serialized) ''' address_book = addressbook_pb2.AddressBook() # Read the existing address book. try: with open(sys.argv[1], "rb") as f: book = f.read() parsebegin = datetime.datetime.now() address_book.ParseFromString(book) parseend = datetime.datetime.now() print parseend-parsebegin print len(book) '''
6. 修改迴圈次數,記錄不同大小的protobuf序列化反序列的效能
三. 測試結果
位元組(MB) |
序列化(s) |
反序列化(s) |
1.03 |
0.799453 |
0.950107 |
53.00 |
36.759911 |
43.303041 |
61.64 |
41.674104 |
52.206466 |
81.00 |
63.077295 |
79.234909 |
106.00 |
72.048027 |
88.280556 |
102.83 |
81.08806 |
102.28786 |
162.00 |
128.883403 |
164.042591 |
205.66 |
163.994605 |
199.729636 |
243.00 |
197.582673 |
246.699898 |
注:表中位元組大小為序列化後得到的字串大小,即程式中的 len(serialized)
四. 測試分析及問題
根據測試的結果看是基本成線性增長,位元組數越大,所用時間越多。當位元組數為243MB時,序列化耗時3s左右,反序列化耗時4s左右。在測試結果上有幾個問題如下:
1. 測試方法是否正確,我感覺應該是可行的,但是結果比我預期的要大。
2. 本次測試是用Python測試的,我在c++下進行測試,得到的結果比python好很多(C++部分參考FlatBuffers與protobuf效能比較)。
我只對比測試了小資料量(1KB)的,序列化及反序列化均迴圈100次,結果如下:(兩次測試的proto檔案為同一個,在C++中用的序列化/反序列化函式為ParseFromArray/SerializeToArray,python中用的序列化/反序列化函式是ParseFromString/SerializeToString)
序列化(毫秒) |
反序列化(毫秒) |
|
Python |
63.879 |
82.89 |
C++ |
1.336 |
1.352 |
3. 經查閱相關資料,序列化反序列化跟proto的結構也是有關係的(比如多層巢狀),所以建議在學習tensorflow之後結合tensorflow再進行一次測試,在訓練某一個模型時,將其中序列化反序列化的過程單獨計時。
以上兩個問題還需討論,也歡迎各位批評指正。