Python的基本Protobuf指南(序列化資料)

阿新 • • 發佈：2020-12-13

協議緩衝區（Protobuf）是Google開發的與語言無關的資料序列化格式。Protobuf之所以出色，原因如下：

資料量低：Protobuf使用二進位制格式，該格式比JSON等其他格式更緊湊。
永續性：Protobuf序列化是向後相容的。這意味著即使介面在此期間發生了更改，您也可以始終還原以前的資料。
按合同設計：Protobuf要求使用顯式識別符號和型別來規範訊息。
gRPC的要求：gRPC（gRPC遠端過程呼叫）是一種利用Protobuf格式的高效遠端過程呼叫系統。

就個人而言，我最喜歡Protobuf的是，如果強迫開發人員明確定義應用程式的介面。這是一個改變規則的遊戲，因為它使所有利益相關者都能理解介面設計併為之做出貢獻。

在這篇文章中，我想分享我在Python應用程式中使用Protobuf的經驗。

安裝Protobuf

對於大多數系統，Protobuf必須從原始碼安裝。在下面，我描述了Unix系統的安裝：

1.從Git下載最新的Protobuf版本：

wget https://github.com/protocolbuffers/protobuf/releases/download/v3.12.4/protobuf-all-3.12.4.tar.gz

2.解壓縮

tar -xzf protobuf-all-3.12.4.tar.gz

3.安裝：

cd protobuf-3.12.4/ && ./configure && make && sudo make install

4.驗證安裝（protoc現在應該可用！）

protoc

protoc --version

一旦原型編譯器可用，我們就可以開始。

1. Protobuf訊息的定義

要使用Protobuf，我們首先需要定義我們要傳輸的訊息。訊息在.proto檔案內定義。請考慮官方文件以獲取協議緩衝區語言的詳細資訊。在這裡，我僅提供一個簡單的示例，旨在展示最重要的語言功能。

假設我們正在開發一個類似Facebook的社交網路，該社交網路完全是關於人及其聯絡的。這就是為什麼我們要為一個人建模訊息。

一個人具有某些固有的特徵（例如年齡，性別，身高），還具有我們需要建模的某些外在特徵（例如朋友，愛好）。

讓我們儲存以下定義src/interfaces/person.proto：

syntax = "proto3";

import "generated/person_info.proto";

package persons;

message Person {
    PersonInfo info = 1; // characteristics of the person
    repeated Friend friends = 2; // friends of the person
}

message Friend {
    float friendship_duration = 1; // duration of friendship in days
    repeated string shared_hobbies = 2; // shared interests
    Person person = 3; // identity of the friend
}

請注意，我們引用的是另一個原始檔案，generated/person_info.proto我們將其定義為：

syntax = "proto3";

package persons;

enum Sex {
    M = 0; // male 
    F = 1; // female
    O = 2; // other
}

message PersonInfo {
    int32 age = 1; // age in years
    Sex sex = 2; 
    int32 height = 3; // height in cm
}

不用擔心這些定義對您還沒有意義，我現在將解釋最重要的關鍵字：

語法：語法定義了規範使用哪個版本的Protobuf。我們正在使用proto3。
import：如果根據另一條訊息定義了一條訊息，則需要使用import語句將其包括在內。您可能想知道為什麼匯入person.proto？我們稍後將對此進行更深入的研究-現在僅知道這是由於Python的匯入系統所致。generated/person_info.protointerfaces/person_info.proto
package：包定義了屬於同一名稱空間的訊息。這樣可以防止名稱衝突。
enum：一個列舉定義一個列舉型別。
messsage：訊息是我們想使用Protobuf建模的一條資訊。
repeat：repeated關鍵字指示一個變數，該變數被解釋為向量。在我們的情況下，friends是Friend訊息的向量。

還要注意，每個訊息屬性都分配有一個唯一的編號。該編號對於協議的向後相容是必需的：一旦將編號分配給欄位，則不應在以後的時間點對其進行修改。

現在我們有了應用程式的基本原型定義，我們可以開始生成相應的Python程式碼了。

2.原始檔案的編譯

要將原始檔案編譯為Python物件，我們將使用Protobuf編譯器protoc。

我們將使用以下選項呼叫原型編譯器：

--python_out：將儲存已編譯的Python檔案的目錄
--proto_path：由於原始檔案不在專案的根資料夾中，因此我們需要使用替代檔案。通過指定generated=./src/interfaces，編譯器知道在匯入其他原始訊息時，我們要使用生成檔案的路徑（generated），而不是介面的位置（src/interfaces）。

有了這種瞭解，我們可以像下面這樣編譯原始檔案：

mkdir src/generated
protoc src/interfaces/person_info.proto --python_out src/ --proto_path generated=./src/interfaces/
protoc src/interfaces/person.proto --python_out src/ --proto_path generated=./src/interfaces/

執行完這些命令後，檔案generated/person_pb2.py和generated/person_info_pb2.py應該存在。例如，person_pb2.py如下所示：

_PERSON = _descriptor.Descriptor(
  name='Person',
  full_name='persons.Person',
  filename=None,
  file=DESCRIPTOR,
  containing_type=None,
  create_key=_descriptor._internal_create_key,
  fields=[
...

生成的Python程式碼並非真正可讀。但這沒關係，因為我們只需要知道person_pb2.py可以用於構造可序列化的Protobuf物件即可。

3. Protobuf物件的序列化

在我們以有意義的方式序列化Protobuf物件之前，我們需要用一些資料填充它。讓我們生成一個有一個朋友的人：

# fill protobuf objects
import generated.person_pb2 as person_pb2
import generated.person_info_pb2 as person_info_pb2
############
# define friend for person of interest
#############
friend_info = person_info_pb2.PersonInfo()
friend_info.age = 40
friend_info.sex = person_info_pb2.Sex.M
friend_info.height = 165
friend_person = person_pb2.Person()
friend_person.info.CopyFrom(friend_info)
friend_person.friends.extend([])  # no friends :-(
#######
# define friendship characteristics
########
friendship = person_pb2.Friend()
friendship.friendship_duration = 365.1
friendship.shared_hobbies.extend(["books", "daydreaming", "unicorns"])
friendship.person.CopyFrom(friend_person)
#######
# assign the friend to the friend of interest
#########
person_info = person_info_pb2.PersonInfo()
person_info.age = 30
person_info.sex = person_info_pb2.Sex.M
person_info.height = 184
person = person_pb2.Person()
person.info.CopyFrom(person_info)
person.friends.extend([friendship])  # person with a single friend

請注意，我們通過直接分配填充了所有瑣碎的資料型別（例如，整數，浮點數和字串）。僅對於更復雜的資料型別，才需要使用其他一些功能。例如，我們利用extend來填充重複的Protobuf欄位並CopyFrom填充Protobuf子訊息。

要序列化Protobuf物件，我們可以使用SerializeToString()函式。此外，我們還可以使用以下str()函式將Protobuf物件輸出為人類可讀的字串：

# serialize proto object
import os
out_dir = "proto_dump"
with open(os.path.join(out_dir, "person.pb"), "wb") as f:
    # binary output
    f.write(person.SerializeToString())
with open(os.path.join(out_dir, "person.protobuf"), "w") as f:
    # human-readable output for debugging
    f.write(str(person))

執行完程式碼段後，可以在proto_dump/person.protobuf以下位置找到生成的人類可讀的Protobuf訊息：

info {
  age: 30
  height: 184
}
friends {
  friendship_duration: 365.1000061035156
  shared_hobbies: "books"
  shared_hobbies: "daydreaming"
  shared_hobbies: "unicorns"
  person {
    info {
      age: 40
      height: 165
    }
  }
}

請注意，此人的資訊既不顯示該人的性別，也不顯示其朋友的性別。這不是Bug，而是Protobuf的功能：0永遠不會列印值為的條目。sex由於這兩個人都是男性，因此此處未顯示0。

4.自動化的Protobuf編譯

在開發過程中，每次更改後必須重新編譯原始檔案可能會變得很乏味。要在安裝開發Python軟體包時自動編譯原始檔案，我們可以使用該setup.py指令碼。

讓我們建立一個函式，該函式為.proto目錄中的所有檔案生成Protobuf程式碼src/interfaces並將其儲存在下src/generated：

import pathlib
import os
from subprocess import check_call

def generate_proto_code():
    proto_interface_dir = "./src/interfaces"
    generated_src_dir = "./src/generated/"
    out_folder = "src"
    if not os.path.exists(generated_src_dir):
        os.mkdir(generated_src_dir)
    proto_it = pathlib.Path().glob(proto_interface_dir + "/**/*")
    proto_path = "generated=" + proto_interface_dir
    protos = [str(proto) for proto in proto_it if proto.is_file()]
    check_call(["protoc"] + protos + ["--python_out", out_folder, "--proto_path", proto_path])

接下來，我們需要覆蓋develop命令，以便每次安裝軟體包時都呼叫該函式：

from setuptools.command.develop import develop
from setuptools import setup, find_packages

class CustomDevelopCommand(develop):
    """Wrapper for custom commands to run before package installation."""
    uninstall = False

    def run(self):
        develop.run(self)

    def install_for_development(self):
        develop.install_for_development(self)
        generate_proto_code()

setup(
    name='testpkg',
    version='1.0.0',
    package_dir={'': 'src'},
    cmdclass={
        'develop': CustomDevelopCommand, # used for pip install -e ./
    },
    packages=find_packages(where='src')
)

下次我們執行時pip install -e ./，Protobuf檔案將在中自動生成src/generated。

我們節省多少空間？

之前，我提到Protobuf的優點之一是其二進位制格式。在這裡，我們將通過比較Protobuf訊息的大小和Person相應的JSON來考慮此優勢：

"person": {
    "info": {
      "age": 30,
      "height": 184
    },
    "friends": {
      "friendship_duration": 365.1000061035156,
      "shared_hobbies": ["books", "daydreaming", "unicorns"],
      "person": {
        "info": {
          "age": 40,
          "height": 165
        }
      }
    }
}

比較JSON和Protobuf文字表示形式，結果發現JSON實際上更緊湊，因為它的列表表示形式更加簡潔。但是，這令人產生誤解，因為我們對二進位制Protobuf格式感興趣。

當比較Person物件的二進位制Protobuf和JSON佔用的位元組數時，我們發現以下內容：

du -b person.pb
53      person.pb

du -b person.json 
304     person.json

在這裡，Protobuf比JSON小5倍

二進位制Protobuf（53個位元組）比相應的JSON（304個位元組）小5倍以上。請注意，如果我們使用gRPC協議傳輸二進位制Protobuf，則只能達到此壓縮級別。

如果不選擇gRPC，則常見的模式是使用base64編碼對二進位制Protobuf資料進行編碼。儘管此編碼不可撤銷地將有效載荷的大小增加了33％，但仍比相應的REST有效載荷小得多。

概要

Protobuf是資料序列化的理想格式。它比JSON小得多，並且允許介面的顯式定義。由於其良好的效能，我建議在所有使用足夠複雜資料的專案中使用Protobuf。儘管Protobuf需要初步的時間投入，但我敢肯定它會很快得到回報。

Python的基本Protobuf指南(序列化資料)

安裝Protobuf

1. Protobuf訊息的定義

2.原始檔案的編譯

3. Protobuf物件的序列化

4.自動化的Protobuf編譯

我們節省多少空間？

在這裡，Protobuf比JSON小5倍

概要

Python的基本Protobuf指南(序列化資料)

基於DataTable, Json的額外序列化資料

python redis存入字典序列化儲存教程

json序列化資料超出最大值(maxJsonLength)

Python使用pickle進行序列化和反序列化的示例程式碼

Python操作Excel製作視覺化資料圖，實現自動化辦公

python後端繼承序列化，不同訪問形式返回不同結果

爬蟲3-python爬取非結構化資料下載到本地

C# 序列化與反序列化資料

SpringBoot整合reids之JSON序列化資料夾操作

Python 面試題：序列化與反序列化

python序列化與資料持久化例項詳解

Python之資料序列化（json、pickle、shelve）詳解

一文看懂Python及Django不同型別資料的json序列化

Python 將json序列化後的字串轉換成字典(推薦)

Python Pickle 實現在同一個檔案中序列化多個物件

Python pickle模組實現物件序列化

Python序列化與反序列化pickle用法例項

Python實現序列化及csv檔案讀取

Python 序列化和反序列化庫 MarshMallow 的用法例項程式碼

Python的基本Protobuf指南(序列化資料)

安裝Protobuf

1. Protobuf訊息的定義

2.原始檔案的編譯

3. Protobuf物件的序列化

4.自動化的Protobuf編譯

我們節省多少空間？

在這裡，Protobuf比JSON小5倍

概要

相關推薦