Thrift 的序列化機制

阿新 • • 發佈：2019-02-10

1.首先我們先來定義下thrift的簡單結構。

namespace java mmxf.thrift;
 
struct Pair {
　　1: required string key
　　2: required string value
}

required修飾符你肯定能猜測到它的意義, 但是你是否有沒有這樣的疑惑, "1", "2" 這些數字識別符號究竟有何含義? 它在序列化機制中究竟扮演什麼樣的角色?

編譯並進行

thrift -gen java <your thrift file>

2.編寫測試程式碼

private String datafile = "1.dat";
     
// *) 把物件寫入檔案
public void writeData() throws IOException, TException {
    Pair pair = new Pair();
    pair.setKey("rowkey").setValue("column-family");
 
    FileOutputStream fos = new FileOutputStream(new File(datafile));
    pair.write(new TBinaryProtocol(new TIOStreamTransport(fos)));
    fos.close();
}

呼叫writeData(), 把pair{key=> rowkey, value=> column-family} 寫入檔案1.dat中

3.如果我重新定義pair結構, 調整數字編號數序

struct Pair {
　　2: required string key
　　1: required string value
}

評註: 這邊2對應key, 1對應value.

重新編譯thrift -gen java <your thrift file>

4.然後讀取該資料

private String datafile = "1.dat";
// *) 從檔案恢復物件
public void readData() throws TException, IOException {
　　FileInputStream fis = new FileInputStream(new File(datafile));
 
　　Pair pair = new Pair();
　　pair.read(new TBinaryProtocol(new TIOStreamTransport(fis)));
 
　　System.out.println("key => " + pair.getKey());
　　System.out.println("value => " + pair.getValue());
 
　　fis.close();
}

呼叫readData(), 從檔案1.dat中恢復Pair物件來。

結果:

key => column-family

value => rowkey

是不是和你預期的相反, 看來屬性名稱並沒有發揮作用, 而id標識在thrift的序列化/反序列化扮演非常重要的角色。帶著這些疑惑, 我們進一步的詳細解讀序列化機制。

thrift 資料格式描述

Versioning in Thrift is implemented via ﬁeld identiﬁers. The ﬁeld header for every member of a struct in Thrift is encoded with a unique ﬁeld identiﬁer. The combination of this ﬁeld identiﬁer and its type speciﬁer is used to uniquely identify the ﬁeld. The Thrift deﬁnition language supports automatic assignment of ﬁeld identiﬁers, but it is good programming practice to always explicitly specify ﬁeld identiﬁers.

翻譯: thrift的向後相容性(Version)藉助屬性標識(數字編號id + 屬性型別type)來實現, 可以理解為在序列化後(屬性資料儲存由 field_name:field_value => id+type:field_value), 這也解釋了上述提到的場景的原因了。

對之前定義的Pair結構體, 進行程式碼解讀:

public void read(org.apache.thrift.protocol.TProtocol iprot, Pair struct) {
　　// *) 讀取結構結束標記
　　iprot.readStructBegin();
　　while ( iprot is stop) {
　　　　// *) 讀取Field屬性開始標記
　　　　schemeField = iprot.readFieldBegin();
　　　　// *) field標記包含 id + type, switch根據(id+type)來分配相關的值
　　　　switch (schemeField.id) {
　　　　　　case <id>: // <field_name>
　　　　　　　　if (schemeField.type == thrift.TType.<type>) {
　　　　　　　　　　struct.<field_name> = iprot.read<type>();
　　　　　　　　　　struct.set<field_name>IsSet(true);
　　　　　　　　}
　　　　}
　　　　// *) 讀取Field屬性結束標記
　　　　iprot.readFieldEnd();
　　}
　　// *) 讀取結構體結束標記
　　iprot.readStructEnd();
}

從恢復物件的函式中, 我們也可以對thrift定義的序列化物件有個初步的認識, 庖丁解牛,最終會被細化為readStructBegin, readFieldBegin, read<type>(readString, readI32, readI64)的有組織有序呼叫。

資料交換格式分類

當前的資料交換格式可以分為如下幾類:

1. 自解析型

　　序列化的資料包含完整的結構, 包含了field名稱和value值. 比如xml/json/java serizable, 大百度的mcpack/compack, 都屬於此類. 即調整不同屬性的順序對序列化/反序列化不影響.

2. 半解析型

　　序列化的資料,丟棄了部分資訊, 比如field名稱, 但引入了index(常常是id+type的方式)來對應具體屬性和值. 這方面的代表有google protobuf, thrift也屬於此類.

3. 無解析型

　　傳說中大百度的infpack實現, 就是藉助該種方式來實現, 丟棄了很多有效資訊, 效能/壓縮比最好, 不過向後相容需要開發做一定的工作, 詳情不知.

thrift與常見資料交換格式的對比

交換格式	型別	優點	缺點
Xml	文字	易讀	臃腫, 不支援二進位制資料型別
Json	文字	易讀	丟棄了型別資訊, 比如"score":100, 對score型別是int/double解析有二義性, 不支援二進位制資料型別
Java serizable	二進位制	使用簡單	臃腫, 只限制在java領域
Thrift	二進位制	高效	不宜讀, 向後相容有一定的約定限制
Google Protobuf	二進位制	高效	不宜讀, 向後相容有一定的約定限制

向後相容實踐：Thrift官方文件, 也提到對新增的欄位屬性, 採用id遞增的方式標識並以optional修飾來新增。

Thrift 的序列化機制

Thrift 的序列化機制

java序列化機制（簡單使用）

java物件序列化機制

大資料入門（10）序列化機制，mr流量求和

JAVA RPC (三) 之thrift序列化協議入門雜談

JAVA RPC (四) 之thrift序列化普通物件

Java的序列化機制

java序列化機制和Serialize介面

螞蟻金服通訊框架SOFABolt解析 |序列化機制(Serializer)

學習Hadoop第十二課（Hadoop序列化機制、Linux安裝Eclipse及建立快捷圖示、使用Maven開發）

物件流的使用以及序列化機制

return，抽象類與介面，物件序列化機制，this和super，識別符號，break和continue以及return，final,finally和finalize

JAVA序列化機制的深入研究

java把物件轉化成流，和把流轉化成物件（包含clone機制+序列化機制）

Android中的序列化機制——Parcel與Parcelable

Java 物件序列化機制詳解

Hadoop 7days -hadoop序列化機制及使用maven開發 MR統計上下行流量的例子開發

關於thrift的一些探索——thrift序列化技術

Java序列化機制和原理

java 序列化機制

Thrift 的序列化機制

相關推薦