Schema evolution in Avro, Protocol Buffers and Thrift
http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
當想要數(shù)據(jù), 比如對象或其他類型的, 存到文件或是通過網(wǎng)絡傳輸, 需要面對的問題是序列化問題
對于序列化, 當然各個語言都提供相應的包, 比如, Java serialization, Ruby’s marshal, or Python’s pickle
一切都沒有問題, 但如果考慮到跨平臺和語言, 可以使用Json或XML
如果你無法忍受Json或XML的verbose和parse的效率, 問題出現(xiàn)了, 當然你可以試圖為Json發(fā)明一種binary編碼
當然沒有這個必要重復造輪子, Thrift, Protocol Buffers or Avro, provide efficient, cross-language serialization of data using a schema, and code generation for the Java folks.
So you have some data that you want to store in a file or send over the network. You may find yourself going through several phases of evolution:
Once you get to the fourth stage, your options are typically Thrift, Protocol Buffers or Avro. All three provide efficient, cross-language serialization of data using a schema, and code generation for the Java folks.
?
實際使用中, data總是在不斷變化的, 所以schema總是在不斷evoluation的, Thrift, Protobuf and Avro都支持這種特性, 保證在client或server的schema發(fā)生變化的時候可以盡量不影響正常的服務.
In real life, data is always in flux. The moment you think you have finalised a schema, someone will come up with a use case that wasn’t anticipated, and wants to “just quickly add a field”. Fortunately Thrift, Protobuf and Avro all support schema evolution: you can change the schema, you can have producers and consumers with different versions of the schema at the same time, and it all continues to work. That is an extremely valuable feature when you’re dealing with a big production system, because it allows you to update different components of the system independently, at different times, without worrying about compatibility.
?
本文的重點就是比較一下, Thrift, Protobuf and Avro到底如果將數(shù)據(jù)進行序列化成binary并且支持schema evoluation.
The example I will use is a little object describing a person. In JSON I would write it like this:
{"userName": "Martin","favouriteNumber": 1337,"interests": ["daydreaming", "hacking"] }This JSON encoding can be our baseline. If I remove all the whitespace it consumes 82 bytes.
Protocol Buffers
The Protocol Buffers schema for the person object might look something like this:
message Person {required string user_name = 1;optional int64 favourite_number = 2;repeated string interests = 3; } 首先PB使用IDL來表示person的schema對于每個field都有一個唯一的tag,作為標識, 所以=1, =2, =3不是賦值, 是注明每個field的tag
然后每個field可以是optional, required and repeated
When we encode the data above using this schema, it uses 33 bytes, as follows:
上圖清晰反映出, 如果將82bytes的Json格式轉化為33bytes的binary格式
首先序列化的時候只會記錄tag, 而不會記錄name, 所以可以任意改變fieldname而不會有影響, 而tag是永遠不能變化的
第一個byte記錄tag和type, 后面記錄具體的數(shù)據(jù), 對于string還需要加上length
可以看到, 在encoding的過程中, 沒有特意記錄optional, required and repeated
在decoding的時候,對required field會進行validation check, 但對于opitonal和repeated, 如果沒有可以完全不出現(xiàn)在encoding數(shù)據(jù)中?
所以對于opitonal和repeated, 可以簡單的從schema中刪除, 比如在客戶端. 但是需要注意的是, 被刪除的field的tag后面不能被再次使用?
但是對于required field的改動, 可能導致問題, 比如在客戶端刪除required field, 此時server端的validation check就會失敗
對于增加field, 只要使用新的tag, 就不會有任何問題
?
Thrift
Thrift is a much bigger project than Avro or Protocol Buffers, as it’s not just a data serialization library, but also an entire RPC framework.
It also has a somewhat different culture: whereas Avro and Protobuf standardize a single binary encoding, Thrift embraces a whole variety of different serialization formats (which it calls “protocols”).
Thrift的功能比較強大, 不僅僅是數(shù)據(jù)序列化庫, 還是一整套的RPC框架, 支持完整的RPC協(xié)議棧.
而且其對protocal的封裝, 使其不僅僅支持binary encoding, 也可以實現(xiàn)不同的協(xié)議來支持其他的encoding
Thrift IDL和PB其實很像, 不同是使用1:(而非=1)來標注field tag, 并且沒有optional, required and repeated類型
All the encodings share the same schema definition, in Thrift IDL:
The BinaryProtocol encoding is very straightforward, but also fairly wasteful (it takes 59 bytes to encode our example record):
The CompactProtocol encoding is semantically equivalent, but uses variable-length integers and bit packing to reduce the size to 34 bytes:
前面說了, Thrift可以通過protocal封裝不同的編碼方式, 對于binary編碼, 也有兩種選擇
第一種就是簡單的binary編碼,沒有做任何的空間優(yōu)化, 可以看到浪費很多空間, 需要59 bytes
第二種是compact binary編碼, 和PB的編碼方式比較相似, 區(qū)別的是Thrift比PB更靈活, 可以直接支持container, 比如這里的list. 而PB就只能通過repeated來實現(xiàn)簡單的數(shù)據(jù)結構. (Thrift defines an explicit list type rather than Protobuf’s repeated field approach)
Avro
Avro schemas can be written in two ways, either in a JSON format:
{"type": "record","name": "Person","fields": [{"name": "userName", "type": "string"},{"name": "favouriteNumber", "type": ["null", "long"]},{"name": "interests", "type": {"type": "array", "items": "string"}}] }…or in an IDL:
record Person {string userName;union { null, long } favouriteNumber;array<string> interests; }Notice that there are no tag numbers in the schema! So how does it work?
Here is the same example data encoded in just 32 bytes:
Avro是比較新的方案, 現(xiàn)在使用的人還比較少, 主要在Hadoop. 同時設計也比較獨特, 和Thrift和PB相比
首先Schema可以使用IDL和Json定義, 而且注意binary encoding, 沒有存儲field tag和field type
意味著,
1. reader在parse data時必須有和其匹配的schema文件
2. 沒有field tag, 只能使用field name作為標識符, Avro支持field name的改變, 但需要先通知所有reader, 如下
Because fields are matched by name, changing the name of a field is tricky. You need to first update all readers of the data to use the new field name, while keeping the old name as an alias (since the name matching uses aliases from the reader’s schema). Then you can update the writer’s schema to use the new field name.
3. 讀數(shù)據(jù)的時候是按照schema的field定義順序依次讀取的, 所以對于optional field需要特別處理, 如例子使用union { null, long }
if you want to be able to leave out a value, you can use a union type, like union { null, long } above. This is encoded as a byte to tell the parser which of the possible union types to use, followed by the value itself. By making a union with the null type (which is simply encoded as zero bytes) you can make a field optional.
4. 可以選擇使用Json實現(xiàn)schema, 而對于Thrift或PB只支持通過IDL將schema轉化為具體的代碼. 所以avro可以實現(xiàn)通用的客戶端和server, 當schema變化時, 只需要更改Json, 而不需要重新編譯
當schema發(fā)生變化時, Avro的處理更加簡單, 只需要將新的schema通知所有的reader
對于Thrift或PB, schema變化時, 需要重新編譯client和server的代碼, 雖然對于兩邊版本不匹配也有比較好的支持
5. writer的schema和reader的schema不一定完全匹配, Avro parser可以使用resolution rules進行data translation
So how does Avro support schema evolution?
Well, although you need to know the exact schema with which the data was written (the writer’s schema), that doesn’t have to be the same as the schema the consumer is expecting (the reader’s schema). You can actually give two different schemas to the Avro parser, and it uses resolution rules to translate data from the writer schema into the reader schema.
6. 支持簡單的增加或減少field
You can add a field to a record, provided that you also give it a default value (e.g. null if the field’s type is a union with null). The default is necessary so that when a reader using the new schema parses a record written with the old schema (and hence lacking the field), it can fill in the default instead.
Conversely, you can remove a field from a record, provided that it previously had a default value. (This is a good reason to give all your fields default values if possible.) This is so that when a reader using the old schema parses a record written with the new schema, it can fall back to the default.
有一個重要的問題沒有討論, Avro依賴于Json schema, 何時, 如何在client和server之間傳遞schema數(shù)據(jù)?
答案是, 不同的場景不同的方法...通過文件頭, connection的握手時...
This leaves us with the problem of knowing the exact schema with which a given record was written.
The best solution depends on the context in which your data is being used:
- In Hadoop you typically have large files containing millions of records, all encoded with the same schema. Object container files handle this case: they just include the schema once at the beginning of the file, and the rest of the file can be decoded with that schema.
- In an RPC context, it’s probably too much overhead to send the schema with every request and response. But if your RPC framework uses long-lived connections, it can negotiate the schema once at the start of the connection, and amortize that overhead over many requests.
- If you’re storing records in a database one-by-one, you may end up with different schema versions written at different times, and so you have to annotate each record with its schema version. If storing the schema itself is too much overhead, you can use a hash of the schema, or a sequential schema version number. You then need a schema registry where you can look up the exact schema definition for a given version number.
Avro相對于Thrift和PB, 更加復雜和難于使用, 當然有如下優(yōu)點...
At first glance it may seem that Avro’s approach suffers from greater complexity, because you need to go to the additional effort of distributing schemas.
However, I am beginning to think that Avro’s approach also has some distinct advantages:
- Object container files are wonderfully self-describing: the writer schema embedded in the file contains all the field names and types, and even documentation strings (if the author of the schema bothered to write some). This means you can load these files directly into interactive tools like Pig, and it Just Works? without any configuration.
- As Avro schemas are JSON, you can add your own metadata to them, e.g. describing application-level semantics for a field. And as you distribute schemas, that metadata automatically gets distributed too.
- A schema registry is probably a good thing in any case, serving as documentation and helping you to find and reuse data. And because you simply can’t parse Avro data without the schema, the schema registry is guaranteed to be up-to-date. Of course you can set up a protobuf schema registry too, but since it’s not required for operation, it’ll end up being on a best-effort basis.
轉載于:https://www.cnblogs.com/fxjwind/archive/2013/05/14/3078041.html
總結
以上是生活随笔為你收集整理的Schema evolution in Avro, Protocol Buffers and Thrift的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: R、Rstudio的下载和安装教程
- 下一篇: java反射 获取方法_java反射之获