DEV Community

Cover image for Why Protobuf Should Dominate the Data Format Ecosystem
Leapcell
Leapcell

Posted on

Why Protobuf Should Dominate the Data Format Ecosystem

Image description

Leapcell: The Next-Gen Serverless Platform for Web Hosting, Async Tasks, and Redis

In-depth Understanding of Protobuf

What is Protobuf

Protobuf (Google Protocol Buffers), as defined in the official documentation: Protocol buffers is a language-independent, platform-independent, and extensible method for serializing structured data, which can be widely applied in scenarios such as data communication protocols and data storage. It is a tool library provided by Google with an efficient protocol data exchange format, possessing the characteristics of flexible, efficient, and automated structured data serialization mechanisms.

Compared with XML, the size of data encoded by Protobuf is smaller, and the encoding and decoding speed is faster. Compared with Json, Protobuf performs more excellently in conversion efficiency, with both its time efficiency and space efficiency reaching 3 to 5 times that of JSON.

As the official description states: “Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.”

Comparison of Data Formats

Suppose we have a person object, represented by JSON, XML, and Protobuf respectively, and let's see their differences.

XML Format

<person>
    <name>John</name>
    <age>24</age>
</person>
Enter fullscreen mode Exit fullscreen mode

JSON Format

{
    "name":"John",
    "age":24
}
Enter fullscreen mode Exit fullscreen mode

Protobuf Format

Protobuf directly represents data in binary format, which is not as intuitive as XML and JSON formats. For example:

[10 6 69 108 108 122 111 116 16 24]
Enter fullscreen mode Exit fullscreen mode

Advantages of Protobuf

Good Performance/High Efficiency

  • Time Overhead: The overhead of XML formatting (serialization) is acceptable, but the overhead of XML parsing (deserialization) is relatively large. Protobuf has optimized this aspect and can significantly reduce the time overhead of serialization and deserialization.
  • Space Overhead: Protobuf also greatly reduces the space occupation.

Code Generation Mechanism

For example, write the following content similar to a structure:

message testA  
{  
    required int32 m_testA = 1;  
}
Enter fullscreen mode Exit fullscreen mode

Protobuf can automatically generate the corresponding .h file and .cpp file, and encapsulate the operations on the structure testA into a class.

Support for Backward Compatibility and Forward Compatibility

When the client and the server use a protocol simultaneously, if the client adds a byte in the protocol, it will not affect the normal use of the client.

Support for Multiple Programming Languages

In the source code officially released by Google, it includes support for multiple programming languages, such as:

  • C++
  • C#
  • Dart
  • Go
  • Java
  • Kotlin
  • Python

Disadvantages of Protobuf

Poor Readability Due to Binary Format

To improve performance, Protobuf uses a binary format for encoding, which makes the data less readable and will affect the efficiency during the development and testing phase. However, under normal circumstances, Protobuf performs very reliably, and serious problems generally do not occur.

Lack of Self-description

Generally, XML is self-descriptive, while the Protobuf format is not. It is a piece of binary format protocol content, and it is difficult to know its function without matching it with a pre-written structure.

Poor Universality

Although Protobuf supports serialization and deserialization in multiple languages, it is not a universal transmission standard across platforms and languages. In scenarios of multi-platform message passing, its compatibility with other projects is not good, and corresponding adaptation and transformation work is often required. Compared with json and XML, its universality is slightly insufficient.

Usage Guide

Defining Message Types

Proto message type files generally end with .proto. In a .proto file, one or more message types can be defined.

The following is an example of defining a message type for a search query. The syntax at the beginning of the file is used to describe the version information. Currently, there are two versions of proto, proto2 and proto3.

syntax="proto3";
Enter fullscreen mode Exit fullscreen mode

Explicitly set the syntax format to proto3. If the syntax is not set, it defaults to proto2. query represents the content to be queried, page_number represents the page number of the query, and result_per_page represents the number of items per page. syntax = "proto3" must be located on the first line of the .proto file excluding comments and blank lines.

The following message contains 3 fields (query, page_number, result_per_page), and each field has a corresponding type, field name, and field number. The field type can be string, int32, enum, or a composite type.

syntax = "proto3";

message SearchRequest {
  string query = 1;
  int32 page_number = 2;
  int32 result_per_page = 3;
}
Enter fullscreen mode Exit fullscreen mode

Field Numbers

Each field in the message type needs to be defined with a unique number, and this number is used to identify the field in the binary data. Numbers in the range of [1,15] can be encoded and represented with one byte; in the range of [16,2047], they need to be encoded and represented with two bytes. Therefore, leaving the numbers within 15 for frequently occurring fields can save space. The minimum value of the number is 1, and the maximum value is 2^29 - 1 = 536870911. Numbers in the range of [19000, 19999] cannot be used because these numbers are used internally by the proto compiler. Similarly, other pre-reserved numbers cannot be used either.

Field Rules

Each field can be modified by singular or repeated. In the proto3 syntax, if the modification type is not specified, the default value is singular.

  • singular: It means that the modified field appears at most once, that is, it appears 0 or 1 time.
  • repeated: It means that the modified field can appear any number of times, including 0 times. In the proto3 syntax, fields modified by repeated use the packed encoding by default.

Comments

You can add comments to the .proto file. The comment syntax is the same as the C/C++ style, using // or /* ... */.

/* SearchRequest represents a search query, with pagination options to
 * indicate which results to include in the response. */

message SearchRequest {
  string query = 1;
  int32 page_number = 2;  // Which page number do we want?
  int32 result_per_page = 3;  // Number of results to return per page.
}
Enter fullscreen mode Exit fullscreen mode

Reserved Fields

When deleting or commenting out a field in a message, other developers in the future may reuse the previous field number when updating the message definition. If they accidentally load the old version of the .proto file, it may lead to serious problems, such as data corruption. To avoid such problems, you can specify the reserved field numbers and field names. If someone uses these field numbers in the future, an error will be generated when compiling the proto, thus reminding that there is a problem with the proto.

Note: Do not mix the use of field names and field numbers for the same field.

message Foo {
  reserved 2, 15, 9 to 11;
  reserved "foo", "bar";
}
Enter fullscreen mode Exit fullscreen mode

Mapping between Field Types and Language Types

The defined .proto file can generate Go language code through a generator. For example, the Go file generated from the a.proto file is the a.pb.go file.

The mapping between basic types in proto and Go language types is shown in the following table (here only the type mapping between Go and C/C++ is listed, and for other languages, refer to https://developers.google.com/protocol-buffers/docs/proto3):
|.proto Type | Go Type | C++ Type |
| ---- | ---- | ---- |
| double | float64 | double |
| float | float32 | float |
| int32 | int32 | int32 |
| int64 | int64 | int64 |
| uint32 | uint32 | uint32 |
| uint64 | uint64 | uint64 |
| sint32 | int32 | int32 |
| sint64 | int64 | int64 |
| fixed32 | uint32 | uint32 |
| fixed64 | uint64 | uint64 |
| sfixed32 | int32 | int32 |
| sfixed64 | int64 | int64 |
| bool | bool | bool |
| string | string | string |
| bytes | []byte | string |

Default Values

.proto Type default value
string ""
bytes []byte
bool false
numeric types 0
enums first defined enum value

Enum Types

When defining a message, if you want the value of a field to be only one of the expected values, you can use the enum type.

For example, now add the corpus field to SearchRequest, and its value can only be one of UNIVERSAL, WEB, IMAGES, LOCAL, NEWS, PRODUCTS, and VIDEO. This can be achieved by adding an enum to the message definition and adding a constant for each possible enum value.

message SearchRequest {
  string query = 1;
  int32 page_number = 2;
  int32 result_per_page = 3;
  enum Corpus {
    UNIVERSAL = 0;
    WEB = 1;
    IMAGES = 2;
    LOCAL = 3;
    NEWS = 4;
    PRODUCTS = 5;
    VIDEO = 6;
  }
  Corpus corpus = 4;
}
Enter fullscreen mode Exit fullscreen mode

The first constant of the Corpus enum must be mapped to 0, and all enum definitions need to include a constant mapped to 0, and this value is the first line content of the enum definition. This is because 0 is used as the default value of the enum. In the proto2 syntax, the enum value on the first line is always the default value. For the sake of compatibility, the value 0 must be the first line of the definition.

Importing Other Protos

Other .proto files can be imported in a .proto file, so as to use the message types defined in the imported file.

import "myproject/other_protos.proto";
Enter fullscreen mode Exit fullscreen mode

By default, only the message types defined in the directly imported .proto file can be used. But sometimes it may be necessary to move the .proto file to a new location. At this time, a virtual .proto file can be placed in the old location, and the import public syntax can be used to forward all imports to the new location, instead of directly moving the .proto file and updating all call points at once. Any place that imports a proto file containing the import public statement can pass on the public dependencies of the imported dependencies.

For example, there are a.proto and b.proto files in the current folder, and b.proto is imported in the a.proto file, that is, the a.proto file has the following content:

import "b.proto";
Enter fullscreen mode Exit fullscreen mode

Suppose now we want to put the messages in b.proto into the common/com.proto file for use in other places. We can modify b.proto and import com.proto in it. Note that we need to use import public because a single import can only use the messages defined in b.proto and cannot use the message types in the proto file imported in b.proto.

// b.proto file, move the message definitions inside to the common/com.proto file,
// add the following import statement inside
import public "common/com.proto"
Enter fullscreen mode Exit fullscreen mode

When using protoc for compilation, the option -I or --proto_path needs to be used to notify protoc where to find the imported files. If the search path is not specified, protoc will look for it in the current directory (the path where protoc is called).

Message types in the proto2 version can be imported into a proto3 file for use, and message types in the proto3 version can also be imported into a proto2 file. But the enum types in proto2 cannot be directly applied to the proto3 syntax.

Nested Messages

Message types can be defined inside another message type, that is, nested definitions. For example, the Result type is defined inside SearchResponse, and it supports multiple levels of nesting.

message SearchResponse {
  message Result {
    string url = 1;
    string title = 2;
    repeated string snippets = 3;
  }
  repeated Result results = 1;
}
Enter fullscreen mode Exit fullscreen mode

When an outer message type uses a message inside another message, such as the SomeOtherMessage type using Result, it can use SearchResponse.Result.

message SomeOtherMessage {
  SearchResponse.Result result = 1;
}
Enter fullscreen mode Exit fullscreen mode

Unknown Fields

Unknown fields are fields that the proto compiler cannot recognize. For example, when an old binary file parses the data sent by a new binary file with new fields, these new fields will become unknown fields in the old binary file. In the initial version of proto3, unknown fields were discarded when the message was parsed, but in version 3.5, the retention of unknown fields was reintroduced. Unknown fields are retained during parsing and are included in the serialized output.

Encoding Principle

TLV Encoding Format

The key to the high efficiency of Protobuf lies in its TLV (tag-length-value) encoding format. Each field has a unique tag value as an identifier, length represents the length of the value data (for a value with a fixed length, there is no length), and value is the content of the data itself.

For the tag value, it is composed of two parts: field_number and wire_type. field_number is the number given to each field in the message earlier, and wire_type represents the type (fixed length or variable length). The wire_type currently has 6 values from 0 to 5, and these 6 values can be represented by 3 bits.

The values of wire_type are shown in the following table, where 3 and 4 have been deprecated, and we only need to pay attention to the remaining 4 types. For data encoded with Varint, there is no need to store the byte length length, and at this time, the TLV encoding format degenerates into TV encoding. For 64-bit and 32-bit data, there is also no need for length because the type value already indicates whether the length is 8 bytes or 4 bytes.

wire_type Encoding Method Encoding Length Storage Method Data Type
0 Varint Variable length T - V int32 int64 uint32 uint64 bool enum
0 Zigzag + Varint Variable length T - V sint32 sint64
1 64-bit Fixed 8 bytes T - V fixed64 sfixed64 double
2 length-delimi Variable length T - L - V string bytes packed repeated fields embedded
3 start group Deprecated Deprecated
4 end group Deprecated Deprecated
5 32-bit Fixed 4 bytes T - V fixed32 sfixed32 float

Varint Encoding Principle

Varint is a variable-length int, which is a variable-length encoding method. It can make smaller numbers use fewer bytes to represent, and achieve data compression by reducing the number of bytes used to represent numbers. For an int32 type number, it usually requires 4 bytes to represent, but with Varint encoding, an int32 type number less than 128 can be represented with 1 byte. For larger numbers, it may require 5 bytes to represent, but in most messages, very large numbers usually do not appear, so using Varint encoding can use fewer bytes to represent numbers.

Varint is a variable-length encoding, and it distinguishes each field through the highest bit of each byte. If the highest bit of a byte is 1, it means that the subsequent byte is also part of the number; if it is 0, it means that this is the last byte, and the remaining 7 bits are all used to represent the number. Although each byte will waste 1 bit of space (that is, 1/8 = 12.5% waste), if there are many numbers that do not need to be fixed as 4 bytes for representation, a large amount of space can still be saved.

For example, for an int32 type number 65, its Varint encoding process is as follows, and the 65 that originally occupied 4 bytes only occupies 1 byte after encoding.

For an int32 type number 128, it occupies 2 bytes after encoding.

Varint decoding is the reverse process of encoding, which is relatively simple, and no example is given here.

Zigzag Encoding

numbers to unsigned numbers, and then use Varint encoding to reduce the number of bytes after encoding.

Zigzag uses unsigned numbers to represent signed numbers, enabling numbers with smaller absolute values to be represented with fewer bytes. Before understanding Zigzag encoding, let's first understand a few concepts:

  • Original Code: The highest bit is the sign bit, and the remaining bits represent the absolute value.
  • One's Complement: Except for the sign bit, invert the remaining bits of the original code one by one.
  • Two's Complement: For positive numbers, the two's complement is itself; for negative numbers, except for the sign bit, invert the remaining bits of the original code one by one and then add 1.

Take the int32 type number -2 as an example, and its encoding process is as follows.

In summary, for negative numbers, perform arithmetic operations on their two's complement. For a number n, if it is of the sint32 type, perform the operation (n<<1) ^ (n>>31); if it is of the sint64 type, perform the operation (n<<1) ^ (n>>63). Through this operation, the negative number is changed to a positive number, and this process is Zigzag encoding. Finally, use Varint encoding.

Since Varint and Zigzag encoding can self-parse the content length, the length item can be omitted, and the TLV storage is simplified to TV storage, without the need for the length item.

Calculation Methods of tag and value Values

tag

The tag stores the identification information and data type information of the field, that is, tag = wire_type (field data type) + field_number (identification number). The field number can be obtained through the tag, corresponding to the defined message field. The calculation formula is tag = field_number<<3 | wire_type, and then perform Varint encoding on it.

value

The value is the value of the message field after Varint and Zigzag encoding.

string Encoding (continued)

When the field type is the string type, the field value is encoded in UTF-8. For example, there is the following message definition:

message stringEncodeTest {
  string test = 1;
}
Enter fullscreen mode Exit fullscreen mode

In the Go language, the sample code for encoding this message is as follows:

func stringEncodeTest(){
    vs:=&api.StringEncodeTest{
        Test:"English",
    }
    data,err:=proto.Marshal(vs)
    if err!=nil{
        fmt.Println(err)
        return
    }
    fmt.Printf("%v\n",data)
}
Enter fullscreen mode Exit fullscreen mode

The binary content after encoding is as follows:

[10 14 67 104 105 110 97 228 184 173 144 155 189 228 120 186]
Enter fullscreen mode Exit fullscreen mode

Encoding of Nested Types

Nested messages mean that the value is another field message. The outer message is stored using TLV storage, and its value is also a TLV storage structure. The schematic diagram of the entire encoding structure is as follows (it can be imagined as a tree structure, where the outer message is the root node, and the nested message inside it is used as a child node, and each node follows the TLV encoding rule):

  1. The outermost message has its corresponding tag, length (if any), and value.
  2. When the value is a nested message, this nested message has its own independent tag, length (if any), and value.
  3. By analogy, if there are nested messages within the nested message, continue to encode according to the TLV rule.

repeated Fields with packed

The fields modified by repeated can be with packed or without it. For multiple field values of the same repeated field, their tag values are all the same, that is, the data type and field sequence number are the same. If multiple TV storages are used, there will be redundancy of the tag.

If packed = true is set, the storage method of the repeated field will be optimized. That is, the same tag is only stored once, and then the total length length of all values under the repeated field is added to form a TLVV... storage structure. This method can effectively compress the length of the serialized data and save transmission overhead. For example:

message repeatedEncodeTest{
   // Method 1, without packed
   repeated int32 cat = 1;
   // Method 2, with packed
   repeated  int32 dog = 2 [packed=true];
}
Enter fullscreen mode Exit fullscreen mode

In the above example, the cat field does not use packed, and each cat value will have independent tag and value storage; while the dog field uses packed, and the tag will only be stored once, followed by the total length length of all dog values, and then all dog values are arranged in sequence. In this way, when the data volume is large, the repeated field using packed can significantly reduce the space occupied by the data and the bandwidth consumption during transmission.

Conclusion

With its efficiency (in terms of size) and professionalism (professional types), Protobuf should have a higher coverage in the future data transmission field.

Leapcell: The Next-Gen Serverless Platform for Web Hosting, Async Tasks, and Redis

Finally, I would like to introduce to you the most suitable platform for deploying services: Leapcell

Image description

1. Multi-Language Support

  • Develop with JavaScript, Python, Go, or Rust.

2. Deploy unlimited projects for free

  • pay only for usage — no requests, no charges.

3. Unbeatable Cost Efficiency

  • Pay-as-you-go with no idle charges.
  • Example: $25 supports 6.94M requests at a 60ms average response time.

4. Streamlined Developer Experience

  • Intuitive UI for effortless setup.
  • Fully automated CI/CD pipelines and GitOps integration.
  • Real-time metrics and logging for actionable insights.

5. Effortless Scalability and High Performance

  • Auto-scaling to handle high concurrency with ease.
  • Zero operational overhead — just focus on building.

Image description

Explore more in the documentation!

Leapcell Twitter: https://x.com/LeapcellHQ

Top comments (0)