DEV Community

tarantool
tarantool

Posted on • Edited on

Advanced MessagePack capabilities

e9d6306b85aa4ce811a0dcca6d033789
Photo by Peretz Partensky / CC BY-SA 2.0

MessagePack is a binary format for data serialization. It is positioned by the authors as a more efficient alternative to JSON. Due to its speed and compactness, it's often used as a format for data exchange in high-performance systems. The other reason this format became popular is that it's very easy to implement. Your favorite programming language most likely already has several libraries designed to work with it.

In this article, I'm not going to tell you how MessagePack works or compare it to its counterparts: there are plenty of materials on this topic on the Internet. What's really missing is information about MessagePack's extended type system. I'll try to explain and show you by examples what it is and how to make serialization even more efficient using extension types.

The Extension type

The MessagePack specification defines 9 basic types:

  • Nil
  • Boolean
  • Integer
  • Float
  • String
  • Binary
  • Array
  • Map
  • Extension.

The last type, Extension, is a container designed for storing extension types. Let's look closely at how it works. It will help us with writing our own types. Here is how the container is structured:

425470e1345d1767f7f1ae6d29195f30 (1)

Header is the container's header (1 to 5 bytes). It contains the payload size, i.e., the length of the Data field. To learn more about how the header is formed, take a look at the specification.

Type is the ID of the stored type, an 8-bit signed integer. Negative values are reserved for official types. User types' IDs can take any value in the range from 0 to 127.

Data is an arbitrary byte string up to 4 GiB long, which contains the actual data. The format of official types is described in the specification, while the format of user types may depend entirely on the developer's imagination.

The list of official types currently includes only Timestamp with the ID of -1. Occasionally there are proposals to add new types (such as UUIDs, multidimensional arrays, or geo-coordinates), but since the discussions are not very active, I would not expect anything new to be added in the near future.

Hello, World!

34ae802c3fd31328904479bee387fe93 (2)
Photo by Brett Ohland / CC BY-NC-SA 2.0

That's enough theory, let's start coding! For these examples, we'll use the msgpack.php MessagePack library since it provides a convenient API to handle extension types. I hope you'll find these code examples easy to understand even if you use other libraries.

Since I mentioned UUID, let's implement support for this data type as an example. To do so, we'll need to write an extension — a class to serialize and deserialize UUID values. We will use the symfony/uid library to make handling such values easier.

This example can be adapted for any UUID library, be it the popular ramsey/uuid, PECL uuid module, or a user implementation.

Let's name our class UuidExtension. The class must implement the Extension interface:

use MessagePack\BufferUnpacker;
use MessagePack\Extension;
use MessagePack\Packer;
use Symfony\Component\Uid\Uuid;

final class UuidExtension implements Extension
{
    public function getType(): int
    {
        // TODO
    }

    public function pack(Packer $packer, mixed $value): ?string
    {
        // TODO
    }

    public function unpackExt(BufferUnpacker $unpacker, int $extLength): Uuid
    {
        // TODO
    }
}
Enter fullscreen mode Exit fullscreen mode

We determined earlier what the type (ID) of the extension is, so we can easily implement the getType() method. In the simplest case, this method could return a fixed constant, globally defined for the whole project. However, to make the class more versatile, we'll let it define the type when initializing the extension. Let's add a constructor with one integer argument, $type:

/** @readonly */
private int $type;

public function __construct(int $type)
{
    if ($type < 0 || $type > 127) {
        throw new \OutOfRangeException(
            "Extension type is expected to be between 0 and 127, $type given"
        );
    }

    $this->type = $type;
}

public function getType(): int
{
    return $this->type;
}
Enter fullscreen mode Exit fullscreen mode

Now let's implement the pack() method. From the method's signature, we can see that it takes two parameters: a Packer class instance and a $value of any type. The method must return either a serialized value (wrapped into the Extension container) or null if the extension does not support the value type:

public function pack(Packer $packer, mixed $value): ?string
{
    if (!$value instanceof Uuid) {
        return null;
    }

    return $packer->packExt($this->type, $value->toBinary());
}
Enter fullscreen mode Exit fullscreen mode

The reverse operation isn't much harder to implement. The unpackExt() method takes a BufferUnpacker instance and the length of the serialized data (the size of the Data field from the schema above). Since we've saved the binary representation of a UUID object in this field, all we need to do is read this data and build a Uuid object:

public function unpackExt(BufferUnpacker $unpacker, int $extLength): Uuid
{
    return Uuid::fromString($unpacker->read($extLength));
}
Enter fullscreen mode Exit fullscreen mode

Our extension is ready! The last step is to register a class object with a specific ID. Let the ID be 0:

$uuidExt = new UuidExtension(0);
$packer = $packer->extendWith($uuidExt);
$unpacker = $unpacker->extendWith($uuidExt);
Enter fullscreen mode Exit fullscreen mode

Let's make sure everything works correctly:

$uuid = new Uuid('7e3b84a4-0819-473a-9625-5d57ad1c9604');

$packed = $packer->pack($uuid);
$unpacked = $unpacker->reset($packed)->unpack();

assert($uuid->equals($unpacked));
Enter fullscreen mode Exit fullscreen mode

That was an example of a simple UUID extension. Similarly, you can add support for any other type used in your application: DateTime, Decimal, Money. Or you can write a versatile extension that allows serializing any object (as it was done in KPHP).

However, this is not the only use for extensions. I'll now show you some interesting examples that demonstrate other advantages of using extension types.

"Lorem ipsum" or compressing the incompressible

851828579dec0b5e1c75b41834b61030 (2)
Photo by dog97209 / CC BY-NC-ND 2.0

If you've ever inquired about MessagePack before, you probably know the phrase from its official website, msgpack.org: "It's like JSON, but fast and small."

In fact, if you compare how much space the same data occupies in JSON and MessagePack, you'll see why the latter is a much more compact format. For example, the number 100 takes 3 bytes in JSON and only 1 in MessagePack. The difference becomes more significant as the number's order of magnitude grows. For the maximum value of int64 (9223372036854775807), the size of the stored data differs by as much as 10 bytes (19 against 9)!

The same is true for boolean values — 4 or 5 bytes in JSON against 1 byte in MessagePack. It is also true for arrays because many syntactic symbols, such as commas separating the elements, semicolons separating the key-value pairs, and brackets indicating the array boundaries, don't exist in binary format. Obviously, the larger the array is, the more syntactic litter accumulates along with the payload.

With string values, however, things are not so straightforward. If your strings do not consist entirely of quotes, line feeds, and other special characters that require escaping, then you won't notice a big difference between their sizes in JSON and in MessagePack. For example, "foobar" has a length of 8 bytes in JSON and 7 in MessagePack. Note that the above only applies to UTF-8 strings. For binary strings, JSON's disadvantage against MessagePack is obvious.

Knowing this peculiarity of MessagePack, you can have a good laugh reading articles that compare the two formats in terms of data compression efficiency while using mainly string data for the tests. Apparently, any conclusions based on the results of such tests would make no practical sense. So take those articles skeptically and run comparative tests on your own data.

At some point, there were discussions about whether to add string compression (individual or in frames) to the specification to make string serialization more compact. However, the idea was rejected, and the implementation of this feature was left to users. So let's try it.

Let's create an extension that will compress long strings. We will use whatever compression tool is at hand, for example, zlib.

Choose the data compression algorithm based on the specifics of your data. For example, if you are working with lots of short strings, take a look at SMAZ.

Let's start with the constructor for our new class, TextExtension. The first argument is the extension ID, and as a second optional argument, we'll add minimum string length. Strings shorter than this value will be serialized in a standard way, without compression. In this way, we will avoid cases where the compressed string ends up longer than the initial one:

final class TextExtension implements Extension
{
    /** @readonly */
    private int $type;

    /** @var positive-int */
    private int $minLength;

    public function __construct(int $type, int $minLength = 100)
    {
        ...

        $this->type = $type;
        $this->minLength = $minLength;
    }

    ...
}
Enter fullscreen mode Exit fullscreen mode

To implement the pack() method, we might write something like this:

public function pack(Packer $packer, mixed $value): ?string
{
    if (!is_string($value)) {
        return null;
    }

    if (strlen($value) < $this->minLength) {
        return $packer->packStr($value);
    }

    // compress and pack
    ...
}
Enter fullscreen mode Exit fullscreen mode

However, this wouldn't work. String is one of the basic types, so the packer will serialize it before our extension is called. This is done in the msgpack.php library for performance reasons. Otherwise, before serializing each value, the packer would need to scan the available extensions, considerably slowing down the process.

Therefore, we need to tell the packer not to serialize certain strings as, you know, strings but to use an extension. As you might guess from the UUID example, it can be done via a ValueObject. Let's call it Text, similar to the extension class:

/**
 * @psalm-immutable
 */
final class Text
{
    public function __construct(
        public string $str
    ) {}

    public function __toString(): string
    {
        return $this->str;
    }
}
Enter fullscreen mode Exit fullscreen mode

So instead of

$packed = $packer->pack('a very long string');
Enter fullscreen mode Exit fullscreen mode

we'll use a Text object to mark long strings:

$packed = $packer->pack(new Text('a very long string'));
Enter fullscreen mode Exit fullscreen mode

Let's update the pack() method:

public function pack(Packer $packer, mixed $value): ?string
{
    if (!$value instanceof Text) {
        return null;
    }

    $length = strlen($value->str);
    if ($length < $this->minLength) {
        return $packer->packStr($value->str);
    }

    // compress and pack
    ...
}
Enter fullscreen mode Exit fullscreen mode

Now we just need to compress the string and put the result in an Extension. Note that the minimum length limit does not guarantee that the string will take less space after compression. For this reason, you might want to compare the lengths of the compressed string and the original and choose whichever is more compact:

$context = deflate_init(ZLIB_ENCODING_GZIP);
$compressed = deflate_add($context, $value->str, ZLIB_FINISH);

return isset($compressed[$length - 1])
    ? $packer->packStr($value->str)
    : $packer->packExt($this->type, $compressed);
Enter fullscreen mode Exit fullscreen mode

Deserialization:

public function unpackExt(BufferUnpacker $unpacker, int $extLength): string
{
    $compressed = $unpacker->read($extLength);
    $context = inflate_init(ZLIB_ENCODING_GZIP);

    return inflate_add($context, $compressed, ZLIB_FINISH);
}
Enter fullscreen mode Exit fullscreen mode

Let's see the result:

$longString = <<<STR
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed 
do eiusmod tempor incididunt ut labore et dolore magna aliqua. 
Ut enim ad minim veniam, quis nostrud exercitation ullamco 
laboris nisi ut aliquip ex ea commodo consequat. Duis aute 
irure dolor in reprehenderit in voluptate velit esse cillum 
dolore eu fugiat nulla pariatur. Excepteur sint occaecat 
cupidatat non proident, sunt in culpa qui officia deserunt 
mollit anim id est laborum.
STR;

$packedString = $packer->pack($longString);
// 448 bytes

$packedCompressedString = $packer->pack(new Text($longString));
// 291 bytes
Enter fullscreen mode Exit fullscreen mode

In this example, we saved 157 bytes, or 35% of what would be the standard serialization result, on just one string!

From "schema-less" to "schema-mixed"

04bbfc5f6758a3841bc7753e4421e960 (8)
Photo by Adventures with E&L / CC BY-NC-ND 2.0

Compressing long strings is not the only way to save space. MessagePack is a schemaless, or schema-on-read, format that has its advantages and disadvantages. One of the disadvantages in comparison with schema-full (schema-on-write) formats is highly ineffective serialization of repeated data structures. An example of such data is a selection from a database, where all elements of the resulting array have the same structure:

$userProfiles = [
    [
        'id' => 1,
        'first_name' => 'First name 1',
        'last_name' => 'Last name 1',
    ],
    [
        'id' => 2,
        'first_name' => 'First name 2',
        'last_name' => 'Last name 2',
    ],
    ...
    [
        'id' => 100,
        'first_name' => 'First name 100',
        'last_name' => 'Last name 100',
    ],
];
Enter fullscreen mode Exit fullscreen mode

If you serialize this array with MessagePack, the repeated keys of each element in the array will take a substantial part of the total data size. But what if we could save the keys of such structured arrays just once? It would significantly cut down the size and also speed up serialization since the packer would have fewer operations to perform.

Like before, we are going to use extension types for that. Our type will be a value object wrapped around an arbitrary structured array:

/**
 * @psalm-immutable
 */
final class StructList
{
    public function __construct(
        public array $list,
    ) {}
}
Enter fullscreen mode Exit fullscreen mode

If your project includes a library for database handling, there is probably a special class in that library to store table selection results. You can use this class as a type instead of/along with StructList.

Here is how we are going to serialize such arrays. First, we'll check the array size. Of course, if the array is empty or has only one element, there is no reason to store keys separately from values. We'll serialize arrays like these in a standard way.

In other cases, we'll first save a list of keys and then a list of values. We won't be storing an associative array list, which is the standard MessagePack option. Instead, we'll write data in a more compact form:

3ca04136382cc4c0767fbc1626e9908d (9)

Implementation:

final class StructListExtension implements Extension
{
    ...

    public function pack(Packer $packer, mixed $value): ?string
    {
        if (!$value instanceof StructList) {
            return null;
        }

        $size = count($value->list);
        if ($size < 2) {
            return $packer->packArray($value->list);
        }

        $keys = array_keys(reset($value->list));

        $values = '';
        foreach ($value->list as $item) {
            foreach ($keys as $key) {
                $values .= $packer->pack($item[$key]);
            }
        }

        return $packer->packExt($this->type,
            $packer->packArray($keys).
            $packer->packArrayHeader($size).
            $values
        );
    }

    ...
}
Enter fullscreen mode Exit fullscreen mode

To deserialize, we need to unpack the keys array and then use it to restore the initial array:

public function unpackExt(BufferUnpacker $unpacker, int $extLength): array
{
    $keys = $unpacker->unpackArray();
    $size = $unpacker->unpackArrayHeader();

    $list = [];
    for ($i = 0; $i < $size; ++$i) {
        foreach ($keys as $key) {
            $list[$i][$key] = $unpacker->unpack();
        }
    }

    return $list;
}
Enter fullscreen mode Exit fullscreen mode

That's it! Now, if we serialize $profiles from the example above as a normal array and as a structured StructList, we'll see a great difference in size — the latter will be 47% more compact.

$packedList = $packer->pack($profiles);
// 5287 bytes

$packedStructList = $packer->pack(new StructList($profiles));
// 2816 bytes
Enter fullscreen mode Exit fullscreen mode

We could go further and create a specialized Profiles type to store information about the array structure in the extension code. This way, we wouldn't need to save the keys array. However, in this case, we would lose in versatility.

Conclusion

We've taken a look at just a few examples of using extension types in MessagePack. To see more examples, check the msgpack.php library. For the implementations of all extension types supported by the Tarantool protocol, see the tarantool/client library.

I hope this article gave you a sense of what extension types are and how they can be useful. If you're already using MessagePack but haven't known about the feature, this information might inspire you to reconsider your current methods of working with the format and start using custom types.

If you're just wondering which serialization format to choose for your next project, the article might help you make a reasonable choice, adding a point in favor of MessagePack :)

Links

Get Tarantool on our website
Get help in our telegram channel

Top comments (0)