A few days ago, the creators of DuckDB wrote the article: Query Engines: Gatekeepers of the Parquet File Format, which explained how the engines that process Parquet files as SQL tables are blocking the evolution of the format. This is because those engines are not fully supporting the latest specification, and without this support, the rest of the ecosystem has no incentive to adopt it.
In my experience, this issue is not limited to Query Engines but extends to the tools within the ecosystem. Soon after releasing the first version of Carpet, I discovered that there was a version 2 of the format and that the core Java Parquet library does not activate it by default. Since the specification had been finalized for some time, I decided that the best approach was to make Carpet use version 2 by default.
A week later, I discovered at work the hard way that if you are not up to date with Pandas in Python, you cannot read files written with version 2. I had to rollback the change immediately.
Parquet Version 2
Upon researching the topic, you'll find that even though the format specification is finalized, it is not fully implemented across the ecosystem. Ideally, the standard would be whatever the specification defines, but in reality, there is no agreement on the minimum set of features an implementation must support to be considered compatible with version 2.
In this Pull Request from the project that describes the file format, there has been an ongoing discussion for four years about what constitutes the core, and there are no signs of a resolution anytime soon. Reading this other thread on the mailing list, I came to the conclusion that although they are part of the specification, two concepts are mixed that could evolve independently:
- Given a series of values in a column, how to encode them efficiently. Being able to incorporate new encodings such as
RLE_DICTIONARY
orDELTA_BYTE_ARRAY
, which further improve compression. - Given an encoded column’s data, where to write it within the file along with its metadata such as headers, nulls, or statistics, which helps to maximize the available metadata while minimizing its size and the number of file reads. This is what they call Data Page V2.
Many would likely prefer to prioritize improvements in encoding over page structure. Finding a file that uses an unknown encoding would make a column unreadable, but a change in how pages are structured would make the entire file unreadable.
What I came to understand is that new logical types are not tied to a specific format version. On the one hand, there are the primitive types that are fixed, but on top of them, logical types are defined: a date is a representation of an int64
, a Big Decimal or String is represented with a BYTE_ARRAY
. Now the VARIANT
type is being defined and I have not seen it associated with either of the two versions.
Meanwhile, in the Machine Learning world, Parquet and ORC have become limiting, requiring specialized features such as handling files with thousands of columns. To address this, two new formats have emerged: Nimble from Facebook and LV2 from LanceDB.
If you want to delve deeper into the topic, I recommend this introductory article. I consider these two formats to be niche solutions and Parquet will continue to reign in the world of data engineering.
Performance of Version 2
The DuckDB article prompted me to investigate the performance implications of Parquet Version 2, which I hadn't considered in [my previous post on compression algorithms]](https://dev.to/jerolba/compression-algorithms-in-parquet-java-4kec).
Configuring file writing with version 2 is straightforward, requiring only a property setting in the writer’s builder:
CarpetWriter<T> writer = new CarpetWriter.Builder<>(outputFile, clazz)
.withWriterVersion(WriterVersion.PARQUET_2_0)
.build();
File Size
Italian government dataset:
Format | Version 1 | Version 2 | Improvement |
---|---|---|---|
CSV | 1761 MB | 1761 MB | - |
UNCOMPRESSED | 564 MB | 355 MB | 37 % |
SNAPPY | 220 MB | 198 MB | 10 % |
GZIP | 146 MB | 138 MB | 5 % |
ZSTD | 148 MB | 144 MB | 2 % |
LZ4_RAW | 209 MB | 192 MB | 8 % |
LZO | 215 MB | 195 MB | 9 % |
New York taxi dataset:
Format | Version 1 | Version 2 | Improvement |
---|---|---|---|
CSV | 2983 MB | 2983 MB | - |
UNCOMPRESSED | 760 MB | 511 MB | 33 % |
SNAPPY | 542 MB | 480 MB | 11 % |
GZIP | 448 MB | 444 MB | 1 % |
ZSTD | 430 MB | 444 MB | -3 % |
LZ4_RAW | 547 MB | 482 MB | 12 % |
LZO | 518 MB | 479 MB | 7 % |
The new encodings in Version 2 compact the data more effectively before compression, resulting in a larger relative improvement for the UNCOMPRESSED version. This leaves less room for subsequent compression algorithms to further reduce the file size (or even slightly worsening it like ZSTD).
Writing
Italian government dataset (seconds):
Format | Version 1 | Version 2 | Improvement |
---|---|---|---|
UNCOMPRESSED | 25.0 | 23.6 | 6 % |
SNAPPY | 25.2 | 23.5 | 7 % |
GZIP | 39.3 | 35.8 | 9 % |
ZSTD | 27.3 | 25.7 | 6 % |
LZ4_RAW | 24.9 | 23.8 | 4 % |
LZO | 26.0 | 24.6 | 5 % |
New York taxi dataset (seconds):
Format | Version 1 | Version 2 | Improvement |
---|---|---|---|
UNCOMPRESSED | 57.9 | 50.2 | 13 % |
SNAPPY | 56.4 | 50.7 | 10 % |
GZIP | 91.1 | 66.9 | 27 % |
ZSTD | 64.1 | 57.1 | 11 % |
LZ4_RAW | 56.5 | 50.5 | 11 % |
LZO | 56.1 | 51.1 | 9 % |
The improvement in writing times is remarkable, but especially in the New York taxi dataset, with a majority of numeric values. Especially noteworthy is the improvement in GZIP format times.
Reading
Italian government dataset (seconds):
Format | Version 1 | Version 2 | Improvement |
---|---|---|---|
UNCOMPRESSED | 11.4 | 11.3 | 1 % |
SNAPPY | 12.5 | 11.5 | 8 % |
GZIP | 13.6 | 12.8 | 6 % |
ZSTD | 13.1 | 12.2 | 7 % |
LZ4_RAW | 12.8 | 11.3 | 12 % |
LZO | 13.1 | 12.1 | 7 % |
New York taxi dataset (seconds):
Format | Version 1 | Version 2 | Improvement |
---|---|---|---|
UNCOMPRESSED | 37.4 | 33.0 | 12 % |
SNAPPY | 39.9 | 34.0 | 15 % |
GZIP | 40.9 | 34.4 | 16 % |
ZSTD | 41.5 | 34.1 | 18 % |
LZ4_RAW | 41.5 | 33.6 | 19 % |
LZO | 41.1 | 33.7 | 18 % |
In reading, we again see a notable improvement, but even better in the taxi dataset with many decimal types.
Conclusion
Although this post might seem like a critique of Parquet, that is not my intention. I am simply documenting what I have learned and explaining the challenges maintainers of an open format face when evolving it. All the benefits and utilities that a format like Parquet has far outweigh these inconveniences.
The improvements that the latest version of Parquet brings help reduce file sizes and processing times, but the difference is not dramatic. Given the low adoption of Version 2 in the ecosystem, for now these improvements do not help to justify potential compatibility problems when you integrate with third parties. However, if you control all parts of the process, consider adopting the latest specification.
Most of what I have written is my interpretation, and I could be wrong. If you have better sources or a different opinion, feel free to share it in the comments.
Top comments (0)