Jerónimo López

Posted on Feb 20 • Originally published at jeronimo.dev

The two versions of Parquet

#parquet #bigdata #java

A few days ago, the creators of DuckDB wrote the article: Query Engines: Gatekeepers of the Parquet File Format, which explained how the engines that process Parquet files as SQL tables are blocking the evolution of the format. This is because those engines are not fully supporting the latest specification, and without this support, the rest of the ecosystem has no incentive to adopt it.

In my experience, this issue is not limited to Query Engines but extends to the tools within the ecosystem. Soon after releasing the first version of Carpet, I discovered that there was a version 2 of the format and that the core Java Parquet library does not activate it by default. Since the specification had been finalized for some time, I decided that the best approach was to make Carpet use version 2 by default.

A week later, I discovered at work the hard way that if you are not up to date with Pandas in Python, you cannot read files written with version 2. I had to rollback the change immediately.

Parquet Version 2

Upon researching the topic, you'll find that even though the format specification is finalized, it is not fully implemented across the ecosystem. Ideally, the standard would be whatever the specification defines, but in reality, there is no agreement on the minimum set of features an implementation must support to be considered compatible with version 2.

In this Pull Request from the project that describes the file format, there has been an ongoing discussion for four years about what constitutes the core, and there are no signs of a resolution anytime soon. Reading this other thread on the mailing list, I came to the conclusion that although they are part of the specification, two concepts are mixed that could evolve independently:

Given a series of values in a column, how to encode them efficiently. Being able to incorporate new encodings such as RLE_DICTIONARY or DELTA_BYTE_ARRAY, which further improve compression.
Given an encoded column’s data, where to write it within the file along with its metadata such as headers, nulls, or statistics, which helps to maximize the available metadata while minimizing its size and the number of file reads. This is what they call Data Page V2.

Many would likely prefer to prioritize improvements in encoding over page structure. Finding a file that uses an unknown encoding would make a column unreadable, but a change in how pages are structured would make the entire file unreadable.

What I came to understand is that new logical types are not tied to a specific format version. On the one hand, there are the primitive types that are fixed, but on top of them, logical types are defined: a date is a representation of an int64, a Big Decimal or String is represented with a BYTE_ARRAY. Now the VARIANT type is being defined and I have not seen it associated with either of the two versions.

Meanwhile, in the Machine Learning world, Parquet and ORC have become limiting, requiring specialized features such as handling files with thousands of columns. To address this, two new formats have emerged: Nimble from Facebook and LV2 from LanceDB.

If you want to delve deeper into the topic, I recommend this introductory article. I consider these two formats to be niche solutions and Parquet will continue to reign in the world of data engineering.

Performance of Version 2

The DuckDB article prompted me to investigate the performance implications of Parquet Version 2, which I hadn't considered in [my previous post on compression algorithms]](https://dev.to/jerolba/compression-algorithms-in-parquet-java-4kec).

Configuring file writing with version 2 is straightforward, requiring only a property setting in the writer’s builder:

CarpetWriter<T> writer = new CarpetWriter.Builder<>(outputFile, clazz)
    .withWriterVersion(WriterVersion.PARQUET_2_0)
    .build();

File Size

Italian government dataset:

Format	Version 1	Version 2	Improvement
CSV	1761 MB	1761 MB	-
UNCOMPRESSED	564 MB	355 MB	37 %
SNAPPY	220 MB	198 MB	10 %
GZIP	146 MB	138 MB	5 %
ZSTD	148 MB	144 MB	2 %
LZ4_RAW	209 MB	192 MB	8 %
LZO	215 MB	195 MB	9 %

New York taxi dataset:

Format	Version 1	Version 2	Improvement
CSV	2983 MB	2983 MB	-
UNCOMPRESSED	760 MB	511 MB	33 %
SNAPPY	542 MB	480 MB	11 %
GZIP	448 MB	444 MB	1 %
ZSTD	430 MB	444 MB	-3 %
LZ4_RAW	547 MB	482 MB	12 %
LZO	518 MB	479 MB	7 %

The new encodings in Version 2 compact the data more effectively before compression, resulting in a larger relative improvement for the UNCOMPRESSED version. This leaves less room for subsequent compression algorithms to further reduce the file size (or even slightly worsening it like ZSTD).

Writing

Italian government dataset (seconds):

Format	Version 1	Version 2	Improvement
UNCOMPRESSED	25.0	23.6	6 %
SNAPPY	25.2	23.5	7 %
GZIP	39.3	35.8	9 %
ZSTD	27.3	25.7	6 %
LZ4_RAW	24.9	23.8	4 %
LZO	26.0	24.6	5 %

New York taxi dataset (seconds):

Format	Version 1	Version 2	Improvement
UNCOMPRESSED	57.9	50.2	13 %
SNAPPY	56.4	50.7	10 %
GZIP	91.1	66.9	27 %
ZSTD	64.1	57.1	11 %
LZ4_RAW	56.5	50.5	11 %
LZO	56.1	51.1	9 %

The improvement in writing times is remarkable, but especially in the New York taxi dataset, with a majority of numeric values. Especially noteworthy is the improvement in GZIP format times.

Reading

Italian government dataset (seconds):

Format	Version 1	Version 2	Improvement
UNCOMPRESSED	11.4	11.3	1 %
SNAPPY	12.5	11.5	8 %
GZIP	13.6	12.8	6 %
ZSTD	13.1	12.2	7 %
LZ4_RAW	12.8	11.3	12 %
LZO	13.1	12.1	7 %

New York taxi dataset (seconds):

Format	Version 1	Version 2	Improvement
UNCOMPRESSED	37.4	33.0	12 %
SNAPPY	39.9	34.0	15 %
GZIP	40.9	34.4	16 %
ZSTD	41.5	34.1	18 %
LZ4_RAW	41.5	33.6	19 %
LZO	41.1	33.7	18 %

In reading, we again see a notable improvement, but even better in the taxi dataset with many decimal types.

Conclusion

Although this post might seem like a critique of Parquet, that is not my intention. I am simply documenting what I have learned and explaining the challenges maintainers of an open format face when evolving it. All the benefits and utilities that a format like Parquet has far outweigh these inconveniences.

The improvements that the latest version of Parquet brings help reduce file sizes and processing times, but the difference is not dramatic. Given the low adoption of Version 2 in the ecosystem, for now these improvements do not help to justify potential compatibility problems when you integrate with third parties. However, if you control all parts of the process, consider adopting the latest specification.

Most of what I have written is my interpretation, and I could be wrong. If you have better sources or a different opinion, feel free to share it in the comments.

DEV Community

The two versions of Parquet

Parquet Version 2

Performance of Version 2

File Size

Writing

Reading

Conclusion

Top comments (0)

Read next

Concurrency Patterns: Balking Pattern

The Ultimate Guide to Programming Languages: Choosing the Right Tool for the Job

Overview of Executor Service in Java

Data API for Amazon Aurora Serverless v2 with AWS SDK for Java - Part 12 Data API quotas, limitations and pricing