DEV Community

abel-cheng
abel-cheng

Posted on

Clickhouse Source Code Analysis: How is primary key generated and used?

Introduction

First let's find some details of primary key from a demo.
Demo table:

CREATE TABLE helloworld.my_first_table
(
    `user_id` UInt32,
    `message` String,
    `timestamp` DateTime,
    `metric` Float32,
    INDEX message_idx message TYPE ngrambf_v1(3, 10000, 3, 7) GRANULARITY 1
)
ENGINE = MergeTree
PRIMARY KEY (user_id, toStartOfTenMinutes(timestamp))
SETTINGS index_granularity = 2
Enter fullscreen mode Exit fullscreen mode

Set index_granularity to 1, so that we can get lots of mark ranges even with limited demo data.

Insert some data:

INSERT INTO helloworld.my_first_table (user_id, message, timestamp, metric) VALUES
    (101, 'Hello, ClickHouse!',                                 now(),       -1.0    ),
    (102, 'Insert a lot of rows per batch',                     yesterday(), 1.41421 ),
    (102, 'Sort your data based on your commonly-used queries', today(),     2.718   ),
    (101, 'Granules are the smallest chunks of data read',      now() + 5,   3.14159 );
INSERT INTO helloworld.my_first_table (user_id, message, timestamp, metric) VALUES
    (101, 'Hello, ClickHouse!',                                 now(),       -1.0    ),
    (103, 'Insert a lot of rows per batch',                     yesterday(), 1.41421 ),
    (103, 'Sort your data based on your commonly-used queries', today() - 1,     2.718   ),
    (101, 'Granules are the smallest chunks of data read',      now() + 5,   3.14159 );
Enter fullscreen mode Exit fullscreen mode

Use user_id in where, we can see 3 mark is filtered out.
Image description

Use timestamp < today() or toStartOfTenMinutes(timestamp) < today(), 2 marks is filtered out.
Image description
Image description

Therefore, the conclusion is that even if we do not use the first position of the sort key when filtering, data skipping can still take effect. How is this happened?

Next, we will explore the principles of PK, including the working mechanism in writing and querying.


Writing of primary key

Write entrance
Image description

1 Generate sort by and other expressions from data block, here we got toStartOfTenMinutes(timestamp).
Image description

Image description

2 Reserve space on disk and create part object.
Image description

3 Write data to output stream.
Image description

4 Finalize data on disk, including flushing all data, writing minmax for partition keys, uuid.txt, partition.dat, checksum.txt, count.txt.
Image description

The writing logic of primary key is inside write function, this function has two implementations: compact and wide parts.

Compact part

primary keys are written into files in function writeDataBlockPrimaryIndexAndSkipIndices.
Image description
Image description
Writing data block in compact mode is like the tranditional paquet format, columns of the same granule is placing together.
Image description
Each granule has one row in primary keys.
Image description
Write primary index block, which contains the primary index expressions. (toStartOfTenMinutes(timestamp) in this demo)
Image description
Create lots of streams on same data file for different compression codecs.
Image description

Wide part

For wide part, primary key has the same logic as they all extends WriterOnDisk.
Columns are writing seperately, write all granules of column1 and then column2...
Image description
Each column has it's own data file and mark file.
Image description


read process(to be done)

Read entrance
Image description

We can see that the filter logic is in function markRangesFromPKRange. This function is invoked on part one by one.
Function description:

Calculates a set of mark ranges, that could possibly contain keys, required by condition.
In other words, it removes subranges from whole range, that definitely could not contain required keys.
If @exact_ranges is not null, fill it with ranges containing marks of fully matched records.
Enter fullscreen mode Exit fullscreen mode

Here is the logic of judging whether a mark may contain the data requested.
Image description

In a single part, data is ordered by pk already, so we can use binary search to find mark for left bound and right bound.
Image description

We can see that main logic for filter is in key_condition's checkInRange function. Key condition is from indexes.
Image description

key_condition is build like this from primary key expressions.
Image description

Above binary search logic happens only when we use prefix contiguous subset of pk columns(here we use user_id, and primary keys are user_id, timestamp).

Instead we use timestamp only, the logic will be totally different.
Image description

Do exclusion search, where we drop ranges that do not match.
Image description


Some personal thoughts

Clickhouse record the pks of the first row in granules and then use these pks to do filtering when querying.
This is not suitable for some extentions, such as zorder.
Zorder is more suitable for minmax indexes, we can filter a mark
by the min and max values in a granule.
Clickhouse now only record minmax for columns used in partitioning, maybe we can extend the minmax to primary keys in the future.

Top comments (0)