Introduction
First let's find some details of primary key from a demo.
Demo table:
CREATE TABLE helloworld.my_first_table
(
`user_id` UInt32,
`message` String,
`timestamp` DateTime,
`metric` Float32,
INDEX message_idx message TYPE ngrambf_v1(3, 10000, 3, 7) GRANULARITY 1
)
ENGINE = MergeTree
PRIMARY KEY (user_id, toStartOfTenMinutes(timestamp))
SETTINGS index_granularity = 2
Set index_granularity to 1, so that we can get lots of mark ranges even with limited demo data.
Insert some data:
INSERT INTO helloworld.my_first_table (user_id, message, timestamp, metric) VALUES
(101, 'Hello, ClickHouse!', now(), -1.0 ),
(102, 'Insert a lot of rows per batch', yesterday(), 1.41421 ),
(102, 'Sort your data based on your commonly-used queries', today(), 2.718 ),
(101, 'Granules are the smallest chunks of data read', now() + 5, 3.14159 );
INSERT INTO helloworld.my_first_table (user_id, message, timestamp, metric) VALUES
(101, 'Hello, ClickHouse!', now(), -1.0 ),
(103, 'Insert a lot of rows per batch', yesterday(), 1.41421 ),
(103, 'Sort your data based on your commonly-used queries', today() - 1, 2.718 ),
(101, 'Granules are the smallest chunks of data read', now() + 5, 3.14159 );
Use user_id in where, we can see 3 mark is filtered out.
Use timestamp < today() or toStartOfTenMinutes(timestamp) < today(), 2 marks is filtered out.
Therefore, the conclusion is that even if we do not use the first position of the sort key when filtering, data skipping can still take effect. How is this happened?
Next, we will explore the principles of PK, including the working mechanism in writing and querying.
Writing of primary key
1 Generate sort by and other expressions from data block, here we got toStartOfTenMinutes(timestamp).
2 Reserve space on disk and create part object.
3 Write data to output stream.
4 Finalize data on disk, including flushing all data, writing minmax for partition keys, uuid.txt, partition.dat, checksum.txt, count.txt.
The writing logic of primary key is inside write function, this function has two implementations: compact and wide parts.
Compact part
primary keys are written into files in function writeDataBlockPrimaryIndexAndSkipIndices.
Writing data block in compact mode is like the tranditional paquet format, columns of the same granule is placing together.
Each granule has one row in primary keys.
Write primary index block, which contains the primary index expressions. (toStartOfTenMinutes(timestamp) in this demo)
Create lots of streams on same data file for different compression codecs.
Wide part
For wide part, primary key has the same logic as they all extends WriterOnDisk.
Columns are writing seperately, write all granules of column1 and then column2...
Each column has it's own data file and mark file.
read process(to be done)
We can see that the filter logic is in function markRangesFromPKRange. This function is invoked on part one by one.
Function description:
Calculates a set of mark ranges, that could possibly contain keys, required by condition.
In other words, it removes subranges from whole range, that definitely could not contain required keys.
If @exact_ranges is not null, fill it with ranges containing marks of fully matched records.
Here is the logic of judging whether a mark may contain the data requested.
In a single part, data is ordered by pk already, so we can use binary search to find mark for left bound and right bound.
We can see that main logic for filter is in key_condition's checkInRange function. Key condition is from indexes.
key_condition is build like this from primary key expressions.
Above binary search logic happens only when we use prefix contiguous subset of pk columns(here we use user_id, and primary keys are user_id, timestamp).
Instead we use timestamp only, the logic will be totally different.
Do exclusion search, where we drop ranges that do not match.
Some personal thoughts
Clickhouse record the pks of the first row in granules and then use these pks to do filtering when querying.
This is not suitable for some extentions, such as zorder.
Zorder is more suitable for minmax indexes, we can filter a mark
by the min and max values in a granule.
Clickhouse now only record minmax for columns used in partitioning, maybe we can extend the minmax to primary keys in the future.
Top comments (0)