Introduction to Clickhouse at scale

#dataenginnering #clickhouse #beginners

Introduction

In many cases we prefer to scale our services, we always prefer scale out against scale up because of the lower cost of scaling out.

Scaling out = adding more components in parallel to spread out a load. Scaling up = making a component bigger or faster so that it can handle more load. Source

In Clickhouse terminology, Scale out is equal to sharding. And we use replicas to ensure availability. In this article, we will learn about Cluster, Replicas, and Sharding in Clickhouse.

Replication used for data integrity and automatic failover. Sharding is used for horizontal scaling of the cluster. From Reference 1

Sharding

Suppose we have a table on a Clickhouse node (yellow table in host1). After a while data become bigger and the request rate increased. Now we decide to scale the node, here host2 comes to the scene. We apply sharding to the table and send a second shard on host2.

How to make a select query to the table?
1. We can directly query a table in each host. In this case, we should know what data exist in each shard.
2. We can create a distributed table. It can create on each node, when we query to distributed table it ingests data from the proper shard. It doesn't store data but has metadata about data in each shard.

How to insert data into the table?
1. We can directly insert our data into the shards if we have a predefined schema or something same.
2. Also we cant insert data into the distributed table and it inserts data regarding the defined sharding key. You can read more about sharding key in Clickhouse Doc

Replication

Suppose we run an application depending on Clickhouse or we need online analytics which uses Clickhouse. What happens if one node becomes down or a hardware issue rises? Certainly, we don't want that, so we need replication for each table.
We can only have replicated tables for *MergeTree* tables.
Clickhouse Keeper is who manages things about replication in Clickhouse. It is compatible with Zookeeper and it's a kind of alternative for Zookeeper which Clickhouse presents.

Cluster

A cluster is a way to manage sharding and replication between some nodes. We can have many cluster topologies for the same nodes. Every time we add a table to one cluster, It shards and replicates in a way defined in that cluster.

How to apply Sharding/Replication/Clustering in Clickhouse?

Set up Clickhouse Cluster

In the below picture, we have 4 nodes. We define a cluster named cluster1 (in the Clickhouse config file). Each table associated with this cluster will have 2 replicas and 2 shards (But they should be Replicated* tables)

Set up Clickhouse Keeper

See This part of Reference 1

Create tables

Every time we want to create a table with a cluster1 topology, we should use the ON CLUSTER statement. For example, if we want to create the table:

The first parameter of the ReplicatedMergeTree engine is a Clickhouse keeper path of the table and the second one is a replica name. Tables with the same path and different replica names are replicated (Clickhouse keeper does these things).
We should define {shard} and {replica} as macros in each node.

Once mydb.my_table is created in each of host1, host2, host3, or host4 it will create in all other nodes.
And finally, we will create a distributed table to better query on my_db.my_table which has 2 shards.