This is the first part of the ClickHouse Tutorial Series. In this series, I cover some practical and vital aspects of the ClickHouse database, a robust OLAP technology many enterprise companies utilize.
In this part, I'll talk about the main features, weaknesses, installation, and usage of ClickHouse. I'll also refer to some helpful links for those who want to dive into broader details.
What is ClickHouse
ClickHouse is an open-source column-oriented database developed by Yandex. It is designed to provide high performance for analytical queries. ClickHouse uses a SQL-like query language for querying data and supports different data types, including integers, strings, dates, and floats. It offers various features such as clustering, distributed query processing, and fault tolerance. It also supports replication and data sharding. ClickHouse is used by companies such as Yandex, Facebook, and Uber for data analysis, machine learning, and more.
Main Features
The main features of Clickhouse Database are:
Column-Oriented
Data in ClickHouse is stored in columns instead of rows, bringing at least two benefits:
- Every column can be sorted in a separate file; hence, stronger compression happens on each column and the whole table.
- In range queries common in analytical processing, the system can access and process data easier since data is sorted in some columns (i.e., columns defined as sort keys). Additionally, it can parallelize processes on multi-cores while loading massive columns.
Note: It should not get mistaken with Wide-Column databases like Cassandra as they store data in rows but enable you to denormalize intensive data in a table with many columns leading to a No-SQL structure.
Data Compression
Thanks to compression algorithms (zstd and LZ4), data occupies much less storage, even more than 20x smaller! You can study some of the benchmarks on ClickHouse and other databases storage here.
Scalability
ClickHouse scales well both vertically and horizontally. It can be scaled by adding extra replicas and extra shards to process queries in a distributed way. ClickHouse supports multi-master asynchronous replication and can be deployed across multiple data centers. All nodes are equal, which allows for avoiding having single points of failure.
Weaknesses
To mention some:
- Lack of full-fledged UPDATE/DELETE implementation: ClickHouse is unsuited for modification and mutations. So you'll come across poor performance regarding those kinds of queries.
- OLTP queries like pointy ones would not make you happy since ClickHouse is easily outperformed by traditional RDBMSs like MySQL with those queries.
Rivals and Alternatives
To name a few:
- Apache Druid
- ElasticSearch
- SingleStore
- Snowflake
- TimescaleDB
Surely, each one is suitable for different use cases and has its pros and cons, but I won't discuss their comparison here. However, you can study some valuable benchmarks here and here.
Quick Start
Installation
I only cover the Docker approach here. For other methods on different distros, please follow ClicHouse's official Installation.
The docker-compose file:
version: '2'
services:
clickhouse:
container_name: myclickhouse
image: clickhouse/clickhouse-server:latest
ports:
- "8123:8123"
- "9000:9000"
volumes:
- ./clickhouse-data:/var/lib/clickhouse/
restart: unless-stopped
And then run it by:
docker compose up -d
As you can see, two ports have been exposed:
- 8123: HTTP API Port for HTTP requests, used by JDBC, ODBC, and web interfaces.
- 9000: Native Protocol port (ClickHouse TCP protocol). Used by ClickHouse apps and processes like clickhouse-server, clickhouse-client, and native ClickHouse tools. Used for inter-server communication for distributed queries.
It's up to your client driver to choose one of them. For example, DBeaver uses 8123, and Python ClickhHouse-Driver uses 9000.
To continue the tutorial, we use ClickHouse-Client available on the installed server:
docker exec -it myclickhouse clickhouse-client
Database and Table Creation
Create database test:
CREATE DATABASE test;
create table orders
:
CREATE TABLE test.orders
(`OrderID` Int64,
`CustomerID` Int64,
`OrderDate` DateTime,
`Comments` String,
`Cancelled` Bool)
ENGINE = MergeTree
PRIMARY KEY (OrderID, OrderDate)
ORDER BY (OrderID, OrderDate, CustomerID)
SETTINGS index_granularity = 8192;
In the next parts, we'll talk about other configurations like Engine
, PRIMARY KEY
, ORDER BY
, etc.
Insert Data
To insert sample data:
INSERT INTO test.orders
VALUES (334, 123, '2021-09-15 14:30:00', 'some comment',
false);
Read Data
Just like any other SQL query:
SELECT OrderID, OrderDate FROM test.orders;
Conclusion
In the first part of the ClickHouse Tutorial Series, we discussed the traits, features, and weaknesses of ClickHouse. Then we saw how to set up an instance with minimum configuration, create a database and table, insert data into it, and read from it.
Top comments (2)
Great! I really enjoy the article.
👏👏👏