Author: Alexey Milovidov, 2017-12-06.
Alexey, ClickHouse developer.
ClickHouse is a distributed analytical column-oriented DBMS.
I will talk about several details of ClickHouse's operation
that I consider important.
ClickHouse is a modular system.
It supports various storage engines.
Storage engine is responsible for:
— storing, reading and writing data;
— synchronizing data access;
— data replication.
Examples of simple Storage Engines:
Memory, File, TinyLog.
If you want to add your own table engine to ClickHouse,
start by looking at the simplest ones.
Examples of specialized Storage Engines:
Distributed, Merge, View, Null.
For serious applications, use MergeTree family tables
Similar to LSM-Tree (HBase, Cassandra, RocksDB…), but worse.
The table consists of a set of data "parts".
Each part is lexicographically sorted
by the primary key value.
Different parts are completely independent
and may overlap in key values.
/opt/clickhouse/data/default/hits_single_partition# du -h --max-depth=1 | sort -k2 -V 102G . 40G ./all_1_1026_4 47G ./all_1027_2210_4 15G ./all_2211_2612_4 1.1G ./all_2613_2640_2 2.2M ./all_2641_2641_0
Within one part, each column is stored in a separate file.
Column data is serialized values laid out sequentially,
then compressed in blocks. Block size before compression is 64 KB .. 1 MB.
all_1_1026_4# ls -1 AdvEngineID.bin AdvEngineID.mrk Age.bin Age.mrk BrowserCountry.bin BrowserCountry.mrk BrowserLanguage.bin BrowserLanguage.mrk checksums.txt ... primary.idx
The entire part is sorted by primary key.
Data from each column has a corresponding order.
(The N-th row corresponds to the N-th values in each column file)
During SELECT, in the worst case, reading from all parts is performed.
Just in case, each part has a minmax index of the partitioning key.
During INSERT, the batch of inserted data is sorted and a new part is formed in the file system.
In the background, several parts are merged together into one larger part.
One index, called the "primary key".
The primary key does not have to be unique
- it's just a key for ordering data.
(clustered index)
The index is sparse.
Does not work well for point lookup queries.
A sparse index addresses not every record in the table,
but a range of several thousand records.
Instead of reading one row, you have to read at least a batch.
… and you also have to decompress compressed blocks.
Requires data ordering.
For one table — only one sparse index.
Instead of reading one row, you have to read at least a batch.
… and you also have to decompress compressed blocks.
Very compact, fits in memory.
Support for complex logical expressions.
— binary search with ternary logic and interval arithmetic.
Example. There is a complex condition
WHERE a AND (b OR NOT c)...
We sequentially split the set of all data
into range granules, where the condition is:
— always false;
— always true;
— can be both false and true.
The index works even with a condition on a non-prefix of the primary key.
Example. PK: (CounterID, Date).
WHERE Date BETWEEN '2017-11-20' AND '2017-11-29'
— will be able to filter ranges, if
there is a sufficiently long range for a fixed CounterID.
The index always works no worse than full scan.
Even if 95% of the data is selected by the index,
it's better than reading 100%.
Support for conditions with partially monotonic functions on columns.
Example. PK: Date
WHERE toDayOfMonth(Date) = 29
Example: parallelogram index analysis.
(vectorized query execution)
Two ways:
1. Query compilation.
Examples: MemSQL, Impala.
2. Vectorized processing.
Examples: kdb, VectorWise (Actian Vector).
Better — both together.
ClickHouse — vectorized query execution
… with minor query compilation capabilities.
Data is processed in "blocks"
— small pieces of columns.
Each column in memory is a flat array (one or more).
(Structure of Arrays, see also Apache Arrow).
Block size is chosen small for cache locality.
Advantages:
— no overhead for operation dispatching;
— possibilities for applying SIMD instructions;
— possibilities for applying operations to the entire array at once
(example — substring search in a string);
— good code isolation;
Disadvantages:
— code bloat due to supporting specializations for different data types;
— overhead for reading and writing temporary data to cache;
— possibly lower instructions per clock
due to one operation in each loop;
— complexity with implementing short-circuit operations;
— complex interface for UDFs (user-defined functions);
Parallel aggregation.
Distributed aggregation.
External memory aggregation.
Incremental aggregation.
Website: https://clickhouse.com/
Google groups: https://groups.google.com/forum/#!forum/clickhouse
Email: [email protected]
Telegram chat: https://telegram.me/clickhouse_en and https://telegram.me/clickhouse_ru (already 1100 participants)
GitHub: https://github.com/ClickHouse/ClickHouse/
Twitter: https://twitter.com/ClickHouseDB
+ meetups. Moscow, Saint Petersburg, Novosibirsk,
Yekaterinburg, Minsk, Berlin, Palo Alto, Nizhny Novgorod…
then Moscow again.