ClickHouse: Building For Fast

Why ClickHouse Is Fast?

— it's very difficult to answer;

— there is no single "silver bullet";

It's easier to answer, why every other system is slow.

I will tell some principles, ideas, and facts...
... but mostly do some handwaving.

Use the Best Algorithms?

— but there is no such thing as the best algorithms!

How not to write a DBMS:

We will write code in C++ / Rust / Asm (it will be fast!)
and use simdjson (isn't it fast?) and Apache Arrow (shall be fast)
will use mmap (I heard it's fast) or no... in-memory database (RAM is fast!)
and it will work on asynchronous Seastar framework (fast?)
... and it will be faster than ClickHouse!

Use the Best Algorithms?

Daniel Lemire, the author of simdjson library

But it is not enough :(

Example: so you want to parse and process JSON data, a lot of it.

The fastest library to parse JSON is simdjson.
And ClickHouse is using it.

"The fastest command-line tools for querying large JSON datasets"
feat.: jq, dsq, trdsql, spark-sql, miller, spyql, octosql and ClickHouse.

https://github.com/dcmoura/spyql/blob/master/notebooks/json_benchmark.ipynb

The best performance is shown by ClickHouse and OctoSQL.
And ClickHouse is using simdjson. Is it the reason for high performance?

Use the Best Algorithms?

The fastest library to parse JSON is simdjson. And ClickHouse is using it.

But in a different context. And it is not used for parsing input data.

Q: Does ClickHouse beat simdjson in speed?
A: Yes and No.

Why ClickHouse and OctoSQL are faster on JSON processing?

Not because they use simdjson. And they don't need it.

But because they parse JSONs, learn the structure
and then reuse it for specialized, structured parsing!

Use the Best Algorithms?

— every algorithm makes different sense in different context.

Example: let's take the best hash function.

Use the Best Algorithms?

— every algorithm makes different sense in different context.

Example:

— let's take the best hash function.

Where you're going to use it?

— in hash tables;
— in bloom filters;
— in hyperloglogs;
— as a function in SQL;

Use the Best Algorithms?

But if you provide it as a function in SQL:

SELECT myBestHash(x) FROM table

You cannot longer use it for hash functions, bloom filters and hyperloglogs*

Example:

CREATE TABLE ... ENGINE = Distributed(... myBestHash(x) % 64)

SELECT ... FROM table GROUP BY x

— interference of hash functions.

* — with the same seed; but seeding not always help, e.g. CityHash is flawed.

Use the Best Algorithms?

What is the best hash function? How the author tested it?

Typical mistake:
— the author tested the hash function on every input size;
— the inputs of 1 byte, 2 bytes, ... 111 bytes, etc;

... but did not test in a loop with a random distribution of input sizes.

— using the same input size in a loop makes
CPU branch predictor works too good;

... but did not test in a loop with a not so random distribution of input sizes.

— what if a real dataset has 90% strings of size zero,
9% strings of size 1..50 and 1% of strings around 1KB?

Always Dig Into Details

Malcolm in the Middle S03E06 - Health Scare, Fox Network, 2001

Always Dig Into Details

Example: you load a website and it's loaded in 5 seconds :(
You check a few more times and it loads instantly.

Wrong:
— it was a temporary glitch; ⌛
— it does not reproduce; ⏱
— the problem fixed by itself;
— why even bother?

Right:
— my precious request was processed slowly!!!
— let's find it in the logs.
I don't care if millions of other requests are processed quickly,
I want to know why that request was slow.

Why some queries are slower? WITH query_id = 'benchmark_16760_29_2' AS first, query_id = 'benchmark_16760_29_3' AS second SELECT PE.Names AS metric, anyIf(PE.Values, first) AS v1, anyIf(PE.Values, second) AS v2 FROM clusterAllReplicas( default, system.query_log) ARRAY JOIN ProfileEvents AS PE WHERE (first OR second) AND (event_date = today()) AND (type = 2) GROUP BY metric HAVING v1 != v2 ORDER BY (v2 - v1) / (v1 + v2) DESC, v2 DESC, metric ASC

┌─metric─────────────────────────────────────┬─────────v1─┬─────────v2─┐
│ OSReadBytes                                │          0 │ 1985159168 │
│ OSIOWaitMicroseconds                       │          0 │  357210000 │
│ AsynchronousReadWaitMicroseconds           │      37405 │  351576828 │
│ DiskReadElapsedMicroseconds                │     821470 │  358286059 │
│ RemoteFSReadMicroseconds                   │     864991 │  358322240 │
│ RealTimeMicroseconds                       │   25954054 │  885426948 │
│ QueryProfilerRuns                          │         39 │        541 │
│ NetworkReceiveElapsedMicroseconds          │         52 │        574 │
│ NetworkSendBytes                           │     253707 │    2387184 │
│ NetworkSendElapsedMicroseconds             │       1363 │      11219 │
│ RemoteFSUnprefetchedReads                  │          1 │          4 │
│ SystemTimeMicroseconds                     │     398182 │    1488797 │
│ SoftPageFaults                             │      37496 │     108137 │
│ FileOpen                                   │         79 │        216 │
│ OSWriteChars                               │     101040 │     194954 │
│ IOBufferAllocBytes                         │  203671745 │  352569537 │
│ IOBufferAllocs                             │        199 │        341 │
│ OpenedFileCacheHits                        │         13 │         18 │
│ ReadBufferFromFileDescriptorRead           │       2080 │       2119 │
│ ContextLock                                │       4331 │       4409 │
│ RemoteFSPrefetchedReads                    │       2044 │       2080 │
│ RemoteFSPrefetches                         │       2044 │       2080 │
│ OSReadChars                                │ 2115113576 │ 2114992840 │
│ ReadBufferFromFileDescriptorReadBytes      │ 2113531138 │ 2113164617 │
│ RemoteFSCacheReadBytes                     │ 2113528933 │ 2113162412 │
│ ArenaAllocChunks                           │        294 │        275 │
│ ArenaAllocBytes                            │   52285440 │   42323968 │
│ OpenedFileCacheMisses                      │         22 │         17 │
│ OSCPUVirtualTimeMicroseconds               │   17664773 │   11345988 │
│ UserTimeMicroseconds                       │   17266392 │    9859270 │
│ OSCPUWaitMicroseconds                      │     604454 │       3598 │
│ AggregationHashTablesInitializedAsTwoLevel │         35 │          0 │
└────────────────────────────────────────────┴────────────┴────────────┘

Always Dig Into Details

Recipes: Always-on profiling. Telemetry and introspection.

What every thread on the server is doing now?
SELECT arrayStringConcat(arrayMap( x -> demangle(addressToSymbol(x)), trace), '\n') FROM system.stack_trace

Where the query spend most of the time?
SELECT arrayStringConcat(arrayMap( x -> demangle(addressToSymbol(x)), trace), '\n') FROM system.trace_log WHERE query_id = '...'

How many resources the query consumed?
SELECT ProfileEvents FROM system.query_log WHERE query_id = '...'

Always Dig Into Details

If you feel bored on Friday night, what to do:

1. Run your program under strace -f and explain the output.

2. Attach gdb to your program and write thread apply all backtrace.

3. Run sudo perf top on production. Look at the annotated asm.

4. If a query runs in 1 second, ask why not 500 ms.

5. If a query runs in 1 ms ask why it took a whole millisecond.

Design For Production Workloads

You cannot design a fast DBMS sitting in your auditorium
at University of Something

... if you do, your DBMS will be fast on paper
or on a few synthetic benchmarks (it's time to forget about TPC)

What to do:

— optimize for real production workloads;

— the more diverse workloads the better;

— do quick development iterations.

BusTub — educational DBMS from CMU with no production usage intent, and I like the logo.

Benchmarks

I like benchmarks... I collect them:
https://github.com/ClickHouse/ClickHouse/issues/22398

Not everyone likes benchmarks:

You may not use the Offerings:
7. in order to benchmark Offerings or to build similar
or competitive products or services.

— Snowflake Acceptable Use Policy

Do benchmarks. Don't prevent others from doing benchmarks.

Learn and improve. Don't do benchmarketing.
Let your users do the marketing for you!

Always Continue Experimenting

If your system is heavily optimized,
most of the new ideas are not going to work.

But you still have to collect ideas and try all of them!

And select one idea out of hundreds.

Examples:
— optimize hash table lookup with __builtin_prefetch; did not work;
— optimize hash table lookup with SIMD instructions; did not work;
— do instruction-parallel computation of hash functions; did not work;
— use Robin Hood hash table algorithm for GROUP BY; did not work;
— use Swiss Table hash table algorithm for GROUP BY; did not work;
— allocating more memory to avoid wraparound; did not work;
— predicting hash table sizes for the query; worked! #33439

Always Continue Experimenting

Don't be afraid of changes. Rewrite the code all the time.

Automated CI tests will do the job.

Must have: functional and integration tests, fuzzing, and sanitizers.

Automated performance regression testing:
https://clickhouse.com/blog/testing-the-performance-of-click-house/

Example: you speed up one place of code +50%,
but the average performance decreased by 1% — not to be approved.

Always Continue Experimenting

The most recent optimization from Maksim Kita:

SET compile_sort_description=1;

SELECT * FROM test_nullable_table
ORDER BY value

— 0 rows in set. Elapsed: 5.299 sec. Processed 100.00 million rows, 1.70 GB (18.87 million rows/s., 320.81 MB/s.)

Always Continue Experimenting

The most recent optimization from Maksim Kita*:

SET compile_sort_description=1;

SELECT * FROM test_nullable_table
ORDER BY value

— 0 rows in set. Elapsed: 4.131 sec. Processed 100.00 million rows, 1.70 GB (24.21 million rows/s., 411.50 MB/s.)
— +30% performance improvement!

* — among 100 others by the same author.

Always Continue Experimenting

Test At Scale

Source: https://imgur.com/gallery/YegLM, 2016.

Test At Scale

ClickHouse works fine on databases of 100 trillion rows (10^14)
in production.

Tests should also run at scale.

Example: Wikipedia pageviews statistics, public domain:
https://dumps.wikimedia.org/other/pageviews/

370 billion records, 1 billion time series 2015..2022.
I've loaded them to ClickHouse. I've loaded them again.
I've loaded them in 10 different ways while testing ClickHouse Cloud.

Now ClickHouse eats this dataset for breakfast
(with 40 million rows/second insert speed per node).

Do Weird Things

Source: Aleksandr Kryukov, 2013, https://www.youtube.com/watch?v=ahQIQxq8JY4

Do Weird Things

Always challenge known beliefs.

ClickHouse is probably not good at large BLOBs?
— but what if I load 600 GB of HTML pages into ClickHouse?
Yes, I did it. And it went alright. #18842

ClickHouse is probably not good at direct inserts from web pages?
— but what if I build a pastebin service on top of this
and persuade all my team to use it?
Yes, I did it. Welcome https://pastila.nl/

What will happen if I do 10 million HTTP requests with the url table function?
— it did not go alright, but we've fixed it! #29696

Do Weird Things

Always challenge known beliefs.

ClickHouse is probably not good
— as a key-value database?
— as a streaming data processing system?
— as a multi-tenant service with workload isolation?
— for vector search in embeddings?
— for full text search?

Hold my beer...
... or read the ClickHouse roadmap 2022.

https://github.com/ClickHouse/ClickHouse/issues/32513

Have Fun

Source: Eh Bee Family, 2014, https://vine.co/v/OwTOez0W6wg

Have Fun

Do whatever you want
if you can make observations and derive conclusions.

Example: I've loaded the first 1 trillion prime numbers into ClickHouse: CREATE TABLE primes (n UInt64 CODEC(Delta, ZSTD)) ENGINE = ReplicatedMergeTree ORDER BY n

The idea has no direct application: it's possible to generate the numbers on the fly inside the subranges at decent speed.

But this is how I've tested the reliability of
ClickHouse Keeper, MergeTree over S3, and Replicated databases.

For the curious readers: prove analytically that ZSTD compression is better than LZ4 on this dataset;
prove that increasing ZSTD level does not make sense; estimate the size of the dataset on disk.

Have Fun

What is it? clickhouse-local --query " WITH number + 1 AS x SELECT x % 15 = 0 ? 'FizzBuzz' : (x % 3 = 0 ? 'Fizz' : (x % 5 = 0 ? 'Buzz' : x::String)) FROM system.numbers" | pv >/dev/null

Is it fast? clickhouse-local --query " WITH number * 15 AS x SELECT format('{}\n{}\nFizz\n{}\nBuzz\nFizz\n{}\n{}\nFizz\nBuzz\n{}\n Fizz\n{}\n{}\nFizzBuzz\n', (x + 1)::String, (x + 2)::String, (x + 4)::String, (x + 7)::String, (x + 8)::String, (x + 11)::String, (x + 13)::String, (x + 14)::String) FROM system.numbers FORMAT RawBLOB" | pv > /dev/null

https://codegolf.stackexchange.com/questions/215216

TLDR

How to write a fast DBMS?

— use good algorithms in a good context;
— dig into details;
— measure everything;
— be close to production;
— do benchmarks, more of them;
— never stop experimenting;
— test at scale;
— do weird stuff;
— have fun!

Get a life! Start contributing to ClickHouse!

ClickHouse:Building For Fast

Why ClickHouse Is Fast?

Use the Best Algorithms?

Use the Best Algorithms?

Use the Best Algorithms?

Use the Best Algorithms?

Use the Best Algorithms?

Use the Best Algorithms?

Use the Best Algorithms?

Always Dig Into Details

Always Dig Into Details

Always Dig Into Details

Always Dig Into Details

Design For Production Workloads

Benchmarks

Always Continue Experimenting

Always Continue Experimenting

Always Continue Experimenting

Always Continue Experimenting

Always Continue Experimenting

Test At Scale

Test At Scale

Do Weird Things

Do Weird Things

Do Weird Things

Have Fun

Have Fun

Have Fun

TLDR

?

ClickHouse:
Building For Fast