— it's very difficult to answer;
— there is no single "silver bullet";
It's easier to answer, why every other system is slow.
I will tell some principles, ideas, and facts...
... but mostly do some handwaving.
— but there is no such thing as the best algorithms!
How not to write a DBMS:
We will write code in C++ / Rust / Asm (it will be fast!)
and use simdjson (isn't it fast?) and Apache Arrow (shall be fast)
will use mmap (I heard it's fast) or no... in-memory database (RAM is fast!)
and it will work on asynchronous Seastar framework (fast?)
... and it will be faster than ClickHouse!
Daniel Lemire, the author of simdjson library
But it is not enough :(
Example: so you want to parse and process JSON data, a lot of it.
The fastest library to parse JSON is simdjson.
And ClickHouse is using it.
"The fastest command-line tools for querying large JSON datasets"
feat.: jq, dsq, trdsql, spark-sql, miller, spyql, octosql and ClickHouse.
https://github.com/dcmoura/spyql/blob/master/notebooks/json_benchmark.ipynb
The best performance is shown by ClickHouse and OctoSQL.
And ClickHouse is using simdjson. Is it the reason for high performance?
The fastest library to parse JSON is simdjson. And ClickHouse is using it.
But in a different context. And it is not used for parsing input data.
Q: Does ClickHouse beat simdjson in speed?
A: Yes and No.
Why ClickHouse and OctoSQL are faster on JSON processing?
Not because they use simdjson. And they don't need it.
But because they parse JSONs, learn the structure
and then reuse it for specialized, structured parsing!
— every algorithm makes different sense in different context.
Example: let's take the best hash function.
— every algorithm makes different sense in different context.
Example:
— let's take the best hash function.
Where you're going to use it?
— in hash tables;
— in bloom filters;
— in hyperloglogs;
— as a function in SQL;
But if you provide it as a function in SQL:
SELECT myBestHash(x) FROM table
You cannot longer use it for hash functions, bloom filters and hyperloglogs*
Example:
CREATE TABLE ... ENGINE = Distributed(... myBestHash(x) % 64)
SELECT ... FROM table GROUP BY x
— interference of hash functions.
* — with the same seed; but seeding not always help, e.g. CityHash is flawed.
What is the best hash function? How the author tested it?
Typical mistake:
— the author tested the hash function on every input size;
— the inputs of 1 byte, 2 bytes, ... 111 bytes, etc;
... but did not test in a loop with a random distribution of input sizes.
— using the same input size in a loop makes
CPU branch predictor works too good;
... but did not test in a loop with a not so random distribution of input sizes.
— what if a real dataset has 90% strings of size zero,
9% strings of size 1..50 and 1% of strings around 1KB?
Malcolm in the Middle S03E06 - Health Scare, Fox Network, 2001
Example: you load a website and it's loaded in 5 seconds :(
You check a few more times and it loads instantly.
Wrong:
— it was a temporary glitch; ⌛
— it does not reproduce; ⏱
— the problem fixed by itself;
— why even bother?
Right:
— my precious request was processed slowly!!!
— let's find it in the logs.
I don't care if millions of other requests are processed quickly,
I want to know why that request was slow.
Why some queries are slower?
WITH
query_id = 'benchmark_16760_29_2' AS first,
query_id = 'benchmark_16760_29_3' AS second
SELECT
PE.Names AS metric,
anyIf(PE.Values, first) AS v1,
anyIf(PE.Values, second) AS v2
FROM clusterAllReplicas(
default, system.query_log)
ARRAY JOIN ProfileEvents AS PE
WHERE (first OR second)
AND (event_date = today()) AND (type = 2)
GROUP BY metric
HAVING v1 != v2
ORDER BY
(v2 - v1) / (v1 + v2) DESC,
v2 DESC,
metric ASC
┌─metric─────────────────────────────────────┬─────────v1─┬─────────v2─┐
│ OSReadBytes │ 0 │ 1985159168 │
│ OSIOWaitMicroseconds │ 0 │ 357210000 │
│ AsynchronousReadWaitMicroseconds │ 37405 │ 351576828 │
│ DiskReadElapsedMicroseconds │ 821470 │ 358286059 │
│ RemoteFSReadMicroseconds │ 864991 │ 358322240 │
│ RealTimeMicroseconds │ 25954054 │ 885426948 │
│ QueryProfilerRuns │ 39 │ 541 │
│ NetworkReceiveElapsedMicroseconds │ 52 │ 574 │
│ NetworkSendBytes │ 253707 │ 2387184 │
│ NetworkSendElapsedMicroseconds │ 1363 │ 11219 │
│ RemoteFSUnprefetchedReads │ 1 │ 4 │
│ SystemTimeMicroseconds │ 398182 │ 1488797 │
│ SoftPageFaults │ 37496 │ 108137 │
│ FileOpen │ 79 │ 216 │
│ OSWriteChars │ 101040 │ 194954 │
│ IOBufferAllocBytes │ 203671745 │ 352569537 │
│ IOBufferAllocs │ 199 │ 341 │
│ OpenedFileCacheHits │ 13 │ 18 │
│ ReadBufferFromFileDescriptorRead │ 2080 │ 2119 │
│ ContextLock │ 4331 │ 4409 │
│ RemoteFSPrefetchedReads │ 2044 │ 2080 │
│ RemoteFSPrefetches │ 2044 │ 2080 │
│ OSReadChars │ 2115113576 │ 2114992840 │
│ ReadBufferFromFileDescriptorReadBytes │ 2113531138 │ 2113164617 │
│ RemoteFSCacheReadBytes │ 2113528933 │ 2113162412 │
│ ArenaAllocChunks │ 294 │ 275 │
│ ArenaAllocBytes │ 52285440 │ 42323968 │
│ OpenedFileCacheMisses │ 22 │ 17 │
│ OSCPUVirtualTimeMicroseconds │ 17664773 │ 11345988 │
│ UserTimeMicroseconds │ 17266392 │ 9859270 │
│ OSCPUWaitMicroseconds │ 604454 │ 3598 │
│ AggregationHashTablesInitializedAsTwoLevel │ 35 │ 0 │
└────────────────────────────────────────────┴────────────┴────────────┘
Recipes: Always-on profiling. Telemetry and introspection.
What every thread on the server is doing now?
SELECT arrayStringConcat(arrayMap(
x -> demangle(addressToSymbol(x)), trace), '\n')
FROM system.stack_trace
Where the query spend most of the time?
SELECT arrayStringConcat(arrayMap(
x -> demangle(addressToSymbol(x)), trace), '\n')
FROM system.trace_log WHERE query_id = '...'
How many resources the query consumed?
SELECT ProfileEvents FROM system.query_log
WHERE query_id = '...'
If you feel bored on Friday night, what to do:
1. Run your program under strace -f and explain the output.
2. Attach gdb to your program and write thread apply all backtrace.
3. Run sudo perf top on production. Look at the annotated asm.
4. If a query runs in 1 second, ask why not 500 ms.
5. If a query runs in 1 ms ask why it took a whole millisecond.
You cannot design a fast DBMS sitting in your auditorium
at University of Something
... if you do, your DBMS will be fast on paper
or on a few synthetic benchmarks (it's time to forget about TPC)
What to do:
— optimize for real production workloads;
— the more diverse workloads the better;
— do quick development iterations.
BusTub — educational DBMS from CMU with no production usage intent, and I like the logo.
I like benchmarks... I collect them:
https://github.com/ClickHouse/ClickHouse/issues/22398
Not everyone likes benchmarks:
You may not use the Offerings:
7. in order to benchmark Offerings or to build similar
or competitive products or services.
— Snowflake Acceptable Use Policy
Do benchmarks. Don't prevent others from doing benchmarks.
Learn and improve. Don't do benchmarketing.
Let your users do the marketing for you!
If your system is heavily optimized,
most of the new ideas are not going to work.
But you still have to collect ideas and try all of them!
And select one idea out of hundreds.
Examples:
— optimize hash table lookup with __builtin_prefetch; did not work;
— optimize hash table lookup with SIMD instructions; did not work;
— do instruction-parallel computation of hash functions; did not work;
— use Robin Hood hash table algorithm for GROUP BY; did not work;
— use Swiss Table hash table algorithm for GROUP BY; did not work;
— allocating more memory to avoid wraparound; did not work;
— predicting hash table sizes for the query; worked! #33439
Don't be afraid of changes. Rewrite the code all the time.
Automated CI tests will do the job.
Must have: functional and integration tests, fuzzing, and sanitizers.
Automated performance regression testing:
https://clickhouse.com/blog/testing-the-performance-of-click-house/
Example: you speed up one place of code +50%,
but the average performance decreased by 1% — not to be approved.
The most recent optimization from Maksim Kita:
SET compile_sort_description=1;
SELECT * FROM test_nullable_table
ORDER BY value
— 0 rows in set. Elapsed: 5.299 sec. Processed 100.00 million rows, 1.70 GB (18.87 million rows/s., 320.81 MB/s.)
The most recent optimization from Maksim Kita*:
SET compile_sort_description=1;
SELECT * FROM test_nullable_table
ORDER BY value
— 0 rows in set. Elapsed: 4.131 sec. Processed 100.00 million rows, 1.70 GB (24.21 million rows/s., 411.50 MB/s.)
— +30% performance improvement!
* — among 100 others by the same author.
Source: https://imgur.com/gallery/YegLM, 2016.
ClickHouse works fine on databases of 100 trillion rows (10^14)
in production.
Tests should also run at scale.
Example: Wikipedia pageviews statistics, public domain:
https://dumps.wikimedia.org/other/pageviews/
370 billion records, 1 billion time series 2015..2022.
I've loaded them to ClickHouse.
I've loaded them again.
I've loaded them in 10 different ways while testing ClickHouse Cloud.
Now ClickHouse eats this dataset for breakfast
(with 40 million rows/second insert speed per node).
Source: Aleksandr Kryukov, 2013, https://www.youtube.com/watch?v=ahQIQxq8JY4
Always challenge known beliefs.
ClickHouse is probably not good at large BLOBs?
— but what if I load 600 GB of HTML pages into ClickHouse?
Yes, I did it. And it went alright. #18842
ClickHouse is probably not good at direct inserts from web pages?
— but what if I build a pastebin service on top of this
and persuade all my team to use it?
Yes, I did it. Welcome https://pastila.nl/
What will happen if I do 10 million HTTP requests with the url table function?
— it did not go alright, but we've fixed it! #29696
Always challenge known beliefs.
ClickHouse is probably not good
— as a key-value database?
— as a streaming data processing system?
— as a multi-tenant service with workload isolation?
— for vector search in embeddings?
— for full text search?
Hold my beer...
... or read the ClickHouse roadmap 2022.
Source: Eh Bee Family, 2014, https://vine.co/v/OwTOez0W6wg
Do whatever you want
if you can make observations and derive conclusions.
Example: I've loaded the first 1 trillion prime numbers into ClickHouse:
CREATE TABLE primes (n UInt64 CODEC(Delta, ZSTD))
ENGINE = ReplicatedMergeTree ORDER BY n
The idea has no direct application: it's possible to generate the numbers on the fly inside the subranges at decent speed.
But this is how I've tested the reliability of
ClickHouse Keeper, MergeTree over S3, and Replicated databases.
For the curious readers: prove analytically that ZSTD compression is better than LZ4 on this dataset;
prove that increasing ZSTD level does not make sense; estimate the size of the dataset on disk.
What is it?
clickhouse-local --query "
WITH number + 1 AS x
SELECT x % 15 = 0 ? 'FizzBuzz'
: (x % 3 = 0 ? 'Fizz'
: (x % 5 = 0 ? 'Buzz' : x::String))
FROM system.numbers" | pv >/dev/null
Is it fast?
clickhouse-local --query "
WITH number * 15 AS x
SELECT format('{}\n{}\nFizz\n{}\nBuzz\nFizz\n{}\n{}\nFizz\nBuzz\n{}\n
Fizz\n{}\n{}\nFizzBuzz\n',
(x + 1)::String, (x + 2)::String, (x + 4)::String, (x + 7)::String,
(x + 8)::String, (x + 11)::String, (x + 13)::String, (x + 14)::String)
FROM system.numbers FORMAT RawBLOB" | pv > /dev/null
How to write a fast DBMS?
— use good algorithms in a good context;
— dig into details;
— measure everything;
— be close to production;
— do benchmarks, more of them;
— never stop experimenting;
— test at scale;
— do weird stuff;
— have fun!
Get a life! Start contributing to ClickHouse!