Maksim, developer of ClickHouse.
1. High Level System Architecture.
2. CI/CD Pipeline.
3. Introspection.
4. Abstractions and Algorithms.
5. Libraries.
6. JIT compilation. Dynamic dispatch.
Column-oriented storage — data is physically stored by columns.
Only necessary columns are read from disk during query.
Better compression because of data locality.
Vectorized query execution — data is processed in blocks. Block contains multiple columns with max_block_size rows (65505 by default).
Each column is stored as a vector of primitive data types or their combination:
1. Better utilization for CPU caches and pipeline.
2. Data is processed using SIMD instructions.
Numeric columns — PODArray. Almost the same as std::vector.
Nullable columns contain data column and UInt8 column bitmask is element null.
Array columns contain data column and UInt64 column with offsets.
Const column contain 1 constant value.
1. Functional, Integration tests.
2. Run all tests with sanitizers (ASAN, TSAN, MSAN).
3. Fuzzers (data types, compression codecs).
4. AST fuzzer.
5. Stress tests (Our special TSAN stress).
6. Performance tests.
Part of CI/CD pipeline.
Runs for each commit in pull request.
Runs for each commit in the master branch.
Collect different statistics during each performance test run. Can be useful for later debugging.
Processor metrics (CPU cycles, cache misses same as perf-stat).
ClickHouse specific profile events (read bytes from disk, transferred over network, etc).
https://clickhouse.com/blog/testing-the-performance-of-click-house/Helps us find performance regressions.
Nice tool that can help to find places where performance can be improved.
1. Try different allocators, different libraries.
2. Try different compiler options (loop unrolling, inline threshold)
3. Enable AVX/AVX2/AVX512 for build.
Collect ProfileEvents for each query:
RealTimeMicroseconds, UserTimeMicroseconds, SystemTimeMicroseconds, SoftPageFaults, HardPageFaults using getrusage system call.
Collect :taskstats from procFS.
OSCPUVirtualTimeMicroseconds, OSCPUWaitMicroseconds (when /proc/thread-self/schedstat is available). OSIOWaitMicroseconds (when /proc/thread-self/stat is available). OSReadChars, OSWriteChars, OSReadBytes, OSWriteBytes (when /proc/thread-self/io is available)
Collect ProfileEvents for each query:
Hardware specific counters CPU cache misses, CPU branch mispredictions using perf_event_open system call.
https://man7.org/linux/man-pages/man2/perf_event_open.2.htmlCollect ProfileEvents for each query:
Different ClickHouse specific metrics FileOpen, DiskReadElapsedMicroseconds, NetworkSendBytes.
SELECT PE.Names AS ProfileEventName, PE.Values AS ProfileEventValue
FROM system.query_log ARRAY JOIN ProfileEvents AS PE
WHERE query_id='344b07d9-9d7a-48f0-a17e-6f5f6f3d61f5'
AND ProfileEventName LIKE 'Perf%';
┌─ProfileEventName─────────────┬─ProfileEventValue─┐
│ PerfCpuCycles │ 40496995274 │
│ PerfInstructions │ 57259199973 │
│ PerfCacheReferences │ 2072274618 │
│ PerfCacheMisses │ 146570206 │
│ PerfBranchInstructions │ 8675194991 │
│ PerfBranchMisses │ 259531879 │
│ PerfStalledCyclesFrontend │ 813419527 │
│ PerfStalledCyclesBackend │ 15797162832 │
│ PerfCpuClock │ 10587371854 │
│ PerfTaskClock │ 10587382785 │
│ PerfContextSwitches │ 3009 │
│ PerfCpuMigrations │ 113 │
│ PerfMinEnabledTime │ 10584952104 │
│ PerfMinEnabledRunningTime │ 4348089512 │
│ PerfDataTLBReferences │ 465992961 │
│ PerfDataTLBMisses │ 5149603 │
│ PerfInstructionTLBReferences │ 1344998 │
│ PerfInstructionTLBMisses │ 181635 │
└──────────────────────────────┴───────────────────┘
Periodically collect stack traces from all currently running threads.
Currently using a patched fork of LLVM libunwind.
Check all threads current stack trace from system.stack_trace
WITH arrayMap(x -> demangle(addressToSymbol(x)), trace) AS all
SELECT thread_name, thread_id, query_id, arrayStringConcat(all, '\n') AS res
FROM system.stack_trace LIMIT 1 FORMAT Vertical;
Row 1:
──────
thread_name: clickhouse-serv
thread_id: 125441
query_id:
res: pthread_cond_wait
std::__1::condition_variable::wait(std::__1::unique_lock&)
BaseDaemon::waitForTerminationRequest()
DB::Server::main(/*arguments*/)
Poco::Util::Application::run()
DB::Server::run()
Poco::Util::ServerApplication::run(int, char**)
mainEntryClickHouseServer(int, char**)
main
__libc_start_main
_start
Generate flamegraph of query execution
./clickhouse-client --query="SELECT
arrayStringConcat(
arrayMap(x -> concat(
splitByChar('/', addressToLine(x))[-1],
'#',
demangle(addressToSymbol(x))),
trace),
';') AS stack,
count(*) AS samples
FROM system.trace_log
WHERE (trace_type = 'Real') AND (query_id = '344b07d9-9d7a-48f0-a17e-6f5f6f3d61f5')
GROUP BY trace" | flamegraph.pl
For high performance systems interfaces must be determined by algorithms.
Top-down approach does not work.
High-performance system must be designed concentrating on doing a single task efficiently.
Designed from hardware capabilities.
ClickHouse was designed to efficiently FILTER and GROUP BY data that fits in RAM.
There is no silver bullet, or best algorithm for any task.
Try to choose the fastest possible algorithm/algorithms for your specific task.
Performance must be evaluated on real data.
Most of the algorithms are affected by data distribution.
Complex task can be viewed as number of small tasks.
Such small tasks can also be viewed as special cases that can be optimized.
For any task there are dozens of different algorithms that can be combined together (Example Sorting, Aggregation).
Each algorithm can be tuned later using different low-level optimizations (Data layout, Specializations, SIMD instructions, JIT compilation).
High level design desigion — data must be processed not only by multiple threads, but by multiple servers.
Core component is the HashTable framework.
Different HashTable for different types of keys (Special StringHashTable for Strings).
Additional specializations for Nullable, LowCardinality
Tuned a lot of low-level details, like allocations, structures layout in memory, batch multiple operations to avoid virtual function calls.
Added JIT compilation for special cases.
Added cache of hash-table sizes.
Optimizing performance is about trying different approaches.
Most of the time without any results.
If someone on the Internet says my algorithm is fastest we will try it in ClickHouse.
Always try to find interesting algorithms, and solutions.
ClickHouse/contrib$ ls | grep -v "cmake" | wc -l
95
1. Different algorithms for parsing floats, json (multiple libraries).
2. A lot of integrations.
3. Embedded storages.
4. LLVM for JIT compilation.
5. libcxx (C++ standard library).
Almost in any library our CI system finds bugs. We report them to maintainers.
We also have a lot of library forks with a lot of changes. For example POCO, LLVM libunwind.
We are not afraid of adding additional contrib. Our CI system will do the job.
JIT compilation can transform dynamic configuration into static configuration.
Not all functions can be easily compiled, not all algorithms can be easily compiled.
Has its own costs (compilation time, memory, maintenance).
But can greatly improve performance in special cases.
Compile evaluation of multiple expressions. Example: SELECT a + b * c + 5 FROM test_table;
Compile special cases for GROUP BY. Example: SELECT sum(a), avg(b), count(c) FROM test_table;
Compile comparator in ORDER BY. Example: SELECT * FROM test_table ORDER BY a, b, c;
In all cases we transform dynamic configuration into static.
My presentation from CPP Russia 2021 JIT in ClickHouse:
ClickHouse distributed as portable binary.
We use the old instruction set SSE4.2.
For AVX, AVX2, AVX512 instructions need to use runtime instructions specialization using CPUID.
In addition a lot of companies bring us SIMD optimizations (ContentSquare, Intel), before most such optimizations were disabled during compilation time.
It is important that compilers can vectorize even complex loops. We can rely on this.
Main idea apply compiler flags to some functions, to compile it with AVX, AVX2, AVX512
Then in runtime check CPUID and execute specialized function.
1. Improved performance of unary functions in 1.15 - 7 times.
2. Improved performance of sum, avg aggregate functions when there are no expressions in GROUP BY in 1.2 - 1.8 times.
For AVX2 we use such optimizations a lot.
For AVX512 currently we do not apply a lot of such optimizations. It potentially could decrease performance of other system parts.
Latest Intel processors like Rocket Lake and Ice Lake fix this issue. We can detect such processors in runtime and then use optimizations.
https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency
1. CI/CD infrastructure, especially performance tests, must be the core component of a high performance system.
2. Without deep introspection it is hard to investigate issues with performance.
3. For high performance systems interfaces must be determined by algorithms.
4. Add specializations for special cases.
5. Tune your performance using low-level techniques (Data layout, JIT compilation, Dynamic Dispatch).