Author: Alexey Milovidov, 2019-11-30.
Alexey — ClickHouse developer.
ClickHouse — analytical DBMS.
Main product feature
— high query processing speed.
and basic principles.
To improve something — you need to measure it.
Break it down into components and measure the contribution of each.
Find out what is the main bottleneck of the system.
— a profiler can be used for this.
Low-level:
— which functions in my code spend more time?
(of course in memcpy... or malloc)
High-level:
— which data transformation takes more time?
(JOIN? Sorting? Aggregation?)
The profiler shows that most time
is spent executing memcpy.
Need to optimize this function?
Yes, ClickHouse has optimized memcpy!
Actually, even several...
— «Chinese» memcpy;
— memcpy for gathering short memory chunks:
memcpySmallAllowReadWriteOverflow15
The profiler shows that most time
is spent executing memcpy.
Need to optimize this function?
But usually we just need to call it less:
std::vector<int> vec;
for (int i = 0; i < n; ++i)
vec.push_back(i); // memcpy
The profiler shows that most time
is spent executing memcpy.
Need to optimize this function?
But usually we just need to call it less:
std::vector<int> vec;
vec.reserve(n);
for (int i = 0; i < n; ++i)
vec.push_back(i);
The profiler shows that most time
is spent executing memcpy.
Need to optimize this function?
But usually we just need to call it less:
std::vector<int> vec;
vec.resize(n); // memset
for (int i = 0; i < n; ++i)
vec[i] = i;
The profiler shows that most time
is spent executing memcpy.
Need to optimize this function?
But usually we just need to call it less:
std::unordered_map<int, std::vector<char>> map;
...
for (const auto elem : map) { ... }
The profiler shows that most time
is spent executing memcpy.
Need to optimize this function?
But usually we just need to call it less:
std::unordered_map<int, std::vector<char>> map;
...
for (const auto & elem : map) { ... }
Number of cycles or CPU time;
— rdtsc/rdtscp, OS information (getrusage)
Number of instructions processed;
— CPU counters (through kernel — perf_event_open
or directly — rdpmc)
— run in emulator (valgrind)
Number of bytes read from disk (or from file system?):
— application counters or OS information.
Real time between events (or its approximation?);
— timers in application;
— sampling;
...
Using CPU profiler.
We optimized memcpy function, but the program didn't speed up.
Because our program is bottlenecked by:
— disk reads;
— receiving data over network;
— waiting on mutex;
— sleep;
Instrumentation:
— add code to the program
that measures its work;
Sampling:
— at random moments in time, look
where code execution is;
Internal profiler:
— application contains profiler code inside itself;
(1) explicitly in source code;
(2) without changing source code:
profiler added during compilation (gprof (-pg), LLVM XRay)
or during linking (Google CPU profiler);
External profiler:
— application requires no changes;
— external program or OS kernel monitors it
(poor man's profiler, Linux Perf);
Running application under emulation.
Example: valgrind, slowdown of 50x and more.
Adding instructions to every function.
Example: gprof, slowdown of tens of times.
Adding nop instructions to large functions and on-the-fly patching.
Example: LLVM XRay, negligible impact if not used.
Sampling.
Example: Linux Perf, can be enabled in production.
Collecting ready metrics.
Example: rdpmc, almost free.
gprof valgrind linux perf
oprofile Coz VTune Google CPU profiler
Poor Man's profiler Pin LIKWID
DTrace SystemTap
Keep the profiler constantly enabled in production:
— need near-zero performance impact;
Be able to associate profiles with specific queries:
— isolate specific queries on a loaded server
and view their profiles;
Store profiling data in ClickHouse:
— aggregate profiles by query types,
by users, across cluster;
Maybe a sampling profiler will work for us?
At certain intervals, check where
code execution was in all threads. Example: Poor Man's Profiler:
#!/bin/bash
for x in $(seq 1 $nsamples)
do
gdb -ex "set pagination 0" \
-ex "thread apply all bt" -batch -p $pid
sleep $sleeptime
done | \
awk '
BEGIN { s = ""; }
/^Thread/ { print s; s = ""; }
/^\#/ { if (s != "" ) { s = s "," $4} else { s = $4 } }
END { print s }' | \
sort | uniq -c | sort -r -n -k 1,1
At certain intervals, check where
code execution was in all threads. Example: Google CPU Profiler:
Set a timer in each thread,
which will send a signal* to that thread.
For example, 100 times per second.
In the signal handler, capture the stack trace.
Aggregate stack traces.
Looks trivial!
* alternative — use ptrace from auxiliary process.
Develop a sampling profiler in ClickHouse, where
collected information will be tied to queries.
Easy, because each thread in ClickHouse executes one query at a time, and we know which one.
Looks trivial!
I didn't do this task :)
All development — Nikita Lapkov;
— Yandex banner system developer;
— Facebook developer;
— ClickHouse contributor;
— strong professional;
— and generally a good person.
When we started the task,
it turned out to be
non-trivial.
Very non-trivial.
«Rick and Morty», ep. «Rest and Ricklaxation», Justin Roiland and Dan Harmon, 2017
How to do something in a signal handler?
In a signal handler, you can only call signal-safe functions.
Can't lock mutex*, wait on condvar, allocate memory...
Can write to file descriptor.
Can pass information to a separate thread via pipe,
and in that thread do everything else: log, insert into table.
* question: can you use spinlock in a signal handler?
1. Frame pointer chain walking.
Doesn't work if code is compiled with -fomit-frame-pointer.
This is default when building with optimizations.
Can build code with -fno-omit-frame-pointer. But that's worse;
glibc library is already built with -fomit-frame-pointer.
Question: ASan, MSan, TSan use exactly this method. Why does it work?
2. backtrace function from execinfo.h — part of libgcc.
Internally does dlopen, locks mutex,
then calls internal unwind implementation from libgcc.
Not suitable for calling from signal handler.
3. Using libunwind.
Should work, because see asynchronous-unwind-tables.
There are at least three variants of using libunwind:
— libunwind from libgcc*;
— libunwind "nongnu";
— LLVM libunwind;
And all three variants allocate memory and lock mutex.
Not suitable for calling from signal handler.
* a C++ program with exception support already uses part of libunwind internally. Another libunwind implementation can be connected either separately — without symbol name conflicts, or as a complete replacement for exception handling implementation.
4. Collect stack trace not from signal handler.
In signal handler pass ucontext to another thread via pipe.
But to convert ucontext to stack trace,
the thread must be frozen.
— in signal handler have to wait — very slow;
— indirect locking — unacceptable.
Chose the option — patch libunwind.
1. nongnu libunwind doesn't pass tests under address sanitizer.
Decided to just switch to LLVM libunwind.
Bonus: also use LLVM libunwind instead of libgcc for C++ exception handling.
Though such build configuration is very inconvenient
to describe in CMake.
-Wl,--start-group
.../libglibc-compatibility.a
.../libcxx.a .../libcxxabi.a
.../libunwind.a
-Wl,--end-group
-nodefaultlibs .../libclang_rt.builtins-x86_64.a
-lc -lm -lrt -lpthread -ldl
2. libunwind allocates memory using malloc.
Possible solution: always allocate memory using mmap
— very slow.
Solution: buffer on stack and stack allocator on top of it.
https://github.com/ClickHouse-Extras/libunwind/commit/500aa227
https://github.com/ClickHouse-Extras/libunwind/commit/ec86b1c6
3. Missing register.
https://github.com/ClickHouse-Extras/libunwind/commit/17a48fbf
4. Incorrect CFA.
unsigned char mincore_res = 0;
if (0 != syscall(SYS_mincore,
(void*)(cfa / 4096 * 4096), 1, &mincore_res))
{
return UNW_EBADFRAME;
}
https://github.com/ClickHouse-Extras/libunwind/commit/36df74d8
5. FDE Cache.
Uses rwlock.
FDE Cache modification — for JIT code registration.
Actually only read lock is used.
Even read lock — not signal safe and leads to deadlock.
Solution — remove FDE Cache.
https://github.com/ClickHouse-Extras/libunwind/commit/96f5f171
6. Using dl_iterate_phdr function.
Solution: lock-free PHDR Cache.
— doesn't work with thread sanitizer;
— need to update manually on dlopen;
https://github.com/libunwind/libunwind/issues/16
https://github.com/ClickHouse/ClickHouse/pull/6124
Solution designed by Fedor Korotkiy, YT developer.
7. Profiler incompatibility with sanitizers.
Reason: sanitizers also use stack unwinding.
For this they use frame pointer and modify the %rbp register.
At this moment libunwind doesn't work.
Solution: don't use profiler under sanitizers.
8. No syscall restart after signals.
man 7 signal
SA_RESTART doesn't work for all syscalls*:
— reading and writing to socket with timeout.
— sleep, usleep, nanosleep, clock_nanosleep.
* in FreeBSD and Mac OS X everything is different.
9. Incorrect syscall restart after signals:
— reading and writing to socket with timeout;
— restart the call, but don't update timeout;
Solution — always do poll before reading.
10. Pipe overflow.
Solution:
— increase pipe size to 1 MB;
— if can't write to pipe — ignore.
Alternative solution:
lock-free circular buffer instead of pipe.
11. Too frequent signals.
Solution: if si_overrun > 0 — skip processing.
1. Enable one or both settings:
SET query_profiler_cpu_time_period_ns = 1000000;
SET query_profiler_real_time_period_ns = 1000000;
2. Run queries.
Data is written to system.trace_log table.
event_date: 2019-09-05
event_time: 2019-09-05 05:47:44
revision: 54425
timer_type: CPU
thread_number: 149
query_id: b1d8e7f9-48d8-4cb3-a768-0a6683f6f061
trace: [140171472847748,61781958,110943821,117594728,117595220,115654933,
120321783,63251928,111161800,120329436,120331356,120308294,120313436,120319113,
120143313,115666412,120146905,111013972,118237176,111013972,117990912,111013972,
110986070,110986938,61896391,61897898,61887509,156206624,140171472807643]
trace — array of addresses (stack trace);
Convert address to function name:
— demangle(addressToSymbol(trace[1]))
Convert address to file name and line number:
— addressToLine(trace[1])
* don't forget to install clickhouse-common-static-dbg package
Example: top functions
SELECT count(),
demangle(addressToSymbol(trace[1] AS addr)) AS symbol
FROM system.trace_log
WHERE event_date = today()
GROUP BY symbol
ORDER BY count() DESC LIMIT 10
Example: top functions:
┌─count()─┬─symbol──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 517 │ void LZ4::(anonymous namespace)::decompressImpl<32ul, false>(char const*, char*, unsigned long) │
│ 480 │ void DB::deserializeBinarySSE2<4>(DB::PODArray<unsigned char, 4096ul, Allocator<false, false>, 15ul, 16ul>&, DB::PODArray<unsigned long, 4096ul, Allocator<false, false>, 15ul, 16ul>&, DB::ReadBuffer&, unsigned long) │
│ 457 │ DB::VolnitskyBase<true, true, DB::StringSearcher<true, true> >::search(unsigned char const*, unsigned long) const │
│ 270 │ read │
│ 163 │ void LZ4::(anonymous namespace)::decompressImpl<16ul, true>(char const*, char*, unsigned long) │
│ 130 │ void LZ4::(anonymous namespace)::decompressImpl<16ul, false>(char const*, char*, unsigned long) │
│ 58 │ CityHash_v1_0_2::CityHash128WithSeed(char const*, unsigned long, std::pair<unsigned long, unsigned long>) │
│ 44 │ void DB::deserializeBinarySSE2<2>(DB::PODArray<unsigned char, 4096ul, Allocator<false, false>, 15ul, 16ul>&, DB::PODArray<unsigned long, 4096ul, Allocator<false, false>, 15ul, 16ul>&, DB::ReadBuffer&, unsigned long) │
│ 37 │ void LZ4::(anonymous namespace)::decompressImpl<8ul, true>(char const*, char*, unsigned long) │
│ 32 │ memcpy │
└─────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Example: top contexts for a query:
SELECT count(),
arrayStringConcat(arrayMap(x -> concat(
demangle(addressToSymbol(x)),
'\n ',
addressToLine(x)), trace), '\n') AS sym
FROM system.trace_log
WHERE query_id = '1a1272b5-695a-4b17-966d-a1701b61b3eb'
AND event_date = today()
GROUP BY trace
ORDER BY count() DESC
LIMIT 10
count(): 154
sym: DB::VolnitskyBase<true, true, DB::StringSearcher<true, true> >::search(unsigned char const*, unsigned long) const
/opt/milovidov/ClickHouse/build_gcc9/dbms/programs/clickhouse
DB::MatchImpl<true, false>::vector_constant(DB::PODArray<unsigned char, 4096ul, Allocator<false, false>, 15ul, 16ul> const&, DB::PODArray<unsigned long, 4096ul, Allocator<false, false>, 15ul, 16ul> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DB::PODArray<unsigned char, 4096ul, Allocator<false, false>, 15ul, 16ul>&)
/opt/milovidov/ClickHouse/build_gcc9/dbms/programs/clickhouse
DB::FunctionsStringSearch<DB::MatchImpl<true, false>, DB::NameLike>::executeImpl(DB::Block&, std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long, unsigned long)
/opt/milovidov/ClickHouse/build_gcc9/dbms/programs/clickhouse
DB::PreparedFunctionImpl::execute(DB::Block&, std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long, unsigned long, bool)
/home/milovidov/ClickHouse/build_gcc9/../dbms/src/Functions/IFunction.cpp:464
DB::ExpressionAction::execute(DB::Block&, bool) const
/usr/local/include/c++/9.1.0/bits/stl_vector.h:677
DB::ExpressionActions::execute(DB::Block&, bool) const
/home/milovidov/ClickHouse/build_gcc9/../dbms/src/Interpreters/ExpressionActions.cpp:759
DB::FilterBlockInputStream::readImpl()
/home/milovidov/ClickHouse/build_gcc9/../dbms/src/DataStreams/FilterBlockInputStream.cpp:84
DB::IBlockInputStream::read()
/usr/local/include/c++/9.1.0/bits/stl_vector.h:108
DB::ExpressionBlockInputStream::readImpl()
/home/milovidov/ClickHouse/build_gcc9/../dbms/src/DataStreams/ExpressionBlockInputStream.cpp:34
DB::IBlockInputStream::read()
/usr/local/include/c++/9.1.0/bits/stl_vector.h:108
DB::ParallelInputsProcessor<DB::ParallelAggregatingBlockInputStream::Handler>::thread(std::shared_ptr<DB::ThreadGroupStatus>, unsigned long)
/usr/local/include/c++/9.1.0/bits/atomic_base.h:419
ThreadFromGlobalPool::ThreadFromGlobalPool<void (DB::ParallelInputsProcessor<DB::ParallelAggregatingBlockInputStream::Handler>::*)(std::shared_ptr<DB::ThreadGroupStatus>, unsigned long), DB::ParallelInputsProcessor<DB::ParallelAggregatingBlockInputStream::Handler>*, std::shared_ptr<DB::ThreadGroupStatus>, unsigned long&>(void (DB::ParallelInputsProcessor<DB::ParallelAggregatingBlockInputStream::Handler>::*&&)(std::shared_ptr<DB::ThreadGroupStatus>, unsigned long), DB::ParallelInputsProcessor<DB::ParallelAggregatingBlockInputStream::Handler>*&&, std::shared_ptr<DB::ThreadGroupStatus>&&, unsigned long&)::{lambda()#1}::operator()() const
/usr/local/include/c++/9.1.0/bits/shared_ptr_base.h:729
ThreadPoolImpl<std::thread>::worker(std::_List_iterator<std::thread>)
/usr/local/include/c++/9.1.0/bits/atomic_base.h:551
execute_native_thread_routine
/home/milovidov/ClickHouse/ci/workspace/gcc/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:81
start_thread
/lib/x86_64-linux-gnu/libpthread-2.27.so
clone
/build/glibc-OTsEL5/glibc-2.27/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:97
https://github.com/Slach/clickhouse-flamegraph
ClickHouse Performance Optimization Secrets:
https://youtu.be/ltg8vstuHUU
Query Performance Analysis in ClickHouse:
https://youtu.be/ondHe_JUyW4
How Hash Tables Work in ClickHouse:
https://youtu.be/EoX82TEz2sQ
Parallel and Distributed GROUP BY:
https://youtu.be/SrucFOs8Y6c
How to Speed Up LZ4 Decompression:
https://youtu.be/V2CqQBICt7M
Database Obfuscation:
https://youtu.be/2iR7i4akL44
Web site: https://clickhouse.com/
Maillist: [email protected]
YouTube: https://www.youtube.com/c/ClickHouseDB
Telegram chat: https://telegram.me/clickhouse_ru, clickhouse_en
GitHub: https://github.com/ClickHouse/ClickHouse/
Twitter: https://twitter.com/ClickHouseDB
Google groups: https://groups.google.com/forum/#!forum/clickhouse