Query Profiler: The Hard Way

Author: Alexey Milovidov, 2019-11-30.

Query Profiler: The Hard Way

About Me

Alexey — ClickHouse developer.

ClickHouse — analytical DBMS.

Main product feature
— high query processing speed.

Trivial Facts

and basic principles.

Trivial Facts

To improve something — you need to measure it.

Break it down into components and measure the contribution of each.

Find out what is the main bottleneck of the system.

— a profiler can be used for this.

What Types of Profilers Exist?

Low-level:

— which functions in my code spend more time?

(of course in memcpy... or malloc)

High-level:

— which data transformation takes more time?

(JOIN? Sorting? Aggregation?)

Trivial Example

The profiler shows that most time
is spent executing memcpy.

Need to optimize this function?

Yes, ClickHouse has optimized memcpy!

Actually, even several...

— «Chinese» memcpy;

— memcpy for gathering short memory chunks:
memcpySmallAllowReadWriteOverflow15

Trivial Example

The profiler shows that most time
is spent executing memcpy.

Need to optimize this function?

But usually we just need to call it less:

std::vector<int> vec; for (int i = 0; i < n; ++i) vec.push_back(i); // memcpy

Obvious Example

The profiler shows that most time
is spent executing memcpy.

Need to optimize this function?

But usually we just need to call it less:

std::vector<int> vec; vec.reserve(n); for (int i = 0; i < n; ++i) vec.push_back(i);

Obvious Example

The profiler shows that most time
is spent executing memcpy.

Need to optimize this function?

But usually we just need to call it less:

std::vector<int> vec; vec.resize(n); // memset for (int i = 0; i < n; ++i) vec[i] = i;

Banal Example

The profiler shows that most time
is spent executing memcpy.

Need to optimize this function?

But usually we just need to call it less:

std::unordered_map<int, std::vector<char>> map; ... for (const auto elem : map) { ... }

Banal Example

The profiler shows that most time
is spent executing memcpy.

Need to optimize this function?

But usually we just need to call it less:

std::unordered_map<int, std::vector<char>> map; ... for (const auto & elem : map) { ... }

What Metric to Measure?

Number of cycles or CPU time;
rdtsc/rdtscp, OS information (getrusage)

Number of instructions processed;
— CPU counters (through kernel — perf_event_open
  or directly — rdpmc)
— run in emulator (valgrind)

Number of bytes read from disk (or from file system?):
— application counters or OS information.

Real time between events (or its approximation?);
— timers in application;
— sampling;
...

Another Example

Using CPU profiler.
We optimized memcpy function, but the program didn't speed up.
Because our program is bottlenecked by:

— disk reads;

— receiving data over network;

— waiting on mutex;

— sleep;

Profiling Methods

Instrumentation:

— add code to the program
that measures its work;

Sampling:

— at random moments in time, look
where code execution is;

External or Internal?

Internal profiler:
— application contains profiler code inside itself;
(1) explicitly in source code;
(2) without changing source code:
profiler added during compilation (gprof (-pg), LLVM XRay)
or during linking (Google CPU profiler);

External profiler:
— application requires no changes;
— external program or OS kernel monitors it
(poor man's profiler, Linux Perf);

How Much Does It Impact Performance?

Running application under emulation.
Example: valgrind, slowdown of 50x and more.

Adding instructions to every function.
Example: gprof, slowdown of tens of times.

Adding nop instructions to large functions and on-the-fly patching.
Example: LLVM XRay, negligible impact if not used.

Sampling.
Example: Linux Perf, can be enabled in production.

Collecting ready metrics.
Example: rdpmc, almost free.

Is Everything Already Written?

gprof valgrind linux perf
oprofile Coz VTune Google CPU profiler
Poor Man's profiler Pin LIKWID
DTrace SystemTap

What Do We Want?

Keep the profiler constantly enabled in production:
— need near-zero performance impact;

Be able to associate profiles with specific queries:
— isolate specific queries on a loaded server
and view their profiles;

Store profiling data in ClickHouse:
— aggregate profiles by query types,
by users, across cluster;

Maybe a sampling profiler will work for us?

Sampling Profilers

At certain intervals, check where
code execution was in all threads. Example: Poor Man's Profiler:

#!/bin/bash for x in $(seq 1 $nsamples) do gdb -ex "set pagination 0" \ -ex "thread apply all bt" -batch -p $pid sleep $sleeptime done | \ awk ' BEGIN { s = ""; } /^Thread/ { print s; s = ""; } /^\#/ { if (s != "" ) { s = s "," $4} else { s = $4 } } END { print s }' | \ sort | uniq -c | sort -r -n -k 1,1

Sampling Profilers

At certain intervals, check where
code execution was in all threads. Example: Google CPU Profiler:

Set a timer in each thread,
which will send a signal* to that thread.
For example, 100 times per second.

In the signal handler, capture the stack trace.

Aggregate stack traces.

Looks trivial!

* alternative — use ptrace from auxiliary process.

The Task

Develop a sampling profiler in ClickHouse, where
collected information will be tied to queries.

Easy, because each thread in ClickHouse executes one query at a time, and we know which one.

Looks trivial!

The Task

I didn't do this task :)

All development — Nikita Lapkov;
— Yandex banner system developer;
— Facebook developer;
— ClickHouse contributor;
— strong professional;
— and generally a good person.

When we started the task,
it turned out to be
non-trivial.

Very non-trivial.

«Rick and Morty», ep. «Rest and Ricklaxation», Justin Roiland and Dan Harmon, 2017

Problems

How to do something in a signal handler?

In a signal handler, you can only call signal-safe functions.

Can't lock mutex*, wait on condvar, allocate memory...

Can write to file descriptor.

Can pass information to a separate thread via pipe,
and in that thread do everything else: log, insert into table.

* question: can you use spinlock in a signal handler?

How to Collect Stack Traces?

1. Frame pointer chain walking.

Doesn't work if code is compiled with -fomit-frame-pointer.
This is default when building with optimizations.

Can build code with -fno-omit-frame-pointer. But that's worse;
glibc library is already built with -fomit-frame-pointer.

Question: ASan, MSan, TSan use exactly this method. Why does it work?

How to Collect Stack Traces?

2. backtrace function from execinfo.h — part of libgcc.

Internally does dlopen, locks mutex,
then calls internal unwind implementation from libgcc.

Not suitable for calling from signal handler.

How to Collect Stack Traces?

3. Using libunwind.

Should work, because see asynchronous-unwind-tables.

There are at least three variants of using libunwind:
— libunwind from libgcc*;
— libunwind "nongnu";
— LLVM libunwind;

And all three variants allocate memory and lock mutex.
Not suitable for calling from signal handler.

* a C++ program with exception support already uses part of libunwind internally. Another libunwind implementation can be connected either separately — without symbol name conflicts, or as a complete replacement for exception handling implementation.

How to Collect Stack Traces?

4. Collect stack trace not from signal handler.

In signal handler pass ucontext to another thread via pipe.

But to convert ucontext to stack trace,
the thread must be frozen.
— in signal handler have to wait — very slow;
— indirect locking — unacceptable.

How to Collect Stack Traces?

Chose the option — patch libunwind.

What's Wrong with libunwind?

1. nongnu libunwind doesn't pass tests under address sanitizer.

Decided to just switch to LLVM libunwind.

Bonus: also use LLVM libunwind instead of libgcc for C++ exception handling.

Though such build configuration is very inconvenient
to describe in CMake.

-Wl,--start-group .../libglibc-compatibility.a .../libcxx.a .../libcxxabi.a .../libunwind.a -Wl,--end-group -nodefaultlibs .../libclang_rt.builtins-x86_64.a -lc -lm -lrt -lpthread -ldl

What's Wrong with libunwind?

2. libunwind allocates memory using malloc.

Possible solution: always allocate memory using mmap
— very slow.

Solution: buffer on stack and stack allocator on top of it.

https://github.com/ClickHouse-Extras/libunwind/commit/500aa227
https://github.com/ClickHouse-Extras/libunwind/commit/ec86b1c6

What's Wrong with libunwind?

3. Missing register.

https://github.com/ClickHouse-Extras/libunwind/commit/17a48fbf

https://reviews.llvm.org/D66685

What's Wrong with libunwind?

4. Incorrect CFA.

unsigned char mincore_res = 0; if (0 != syscall(SYS_mincore, (void*)(cfa / 4096 * 4096), 1, &mincore_res)) { return UNW_EBADFRAME; }

https://github.com/ClickHouse-Extras/libunwind/commit/36df74d8

What's Wrong with libunwind?

5. FDE Cache.

Uses rwlock.

FDE Cache modification — for JIT code registration.
Actually only read lock is used.

Even read lock — not signal safe and leads to deadlock.

Solution — remove FDE Cache.

https://github.com/ClickHouse-Extras/libunwind/commit/96f5f171

What's Wrong with libunwind?

6. Using dl_iterate_phdr function.

Solution: lock-free PHDR Cache.

— doesn't work with thread sanitizer;
— need to update manually on dlopen;

https://github.com/libunwind/libunwind/issues/16
https://github.com/ClickHouse/ClickHouse/pull/6124

Solution designed by Fedor Korotkiy, YT developer.

What's Wrong with the Profiler?

7. Profiler incompatibility with sanitizers.

Reason: sanitizers also use stack unwinding.

For this they use frame pointer and modify the %rbp register.

At this moment libunwind doesn't work.

Solution: don't use profiler under sanitizers.

https://github.com/ClickHouse/ClickHouse/pull/6769

What's Wrong with the Profiler?

8. No syscall restart after signals.

man 7 signal

SA_RESTART doesn't work for all syscalls*:

— reading and writing to socket with timeout.

— sleep, usleep, nanosleep, clock_nanosleep.

* in FreeBSD and Mac OS X everything is different.

What's Wrong with the Profiler?

9. Incorrect syscall restart after signals:

— reading and writing to socket with timeout;

— restart the call, but don't update timeout;

Solution — always do poll before reading.

https://github.com/ClickHouse/ClickHouse/pull/6386

What's Wrong with the Profiler?

10. Pipe overflow.

Solution:
— increase pipe size to 1 MB;
— if can't write to pipe — ignore.

Alternative solution:
lock-free circular buffer instead of pipe.

What's Wrong with the Profiler?

11. Too frequent signals.

Solution: if si_overrun > 0 — skip processing.

Result

Result

1. Enable one or both settings:

SET query_profiler_cpu_time_period_ns = 1000000; SET query_profiler_real_time_period_ns = 1000000;

2. Run queries.
Data is written to system.trace_log table.

event_date: 2019-09-05 event_time: 2019-09-05 05:47:44 revision: 54425 timer_type: CPU thread_number: 149 query_id: b1d8e7f9-48d8-4cb3-a768-0a6683f6f061 trace: [140171472847748,61781958,110943821,117594728,117595220,115654933, 120321783,63251928,111161800,120329436,120331356,120308294,120313436,120319113, 120143313,115666412,120146905,111013972,118237176,111013972,117990912,111013972, 110986070,110986938,61896391,61897898,61887509,156206624,140171472807643]

Result

trace — array of addresses (stack trace);

Convert address to function name:
— demangle(addressToSymbol(trace[1]))
Convert address to file name and line number:
— addressToLine(trace[1])

* don't forget to install clickhouse-common-static-dbg package

Example: top functions

SELECT count(), demangle(addressToSymbol(trace[1] AS addr)) AS symbol FROM system.trace_log WHERE event_date = today() GROUP BY symbol ORDER BY count() DESC LIMIT 10

Result

Example: top functions:

┌─count()─┬─symbol──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ 517 │ void LZ4::(anonymous namespace)::decompressImpl<32ul, false>(char const*, char*, unsigned long) │ │ 480 │ void DB::deserializeBinarySSE2<4>(DB::PODArray<unsigned char, 4096ul, Allocator<false, false>, 15ul, 16ul>&, DB::PODArray<unsigned long, 4096ul, Allocator<false, false>, 15ul, 16ul>&, DB::ReadBuffer&, unsigned long) │ │ 457 │ DB::VolnitskyBase<true, true, DB::StringSearcher<true, true> >::search(unsigned char const*, unsigned long) const │ │ 270 │ read │ │ 163 │ void LZ4::(anonymous namespace)::decompressImpl<16ul, true>(char const*, char*, unsigned long) │ │ 130 │ void LZ4::(anonymous namespace)::decompressImpl<16ul, false>(char const*, char*, unsigned long) │ │ 58 │ CityHash_v1_0_2::CityHash128WithSeed(char const*, unsigned long, std::pair<unsigned long, unsigned long>) │ │ 44 │ void DB::deserializeBinarySSE2<2>(DB::PODArray<unsigned char, 4096ul, Allocator<false, false>, 15ul, 16ul>&, DB::PODArray<unsigned long, 4096ul, Allocator<false, false>, 15ul, 16ul>&, DB::ReadBuffer&, unsigned long) │ │ 37 │ void LZ4::(anonymous namespace)::decompressImpl<8ul, true>(char const*, char*, unsigned long) │ │ 32 │ memcpy │ └─────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Result

Example: top contexts for a query:

SELECT count(), arrayStringConcat(arrayMap(x -> concat( demangle(addressToSymbol(x)), '\n ', addressToLine(x)), trace), '\n') AS sym FROM system.trace_log WHERE query_id = '1a1272b5-695a-4b17-966d-a1701b61b3eb' AND event_date = today() GROUP BY trace ORDER BY count() DESC LIMIT 10
count(): 154 sym: DB::VolnitskyBase<true, true, DB::StringSearcher<true, true> >::search(unsigned char const*, unsigned long) const /opt/milovidov/ClickHouse/build_gcc9/dbms/programs/clickhouse DB::MatchImpl<true, false>::vector_constant(DB::PODArray<unsigned char, 4096ul, Allocator<false, false>, 15ul, 16ul> const&, DB::PODArray<unsigned long, 4096ul, Allocator<false, false>, 15ul, 16ul> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DB::PODArray<unsigned char, 4096ul, Allocator<false, false>, 15ul, 16ul>&) /opt/milovidov/ClickHouse/build_gcc9/dbms/programs/clickhouse DB::FunctionsStringSearch<DB::MatchImpl<true, false>, DB::NameLike>::executeImpl(DB::Block&, std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long, unsigned long) /opt/milovidov/ClickHouse/build_gcc9/dbms/programs/clickhouse DB::PreparedFunctionImpl::execute(DB::Block&, std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long, unsigned long, bool) /home/milovidov/ClickHouse/build_gcc9/../dbms/src/Functions/IFunction.cpp:464 DB::ExpressionAction::execute(DB::Block&, bool) const /usr/local/include/c++/9.1.0/bits/stl_vector.h:677 DB::ExpressionActions::execute(DB::Block&, bool) const /home/milovidov/ClickHouse/build_gcc9/../dbms/src/Interpreters/ExpressionActions.cpp:759 DB::FilterBlockInputStream::readImpl() /home/milovidov/ClickHouse/build_gcc9/../dbms/src/DataStreams/FilterBlockInputStream.cpp:84 DB::IBlockInputStream::read() /usr/local/include/c++/9.1.0/bits/stl_vector.h:108 DB::ExpressionBlockInputStream::readImpl() /home/milovidov/ClickHouse/build_gcc9/../dbms/src/DataStreams/ExpressionBlockInputStream.cpp:34 DB::IBlockInputStream::read() /usr/local/include/c++/9.1.0/bits/stl_vector.h:108 DB::ParallelInputsProcessor<DB::ParallelAggregatingBlockInputStream::Handler>::thread(std::shared_ptr<DB::ThreadGroupStatus>, unsigned long) /usr/local/include/c++/9.1.0/bits/atomic_base.h:419 ThreadFromGlobalPool::ThreadFromGlobalPool<void (DB::ParallelInputsProcessor<DB::ParallelAggregatingBlockInputStream::Handler>::*)(std::shared_ptr<DB::ThreadGroupStatus>, unsigned long), DB::ParallelInputsProcessor<DB::ParallelAggregatingBlockInputStream::Handler>*, std::shared_ptr<DB::ThreadGroupStatus>, unsigned long&>(void (DB::ParallelInputsProcessor<DB::ParallelAggregatingBlockInputStream::Handler>::*&&)(std::shared_ptr<DB::ThreadGroupStatus>, unsigned long), DB::ParallelInputsProcessor<DB::ParallelAggregatingBlockInputStream::Handler>*&&, std::shared_ptr<DB::ThreadGroupStatus>&&, unsigned long&)::{lambda()#1}::operator()() const /usr/local/include/c++/9.1.0/bits/shared_ptr_base.h:729 ThreadPoolImpl<std::thread>::worker(std::_List_iterator<std::thread>) /usr/local/include/c++/9.1.0/bits/atomic_base.h:551 execute_native_thread_routine /home/milovidov/ClickHouse/ci/workspace/gcc/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:81 start_thread /lib/x86_64-linux-gnu/libpthread-2.27.so clone /build/glibc-OTsEL5/glibc-2.27/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:97

https://github.com/Slach/clickhouse-flamegraph

 

ClickHouse Performance Optimization Secrets:
https://youtu.be/ltg8vstuHUU

Query Performance Analysis in ClickHouse:
https://youtu.be/ondHe_JUyW4

How Hash Tables Work in ClickHouse:
https://youtu.be/EoX82TEz2sQ

Parallel and Distributed GROUP BY:
https://youtu.be/SrucFOs8Y6c

How to Speed Up LZ4 Decompression:
https://youtu.be/V2CqQBICt7M

Database Obfuscation:
https://youtu.be/2iR7i4akL44

.

Web site: https://clickhouse.com/

Maillist: [email protected]

YouTube: https://www.youtube.com/c/ClickHouseDB

Telegram chat: https://telegram.me/clickhouse_ru, clickhouse_en

GitHub: https://github.com/ClickHouse/ClickHouse/

Twitter: https://twitter.com/ClickHouseDB

Google groups: https://groups.google.com/forum/#!forum/clickhouse