Author: Alexey Milovidov, 2021-05-11.
Alexey, ClickHouse developer.
Profiling under different workloads.
Optimizing everything that pops up.
About performance testing — see Alexander Kuzmenkov's talk tomorrow at 10 AM.
https://www.techdesignforums.com/practice/technique/
winning-at-whac-a-mole-redesigning-an-rf-transceiver/
ClickHouse has different «table engines».
MergeTree tables store data on disk.
Memory tables store data in RAM.
Memory is faster than disks*.
Does this mean Memory tables are faster than MergeTree?
* What does «faster» mean? Sequential read and write speed. Random read and write latency. IOPS at given parallelism and load distribution.
Of course memory can be slower than disk subsystem,
for example single-channel memory vs. 10x PCIe 4.0 SSDs.
Memory tables store data in RAM.
MergeTree tables store data on disk,
more precisely in the file system.
But data from the file system goes into page cache.
And then is read from RAM.
Does this mean there's no difference between Memory and MergeTree tables
when data is in page cache?
Obvious cases when MergeTree is faster than Memory.
MergeTree tables have a primary key and secondary indexes,
and allow reading only the necessary data ranges.
Memory tables only allow full scan.
But this case is not interesting.
Can MergeTree be faster than Memory in full scan?
Non-obvious cases when MergeTree is faster than Memory.
MergeTree tables store data in sorted order
by primary key.
Some algorithms in ClickHouse exploit
data locality advantages when available (fast path).
For example, if during GROUP BY the same value occurs
twice in a row, we don't do a repeated lookup in the hash table.
About hash tables in ClickHouse, see Maxim Kita's talk tomorrow at 12:50.
And if data in tables is in the same order,
can MergeTree be faster than Memory?
Data in ClickHouse is stored by columns
and is also processed by columns.
| Array of Structures | Structure of Arrays |
|---|---|
struct Point3d
{
float x;
float y;
float z;
};
std::vector<Point3d> points;
|
struct Points
{
std::vector<float> x;
std::vector<float> y;
std::vector<float> z;
};
|
Data in ClickHouse is stored by columns
and is also processed by columns. By chunks of columns.
struct Chunk
{
std::vector<float> x;
std::vector<float> y;
std::vector<float> z;
};
std::vector<Chunk> chunks;
— Morsel-based processing.
In the case of MergeTree:
— read compressed files from the file system;
— calculate and verify checksums;
— decompress compressed blocks;
— deserialize column chunks;
— process them;
In the case of Memory:
— ready-made column chunks are
already in RAM,
process them;
In the case of MergeTree:
1. Read compressed files from the file system:
— reading can be done with synchronous (read/pread, mmap)
or asynchronous (AIO, uring) I/O;
— in case of synchronous I/O, we can
use (regular read or mmap)
or not use page cache (O_DIRECT);
— if reading from page cache without mmap,
there will be copying from page cache to userspace;
— reading compressed data — if compression ratio is large,
the proportion of time in query processing is small;
In the case of MergeTree:
2. Decompress compressed blocks:
— by default LZ4* is used;
— you can choose a stronger compression method (ZSTD),
or weaker, for example no compression at all (NONE);
— sometimes NONE unexpectedly works slower, why would that be?
— and what block size was used for compressing data?
and how does this affect speed?
* See the talk «How to Speed Up LZ4 Decompression» from HighLoad++ Siberia 2018.
In the case of MergeTree:
3. Deserialize column chunks:
— there is no deserialization as such;
— it's just moving data (memcpy);
— why is it necessary to move data at all?
In the case of Memory:
— ready-made column chunks in RAM.
In the case of MergeTree:
— column chunks are formed dynamically during reading.
MergeTree does more work,
but can this sometimes be more optimal?
In the case of MergeTree:
— column chunks are formed dynamically during reading,
and their size in rows can be chosen adaptively
for cache locality!
What speed does RAM work at?
— which RAM, on which machine?
What speed does cache work at?
— which cache level, on which CPU?
— one or all together?
What speed of what?
— throughput, latency?..
In ClickHouse data is stored compressed by default.
It's compressed on write, decompressed on read.
Profiling queries...
At the top by CPU — LZ4_decomress_safe function.
To speed everything up, should we just remove data compression?
Megg, Mogg & Owl Series by Simon Hanselmann
But nothing good comes of it:
1. Removed data compression and now it doesn't fit on disk.
2. Removed data compression and now reading from disk is slow.
3. Removed data compression and now
less data fits in page cache.
...
But even if uncompressed data fits entirely
in RAM — does it make sense to not compress it?
The memcpy function is used as a baseline
of the weakest compression or decompression in benchmarks.
Of course, this is the fastest benchmark for comparison.
Example:
— memcpy: 12 GB per second.
— LZ4 decompression: 2..4 GB of decompressed data per second.
Conclusion: is memcpy faster than LZ4 decompression?
Consider the scenario:
— data is stored in RAM;
— data is processed in blocks;
— each block is quite small and fits in CPU cache;
— processing of each block fits in CPU cache;
— data is processed in multiple threads;
Data is read from RAM, then only CPU cache is used.
Example: Ryzen 3950 (16 cores)
— memcpy: 16×12 GB = 192 GB per second.
— LZ4 decompression: 16×2..4 GB = 32..48 GB of decompressed data per second.
— memory read speed: 30 GB* per second.
In the case of memcpy, reading is limited by memory speed.
But if compression is used, less data is read from memory.
Memory works like disk. Is LZ4 decompression faster than memcpy?
* dual-channel memory, but not working at maximum frequency.
According to specs for this CPU up to 48 GB per second.
Example: 2 × AMD EPYC 7742 (128 cores)
8 channel memory, max throughput 190 GiB/s
For this server, working with data
compressed with LZ4, will also be faster.
But if there are fewer cores — it's not so clear anymore.
If data is well compressed, decompression still hits CPU,
which means it can be sped up!
For Memory tables:
— Reduced block size on write
for better cache locality of data processing #20169.
— Ability to compress Memory tables #20168.
For MergeTree tables:
— Removed unnecessary copying for NONE compression mode #22145.
— Ability to disable checksums when reading #19588,
but this ability should not be used.
— Ability to read using mmap #8520, to remove
unnecessary copying from page cache and also cache memory mappings #22206.
To optimize performance, you just need to:
— know exactly what your code does;
— profile the system under realistic load scenarios;
— understand hardware capabilities;
...
— not forget that the system has many cores, and the processor has cache;
don't confuse latency and throughput :)