https://www.techdesignforums.com/practice/
technique/winning-at-whac-a-mole-redesigning-an-rf-transceiver/
Profile on various workloads.
Optimize everything that stands out.
There are multiple table engines in ClickHouse:
MergeTree tables store data on disk.
Memory tables store data in memory.
RAM is faster than disk*.
So are Memory tables faster than MergeTree?
* What is "faster"? Sequential read/write throughput. Random read/write latencies.
IOPS under specific parallelism and workload.
Obviously RAM can be slower than disks, for example single channel RAM vs. 10x PCIe 4.0 SSDs.
Memory tables store data in memory.
MergeTree tables store data on disk,
actually in filesystem.
But data from filesystem is cached in page cache in memory.
And then read from memory.
So there is no difference between Memory and MergeTree tables
if data reside in page cache?
There are obvious cases when MergeTree is faster than Memory.
MergeTree tables have primary key and secondary indices,
and allow to read only subranges of data.
Memory tables only allow full scan.
But that's not interesting...
Is it possible that MergeTree is faster than Memory on full scan?
Less obvious cases when MergeTree can be faster than Memory.
MergeTree tables store data sorted in the primary key order.
Some algorithms in ClickHouse exploit
the data locality, if there is it (fast path).
Example: if one value is met twice in consecutive records during GROUP BY
we don't perform excessive lookup in a hash table.
But if the tables store data in the same order,
is it possible that MergeTree is faster than Memory?
Data in ClickHouse is stored by columns
and processed also by columns.
Array of Structures | Structure of Arrays |
---|---|
struct Point3d { float x; float y; float z; }; std::vector<Point3d> points; |
struct Points { std::vector<float> x; std::vector<float> y; std::vector<float> z; }; |
Data in ClickHouse is stored by columns
and processed also by columns. By chunks of columns.
struct Chunk { std::vector<float> x; std::vector<float> y; std::vector<float> z; }; std::vector<Chunk> chunks;
— Morsel-based processing.
In case of MergeTree:
— read compressed files from filesystem;
— calculate and validate checksums;
— decompress the compressed blocks;
— deserialize chunks of columns;
— process them;
In case of Memory:
— already prepared chunks of columns
are already in RAM,
just process them;
In case of MergeTree:
1. Read compressed files from filesystem:
— we can read with synchronous (read/pread, mmap)
or asynchronous (AIO, uring) IO methods;
— in case of synchronous IO, we can
use the page cache (simply with read or mmap)
or don't use the page cache (O_DIRECT);
— if we read from the page cache without mmap,
then the data will be copied from the page cache to userspace;
— we read compressed data — if compression rate is high,
then we spent little amount of time reading;
In case of MergeTree:
2. Decompression of compressed blocks:
— LZ4* is used by default;
— you can choose more heavy (ZSTD),
or more weak method, up to no compression at all (NONE);
— sometimes NONE is surprisingly slower than LZ4, but why?
— what is the compressed block size?
and how it affects the speed?
* Read the article "How to speed up LZ4 decompression in ClickHouse?".
In case of MergeTree:
3. Deserialization of columns chunks:
— there is almost no "deserialization";
— usually it is just copying of data (memcpy);
— but why need to copy data at all?
Memory:
— already prepared chunks of columns in memory.
MergeTree:
— chunks of columns are formed dynamically during read.
MergeTree is doing more work,
but is it possible to be more efficient?
In case of MergeTree:
— chunks of columns are formed dynamically during read,
and their size in the number of records can be selected adaptively
for better cache locality!
What is the speed of the main memory?
— what memory, on what server?
What is the speed of CPU cache?
— cache of what level, on what CPU?
— on one CPU core or all cores together?
The speed of what?
— throughput, latency?..
ClickHouse compresses data by default.
Data is compressed on write and decompressed on read.
Let's profile some queries...
There is LZ4_decomress_safe function in top by CPU usage.
🤔 So, let's just turn off compression and speed everything up?
Megg, Mogg & Owl Series by Simon Hanselmann
But nothing good happens:
1. Compression is turned off but data doesn't fit on disk anymore.
2. Compression is turned off but data is reading slowly from disk.
3. Compression is turned off but less amount of data fits in page cache.
...
But if all data fits in memory even uncompressed
— does it make sense to not compress the data?
The memcpy function is often used as a baseline
of the most weak possible compression or decompression in benchmarks.
Memcpy is the fastest example for comparison, for sure.
Example:
— memcpy: 12 GB per second.
— LZ4 decompression: 2..4 GB of decompressed data per second.
Conclusion: memcpy is faster than LZ4 decompression?
Let's look at the following scenario:
— data is stored in memory;
— data is processed by chunks;
— every chunk is small enough to fit in CPU cache;
— processing of every chunk also fits in CPU cache;
— data is processed by multiple threads;
Data is read from main memory, then all processing is inside CPU cache.
Example: Ryzen 3950 (16 cores)
— memcpy: 16×12 GB = 192 GB per second.
— LZ4 decompression: 16×2..4 GB = 32..48 GB
of decompressed data per second.
— RAM read speed: 30 GB* per second.
For memcpy reading is limited by RAM speed.
But if data is compressed — then less amount of data is read from memory.
"Memory is the new disk". Is LZ4 decompression faster than memcpy?
* 2-channel memory but it works not on it's max frequency.
According to the specs of the CPU it should be 48 GB per second.
Example: 2 × AMD EPYC 7742 (128 cores)
8 channel memory, max throughput 190 GiB/s
LZ4 decompression is faster even on this server.
But if the number of cores is lower — then it is not so clear.
If data is highly compressed then decompression is limited by CPU
and therefore can be optimized!
For Memory tables:
— Lowered the block size on insertion
for better cache-locality of data processing #20169.
— Support for compression of Memory tables #20168.
For MergeTree tables:
— Eliminated unnecessary data copying for NONE codec #22145.
— Allowed to turn off checksumming on read #19588,
but you should not use this switch.
— Allowed to read data with mmap #8520
also added a cache of memory mappings #22206.
If you want to optimize performance you just need:
— to know exactly what the code is doing;
— to profile the system on realistic workloads;
— to be aware of the hardware capabilities;
...
— don't forget that CPU has multiple cores and cache;
don't mess up latency with throughput :)