Unusual Cases of Performance Optimization Using ClickHouse as an Example

Author: Alexey Milovidov, 2021-05-11.

Unusual Cases of Performance
Optimization
Using ClickHouse as an Example

About Me

Alexey, ClickHouse developer.

ClickHouse

Performance Optimization

Profiling under different workloads.

Optimizing everything that pops up.

About performance testing — see Alexander Kuzmenkov's talk tomorrow at 10 AM.

https://www.techdesignforums.com/practice/technique/
winning-at-whac-a-mole-redesigning-an-rf-transceiver/

Episode 1: MergeTree vs Memory

ClickHouse has different «table engines».

MergeTree tables store data on disk.

Memory tables store data in RAM.

Memory is faster than disks*.

Does this mean Memory tables are faster than MergeTree?

* What does «faster» mean? Sequential read and write speed. Random read and write latency. IOPS at given parallelism and load distribution.

Of course memory can be slower than disk subsystem,
for example single-channel memory vs. 10x PCIe 4.0 SSDs.

MergeTree vs Memory

Memory tables store data in RAM.

MergeTree tables store data on disk,
more precisely in the file system.

But data from the file system goes into page cache.

And then is read from RAM.

Does this mean there's no difference between Memory and MergeTree tables
when data is in page cache?

MergeTree vs Memory

Obvious cases when MergeTree is faster than Memory.

MergeTree tables have a primary key and secondary indexes,
and allow reading only the necessary data ranges.

Memory tables only allow full scan.

But this case is not interesting.

Can MergeTree be faster than Memory in full scan?

MergeTree vs Memory

Non-obvious cases when MergeTree is faster than Memory.

MergeTree tables store data in sorted order
by primary key.

Some algorithms in ClickHouse exploit
data locality advantages when available (fast path).

For example, if during GROUP BY the same value occurs
twice in a row, we don't do a repeated lookup in the hash table.

About hash tables in ClickHouse, see Maxim Kita's talk tomorrow at 12:50.

And if data in tables is in the same order,
can MergeTree be faster than Memory?

How Data is Processed in ClickHouse

Data in ClickHouse is stored by columns
and is also processed by columns.

Array of Structures Structure of Arrays
struct Point3d
{
    float x;
    float y;
    float z;
};
std::vector<Point3d> points;
struct Points
{
    std::vector<float> x;
    std::vector<float> y;
    std::vector<float> z;
};

How Data is Processed in ClickHouse

Data in ClickHouse is stored by columns
and is also processed by columns. By chunks of columns.

struct Chunk
{
    std::vector<float> x;
    std::vector<float> y;
    std::vector<float> z;
};

std::vector<Chunk> chunks;

Morsel-based processing.

How Exactly is Data Read from the Table?

In the case of MergeTree:

— read compressed files from the file system;

— calculate and verify checksums;

— decompress compressed blocks;

— deserialize column chunks;

— process them;

How Exactly is Data Read from the Table?

In the case of Memory:

— ready-made column chunks are
  already in RAM,

  process them;

What Exactly Happens During Reading?

In the case of MergeTree:

1. Read compressed files from the file system:

— reading can be done with synchronous (read/pread, mmap)
  or asynchronous (AIO, uring) I/O;

— in case of synchronous I/O, we can
  use (regular read or mmap)
  or not use page cache (O_DIRECT);

— if reading from page cache without mmap,
  there will be copying from page cache to userspace;

— reading compressed data — if compression ratio is large,
  the proportion of time in query processing is small;

What Exactly Happens During Reading?

In the case of MergeTree:

2. Decompress compressed blocks:

— by default LZ4* is used;

— you can choose a stronger compression method (ZSTD),
  or weaker, for example no compression at all (NONE);

— sometimes NONE unexpectedly works slower, why would that be?

— and what block size was used for compressing data?
  and how does this affect speed?

* See the talk «How to Speed Up LZ4 Decompression» from HighLoad++ Siberia 2018.

What Exactly Happens During Reading?

In the case of MergeTree:

3. Deserialize column chunks:

— there is no deserialization as such;

— it's just moving data (memcpy);

— why is it necessary to move data at all?

Difference Between MergeTree and Memory

In the case of Memory:
— ready-made column chunks in RAM.

In the case of MergeTree:
— column chunks are formed dynamically during reading.

MergeTree does more work,
but can this sometimes be more optimal?

MergeTree vs Memory

In the case of MergeTree:

— column chunks are formed dynamically during reading,
  and their size in rows can be chosen adaptively
  for cache locality!

Cache Locality

What speed does RAM work at?
— which RAM, on which machine?

What speed does cache work at?
— which cache level, on which CPU?
— one or all together?

What speed of what?
— throughput, latency?..

Episode 2: does data compression slow things down?

In ClickHouse data is stored compressed by default.

It's compressed on write, decompressed on read.

Profiling queries...

At the top by CPU — LZ4_decomress_safe function.

To speed everything up, should we just remove data compression?

Megg, Mogg & Owl Series by Simon Hanselmann

Trying to Remove Data Compression

But nothing good comes of it:

1. Removed data compression and now it doesn't fit on disk.

2. Removed data compression and now reading from disk is slow.

3. Removed data compression and now
 less data fits in page cache.

...

But even if uncompressed data fits entirely
in RAM — does it make sense to not compress it?

What's Faster: Decompression or memcpy?

The memcpy function is used as a baseline
of the weakest compression or decompression in benchmarks.

Of course, this is the fastest benchmark for comparison.

Example:
— memcpy: 12 GB per second.
— LZ4 decompression: 2..4 GB of decompressed data per second.

Conclusion: is memcpy faster than LZ4 decompression?

What's Faster: Decompression or memcpy?

Consider the scenario:

— data is stored in RAM;
— data is processed in blocks;
— each block is quite small and fits in CPU cache;
— processing of each block fits in CPU cache;
data is processed in multiple threads;

Data is read from RAM, then only CPU cache is used.

What's Faster: Decompression or memcpy?

Example: Ryzen 3950 (16 cores)

— memcpy: 16×12 GB = 192 GB per second.
— LZ4 decompression: 16×2..4 GB = 32..48 GB of decompressed data per second.
— memory read speed: 30 GB* per second.

In the case of memcpy, reading is limited by memory speed.

But if compression is used, less data is read from memory.
Memory works like disk. Is LZ4 decompression faster than memcpy?

* dual-channel memory, but not working at maximum frequency.
According to specs for this CPU up to 48 GB per second.

What's Faster: Decompression or memcpy?

Example: 2 × AMD EPYC 7742 (128 cores)

8 channel memory, max throughput 190 GiB/s

For this server, working with data
compressed with LZ4, will also be faster.

But if there are fewer cores — it's not so clear anymore.

If data is well compressed, decompression still hits CPU,
which means it can be sped up!

Optimizations in ClickHouse

For Memory tables:

— Reduced block size on write
for better cache locality of data processing #20169.

— Ability to compress Memory tables #20168.

For MergeTree tables:

— Removed unnecessary copying for NONE compression mode #22145.

— Ability to disable checksums when reading #19588,
but this ability should not be used.

— Ability to read using mmap #8520, to remove
unnecessary copying from page cache and also cache memory mappings #22206.

Conclusions

To optimize performance, you just need to:

— know exactly what your code does;

— profile the system under realistic load scenarios;

— understand hardware capabilities;

...

— not forget that the system has many cores, and the processor has cache;
  don't confuse latency and throughput :)

Thank you!