ClickHouse Performance Optimization Secrets

Author: Alexey Milovidov, 2019-09-14.

ClickHouse Performance Optimization Secrets

Trivial Facts

and basic principles.

Development Principles

Top-down development
or bottom-up?

Development Principles

Top-down:

— define project architecture;

— define what classes will be in the code;

— draw diagrams;

Development Principles

Bottom-up:

— how will the inner loop work?

— what's the memory layout of data?

— which bytes get moved where?

How ClickHouse Appeared

ClickHouse emerged from a prototype,
implemented in 2008
to solve a single task:
— efficiently filter and aggregate data.

— that is, efficiently execute GROUP BY.

https://habr.com/en/company/yandex/blog/273305/

How ClickHouse Appeared

Hardware Capabilities

1. What characteristics (throughput, latency, volume...)
 does our hardware have (CPU, RAM, SSD, disks, network...)
 and for which operations?

2. What data structures do we use
 and how do they work with our hardware?

3. Use arithmetic...

Hardware Capabilities

Example:

— we'll do GROUP BY in memory;

— we'll put data in a hash table;

— if the hash table is large, it doesn't fit in L3 CPU cache;

— if keys are not locally distributed,
      then for each row — L3 cache miss;

— L3 cache miss has latency of 70..100 ns;

How many keys per second can a GROUP BY query process?

Hardware Capabilities

Example:

SELECT rand() % 10000000 AS k FROM system.numbers_mt GROUP BY k

175.52 million rows/s.

Hardware Capabilities

L3 cache miss has throughput of 40 million ops/sec on a single CPU core

and ~500 million ops/sec* on 32 hyper-threading CPU cores
(Xeon E5-2650v2).

Never confuse latency and throughput!

* but we only got 175 million rows per second. Is ClickHouse slow?

Abstractions Serve Algorithms

If maximum efficiency is needed
interfaces in code are defined by algorithms, not the other way around!

Abstractions Serve Algorithms

Example: substring search:

— in C: strstr, memmem;

— in C++: std::search, std::string::find.

These functions are slow (in certain usage scenarios).

Abstractions Serve Algorithms

Substring search:

void * memmem(const void * haystack, size_t haystacklen, const void * needle, size_t needlelen);

— no separate initialization function;
— reentrant requirements — can't allocate memory inside.

What if we search for the same needle in 1,000,000 different haystacks?

Searcher searcher(needle); for (const auto & haystack : haystacks) searcher.search(haystack);

Abstractions Serve Algorithms

Substring search:

void * memmem(const void * haystack, size_t haystacklen, const void * needle, size_t needlelen);

If we search for the same needle in 1,000,000 different haystacks,
then none of the functions strstr, memmem, std::search, std::string::find
will work well, because their interface doesn't allow it.

Without changing the abstraction, the code can't be made maximally efficient.

Can Always Do Better

If you know your task better.

— substring search;
(but smart people already wrote std::search)

— array sorting;
(but smart people already wrote std::sort)

— hash table;
(but smart people already wrote std::unordered_map)

Each Task is a Landscape

Substring search:

— exact or approximate search?
— of one or multiple strings?
— explicitly defined set or defined by a language?
— relatively short or long substrings?
— is substring a subsequence of
bytes / unicode code points / characters with collation / words?
— search in unknown or fixed string?
— located entirely in memory or as a data stream?
— with strict time guarantees or without?

Brute Force algorithm Deterministic Finite Automaton algorithm Karp-Rabin algorithm Shift Or algorithm Morris-Pratt algorithm Knuth-Morris-Pratt algorithm Simon algorithm Colussi algorithm Galil-Giancarlo algorithm Apostolico-Crochemore algorithm Not So Naive algorithm Boyer-Moore algorithm Turbo BM algorithm Apostolico-Giancarlo algorithm Reverse Colussi algorithm Horspool algorithm Quick Search algorithm Tuned Boyer-Moore algorithm Zhu-Takaoka algorithm Berry-Ravindran algorithm Smith algorithm Raita algorithm Reverse Factor algorithm Turbo Reverse Factor algorithm Forward Dawg Matching algorithm Backward Nondeterministic Dawg Matching algorithm Backward Oracle Matching algorithm Galil-Seiferas algorithm Two Way algorithm String Matching on Ordered Alphabets algorithm Optimal Mismatch algorithm Maximal Shift algorithm Skip Search algorithm KMP Skip Search algorithm Alpha Skip Search algorithm

https://www-igm.univ-mlv.fr/~lecroq/string/

None of these algorithms are used in ClickHouse.

Each Task is a Landscape

ClickHouse uses:

Volnitsky algorithm for constant needle;
— SIMD variant of brute-force for non-constant needle;
Volnitsky algorithm for small set of constant needles;
— re2 and hyperscan for regular expressions.

https://habr.com/en/company/yandex/blog/466183/

Each Task is a Landscape

Sorting

— array of numbers / tuples / strings / structures?

— entirely available in RAM?

— by comparisons / 3-way comparisons /
     parallel comparisons / radix?

— direct / indirect?

— stable / non-stable?

— full / partial / n-th element?

— finishing sort / merge / partial sort?

Each Task is a Landscape

ClickHouse uses pdqsort and radix sort,
but generally it's all bad and needs to be redone.

Each Task is a Landscape

Hash table

— hash function selection;
— memory layout: open-addressing vs. chaining;
— large or small values;
— support for non-movable values needed;
— memory layout: keys and values in one array or separate;
— collision resolution method;
— value deletion method;
— fill factor; how and when to resize;
— how to move values during resize;
— fast probing with bit masks;
— inline placement of string keys;
— prefetch and batching;

Hash Table

ClickHouse uses the best* hash table.

* for its task.
* not one, but many different ones.
* and now we'll make it even better:

https://github.com/ClickHouse/ClickHouse/pull/5417

Can Always Do Better

If you know your task better.

— substring search;

— array sorting;

— hash table;

...

— memory allocator (malloc);

— byte copying (memcpy);

Can Always Do Better

ClickHouse uses:

— «Chinese» memcpy;

— memcpy for gathering short memory chunks:

memcpySmallAllowReadWriteOverflow15

Can Always Do Better

or take the best!

If someone on the internet says their algorithm is «the best»
— we pull it into ClickHouse.

... and then throw it out if it's not the best.

Example: simdjson — took it, didn't throw out.

Example: mimalloc — took it, threw out.

Task Specialization

Trivial example:

WHERE str LIKE '%hello%world!%'

— re2 regular expression
but before that — substring search for "world!";

WHERE str LIKE '%hello%'

— substring search;

WHERE str LIKE 'hello%'

— prefix comparison.

Similar optimization exists even in MySQL.

Task Specialization

Specialization by data types (GROUP BY example):

using AggregatedDataWithoutKey = AggregateDataPtr; ...UInt8Key = FixedHashMap<UInt8, AggregateDataPtr>; ...UInt16Key = FixedHashMap<UInt16, AggregateDataPtr>; ...UInt64Key = HashMap<UInt64, AggregateDataPtr, HashCRC32<UInt64>>; ...StringKey = HashMapWithSavedHash<StringRef, AggregateDataPtr>; ...Keys128 = HashMap<UInt128, AggregateDataPtr, UInt128HashCRC32>; ...Keys256 = HashMap<UInt256, AggregateDataPtr, UInt256HashCRC32>; ...UInt64KeyTwoLevel = TwoLevelHashMap<UInt64, AggregateDataPtr, HashCRC32<UInt64>>; ...StringKeyTwoLevel = TwoLevelHashMapWithSavedHash<StringRef, AggregateDataPtr>; ...Keys128TwoLevel = TwoLevelHashMap<UInt128, AggregateDataPtr, UInt128HashCRC32>; ...Keys256TwoLevel = TwoLevelHashMap<UInt256, AggregateDataPtr, UInt256HashCRC32>; ...UInt64KeyHash64 = HashMap<UInt64, AggregateDataPtr, DefaultHash<UInt64>>; ...StringKeyHash64 = HashMapWithSavedHash<StringRef, AggregateDataPtr, StringRefHash64>; ...Keys128Hash64 = HashMap<UInt128, AggregateDataPtr, UInt128Hash>; ...Keys256Hash64 = HashMap<UInt256, AggregateDataPtr, UInt256Hash>;

dbms/src/Interpreters/Aggregator.h

Task Specialization

Specialization by data volume.

Example: quantileTiming function:
— up to 64 values — flat array in arena;
— up to 5670 values — flat array in heap;
— more — fixed-type histogram.
(QuantileTiming.h)

Example: uniqCombined function:
— flat array;
— hash table;
— HyperLogLog.
(CombinedCardinalityEstimator.h)

Data Structures Considered
in Task Context

How to choose a data structure?

— determine what it does that's needed and unneeded.

Trivial example: std::string:

— manages memory for the string itself.

— allows string modification,
  e.g. appending another character.

— knows the string size itself.

Data Structures Considered
in Task Context

Trivial example: how to execute GROUP BY?

Option 1:
— sort the array;
— then walk through it,
  computing aggregate functions for identical keys.

Option 2:
— put keys in a hash table;
— when encountering an existing key,
  update aggregate function states.

Answer: option 2 is better; but if data is nearly sorted, better to finish sorting and use option 1; and if data doesn't fit in memory, partition it, then option 2.

Algorithms Consider Data Distribution

#ifdef __SSE2__ /** A slightly more optimized version. * Based on the assumption that often sequences of consecutive values * completely pass or do not pass the filter. * Therefore, we will optimistically check the sequences of SIMD_BYTES values. */ static constexpr size_t SIMD_BYTES = 16; const __m128i zero16 = _mm_setzero_si128(); const UInt8 * filt_end_sse = filt_pos + size / SIMD_BYTES * SIMD_BYTES; while (filt_pos < filt_end_sse) { int mask = _mm_movemask_epi8( _mm_cmpgt_epi8( _mm_loadu_si128(reinterpret_cast(filt_pos)), zero16)); if (0 == mask) { /// Nothing is inserted. } else if (0xFFFF == mask) { res_data.insert(data_pos, data_pos + SIMD_BYTES); } else

Algorithms Consider Data Distribution

static inline int digits10(uint128_t x) { if (x < 10ULL) return 1; if (x < 100ULL) return 2; if (x < 1000ULL) return 3; if (x < 1000000000000ULL) { if (x < 100000000ULL) { if (x < 1000000ULL) { if (x < 10000ULL) return 4; else return 5 + (x >= 100000ULL); } return 7 + (x >= 10000000ULL); } if (x < 10000000000ULL) return 9 + (x >= 1000000000ULL); return 11 + (x >= 100000000000ULL); } return 12 + digits10(x / 1000000000000ULL); }

Algorithms Learn from Data

Source: https://learnforeverlearn.com/bandits/

Multi-Armed Bandits

— choose different options randomly;

— collect statistics for each;

— treat time for each option as a random variable;

— estimate time distribution for each option;

Thompson Sampling

— «mentally draw» a random variable for each option;

— choose the one with the best drawn value.

Used to optimize LZ4 decompression in ClickHouse.

https://habr.com/en/company/yandex/blog/452778/

Testing on Real Data

Example:

Suppose we're testing the performance
not of an analytical database, but something simpler:
for example, hash tables.

Hash table performance depends on the balance
between quality and speed of the hash function.

Real Data is Important

um, it's just a trivial test ...
clickhouse hashmap takes 10GB
this only takes 2
and it's 60% faster
Alexey Milovidov
This test almost doesn't make sense...
Let me share a dataset with real strings...
[ File : data.tar ]
clickhouse hashmap is faster
from 2 times to 5 times

Data Obfuscation

Progaradar-children pregnant departures or Dachna Brideica and MO | Freezers. - Posters to enter into Accessories Progaradar-children shore — Yandex.Money: Payment journal five bicycles on Lore - dfghf — ya.ru - real estate in Moscow) by 473682 announcements - Sale Comprom Progaradar-children free! in large assortment»in Moscow - Embroidery — Omsk Free in Most sandal height Progaradar-children free! in large assortment»in Moscow, transport portal Progaradar-children berdyansk Fashions. Recipe with photo gallery and covered loud football [email protected] - Search Progaradar-children pregnant sale Watch best price, community 2010 | Designer MAIN. Delivery Progaradar-children free! in large assortment work Price: 820 0000 km., Taganda apartments in Saint-Pet Progaradar-children pregnant monthly - DoramaTv.ru - Dresses around world. Internet sale auto used and on with discount Progaradar-children pregnant room in 2013 year, policemen used in Stavropavlina and strollers -> Magnetaz 80 cell.RU Progaradar-children pregnant - Official forums Kalinin (Ukraine. Avtoria Baksler Kudryavtsev delivery, vacancies, hotel sale Progaradar-children pregnancy detailed bld. 5, 69, communication W*oychivom - Yandex.Maps, houses, what price tow truck forum games World of Tanks Progaradar-children cap, fatherland and in pink page 2 from cabinet search by [email protected] Progaradar-children pregnancy program in China heights Bar, weather Maniku. Records in Smolensk

Using the clickhouse-obfuscator utility.

More Principles...

Developer access to production.

Instrumentation, monitoring, diagnostics.

Fast release cycle.

Conclusions

To make your code run fast, you just need to:

— consider lower levels when designing the system;
— start from hardware capabilities;
— choose data structures and abstractions based on the task;
— apply specializations for specific cases;
— try new, «best» algorithms,
  that you read about yesterday;
— choose algorithm variants at runtime based on statistics;
— test on real data;
— control performance in CI;
— measure and observe everything;
— including in production;
— and constantly redo everything;

.

Query Performance Analysis in ClickHouse:
https://youtu.be/ondHe_JUyW4

How Hash Tables Work in ClickHouse:
https://youtu.be/EoX82TEz2sQ

Parallel and Distributed GROUP BY:
https://youtu.be/SrucFOs8Y6c

How to Speed Up LZ4 Decompression:
https://youtu.be/V2CqQBICt7M

Database Obfuscation:
https://youtu.be/2iR7i4akL44

.

Web site: https://clickhouse.com/

Maillist: [email protected]

YouTube: https://www.youtube.com/c/ClickHouseDB

Telegram chat: https://telegram.me/clickhouse_ru, clickhouse_en

GitHub: https://github.com/ClickHouse/ClickHouse/

Twitter: https://twitter.com/ClickHouseDB

Google groups: https://groups.google.com/forum/#!forum/clickhouse