Author: Alexey Milovidov, 2019-09-14.
and basic principles.
Top-down development
or bottom-up?
Top-down:
— define project architecture;
— define what classes will be in the code;
— draw diagrams;
Bottom-up:
— how will the inner loop work?
— what's the memory layout of data?
— which bytes get moved where?
ClickHouse emerged from a prototype,
implemented in 2008
to solve a single task:
— efficiently filter and aggregate data.
— that is, efficiently execute GROUP BY.
1. What characteristics (throughput, latency, volume...)
does our hardware have (CPU, RAM, SSD, disks, network...)
and for which operations?
2. What data structures do we use
and how do they work with our hardware?
3. Use arithmetic...
Example:
— we'll do GROUP BY in memory;
— we'll put data in a hash table;
— if the hash table is large, it doesn't fit in L3 CPU cache;
— if keys are not locally distributed,
then for each row — L3 cache miss;
— L3 cache miss has latency of 70..100 ns;
How many keys per second can a GROUP BY query process?
Example:
SELECT rand() % 10000000 AS k
FROM system.numbers_mt
GROUP BY k
175.52 million rows/s.
L3 cache miss has throughput of 40 million ops/sec on a single CPU core
and ~500 million ops/sec* on 32 hyper-threading CPU cores
(Xeon E5-2650v2).
Never confuse latency and throughput!
* but we only got 175 million rows per second. Is ClickHouse slow?
If maximum efficiency is needed
— interfaces in code are defined by algorithms, not the other way around!
Example: substring search:
— in C: strstr, memmem;
— in C++: std::search, std::string::find.
These functions are slow (in certain usage scenarios).
Substring search:
void * memmem(const void * haystack, size_t haystacklen,
const void * needle, size_t needlelen);
— no separate initialization function;
— reentrant requirements — can't allocate memory inside.
What if we search for the same needle in 1,000,000 different haystacks?
Searcher searcher(needle);
for (const auto & haystack : haystacks)
searcher.search(haystack);
Substring search:
void * memmem(const void * haystack, size_t haystacklen,
const void * needle, size_t needlelen);
If we search for the same needle in 1,000,000 different haystacks,
then none of the functions strstr, memmem, std::search, std::string::find
will work well, because their interface doesn't allow it.
Without changing the abstraction, the code can't be made maximally efficient.
If you know your task better.
— substring search;
(but smart people already wrote std::search)
— array sorting;
(but smart people already wrote std::sort)
— hash table;
(but smart people already wrote std::unordered_map)
Substring search:
— exact or approximate search?
— of one or multiple strings?
— explicitly defined set or defined by a language?
— relatively short or long substrings?
— is substring a subsequence of
bytes / unicode code points / characters with collation / words?
— search in unknown or fixed string?
— located entirely in memory or as a data stream?
— with strict time guarantees or without?
Brute Force algorithm Deterministic Finite Automaton algorithm Karp-Rabin algorithm Shift Or algorithm Morris-Pratt algorithm Knuth-Morris-Pratt algorithm Simon algorithm Colussi algorithm Galil-Giancarlo algorithm Apostolico-Crochemore algorithm Not So Naive algorithm Boyer-Moore algorithm Turbo BM algorithm Apostolico-Giancarlo algorithm Reverse Colussi algorithm Horspool algorithm Quick Search algorithm Tuned Boyer-Moore algorithm Zhu-Takaoka algorithm Berry-Ravindran algorithm Smith algorithm Raita algorithm Reverse Factor algorithm Turbo Reverse Factor algorithm Forward Dawg Matching algorithm Backward Nondeterministic Dawg Matching algorithm Backward Oracle Matching algorithm Galil-Seiferas algorithm Two Way algorithm String Matching on Ordered Alphabets algorithm Optimal Mismatch algorithm Maximal Shift algorithm Skip Search algorithm KMP Skip Search algorithm Alpha Skip Search algorithm
https://www-igm.univ-mlv.fr/~lecroq/string/
None of these algorithms are used in ClickHouse.
ClickHouse uses:
— Volnitsky algorithm for constant needle;
— SIMD variant of brute-force for non-constant needle;
— Volnitsky algorithm for small set of constant needles;
— re2 and hyperscan for regular expressions.
Sorting
— array of numbers / tuples / strings / structures?
— entirely available in RAM?
— by comparisons / 3-way comparisons /
parallel comparisons / radix?
— direct / indirect?
— stable / non-stable?
— full / partial / n-th element?
— finishing sort / merge / partial sort?
ClickHouse uses pdqsort and radix sort,
but generally it's all bad and needs to be redone.
Hash table
— hash function selection;
— memory layout: open-addressing vs. chaining;
— large or small values;
— support for non-movable values needed;
— memory layout: keys and values in one array or separate;
— collision resolution method;
— value deletion method;
— fill factor; how and when to resize;
— how to move values during resize;
— fast probing with bit masks;
— inline placement of string keys;
— prefetch and batching;
ClickHouse uses the best* hash table.
* for its task.
* not one, but many different ones.
* and now we'll make it even better:
https://github.com/ClickHouse/ClickHouse/pull/5417
If you know your task better.
— substring search;
— array sorting;
— hash table;
...
— memory allocator (malloc);
— byte copying (memcpy);
ClickHouse uses:
— «Chinese» memcpy;
— memcpy for gathering short memory chunks:
memcpySmallAllowReadWriteOverflow15
or take the best!
If someone on the internet says their algorithm is «the best»
— we pull it into ClickHouse.
... and then throw it out if it's not the best.
Example: simdjson — took it, didn't throw out.
Example: mimalloc — took it, threw out.
Trivial example:
WHERE str LIKE '%hello%world!%'
— re2 regular expression
but before that — substring search for "world!";
WHERE str LIKE '%hello%'
— substring search;
WHERE str LIKE 'hello%'
— prefix comparison.
Similar optimization exists even in MySQL.
Specialization by data types (GROUP BY example):
using AggregatedDataWithoutKey = AggregateDataPtr;
...UInt8Key = FixedHashMap<UInt8, AggregateDataPtr>;
...UInt16Key = FixedHashMap<UInt16, AggregateDataPtr>;
...UInt64Key = HashMap<UInt64, AggregateDataPtr, HashCRC32<UInt64>>;
...StringKey = HashMapWithSavedHash<StringRef, AggregateDataPtr>;
...Keys128 = HashMap<UInt128, AggregateDataPtr, UInt128HashCRC32>;
...Keys256 = HashMap<UInt256, AggregateDataPtr, UInt256HashCRC32>;
...UInt64KeyTwoLevel = TwoLevelHashMap<UInt64, AggregateDataPtr, HashCRC32<UInt64>>;
...StringKeyTwoLevel = TwoLevelHashMapWithSavedHash<StringRef, AggregateDataPtr>;
...Keys128TwoLevel = TwoLevelHashMap<UInt128, AggregateDataPtr, UInt128HashCRC32>;
...Keys256TwoLevel = TwoLevelHashMap<UInt256, AggregateDataPtr, UInt256HashCRC32>;
...UInt64KeyHash64 = HashMap<UInt64, AggregateDataPtr, DefaultHash<UInt64>>;
...StringKeyHash64 = HashMapWithSavedHash<StringRef, AggregateDataPtr, StringRefHash64>;
...Keys128Hash64 = HashMap<UInt128, AggregateDataPtr, UInt128Hash>;
...Keys256Hash64 = HashMap<UInt256, AggregateDataPtr, UInt256Hash>;
Specialization by data volume.
Example: quantileTiming function:
— up to 64 values — flat array in arena;
— up to 5670 values — flat array in heap;
— more — fixed-type histogram.
(QuantileTiming.h)
Example: uniqCombined function:
— flat array;
— hash table;
— HyperLogLog.
(CombinedCardinalityEstimator.h)
How to choose a data structure?
— determine what it does that's needed and unneeded.
Trivial example: std::string:
— manages memory for the string itself.
— allows string modification,
e.g. appending another character.
— knows the string size itself.
Trivial example: how to execute GROUP BY?
Option 1:
— sort the array;
— then walk through it,
computing aggregate functions for identical keys.
Option 2:
— put keys in a hash table;
— when encountering an existing key,
update aggregate function states.
Answer: option 2 is better; but if data is nearly sorted, better to finish sorting and use option 1; and if data doesn't fit in memory, partition it, then option 2.
#ifdef __SSE2__
/** A slightly more optimized version.
* Based on the assumption that often sequences of consecutive values
* completely pass or do not pass the filter.
* Therefore, we will optimistically check the sequences of SIMD_BYTES values.
*/
static constexpr size_t SIMD_BYTES = 16;
const __m128i zero16 = _mm_setzero_si128();
const UInt8 * filt_end_sse = filt_pos + size / SIMD_BYTES * SIMD_BYTES;
while (filt_pos < filt_end_sse)
{
int mask = _mm_movemask_epi8(
_mm_cmpgt_epi8(
_mm_loadu_si128(reinterpret_cast(filt_pos)), zero16));
if (0 == mask)
{
/// Nothing is inserted.
}
else if (0xFFFF == mask)
{
res_data.insert(data_pos, data_pos + SIMD_BYTES);
}
else
static inline int digits10(uint128_t x)
{
if (x < 10ULL)
return 1;
if (x < 100ULL)
return 2;
if (x < 1000ULL)
return 3;
if (x < 1000000000000ULL)
{
if (x < 100000000ULL)
{
if (x < 1000000ULL)
{
if (x < 10000ULL)
return 4;
else
return 5 + (x >= 100000ULL);
}
return 7 + (x >= 10000000ULL);
}
if (x < 10000000000ULL)
return 9 + (x >= 1000000000ULL);
return 11 + (x >= 100000000000ULL);
}
return 12 + digits10(x / 1000000000000ULL);
}
— choose different options randomly;
— collect statistics for each;
— treat time for each option as a random variable;
— estimate time distribution for each option;
— «mentally draw» a random variable for each option;
— choose the one with the best drawn value.
Used to optimize LZ4 decompression in ClickHouse.
Example:
Suppose we're testing the performance
not of an analytical database, but something simpler:
for example, hash tables.
Hash table performance depends on the balance
between quality and speed of the hash function.
Progaradar-children pregnant departures or Dachna Brideica and MO | Freezers. - Posters to enter into Accessories
Progaradar-children shore — Yandex.Money: Payment journal five bicycles on Lore - dfghf — ya.ru - real estate in Moscow) by 473682 announcements - Sale Comprom
Progaradar-children free! in large assortment»in Moscow - Embroidery — Omsk Free in Most sandal height
Progaradar-children free! in large assortment»in Moscow, transport portal
Progaradar-children berdyansk Fashions. Recipe with photo gallery and covered loud football [email protected] - Search
Progaradar-children pregnant sale Watch best price, community 2010 | Designer MAIN. Delivery
Progaradar-children free! in large assortment work Price: 820 0000 km., Taganda apartments in Saint-Pet
Progaradar-children pregnant monthly - DoramaTv.ru - Dresses around world. Internet sale auto used and on with discount
Progaradar-children pregnant room in 2013 year, policemen used in Stavropavlina and strollers -> Magnetaz 80 cell.RU
Progaradar-children pregnant - Official forums Kalinin (Ukraine. Avtoria Baksler Kudryavtsev delivery, vacancies, hotel sale
Progaradar-children pregnancy detailed bld. 5, 69, communication W*oychivom - Yandex.Maps, houses, what price tow truck forum games World of Tanks
Progaradar-children cap, fatherland and in pink page 2 from cabinet search by [email protected]
Progaradar-children pregnancy program in China heights Bar, weather Maniku. Records in Smolensk
Using the clickhouse-obfuscator utility.
Developer access to production.
Instrumentation, monitoring, diagnostics.
Fast release cycle.
To make your code run fast, you just need to:
— consider lower levels when designing the system;
— start from hardware capabilities;
— choose data structures and abstractions based on the task;
— apply specializations for specific cases;
— try new, «best» algorithms,
that you read about yesterday;
— choose algorithm variants at runtime based on statistics;
— test on real data;
— control performance in CI;
— measure and observe everything;
— including in production;
— and constantly redo everything;
Query Performance Analysis in ClickHouse:
https://youtu.be/ondHe_JUyW4
How Hash Tables Work in ClickHouse:
https://youtu.be/EoX82TEz2sQ
Parallel and Distributed GROUP BY:
https://youtu.be/SrucFOs8Y6c
How to Speed Up LZ4 Decompression:
https://youtu.be/V2CqQBICt7M
Database Obfuscation:
https://youtu.be/2iR7i4akL44
Web site: https://clickhouse.com/
Maillist: [email protected]
YouTube: https://www.youtube.com/c/ClickHouseDB
Telegram chat: https://telegram.me/clickhouse_ru, clickhouse_en
GitHub: https://github.com/ClickHouse/ClickHouse/
Twitter: https://twitter.com/ClickHouseDB
Google groups: https://groups.google.com/forum/#!forum/clickhouse