or: How To Write Efficient Code
and ground truths.
"top-down"
or "bottom-up" design?
Top-Down:
— choose a high-level architecture;
— choose what classes will be in the codebase;
— draw some diagrams;
Bottom-Up:
— how the inner loop in my code will work?
— what is the data layout in memory?
— what bytes are read/written and where?
ClickHouse was developed from a prototype,
implemented in year 2008
that was intended to solve just a single task:
— to filter and aggregate data as fast as possible.
— in other words, just to do GROUP BY.
http://highscalability.com/blog/2017/9/18/evolution-of-data-structures-in-yandexmetrica.html
1. What are the basic numbers (throughput, latency, volume...)
of our hardware (CPU, RAM, SSD, HDD, network...)
on what operations?
2. What are the data structures we use?
and how the work with our hardware?
3. and do some basic math...
Example:
— we will do GROUP BY in memory;
— will put all data in a hash table;
— if the hash table is large, it will not fit in L3 cache of CPU;
— if the values of GROUP BY keys are not distributed locally,
then we have L3 cache miss for every row in a table;
— L3 cache miss has 70..100 ns latency;
How many keys per second we can process?
Example:
SELECT rand() % 10000000 AS k
FROM system.numbers_mt
GROUP BY k
175.52 million rows/s.
L3 cache miss has throughput of 40 million ops/sec. on a single CPU core
and ~ 500 million. ops/sec*. on 32 hyper-threading CPU cores
(Xeon E5-2650v2).
Never mess up latency and throughput!
* but we have just 175 million rows per second. Is ClickHouse slow?
If you need maximum performance
— then interfaces in the code are determined by algorithms!
Example: substring search:
— in C: strstr, memmem;
— in C++: std::search, std::string::find.
But these functions are slow! (in some usage scenario).
Substring Search:
void * memmem(const void * haystack, size_t haystacklen,
const void * needle, size_t needlelen);
— there is no separate initialization routine;
— required to be reentrable — cannot allocate memory.
But what if we search a single needle in 1 000 000 different haystacks?
Searcher searcher(needle);
for (const auto & haystack : haystacks)
searcher.search(haystack);
Substring Search:
void * memmem(const void * haystack, size_t haystacklen,
const void * needle, size_t needlelen);
If we search a single needle in 1 000 000 different haystacks,
then neither of strstr, memmem, std::search, std::string::find
will work fast, because their interface is not suitable.
And without changing the interface you cannot make them fast.
If you know your task better.
— substring search;
(but some smart guys have already implemented std::search)
— array sorting;
(but some smart guys have already implemented std::sort)
— hash table;
(but some smart guys have already implemented std::unordered_map)
Substring Search:
— exact or approximate search?
— one or multiple substrings?
— the set of substrings is explicit or specified by a language?
— substrings are rather short or long?
— substring is a sequence of
bytes / unicode code points / characters with custom collation / words?
— a search in predefined text or the text is not known in advance?
— is located in memory completely or available as a stream of data?
— with strong guarantees on time or not?
Brute Force algorithm Deterministic Finite Automaton algorithm Karp-Rabin algorithm Shift Or algorithm Morris-Pratt algorithm Knuth-Morris-Pratt algorithm Simon algorithm Colussi algorithm Galil-Giancarlo algorithm Apostolico-Crochemore algorithm Not So Naive algorithm Boyer-Moore algorithm Turbo BM algorithm Apostolico-Giancarlo algorithm Reverse Colussi algorithm Horspool algorithm Quick Search algorithm Tuned Boyer-Moore algorithm Zhu-Takaoka algorithm Berry-Ravindran algorithm Smith algorithm Raita algorithm Reverse Factor algorithm Turbo Reverse Factor algorithm Forward Dawg Matching algorithm Backward Nondeterministic Dawg Matching algorithm Backward Oracle Matching algorithm Galil-Seiferas algorithm Two Way algorithm String Matching on Ordered Alphabets algorithm Optimal Mismatch algorithm Maximal Shift algorithm Skip Search algorithm KMP Skip Search algorithm Alpha Skip Search algorithm
https://www-igm.univ-mlv.fr/~lecroq/string/
But none of these algorithms are used in ClickHouse!
What ClickHouse is using:
— Volnitsky algorithm when needle is constant;
— SIMD optimized brute-force for non-constant needle;
— variation of Volnitsky algorithm for a set of constant needles;
— re2 and hyperscan for regular expressions.
Sorting
— array of numbers / tuples / strings / structures?
— available completely in RAM?
— with comparisons / 3-way comparisons /
parallel comparisons / by radix?
— direct / indirect (not sort, obtain a permutation)?
— stable / non-stable?
— full / partial / n-th element?
— finish sorting / merging / incomplete sorting?
ClickHouse is using pdqsort and radix sort,
... but it's not perfect, must rewrite.
Hash Table (my favorite)
— the choice of hash function;
— memory layout: open-addressing vs. chaining;
— small or big values;
— support for non-moveable values;
— memory layout: one array for keys and values or separate arrays;
— collision resolution algorithm;
— algorithm for values removal;
— fill factor; when and how to resize;
— how to move values around on resize;
— fast probing with bitmaps;
— inline placement of string keys;
— prefetch and batching;
We use the best* hash table in ClickHouse.
* the best for our needs.
* not a single but multiple different hash tables.
* and we constantly trying to do better:
https://github.com/ClickHouse/ClickHouse/pull/5417
— Add StringHashMap to optimize string aggregation by Amos Bird.
If you know your task better.
— substring search;
— array sorting;
— hash table;
...
— allocating memory (malloc);
— copying bytes around (memcpy);
In ClickHouse we use:
— chinese memcpy:
https://github.com/ClickHouse/ClickHouse/tree/master/libs/libmemcpy
— a special memcpy to gather short pieces of memory:
memcpySmallAllowReadWriteOverflow15
or grab the best!
If someone on the internet said that "my algorithm is the best"
— use that algorithm in ClickHouse.
... and probably throw away if it isn't the best.
Example: simdjson — adopted, still using.
Example: mimalloc — tried, throwed away.
Trivial example:
WHERE str LIKE '%hello%world!%'
— regular expression (re2)
but substring search for "world!" before;
WHERE str LIKE '%hello%'
— substring search;
WHERE str LIKE 'hello%'
— prefix comparison.
But even MySQL has similar optimization.
Specialization For the Data Types (example: GROUP BY):
using AggregatedDataWithoutKey = AggregateDataPtr;
...UInt8Key = FixedHashMap<UInt8, AggregateDataPtr>;
...UInt16Key = FixedHashMap<UInt16, AggregateDataPtr>;
...UInt64Key = HashMap<UInt64, AggregateDataPtr, HashCRC32<UInt64>>;
...StringKey = HashMapWithSavedHash<StringRef, AggregateDataPtr>;
...Keys128 = HashMap<UInt128, AggregateDataPtr, UInt128HashCRC32>;
...Keys256 = HashMap<UInt256, AggregateDataPtr, UInt256HashCRC32>;
...UInt64KeyTwoLevel = TwoLevelHashMap<UInt64, AggregateDataPtr, HashCRC32<UInt64>>;
...StringKeyTwoLevel = TwoLevelHashMapWithSavedHash<StringRef, AggregateDataPtr>;
...Keys128TwoLevel = TwoLevelHashMap<UInt128, AggregateDataPtr, UInt128HashCRC32>;
...Keys256TwoLevel = TwoLevelHashMap<UInt256, AggregateDataPtr, UInt256HashCRC32>;
...UInt64KeyHash64 = HashMap<UInt64, AggregateDataPtr, DefaultHash<UInt64>>;
...StringKeyHash64 = HashMapWithSavedHash<StringRef, AggregateDataPtr, StringRefHash64>;
...Keys128Hash64 = HashMap<UInt128, AggregateDataPtr, UInt128Hash>;
...Keys256Hash64 = HashMap<UInt256, AggregateDataPtr, UInt256Hash>;
Specialization for different amounts of data.
Example: quantileTiming function:
— less than 64 values — flat array in memory arena;
— less than 5670 values — flat array in heap memory;
— more — a histogram with custom buckets.
(QuantileTiming.h)
Example: uniqCombined function:
— flat array;
— hash table;
— HyperLogLog.
(CombinedCardinalityEstimator.h)
How to choose the data structure?
— find out what it implements, needed and unneeded.
Trivial example: std::string:
— implements memory management by itself.
— allow to modify a string,
e.g. append one more character.
— tracks string size by its own.
Trivial example: how to implement GROUP BY?
Option 1:
— sort the array by keys;
— then iterate through it,
and calculate aggregate functions for consecutive identical keys.
Option 2:
— put all keys into a hash table;
— when the key is found again,
update the states of aggregate functions.
Answer: option 2 is better; but if the data is almost sorted then better to finish sorting and apply option 1; but if the data doesn't fit in memory, partition it by buckets and then option 2.
#ifdef __SSE2__
/** A slightly more optimized version.
* Based on the assumption that often sequences of consecutive values
* completely pass or do not pass the filter.
* Therefore, we will optimistically check the sequences of SIMD_BYTES values.
*/
static constexpr size_t SIMD_BYTES = 16;
const __m128i zero16 = _mm_setzero_si128();
const UInt8 * filt_end_sse = filt_pos + size / SIMD_BYTES * SIMD_BYTES;
while (filt_pos < filt_end_sse)
{
int mask = _mm_movemask_epi8(
_mm_cmpgt_epi8(
_mm_loadu_si128(reinterpret_cast(filt_pos)), zero16));
if (0 == mask)
{
/// Nothing is inserted.
}
else if (0xFFFF == mask)
{
res_data.insert(data_pos, data_pos + SIMD_BYTES);
}
else
static inline int digits10(uint128_t x)
{
if (x < 10ULL)
return 1;
if (x < 100ULL)
return 2;
if (x < 1000ULL)
return 3;
if (x < 1000000000000ULL)
{
if (x < 100000000ULL)
{
if (x < 1000000ULL)
{
if (x < 10000ULL)
return 4;
else
return 5 + (x >= 100000ULL);
}
return 7 + (x >= 10000000ULL);
}
if (x < 10000000000ULL)
return 9 + (x >= 1000000000ULL);
return 11 + (x >= 100000000000ULL);
}
return 12 + digits10(x / 1000000000000ULL);
}
— select from different options randomly;
— calculate statistics for each option;
— consider the time (exec speed) for each option as a random variable;
— estimate the distribution of time for each variant;
— sample from random variable for each option;
— choose the option for which the sampled value is better.
This method is used to optimize LZ4 decompression in ClickHouse.
Example:
Suppose we need to do a benchmark...
not a column-oriented DBMS, but something simple:
for example, hash tables.
But the speed of hash table depends on a balance between
quality and the speed of a hash function.
Проградар-детей беременны отправления или Дачна Невестика и МО | Холодеи. - Плакаты пустить в Аксессуаро
Проградар-детей бережье — Яндекс.Деньги: Оплатного журнал пять велосипеды на Lore - dfghf — я.ру - недвижимость в Москва) по 473682 объявленов - Продаже Компром
Проградар-детей бесплатно! в большом ассоте»в Москве - Вышивку — Омский Оплатно в Самые босоножка рост
Проградар-детей бесплатно! в большом ассоте»в Москве, странспортал
Проградар-детей бердянский Модов. Рецепт c фотогалерея и прикрыло громной футбола Московье@Mail.Ru - Поиск
Проградар-детей бережнева продажа Смотрите лучшей цене, сообществе 2010 | Проектиролабор СНОВНЫЕ. Доста
Проградар-детей бесплатно! в большом ассотруди Цена: 820 0000 км., Таганде квартиры в Санкт-Пет
Проградар-детей бережный месяцам - DoramaTv.ru - Платья повсему мире. Интернет-продажа авто б.у. и на со скидкой
Проградар-детей беремховой комн. в 2013 год, болисменной подержанны в Ставропавлины и коляски -> Магнитаз 80 сотовим.РУ
Проградар-детей бережена - Официальный форумы Калинин (Украины. Авториа Бакслера Кудрявцева поставка, вакансионны, продаже отеля
Проградар-детей беременность подробная д. 5, 69, общения W*ойчивом - Яндекс.Карты, дома, какой цены эвакуатор форум игры World of Tanks
Проградар-детей берец, отечестве и в розовый стр. 2 из кабинет поиск по доровье@Mail.Ru
Проградар-детей беремени програда в Китая верты Баре, попогода Манику. Записи в Смоленское
With clickhouse-obfuscator tool.
Developers should have access to production servers.
Instrumentation, monitoring, diagnostics.
Fast release life cycle and deployment.
To write fast code you just need to:
— keep in mind low-level details when designing your system;
— design based on hardware capabilities;
— choose data structures and abstractions based on the needs of the task;
— provide specializations for special cases;
— try the new, "best" algorithms, that you read about yesterday;
— choose algorithm in runtime based on statistics;
— benchmark on real datasets;
— test for performance regressions in CI;
— measure and observe everything;
— even in production environment;
— and rewrite code all the time;
Web site: https://clickhouse.com/
GitHub: https://github.com/ClickHouse/ClickHouse/
Maillist: [email protected]
Wechat: 4 groups (ask your friend to invite)
+ meetups.
Moscow, Saint-Petersburg, Novosibirsk, Ekaterinburg, Minsk...
... Berlin, Paris, Amsterdam, Madrid, Munich, Istanbul, Ankara...
... San-Francisco, Beijing, Shenzhen, Shanghai, Hong Kong, Singapore, Tokyo...