Author: Alexey Milovidov, 2017-08-22.
Alexey, ClickHouse developer.
Since 2008, I worked on the data processing engine for Yandex.Metrica.
I write in C++.
GROUP BY
SELECT DISTINCT
IN, JOIN
And also:
— uniqExact, uniq;
— arrayEnumerateUniq;
— LIMIT BY.
1. Hashing.
2. Collision resolution.
3. Resizes.
All points are not mandatory.
Example: direct mapped cache.
A magical technique that permeates
all software engineering & computer science.
Used everywhere.
Search algorithms, machine learning,
distributed systems, data compression,
sketching data structures...
1. Using a trivial hash function
hash(x) = x
This is how it's done in the standard library implementation (std::hash)
in libstdc++ and libc++ for integers.
Example: let's count how many times each user was on the internet.
Input — array of yandexuid.
yandexuid — UInt64
for (const auto & key : data)
++map[key];
Total 100,000,000 visits, containing 17,630,976 different yandexuid.
value_type — 16 bytes, RAM approximately 260 MB
— doesn't fit in LL cache.
What is yandexuid: concat(rand, timestamp).
Lower bits — unix timestamp with seconds precision.
SELECT toHour(EventTime) * 60 + toMinute(EventTime) AS k,
count() AS c FROM hits_all
WHERE EventDate >= today() - 365 GROUP BY k ORDER BY k
INTO OUTFILE 'minutes.tsv' FORMAT TabSeparated
Elapsed: 800.580 sec. Processed 4.33 trillion rows
std::unordered_map
— no difference in performance (10.319 vs 10.279 sec)
google::dense_hash_map
— performance difference: 3.156 vs 81.86 sec — 26 times
ClickHouse HashMap
— performance difference: 2.527 vs 10.264 sec — 4 times
hash(x, y) = hash(x) ^ hash(y) — terrible, don't do this
hash(x, y) = hash(hash(x) ^ y) — you can do this, but often you can do better
Example:
hash(int x) = CityHash64(reinterpret_cast<const char *>(&x), sizeof(x))
— very bad.
— hash functions for strings generally don't inline; — they contain many branches — variations on the theme of duff device for fast path loops and tail handling, all this is unnecessary when hashing fixed-length types.
HashMap in ClickHouse:
| — good hash function: | 2.439 sec. |
| — mistakenly using CityHash64 for UInt64: | 3.347 sec. |
| — using murmur finalizer: | 2.722 sec. |
| — trivial hash function: | 9.721 sec. |
std::unordered_map:
| — good hash function: | 10.097 sec. |
| — mistakenly using CityHash64 for UInt64: | 11.040 sec. |
(performance is masked by dependent cache misses)
Use different hash functions for directly or indirectly related data structures.
Example: for data sharding and for hash tables when processing this data.
☠ Leads to slowdown by multiple times or complete lockup!
Sometimes choosing multiple hash functions from the same family is not a good enough solution.
Example: hash(x, seed) = hash(x ^ seed)— often not good enough,
if part of hash operations commute with xor.
Example: libstdc++ from gcc 7 uses FNV1a for strings. This hash function contains a loop over bytes, can only work fast for short strings, but for short strings the quality is too low.
Example: Making 5 million inserts. PageCharset — short repeating strings (windows-1251, utf-8)
| — FNV1a: | 76,409,407 inserts/sec. |
| — CityHash64: | 94,725,900 inserts/sec. |
| — hash function from ClickHouse: | 112,791,109 inserts/sec. |
| — FNV1a: | 10,108,171 inserts/sec. |
| — CityHash64: | 11,337,682 inserts/sec. |
| — hash function from ClickHouse: | 13,637,320 inserts/sec. |
Sometimes not a mistake if you need to avoid algorithmic complexity attack.
In this case use SipHash. Don't use MD5, SHA1,2,3 for hash tables.
But this can also be achieved by other means (choosing a random hash function from a family)
* though this usually doesn't help.
Example: URL:
| CityHash64: | 3755.84 MB/sec. |
| SipHash: | 1237.51 MB/sec. |
| MD5: | 361.86 MB/sec. |
| CityHash64: | 822.56 MB/sec. |
| SipHash: | 183.35 MB/sec. |
| MD5: | 30.87 MB/sec. |
Example: SMHasher. https://github.com/aappleby/smhasher
Avalanche, Bit Independence
http://ticki.github.io/blog/designing-a-good-non-cryptographic-hash-function/For different algorithms, different criteria for hash function quality make sense.
A hash function can be quality for linear probing open addressing hash table, but not quality for HyperLogLog.
inline UInt64 intHash64(UInt64 x)
{
x ^= x >> 33;
x *= 0xff51afd7ed558ccdULL;
x ^= x >> 33;
x *= 0xc4ceb9fe1a85ec53ULL;
x ^= x >> 33;
return x;
}
2.5 "rounds" consisting of xor-shift and multiplication. (city, farm, metro hash use roughly the same thing)
latency of multiplication — 3 cycles, xor and shift — 1 cycle, there's no instruction level parallelism in one hash function, so total latency is approximately 12 cycles.
Advantages: — good quality; — independent calculation of multiple hashes vectorizes (gcc, clang even do this themselves); — instruction-parallel independent calculation of multiple hashes; — clear meaning.
Disadvantages: — when hash table fits in L1..L2 cache, overhead is still large. (can be reduced to 1..1.5 rounds if quality is acceptable)
#if __SSE4_2__
#include <nmmintrin.h>
inline UInt64 intHashCRC32(UInt64 x)
{
return _mm_crc32_u64(-1ULL, x);
}
#endif
one instruction (actually two) latency 3 cycles instruction level parallelism — 3 operations simultaneously throughput up to 1 cycle per operation https://software.intel.com/sites/landingpage/IntrinsicsGuide/
Disadvantages: — only 32 bits — zero quality by Avalanche and Bit Independence Criteria because crc commutes with xor:
crc(x ^ y) = crc(x) ^ crc(y)
meaning when changing the n-th bit of x, the m-th bit of crc(x) changes or doesn't change depending on n, but independently of x. each bit of crc is xor of some bits of x.
Advantages: — works well in practice in hash tables
By the way, both functions are reversible.
But nobody cares.
Throughput, Latency, Avalanche of hash functions for integers: https://github.com/ClickHouse/ClickHouse/blob/master/dbms/src/Common/tests/int_hashes_perf.cpp
./int_hashes_perf 1000000000
(with external explicit collision resolution lists)
Examples: std::unordered_map, boost::intrusive::hash_table.
Advantages: — suitable for fat objects; — suitable for non-movable objects; — suitable for intrusive containers; — node-based container, pointers to nodes don't invalidate (note, iterators invalidate); — no mutual influence of collision resolution chains — positive feedback on chain growth; — more tolerant of bad hash function and large load factor.
Disadvantages: — low cache locality; — load on allocator; — large memory overhead for small values;
(closed hashing; with internal collision resolution chains) Examples: ClickHouse HashMap, sparse_hash_map, dense_hash_map, HashMap in Rust.
Disadvantages: — very sensitive to hash function choice; — for example, work absolutely badly with trivial hash function; — variety of memory layout and collision resolution options, need to choose for specific task; — strong performance degradation when storing large objects inplace;
Advantages: — when used correctly, work fast :)
Not necessarily lists.
Other external explicit structures for collision resolution: — contiguous arrays; — balanced trees (example: HashMap in Java).
Chaining with first cell inplace; — combine disadvantages of chaining and open-addressing; — can use combined variant for multiset/multimap.
How to indicate that a cell is occupied? 1. Explicitly — occupancy bit in each cell. 2. Implicitly — absence of value denoted by zero key, and element with zero key stored separately.
— simple array of cells; — unusual array of cells — sparse array (example: sparse_hash_map); — two-level array; — array with address translation (indirect addressing); — two arrays of cells — one for key, another for mapped; — two or more arrays of cells for cuckoo hash table; — array of cells + bit masks;
— hash value — for shortcut comparison of heavy keys; to avoid repeated calculations during resizes; — version number for instant hash table clearing; — information on how far away the element is with the corresponding hash modulo value; — pointers forming collision resolution chain as in chaining hash table; — atomics and even mutexes for locking.
1. When everything fits in L1..L2 cache (sometimes L3 cache): — hash function calculation; — collision resolution chain traversal (branches, arithmetic, sometimes just many instructions); — element equality comparison. 2. When everything doesn't fit in LL cache (sometimes only in L3 cache): — random memory reads.
Advantages: — excellent cache locality; — very simple; — fastest given equal chain lengths; Disadvantages: — most finicky method; — doesn't tolerate large fill factor. Average complexity of lookup in hash table: O(n^2), where n is average collision resolution chain length.
Advantages: — decent cache locality; — only slightly more complex than linear probing; — slightly less finicky;
Example: used in dense_hash_map.
Disadvantages: — poor cache locality; — more complex in implementation and usage; — less efficient given equal chain lengths; Advantages: — not finicky. Bonus: — useful not only for hash tables For example, for Bloom Filter two hash functions are enough to generate a family of hash functions: hash(x, k) = hash1(x) + k * hash2(x)
Disadvantages: — slightly slower than linear probing given equal chain lengths; — need to compute more or store hash function value for cells; Advantages: — same cache locality as linear probing; — average chain lengths substantially less than linear probing; — hash table becomes almost completely ordered by hash function modulo, which can be used in other algorithms. Example: HashMap in Rust.
Advantages: — O(1) lookup Disadvantages: — two (on average one and a half) random memory accesses instead of one — complex inserts
Advantages: — O(1) lookup Disadvantages: — complex inserts I took the first implementation from the internet yesterday, copied it into the project, and it's slow. https://github.com/Tessil/hopscotch-map/ http://codecapsule.com/2013/08/11/hopscotch-hashing/
Disadvantages: — completely unclear why needed.
— what max fill to use; — by how much to resize; — what hash table size to take; — how to allocate and initialize memory; — how to move elements;
0.5 — perfectly suitable for most finicky variants, — if memory overhead is not a big concern; — with max fill = 0.5 and resize by two, maximum overhead is 4 times. More — unacceptable for linear probing. Only for Robin Hood (not always) and double hashing.
By two (or close). — this is almost the only option if size is power of two; While hash table is small, can do 4 times. — since memory overhead doesn't matter; — but benefit from this is small; Other options: By 1.5 times; by golden ratio? — complex, meaningless, expensive. Use arrays of chunks with indirect addressing and add chunks. — complex, expensive.
2n — cheap arithmetic — possible monstrous slowdown when inserting elements from one hash table into smaller one, if max_fill > 0.5: https://accidentallyquadratic.tumblr.com/post/153545455987/rust-hash-iteration-reinsertion
Prime number close to 2n Disadvantages: — division with remainder — very slow — if constant is compile time, compiler replaces with multiplications and shifts — to be compile time, have to use switch/case with all variants — although branch is well predicted, still ends up slow
False advantage: — masks low quality hash function; Why false: Improved quality of element distribution is similar to choosing higher quality hash function, but remainder of division is too expensive a way to improve quality. It makes sense to encapsulate distribution quality inside hash function. Possibly, using prime number for size is absurd. std::unordered_map uses exactly this absurdity.
If empty cell is represented by zero bytes, then can use mmap or calloc. During resize, can use mremap or realloc for inplace resize. But: — mmap is monstrously slow; — calloc in almost all allocators is completely useless (works through malloc, memset); — realloc in almost all allocators is completely useless (works through malloc, memcpy, free); — also note that std::allocator doesn't even have interface for realloc or calloc.
~ 2000 calls of mmap, page fault, munmap per second, regardless of number of cores. Why mmap is slow: — system call; — changing data structures in kernel; — TLB cache flush; — page fault.
Can use mmap, munmap, mremap only for large chunks of memory. 64 MiB (suppose we can work with memory at 50 GB/sec speed; suppose we want mmap overhead to be no more than half; suppose mmap can be done only 1000 times per second) Actually all allocators already use mmap, if chunk of memory is large, but: — suffer from using mmap for insufficiently large chunks of memory; — don't use mremap for realloc even in this case; So it makes sense to use mmap manually.
1. Allocate new array, insert all elements there. Advantages: trivial.
2. Expand array by two. Right half will be empty. Given that array size is always power of two, on average — slightly less than half of elements stay in place; — half of elements move to new place on the right; — some part of elements move slightly left because previous ones in collision resolution chain moved to right half. Advantages: — this is really more efficient; — allocate less temporary memory; — better cache locality; Disadvantages: — complex algorithm, easy to make mistakes;
3. Amortized resize — do resize gradually, spreading complexity over time of inserting new elements. Advantages: — latency control; Disadvantages: — throughput will be worse; — more complex implementation; — need up to two random memory accesses instead of one.
Resize when chain becomes long — collision resolution chain length is a heavy-tailed distribution, and maximum length in case of linear probing becomes large too early.
Get rid of overlap — meaningless, because in case of 2n hash table size, overlap is bit and with mask, and this is a very simple operation, which doesn't reduce throughput thanks to instruction level parallelism.
Move resize condition outside loop
/// This is a loop:
for (const auto & key : data)
++map[key]; /// And here resize can be called.
if (there is enough space)
for (batch of data)
insert into map without
possibility of resize;
— in this case well-predicted branch costs nothing
mmap, mremap for large sizes
if (old_size < MMAP_THRESHOLD && new_size < MMAP_THRESHOLD
&& alignment <= MALLOC_MIN_ALIGNMENT)
{
...
}
else if (old_size >= MMAP_THRESHOLD && new_size >= MMAP_THRESHOLD)
{
...
buf = mremap(buf, old_size, new_size, MREMAP_MAYMOVE);
if (MAP_FAILED == buf)
DB::throwFromErrno("Allocator: Cannot mremap memory chunk...
/// No need for zero-fill, because mmap guarantees it.
}
else
{
...
}
Inplace resizes
/** Now some items may need to be moved to a new location.
* The element can stay in place, or move to a new location "on the right",
* or move to the left of the collision resolution chain, because
* the elements to the left of it have been moved to the new "right" location.
*/
size_t i = 0;
for (; i < old_size; ++i)
if (!buf[i].isZero(*this) && !buf[i].isDeleted())
reinsert(buf[i], buf[i].getHash(*this));
/** There is also a special case:
* if the element was to be at the end of the old buffer, [ x]
* but is at the beginning because of the collision resolution chain, [o x]
* then after resizing, it will first be out of place again, [ xo ]
* and in order to transfer it where necessary,
* after transferring all the elements from the old halves you need to [ o x ]
* process tail from the collision resolution chain immediately after it [ o x ]
*/
for (; !buf[i].isZero(*this) && !buf[i].isDeleted(); ++i)
reinsert(buf[i], buf[i].getHash(*this));
Shortcut for consecutive identical key values
typename Method::Key prev_key;
for (size_t i = 0; i < rows; ++i)
{
/// Get the key to insert into the hash table.
typename Method::Key key = state.getKey(...);
/// Optimization for consecutive identical keys.
if (!Method::no_consecutive_keys_optimization)
{
if (i != 0 && key == prev_key)
{
/// Add values to the aggregate functions.
continue;
}
else
prev_key = key;
}
method.data.emplace(key, it, inserted);
/// Add values to the aggregate functions.
}
Storing hash value for heavy keys
template <typename Key, typename TMapped, typename Hash, ...>
struct HashMapCellWithSavedHash
: public HashMapCell<Key, TMapped, Hash, TState>
{
using Base = HashMapCell<Key, TMapped, Hash, TState>;
size_t saved_hash;
using Base::Base;
bool keyEquals(const Key & key_) const
{ return this->value.first == key_; }
bool keyEquals(const Key & key_, size_t hash_) const
{ return saved_hash == hash_ && this->value.first == key_; }
void setHash(size_t hash_value) { saved_hash = hash_value; }
size_t getHash(const Hash & hash) const { return saved_hash; }
};
Instantly clearable hash tables
template <typename Key, typename BaseCell>
struct ClearableHashTableCell : public BaseCell
{
using State = ClearableHashSetState;
using value_type = typename BaseCell::value_type;
UInt32 version;
bool isZero(const State & state) const
{ return version != state.version; }
void setZero() { version = 0; }
static constexpr bool need_zero_value_storage = false;
ClearableHashTableCell() {}
ClearableHashTableCell(const Key & key_, const State & state)
: BaseCell(key_, state), version(state.version) {}
};
Prefetch
template <size_t N>
void ALWAYS_INLINE hasN(const Key * x, UInt8 * res) const
{
size_t hash_value[N];
size_t place[N];
for (size_t i = 0; i < N; ++i)
{
hash_value[i] = hash(x[i]);
place[i] = grower.place(hash_value[i]);
_mm_prefetch(&buf[place[i]], _MM_HINT_T0);
}
for (size_t i = 0; i < N; ++i)
res[i] = Cell::isZero(x[i], *this)
? this->hasZero()
: !buf[findCell(x[i], hash_value[i], place[i])]
.isZero(*this);
}
Marking hash table methods as ALWAYS_INLINE, and external methods containing loops working with hash table as NO_INLINE.
#define ALWAYS_INLINE __attribute__((__always_inline__))
#define NO_INLINE __attribute__((__noinline__))
Predicting hash table size before creating it Storing aggregate function states inplace in hash table if they're small Storing aggregate function states inplace in hash table, and if they're large, then splitting into two arrays — keys and values separately Storing short string keys inplace in hash table with splitting into multiple hash tables by power-of-two size classes Unrolled speculative probing SIMD probing
For strings: — stack strings consecutively in Arena, store StringRef in hash table (std::string_view, std::pair<const char *, size_t>). For tuples: — if tuple is small, pack it into 64 or 128 or 256 bits and use as key; — if tuple is large, serialize it into Arena and work as with string key.