Zero-Cost Abstractions Using Hash Tables in ClickHouse as an Example

Author: Maksim Kita, 2021-05-11.

Zero-Cost Abstractions Using Hash Tables in ClickHouse as an Example

Hash Tables

1. Introduction to hash tables.

2. Main design questions.

3. Benchmarks.

4. C++ hash table design.

Hash Tables in ClickHouse

GROUP BY

JOIN

SELECT DISTINCT

Hash Table

Main methods

Hash Table

Hash Table Components

1. Hash function.

2. Collision resolution method.

3. Resize.

4. Cell memory layout.

Choosing Hash Function

1. Don't use identity function for integer types.

2. Don't use hash functions for strings (CityHash) for integer types.

3. Don't use cryptographic hash functions if you're not under attack. For example, SipHash calculation ~980 MB/s. CityHash ~9 GB/s.

4. Don't use outdated hash functions. FNV1a.

https://github.com/rurban/smhasher

Choosing Hash Function

By default ClickHouse uses bad hash functions.

1. CRC32-C for integer types. One processor instruction (actually two) latency 3 cycles.

2. Special hash function for strings. Can use CityHash, xxHash, wyhash as standard.

Collision Resolution

Collision Resolution

1. Chaining.

2. Open Addressing.

3. Good in theory (Cuckoo hashing, Hopscotch hashing, 2-choice hashing). Usually either hard to implement, or slow due to additional memory fetches.

Chaining

Chaining

Example: std::unordered_map

1. Pointer stability for key, value.

2. Ability to store large objects, non-movable objects.

3. Works well with bad hash function, high load factor.

4. Very slow. Stresses allocator (even just calling function is expensive for hot path).

Open Addressing

Open Addressing

Linear probing. Example: ClickHouse HashMap.

Quadratic probing. Example: Google DenseHashMap.

1. Good cache locality.

2. Need to carefully choose hash function.

3. Can't store large objects. Serialize to arena and store pointers to them.

Resize

1. By powers of two. Fast modulo division.

size_t place = hash & (size - 1)

2. To a prime number close to power of two. Slow division even with constant switch, libdivide but there's also fastrange.

Choosing Load Factor

0.5 is a good choice for linear probing with step 1.

ClickHouse HashMap, Google DenseHashMap uses 0.5.

Abseil HashMap uses 0.875.

Memory Layout

Memory Layout

Ask client to choose keys for empty and deleted values.

Memory Layout

Handle empty value separately and don't store it in hash table.

Memory Layout

Compressed storage of metadata and data.

Benchmarks

How NOT to Do Benchmarks

Test hash tables on random integer values.

Test hash tables without accounting for maximum load factor, memory consumption.

Test and not show benchmark code.

How to Do Benchmarks

On real scenarios and real data. In ClickHouse the real scenario is data aggregation.

Yandex.Metrica dataset.

wget https://datasets.clickhouse.com/hits/partitions/hits_100m_obfuscated_v1.tar.xz

Benchmarks

WatchID almost all values are unique. Hash table size 20714865 elements. This is ~600 MB, doesn't fit in LL caches.

ClickHouse HashMap: 7.366 sec.
Google DenseMap: 10.089 sec.
Abseil HashMap: 9.011 sec.
std::unordered_map: 44.758 sec.

Deinitialization of std::unordered_map took longer than benchmarks of other tables.

Benchmarks

perf stat -e cache-misses:u ./integer_hash_tables_and_hashes
ClickHouse HashMap: 329,664,616
Google DenseMap: 383,350,820
Abseil HashMap: 415,869,669
std::unordered_map: 1,939,811,017

Benchmarks

Latency Comparison Numbers
----------------------------------
L1 cache reference                           0.5 ns
Branch mispredict                            5   ns
L2 cache reference                           7   ns 14x L1 cache
Mutex lock/unlock                           25   ns
Main memory reference                      100   ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy             3,000   ns        3 us
Send 1K bytes over 1 Gbps network       10,000   ns       10 us
Read 4K randomly from SSD*             150,000   ns      150 us ~1GB/sec SSD
Read 1 MB sequentially from memory     250,000   ns      250 us
Round trip within same datacenter      500,000   ns      500 us

http://norvig.com/21-days.html#answers

Benchmarks

RegionID frequently repeating values. Hash table size 9040 elements. Fits in LL caches.

ClickHouse HashMap: 0.201 sec.
Google DenseMap: 0.261 sec.
Abseil HashMap: 0.307 sec.
std::unordered_map: 0.466 sec.

C++ Hash Table Design

1. Hash

2. Allocator

3. Cell

4. Grower (resize interface)

5. HashTable

C++ Hash Table Design

Hash

Same as std::hash.

template <typename T> struct Hash { size_t operator() (T key) const { return DefaultHash<T>(key); } };

C++ Hash Table Design

Allocator

Uses mmap, mremap for large chunks of memory.

class Allocator { void * alloc(size_t size, size_t alignment); void free(void * buf, size_t size); void * realloc(void * buf, size_t old_size, size_t new_size); };

There is an allocator that initially allocates memory on stack:

AllocatorWithStackMemory<HashTableAllocator, initial_bytes>

C++ Hash Table Design

Cell

template <typename Key, typename Mapped, typename HashTableState> struct HashTableCell { ... void setHash(size_t hash_value); size_t getHash(const Hash & hash) const; bool isZero(const State & state); void setZero(); ... };

C++ Hash Table Design

Grower

struct HashTableGrower { size_t place(size_t x) const; size_t next(size_t pos) const; bool willNextElementOverflow() const; void increaseSize(); };

C++ Hash Table Design

HashTable

template < typename Key, typename Cell, typename Hash, typename Grower, typename Allocator > class HashTable : protected Hash, protected Allocator, protected Cell::State; protected ZeroValueStorage<Cell::need_zero_value_storage, Cell>

C++ Hash Table Design

ZeroValueStorage

template <bool need_zero_value_storage, typename Cell> struct ZeroValueStorage; template <typename Cell> struct ZeroValueStorage<true, Cell> { ... }; template <typename Cell> struct ZeroValueStorage<false, Cell> { ... };

C++ Hash Table Design

Ability to pass custom Grower.

1. Pass Grower with fixed size, without resize and collision resolution chains we get a Lookup table.

2. Pass Grower with collision resolution step not 1.

C++ Hash Table Design

Ability to store state in cell.

Save hash. Used for string hash tables.

struct HashMapCellWithSavedHash : public HashMapCell { size_t saved_hash; void setHash(size_t hash_value) { saved_hash = hash_value; } size_t getHash(const Hash &) const { return saved_hash; } };

C++ Hash Table Design

Fast-clearable hash table.

struct FixedClearableHashMapCell { struct ClearableHashSetState { UInt32 version = 1; }; using State = ClearableHashSetState; UInt32 version = 1; bool isZero(const State & st) const { return version != st.version; } void setZero() { version = 0; } };

C++ Hash Table Design

LRUCache.

struct LRUHashMapCell { static bool constexpr need_to_notify_cell_during_move = true; static void move(LRUHashMapCell * old_loc, LRUHashMapCell * new_loc); LRUHashMapCell * next = nullptr; LRUHashMapCell * prev = nullptr; };

C++ Hash Table Design

LRUCache. Use boost::intrusive::list.

using LRUList = boost::intrusive::list < Cell, boost::intrusive::value_traits<LRUHashMapCellIntrusiveValueTraits>, boost::intrusive::constant_time_size<false> >; LRUList lru_list;

https://www.boost.org/doc/libs/1_76_0/doc/html/intrusive.html

Specialized Hash Tables

SmallTable

Consists of an array for some small number of elements. Fits in L1 cache.

template <typename Key, typename Cell, size_t capacity> class SmallTable : protected Cell::State { size_t m_size = 0; Cell buf[capacity]; ... }

Specialized Hash Tables

StringHashTable

Consists of 4 hash tables:

1. For strings of size 0-8 bytes.

2. For strings of size 9-16 bytes.

3. For strings of size 17-24 bytes.

4. For strings larger than 24 bytes.

https://www.researchgate.net/publication/339879042_SAHA_A_String_Adaptive_Hash_Table_for_Analytical_Databases

Specialized Hash Tables

TwoLevelHashTable

Consists of 256 hash tables:

On key insert we calculate the hash table index where we will insert the key.

size_t getBucketFromHash(size_t hash_value) { return (hash_value >> (32 - BITS_FOR_BUCKET)) & MAX_BUCKET; }

Conclusion

We wrote a framework for hash tables for our data aggregation scenario.

https://github.com/ClickHouse/ClickHouse/blob/master/src/Common/HashTable/HashTable.h https://github.com/ClickHouse/ClickHouse/blob/master/src/Common/examples/integer_hash_tables_benchmark.cpp

Thank you!