How to Speed Up LZ4 Decompression

Author: Alexey Milovidov, 2018-06-25.

How to Speed Up LZ4 Decompression

SELECT * FROM hits_100m
WHERE URL LIKE '%metrika%'
ORDER BY EventTime LIMIT 10

0.772 sec.
11.0 GiB/sec.
125.98 million rows/sec.

What is LZ4

— data compression library;

— high performance;

— special case of LZ77 algorithm;

— not specialized for specific data types;

— exists since 2011;

— used in ClickHouse since 2011;

— author — Yann Collet;

— analogues: Snappy, QuickLZ, LZO, FastLZ, BriefLZ, LZF;

Why LZ4 Specifically

It's almost «Pareto-optimal».

We compare algorithms by three parameters
on all possible datasets:
— compression ratio;
— compression speed;
— decompression speed;

An algorithm is called «Pareto-optimal» if there is no other algorithm that is better in at least one parameter and not worse in all others.

There Are Slightly «Cooler» Options:

Lizard (formerly LZ5): https://github.com/inikep/lizard/

LZSSE: https://github.com/ConorStokes/LZSSE/

density: https://github.com/centaurean/density

LZTURBO: https://sites.google.com/site/powturbo/

Why Is Decompression a Bottleneck?

Typical decompression speed is several GB/sec on a single CPU core.

Decompression is done in multiple threads,
which gives a speed of 10 GB/sec and more.

This is more than the speed of HDD and SSD arrays,
so decompression shouldn't be a bottleneck?

But:

Compressed data is read from disk, which reduces IO.

Data is often in the page cache
— then the CPU will always be the bottleneck, not IO.

Should We Compress Data at All?

— yes.

If we don't — we'll hit IO or data volume limits.

Always compress data:
— when storing, even if you have an array of fast NVMe SSDs;
— when transmitting data over the network, even with 10 GBit without oversubscription;

How LZ4 Works*

A compressed file is a set of two types of commands,
executing which results in data decompression:

Literals: copy the next N bytes of compressed data as is.

Match: copy N bytes at offset M,
which were recently decompressed.

Hello world Hello
literals 12 "Hello world " match 5 12

* — except for small details; any LZ77 family algorithm;

How to Compress Data

— iterate through the data;

— hash the next 4 bytes;

— put the offset in a hash table;

— if these 4 bytes are already there - we found a match;

— look at the maximum match length;

— write the command "repeat data of such length at such offset" to the compressed file;

How to Decompress Data

while (...) { read(input_pos, literal_length, match_length); copy(output_pos, input_pos, literal_length); output_pos += literal_length; read(input_pos, match_offset); copy(output_pos, output_pos - match_offset, match_length); output_pos += match_length; }

How to Copy a Memory Block?

memcpy

— memcpy is slow*;

— memcpy is not always
correct to use;

* — in this usage scenario;

Why Is memcpy Non-Optimal?

1. Because memcpy is usually in the libc library, and the libc library is usually dynamically linked, and the call to memcpy will be indirect, through PLT.

2. Because the memcpy function with an unknown size argument at compile time is not inlined.

3. Because the memcpy function makes a lot of effort to correctly handle «tails» of memory chunks that are not multiples of the machine word or register size.

We Can Write Extra Bytes

We can always copy in chunks that are multiples of 8 (or 16, 32...) bytes.

This optimization is already present in the original LZ4 implementation.

Hello world ........... ^^^^^ - src ^^^^^ - dst Hello world Hello wo... ^^^^^^^^ - src ^^^^^^^^ - dst

(but we need to check for buffer overflow)

We Can Write Extra Bytes

void copy8(UInt8 * dst, const UInt8 * src) { /// Here memcpy is not called (builtin). memcpy(dst, src, 8); } void wildCopy8( UInt8 * dst, const UInt8 * src, UInt8 * dst_end) { do { copy8(dst, src); dst += 8; src += 8; } while (dst < dst_end); }

Need to Check for Buffer Overflow

The last 5 bytes are always literals
The last match must start at least 12 bytes before end of block.
Consequently, a block with less than 13 bytes cannot be compressed.


https://github.com/lz4/lz4/wiki/lz4_Block_format.md

Why Copy Exactly 8 Bytes?

— why isn't the loop unrolled?

— why don't we use SSE or AVX registers?

void copy16(UInt8 * dst, const UInt8 * src) { _mm_storeu_si128(reinterpret_cast<__m128i *>(dst), _mm_loadu_si128(reinterpret_cast(src))); } void wildCopy16( UInt8 * dst, const UInt8 * src, UInt8 * dst_end) { do { copy16(dst, src); dst += 16; src += 16; } while (dst < dst_end); }

Why Copy Exactly 8 Bytes?

— it's not obvious which is better;

— in the original LZ4 implementation
this is related to conditions in the compressed data format,
preventing memory overrun;

Copying Match, Simple Case

Copy 5 bytes at offset 12:

Hello world ........... ^^^^^ - src ^^^^^ - dst Hello world Hello wo... ^^^^^ - src ^^^^^ - dst

Copying Match, Complex Case

Copy 10 bytes at offset 3:

abc............. ^^^^^^^^^^ - src ^^^^^^^^^^ - dst abcabcabcabca... ^^^^^^^^^^ - src ^^^^^^^^^^ - dst

Need to Copy
As If We Were Doing It Byte by Byte

op[0] = match[0]; op[1] = match[1]; op[2] = match[2]; op[3] = match[3]; ...
abca............ ^ - src ^ - dst abcab........... ^ - src ^ - dst abcabc.......... ^ - src ^ - dst abcabca......... ^ - src ^ - dst abcabcab........ ^ - src ^ - dst

Copy the first 4 bytes byte by byte:

abcabca......... ^^^^ - src ^^^^ - dst

Now we can copy 4 bytes at once:

abcabcabcab..... ^^^^ - src ^^^^ - dst

Now we can copy 8 bytes at once:

abcabcabcabcabcabca..... ^^^^^^^^ - src ^^^^^^^^ - dst

Surprisingly Confusing Code

const unsigned dec32table[] = {0, 1, 2, 1, 4, 4, 4, 4}; const int dec64table[] = {0, 0, 0, -1, 0, 1, 2, 3}; const int dec64 = dec64table[offset]; op[0] = match[0]; op[1] = match[1]; op[2] = match[2]; op[3] = match[3]; match += dec32table[offset]; memcpy(op+4, match, 4); match -= dec64;
void copyOverlap8(UInt8 * op, const UInt8 *& match, const size_t offset) { /// 4 % n. /// Or if 4 % n is zero, we use n. /// It gives equivalent result, but is better CPU friendly. static constexpr int shift1[] = { 0, 1, 2, 1, 4, 4, 4, 4 }; /// 8 % n - 4 % n static constexpr int shift2[] = { 0, 0, 0, 1, 0, -1, -2, -3 }; op[0] = match[0]; op[1] = match[1]; op[2] = match[2]; op[3] = match[3]; match += shift1[offset]; memcpy(op + 4, match, 4); match += shift2[offset]; }
void copyOverlap16(UInt8 * op, const UInt8 *& match, const size_t offset) { /// 4 % n. static constexpr int shift1[] = { 0, 1, 2, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4 }; /// 8 % n - 4 % n static constexpr int shift2[] = { 0, 0, 0, 1, 0, -1, -2, -3, -4, 4, 4, 4, 4, 4, 4, 4 }; /// 16 % n - 8 % n static constexpr int shift3[] = { 0, 0, 0, -1, 0, -2, 2, 1, 8, -1, -2, -3, -4, -5, -6, -7 }; op[0] = match[0]; op[1] = match[1]; op[2] = match[2]; op[3] = match[3]; match += shift1[offset]; memcpy(op + 4, match, 4); match += shift2[offset]; memcpy(op + 8, match, 8); match += shift3[offset]; }

Can We Implement «Clever» Copying More Optimally?

Maybe there's a magic SIMD instruction
that will do everything we need at once?

The pshufb Instruction

— from the word shuffle;

— part of SSSE3 (three S's).

Takes two 16-byte registers:

— data to shuffle;

— «selector» — which byte to take for the i-th result byte.

Example

xmm0: abc............. xmm1: 0120120120120120 pshufb %xmm1, %xmm0 xmm0: abcabcabcabcabca

Each result byte is filled with a byte we selected from the source data - exactly what we need!

void copyOverlap16Shuffle(UInt8 * op, const UInt8 *& match, const size_t offset) { static constexpr UInt8 __attribute__((__aligned__(16))) masks[] = { /* offset = 0, not used as mask, but for shift amount instead */ 0, 1, 2, 1, 4, 1, 4, 2, 8, 7, 6, 5, 4, 3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* offset = 1 */ 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 0, 1, 2, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, }; _mm_storeu_si128(reinterpret_cast<__m128i *>(op), _mm_shuffle_epi8( _mm_loadu_si128(reinterpret_cast(match)), _mm_load_si128(reinterpret_cast(masks) + offset))); match += masks[offset]; }

The pshufb Instruction

Can also be used for 8 bytes:

— using MMX registers (bad);

— using 16-byte registers, just shuffling unnecessary bytes in the second half.

Analogous instructions:

— in AVX2: vpermb (incomplete);

— in AVX-512 VBMI - for 64 bytes at once;

— in ARM Neon: vtbl (incomplete).

How to Avoid Checking
Array Bounds?

— simply ask the user
to allocate more memory for buffers.

— but there may be performance degradation due to the allocator.

Three Optimizations

1. Copying 16 bytes instead of 8 bytes.

2. Using shuffle instruction for offset < 16 case.

3. Removed one unnecessary if.

Does This Give Speedup?

Example 1:

Xeon E2650v2, Yandex.Browser data, AppVersion column.

reference: 1.67 GB/sec.
16 bytes, shuffle: 2.94 GB/sec (76% faster!).

Example 2:

Xeon E2650v2, Yandex.Direct data, ShowsSumPosition column.

reference: 2.30 GB/sec.
16 bytes, shuffle: 1.91 GB/sec (20% slower).

Let's Make 4 Code Variants

— operate with 8 or 16-byte chunks;

— use or don't use shuffle instruction.

template <size_t copy_amount, bool use_shuffle> void NO_INLINE decompressImpl( const char * const source, char * const dest, size_t dest_size)
sudo echo 'performance' | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor kill -STOP $(pidof firefox) $(pidof chromium)
sudo kill -STOP \ $(pidof python) \ $(pidof perl) \ $(pgrep -u skynet) \ $(pidof cqudp-client)


Otherwise the results will be unstable.

How to Choose the Best Variant?

— choose the one that's better on average,
on representative workload and current CPU models;

— manually write conditions
depending on compression ratio
and CPU model;

— test all variants on all available CPUs
or read reference manuals for instruction latency and throughput.

Source: https://learnforeverlearn.com/bandits/

Multi-Armed Bandits

— randomly choose different variants;

— collect statistics for each;

— consider the time of each variant as a random variable;

— estimate the time distribution for each variant;

Thompson Sampling

— «simulate in our mind» a random variable for each variant;

— choose the one for which the simulated value turned out better.

How to Estimate Distribution

— Bayesian method;

— parametrically, something like Gamma distribution;

— parametrically, normal distribution;

Testing on Different CPUs

Intel(R) Xeon(R) CPU
— E5-2650 v2 @ 2.60GHz
— E5-2660 v4 @ 2.00GHz
— E5-2660 0 @ 2.20GHz
— E5-2683 v4 @ 2.10GHz
— E5-2667 v2 @ 3.30GHz
— E312xx (Sandy Bridge)
— E5645 @ 2.40GHz
— E5530 @ 2.40GHz
— E5440 @ 2.83GHz

AMD EPYC 7351 16-Core Processor
AMD Opteron(TM) Processor 6274
AMD Opteron(tm) Processor 6380

Cavium ThunderX2

Testing on Different CPUs and Datasets

— we have 13 servers;

— each runs tests on 256 files;

— in 6 variants (reference, 0, 1, 2, 3, adaptive);

— and each test is run 10 times, alternating variants.

This gives 199,680 results that can be compared.

Testing on Different CPUs and Datasets

— on modern Intel, speedup is 13..17% on average;

— on AMD EPYC, speedup is 19.7% on average;

— on old Intel E5440 — −1%;

— on Cavium — 3..4% (preliminary results);

— on all except E5440, multi-armed bandits on average
are always better than any pre-selected variant.

.

Pull request: https://github.com/ClickHouse/ClickHouse/pull/1890
(already merged)

┌─cpu───────────────────┬──ref─┬─adapt─┬──max─┬─best─┬─adapt_boost─┬─max_boost─┬─adapt_over_max─┐ │ E5-2667 v2 @ 3.30GHz │ 2.81 │ 3.19 │ 3.15 │ 3 │ 1.14 │ 1.12 │ 1.01 │ │ E5-2650 v2 @ 2.60GHz │ 2.5 │ 2.84 │ 2.81 │ 3 │ 1.14 │ 1.12 │ 1.01 │ │ E5-2683 v4 @ 2.10GHz │ 2.26 │ 2.63 │ 2.59 │ 3 │ 1.16 │ 1.15 │ 1.02 │ │ E5-2660 v4 @ 2.00GHz │ 2.15 │ 2.49 │ 2.46 │ 3 │ 1.16 │ 1.14 │ 1.01 │ │ AMD EPYC 7351 │ 2.03 │ 2.44 │ 2.35 │ 3 │ 1.2 │ 1.16 │ 1.04 │ │ E5-2660 0 @ 2.20GHz │ 2.13 │ 2.39 │ 2.37 │ 3 │ 1.12 │ 1.11 │ 1.01 │ │ E312xx (Sandy Bridge) │ 1.97 │ 2.2 │ 2.18 │ 3 │ 1.12 │ 1.11 │ 1.01 │ │ E5530 @ 2.40GHz │ 1.65 │ 1.93 │ 1.94 │ 3 │ 1.17 │ 1.18 │ 0.99 │ │ E5645 @ 2.40GHz │ 1.65 │ 1.92 │ 1.94 │ 3 │ 1.16 │ 1.18 │ 0.99 │ │ AMD Opteron 6380 │ 1.47 │ 1.58 │ 1.56 │ 1 │ 1.07 │ 1.06 │ 1.01 │ │ AMD Opteron 6274 │ 1.15 │ 1.35 │ 1.35 │ 1 │ 1.17 │ 1.17 │ 1 │ │ E5440 @ 2.83GHz │ 1.35 │ 1.33 │ 1.42 │ 1 │ 0.99 │ 1.05 │ 0.94 │ │ Cavium ThunderX2 │ 0.84 │ 0.87 │ 0.87 │ 0 │ 1.04 │ 1.04 │ 1 │ └───────────────────────┴──────┴───────┴──────┴──────┴─────────────┴───────────┴────────────────┘