Author: Alexey Milovidov, 2018-06-25.
SELECT * FROM hits_100m WHERE URL LIKE '%metrika%' ORDER BY EventTime LIMIT 10
0.772 sec.
11.0 GiB/sec.
125.98 million rows/sec.
— data compression library;
— high performance;
— special case of LZ77 algorithm;
— not specialized for specific data types;
— exists since 2011;
— used in ClickHouse since 2011;
— author — Yann Collet;
— analogues: Snappy, QuickLZ, LZO, FastLZ, BriefLZ, LZF;
It's almost «Pareto-optimal».
We compare algorithms by three parameters
on all possible datasets:
— compression ratio;
— compression speed;
— decompression speed;
An algorithm is called «Pareto-optimal» if there is no other algorithm that is better in at least one parameter and not worse in all others.
Lizard (formerly LZ5): https://github.com/inikep/lizard/
LZSSE: https://github.com/ConorStokes/LZSSE/
Typical decompression speed is several GB/sec on a single CPU core.
Decompression is done in multiple threads,
which gives a speed of 10 GB/sec and more.
This is more than the speed of HDD and SSD arrays,
so decompression shouldn't be a bottleneck?
But:
Compressed data is read from disk, which reduces IO.
Data is often in the page cache
— then the CPU will always be the bottleneck, not IO.
— yes.
If we don't — we'll hit IO or data volume limits.
Always compress data:
— when storing, even if you have an array of fast NVMe SSDs;
— when transmitting data over the network, even with 10 GBit without oversubscription;
A compressed file is a set of two types of commands,
executing which results in data decompression:
Literals: copy the next N bytes of compressed data as is.
Match: copy N bytes at offset M,
which were recently decompressed.
Hello world Hello
literals 12 "Hello world " match 5 12
* — except for small details; any LZ77 family algorithm;
— iterate through the data;
— hash the next 4 bytes;
— put the offset in a hash table;
— if these 4 bytes are already there - we found a match;
— look at the maximum match length;
— write the command "repeat data of such length at such offset" to the compressed file;
while (...)
{
read(input_pos, literal_length, match_length);
copy(output_pos, input_pos, literal_length);
output_pos += literal_length;
read(input_pos, match_offset);
copy(output_pos, output_pos - match_offset,
match_length);
output_pos += match_length;
}
* — in this usage scenario;
1. Because memcpy is usually in the libc library, and the libc library is usually dynamically linked, and the call to memcpy will be indirect, through PLT.
2. Because the memcpy function with an unknown size argument at compile time is not inlined.
3. Because the memcpy function makes a lot of effort to correctly handle «tails» of memory chunks that are not multiples of the machine word or register size.
We can always copy in chunks that are multiples of 8 (or 16, 32...) bytes.
This optimization is already present in the original LZ4 implementation.
Hello world ...........
^^^^^ - src
^^^^^ - dst
Hello world Hello wo...
^^^^^^^^ - src
^^^^^^^^ - dst
(but we need to check for buffer overflow)
void copy8(UInt8 * dst, const UInt8 * src)
{
/// Here memcpy is not called (builtin).
memcpy(dst, src, 8);
}
void wildCopy8(
UInt8 * dst, const UInt8 * src, UInt8 * dst_end)
{
do
{
copy8(dst, src);
dst += 8;
src += 8;
} while (dst < dst_end);
}
The last 5 bytes are always literals
The last match must start at least 12 bytes before end of block.
Consequently, a block with less than 13 bytes cannot be compressed.
— why isn't the loop unrolled?
— why don't we use SSE or AVX registers?
void copy16(UInt8 * dst, const UInt8 * src)
{
_mm_storeu_si128(reinterpret_cast<__m128i *>(dst),
_mm_loadu_si128(reinterpret_cast(src)));
}
void wildCopy16(
UInt8 * dst, const UInt8 * src, UInt8 * dst_end)
{
do
{
copy16(dst, src);
dst += 16;
src += 16;
} while (dst < dst_end);
}
— it's not obvious which is better;
— in the original LZ4 implementation
this is related to conditions in the compressed data format,
preventing memory overrun;
Copy 5 bytes at offset 12:
Hello world ...........
^^^^^ - src
^^^^^ - dst
Hello world Hello wo...
^^^^^ - src
^^^^^ - dst
Copy 10 bytes at offset 3:
abc.............
^^^^^^^^^^ - src
^^^^^^^^^^ - dst
abcabcabcabca...
^^^^^^^^^^ - src
^^^^^^^^^^ - dst
op[0] = match[0];
op[1] = match[1];
op[2] = match[2];
op[3] = match[3];
...
abca............
^ - src
^ - dst
abcab...........
^ - src
^ - dst
abcabc..........
^ - src
^ - dst
abcabca.........
^ - src
^ - dst
abcabcab........
^ - src
^ - dst
Copy the first 4 bytes byte by byte:
abcabca.........
^^^^ - src
^^^^ - dst
Now we can copy 4 bytes at once:
abcabcabcab.....
^^^^ - src
^^^^ - dst
Now we can copy 8 bytes at once:
abcabcabcabcabcabca.....
^^^^^^^^ - src
^^^^^^^^ - dst
const unsigned dec32table[] = {0, 1, 2, 1, 4, 4, 4, 4};
const int dec64table[] = {0, 0, 0, -1, 0, 1, 2, 3};
const int dec64 = dec64table[offset];
op[0] = match[0];
op[1] = match[1];
op[2] = match[2];
op[3] = match[3];
match += dec32table[offset];
memcpy(op+4, match, 4);
match -= dec64;
void copyOverlap8(UInt8 * op, const UInt8 *& match, const size_t offset)
{
/// 4 % n.
/// Or if 4 % n is zero, we use n.
/// It gives equivalent result, but is better CPU friendly.
static constexpr int shift1[] = { 0, 1, 2, 1, 4, 4, 4, 4 };
/// 8 % n - 4 % n
static constexpr int shift2[] = { 0, 0, 0, 1, 0, -1, -2, -3 };
op[0] = match[0];
op[1] = match[1];
op[2] = match[2];
op[3] = match[3];
match += shift1[offset];
memcpy(op + 4, match, 4);
match += shift2[offset];
}
void copyOverlap16(UInt8 * op, const UInt8 *& match, const size_t offset)
{
/// 4 % n.
static constexpr int shift1[]
= { 0, 1, 2, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4 };
/// 8 % n - 4 % n
static constexpr int shift2[]
= { 0, 0, 0, 1, 0, -1, -2, -3, -4, 4, 4, 4, 4, 4, 4, 4 };
/// 16 % n - 8 % n
static constexpr int shift3[]
= { 0, 0, 0, -1, 0, -2, 2, 1, 8, -1, -2, -3, -4, -5, -6, -7 };
op[0] = match[0];
op[1] = match[1];
op[2] = match[2];
op[3] = match[3];
match += shift1[offset];
memcpy(op + 4, match, 4);
match += shift2[offset];
memcpy(op + 8, match, 8);
match += shift3[offset];
}
Maybe there's a magic SIMD instruction
that will do everything we need at once?
— from the word shuffle;
— part of SSSE3 (three S's).
Takes two 16-byte registers:
— data to shuffle;
— «selector» — which byte to take for the i-th result byte.
xmm0: abc.............
xmm1: 0120120120120120
pshufb %xmm1, %xmm0
xmm0: abcabcabcabcabca
Each result byte is filled with a byte we selected from the source data - exactly what we need!
void copyOverlap16Shuffle(UInt8 * op, const UInt8 *& match, const size_t offset)
{
static constexpr UInt8 __attribute__((__aligned__(16))) masks[] =
{
/* offset = 0, not used as mask, but for shift amount instead */
0, 1, 2, 1, 4, 1, 4, 2, 8, 7, 6, 5, 4, 3, 2, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* offset = 1 */
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0,
0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3,
0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0,
0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3,
0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7,
0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 1, 2, 3, 4,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 0, 1, 2,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0, 1,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0,
};
_mm_storeu_si128(reinterpret_cast<__m128i *>(op),
_mm_shuffle_epi8(
_mm_loadu_si128(reinterpret_cast(match)),
_mm_load_si128(reinterpret_cast(masks) + offset)));
match += masks[offset];
}
Can also be used for 8 bytes:
— using MMX registers (bad);
— using 16-byte registers, just shuffling unnecessary bytes in the second half.
Analogous instructions:
— in AVX2: vpermb (incomplete);
— in AVX-512 VBMI - for 64 bytes at once;
— in ARM Neon: vtbl (incomplete).
— simply ask the user
to allocate more memory for buffers.
— but there may be performance degradation due to the allocator.
1. Copying 16 bytes instead of 8 bytes.
2. Using shuffle instruction for offset < 16 case.
3. Removed one unnecessary if.
Example 1:
Xeon E2650v2, Yandex.Browser data, AppVersion column.
reference: 1.67 GB/sec.
16 bytes, shuffle: 2.94 GB/sec (76% faster!).
Example 2:
Xeon E2650v2, Yandex.Direct data, ShowsSumPosition column.
reference: 2.30 GB/sec.
16 bytes, shuffle: 1.91 GB/sec (20% slower).
— operate with 8 or 16-byte chunks;
— use or don't use shuffle instruction.
template <size_t copy_amount, bool use_shuffle>
void NO_INLINE decompressImpl(
const char * const source,
char * const dest,
size_t dest_size)
sudo echo 'performance' |
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
kill -STOP $(pidof firefox) $(pidof chromium)
sudo kill -STOP \
$(pidof python) \
$(pidof perl) \
$(pgrep -u skynet) \
$(pidof cqudp-client)
Otherwise the results will be unstable.
— choose the one that's better on average,
on representative workload and current CPU models;
— manually write conditions
depending on compression ratio
and CPU model;
— test all variants on all available CPUs
or read reference manuals for instruction latency and throughput.
— randomly choose different variants;
— collect statistics for each;
— consider the time of each variant as a random variable;
— estimate the time distribution for each variant;
— «simulate in our mind» a random variable for each variant;
— choose the one for which the simulated value turned out better.
— Bayesian method;
— parametrically, something like Gamma distribution;
— parametrically, normal distribution;
Intel(R) Xeon(R) CPU
— E5-2650 v2 @ 2.60GHz
— E5-2660 v4 @ 2.00GHz
— E5-2660 0 @ 2.20GHz
— E5-2683 v4 @ 2.10GHz
— E5-2667 v2 @ 3.30GHz
— E312xx (Sandy Bridge)
— E5645 @ 2.40GHz
— E5530 @ 2.40GHz
— E5440 @ 2.83GHz
AMD EPYC 7351 16-Core Processor
AMD Opteron(TM) Processor 6274
AMD Opteron(tm) Processor 6380
Cavium ThunderX2
— we have 13 servers;
— each runs tests on 256 files;
— in 6 variants (reference, 0, 1, 2, 3, adaptive);
— and each test is run 10 times, alternating variants.
This gives 199,680 results that can be compared.
— on modern Intel, speedup is 13..17% on average;
— on AMD EPYC, speedup is 19.7% on average;
— on old Intel E5440 — −1%;
— on Cavium — 3..4% (preliminary results);
— on all except E5440, multi-armed bandits on average
are always better than any pre-selected variant.
Pull request: https://github.com/ClickHouse/ClickHouse/pull/1890
(already merged)
┌─cpu───────────────────┬──ref─┬─adapt─┬──max─┬─best─┬─adapt_boost─┬─max_boost─┬─adapt_over_max─┐
│ E5-2667 v2 @ 3.30GHz │ 2.81 │ 3.19 │ 3.15 │ 3 │ 1.14 │ 1.12 │ 1.01 │
│ E5-2650 v2 @ 2.60GHz │ 2.5 │ 2.84 │ 2.81 │ 3 │ 1.14 │ 1.12 │ 1.01 │
│ E5-2683 v4 @ 2.10GHz │ 2.26 │ 2.63 │ 2.59 │ 3 │ 1.16 │ 1.15 │ 1.02 │
│ E5-2660 v4 @ 2.00GHz │ 2.15 │ 2.49 │ 2.46 │ 3 │ 1.16 │ 1.14 │ 1.01 │
│ AMD EPYC 7351 │ 2.03 │ 2.44 │ 2.35 │ 3 │ 1.2 │ 1.16 │ 1.04 │
│ E5-2660 0 @ 2.20GHz │ 2.13 │ 2.39 │ 2.37 │ 3 │ 1.12 │ 1.11 │ 1.01 │
│ E312xx (Sandy Bridge) │ 1.97 │ 2.2 │ 2.18 │ 3 │ 1.12 │ 1.11 │ 1.01 │
│ E5530 @ 2.40GHz │ 1.65 │ 1.93 │ 1.94 │ 3 │ 1.17 │ 1.18 │ 0.99 │
│ E5645 @ 2.40GHz │ 1.65 │ 1.92 │ 1.94 │ 3 │ 1.16 │ 1.18 │ 0.99 │
│ AMD Opteron 6380 │ 1.47 │ 1.58 │ 1.56 │ 1 │ 1.07 │ 1.06 │ 1.01 │
│ AMD Opteron 6274 │ 1.15 │ 1.35 │ 1.35 │ 1 │ 1.17 │ 1.17 │ 1 │
│ E5440 @ 2.83GHz │ 1.35 │ 1.33 │ 1.42 │ 1 │ 0.99 │ 1.05 │ 0.94 │
│ Cavium ThunderX2 │ 0.84 │ 0.87 │ 0.87 │ 0 │ 1.04 │ 1.04 │ 1 │
└───────────────────────┴──────┴───────┴──────┴──────┴─────────────┴───────────┴────────────────┘