Alexey Milovidov
ClickHouse Developer
High-level optimization:
— choosing the right algorithm;
— using appropriate data structures;
— choosing interfaces that match the task;
Low-level optimization:
— eliminate unnecessary copying and allocations;
— make sure the right instructions are generated;
— make sure data is well-arranged in memory;
Profile it:
4.47% libstdc++.so.6.0.28 [.] __dynamic_cast 1.42% libstdc++.so.6.0.28 [.] __cxxabiv1::__vmi_class_type_info::__do_dyncast 1.37% libstdc++.so.6.0.28 [.] std::operator>><char, std::char_traits<char>, std::allocator<char> > 1.18% libstdc++.so.6.0.28 [.] __cxxabiv1::__si_class_type_info::__do_dyncast 0.83% libstdc++.so.6.0.28 [.] std::basic_ios<char, std::char_traits<char> >::_M_cache_locale 0.75% libstdc++.so.6.0.28 [.] std::locale::id::_M_id 0.73% libstdc++.so.6.0.28 [.] std::use_facet<std::ctype<char> > 0.64% libstdc++.so.6.0.28 [.] std::ios_base::_M_init 0.64% libc-2.31.so [.] __strcmp_avx2 0.63% libstdc++.so.6.0.28 [.] std::ostream::_M_insert<unsigned long>
Profile it:
4.47% libstdc++.so.6.0.28 [.] __dynamic_cast 1.42% libstdc++.so.6.0.28 [.] __cxxabiv1::__vmi_class_type_info::__do_dyncast 1.37% libstdc++.so.6.0.28 [.] std::operator>><char, std::char_traits<char>, std::allocator<char> > 1.18% libstdc++.so.6.0.28 [.] __cxxabiv1::__si_class_type_info::__do_dyncast 0.83% libstdc++.so.6.0.28 [.] std::basic_ios<char, std::char_traits<char> >::_M_cache_locale 0.75% libstdc++.so.6.0.28 [.] std::locale::id::_M_id 0.73% libstdc++.so.6.0.28 [.] std::use_facet<std::ctype<char> > 0.64% libstdc++.so.6.0.28 [.] std::ios_base::_M_init 0.64% libc-2.31.so [.] __strcmp_avx2 0.63% libstdc++.so.6.0.28 [.] std::ostream::_M_insert<unsigned long>
Removing std::stringstream from an inner loop
— is this high-level or low-level optimization?
Profile it:
12.96% clickhouse [.] memcpy - copying
10.37% [kernel] [k] copy_user_generic_string - copying
5.73% clickhouse [.] DB::deserializeBinarySSE2<1> - deserialization
4.50% clickhouse [.] DB::deserializeBinarySSE2<4> - deserialization
4.47% clickhouse [.] LZ4::(anonymous namespace)::decompressImpl<32ul, false>
4.34% clickhouse [.] LZ4::(anonymous namespace)::decompressImpl<8ul, true>
4.10% clickhouse [.] LZ4::(anonymous namespace)::decompressImpl<16ul, false>
3.96% clickhouse [.] LZ4::(anonymous namespace)::decompressImpl<16ul, true>
memcpy — data copying.
How to remove memcpy from the profile?
1. Copy less data.
Not always possible.
Often data needs to be moved to speed everything up.
2. Optimize memcpy implementation.
But hasn't everyone already optimized it?
«It's the world's most popular function — one all programmers love»
— Alexandra Justine Roberts Tunney.
1. The compiler can replace memcpy with a built-in implementation.
And it does if the size is small and known at compile time.
memcpy(dst, src, 8); movq (%rsi), %rax movq %rax, (%rdi)
This is controlled by the parameter -fbuiltin-memcpy, -fno-builtin-memcpy.
Built-in implementation can be specified explicitly:
__builtin_memcpy(dst, src, size).
For unknown sizes it will be replaced by a call to memcpy.
1. The compiler can replace memcpy with a built-in implementation.
And it does if the size is small and known at compile time.
memcpy(dst, src, 16): movups (%rsi), %xmm0 movups %xmm0, (%rdi)
memcpy(dst, src, 15): movq (%rsi), %rax movq 7(%rsi), %rcx movq %rcx, 7(%rdi) movq %rax, (%rdi)
2. The compiler can replace your code with a call to memcpy.
And it does if you write something similar.
void f(char * __restrict dst, const char * __restrict src, size_t size)
{
for (size_t i = 0; i < size; ++i)
dst[i] = src[i];
}
f(char*, char const*, unsigned long):
testq %rdx, %rdx
je .LBB2_2
pushq %rax
callq memcpy@PLT
addq $8, %rsp
.LBB2_2:
retq
This is controlled by the parameter -fbuiltin-memcpy (clang)
or -ftree-loop-distribute-patterns (gcc).
2. The compiler can replace your code with a call to memcpy.
And it does if you write something similar.
void memcpy(
char * __restrict dst,
const char * __restrict src,
size_t size)
{
for (size_t i = 0; i < size; ++i)
dst[i] = src[i];
}
Segmentation fault (stack overflow).
3. GLibc contains many memcpy implementations,
from which one suitable for the specific instruction set is chosen.
__memcpy_erms __memcpy_sse2_unaligned __memcpy_ssse3 __memcpy_ssse3_back __memcpy_avx_unaligned __memcpy_avx_unaligned_erms __memcpy_avx512_unaligned __memcpy_avx512_unaligned_erms __memcpy_avx512_no_vzeroupper
This happens using GNU IFUNC (indirect functions) mechanism
during first use of the function from shared library,
via dynamic loader.
4. The memcpy function is built into the processor
and can be executed in a single instruction!
rep movsb %ds:(%rsi), %es:(%rdi)
This instruction has been present in all x86 CPUs from the very beginning.
If memcpy is implemented in hardware,
why not use that implementation?
Every self-respecting developer
has tried to optimize memcpy at least once in their life.
Do you have reproducible performance tests
under representative load?
Can you optimize code at a high level,
removing unnecessary copying?
Maybe you need a function with a slightly different interface?
In Assembly language:
memcpy.s
+ you precisely control instruction and register selection,
their order and even alignment.
− function doesn't inline.
− link time optimization doesn't work.
— in Assembly language;
— in C;
— in C++;
— in C or C++ with inline assembly;
1. Copying "tails" — small unaligned
chunks of data at the beginning and end.
2. Unrolled loop using wide instructions.
For small sizes, unknown at compile time.
Example: copying std::string, size ~50 bytes.
Cost of function call:
— if function is located in shared library,
then its call is indirect, through PLT;
— if function is located in a separate translation unit,
then it can't be inlined and must be called
(in case of asm, LTO doesn't help);
— saving registers at the function call site;
For large sizes:
— Which instructions are used for copying.
— How the loop is unrolled.
— In what order to traverse the data.
— Whether aligned instructions are used.
— Non-temporary stores.
— Flushing upper halves of registers.
Strongly depends on:
— data distribution and load;
— processor model;
Example: repne movsb.
One instruction. Implemented using microcode.
For most old CPUs — works slowly in any case.
For many CPUs — works fast, but large startup overhead.
On new Intel processors there's erms flag (Enhanced Repne MovSb)
and it works fast in all cases... but not on AMD.
Example: copying with AVX instructions.
For most CPUs works faster than with SSE instructions.
But if the rest of the program code uses SSE,
then after AVX, all SSE instructions slow down forever.
To avoid being expensive, need to call vzeroupper instruction.
But the vzeroupper instruction itself is expensive.
But on the newest CPUs switching is not expensive
and vzeroupper doesn't need to be called.
Example: copying with AVX-512 instructions.
On not-so-new processors slower than AVX
and causes frequency reduction.
... if used simultaneously on many cores.
But on new processors it no longer causes this.
But there's no AVX-512 on AMD.
Example: copying with non-temporary stores.
_mm_stream* functions, movnt*, vmovnt* instructions.
Writes data to memory, bypassing cache.
Speeds up memcpy for large sizes by 1.5 times.
Completely useless in production.
Better not try to optimize memcpy!
On your machine and your benchmark everything will speed up.
But in production for users — it will slow down.
... but I'll try.
Use different buffer sizes.
Use different sizes for memcpy calls.
Use sizes from random distributions,
so the branch predictor doesn't have it too easy.
Use different numbers of threads.
Copy between buffers in different directions.
— 9 implementations from glibc.
— Cosmopolitan libc.
— repne movsb.
— Simple loop implementation:
loop peeling, loop unrolling and vectorization — done by compiler.
— "Chinese memcpy", 2 variants.
— My implementation, from afar similar to the others.
Run on different machines:
— Intel Cascade Lake;
— Intel Ivy Bridge;
— AMD EPYC;
— AMD Ryzen.
https://github.com/ClickHouse/ClickHouse/tree/master
/utils/memcpy-bench
for size in 4096 16384 50000 65536 100000 1000000 10000000 100000000; do
for threads in 1 2 4 $(($(nproc) / 2)) $(nproc); do
for distribution in 1 2 3 4 5; do
for variant in {1..13} {21..29}; do
for i in {1..10}; do
./memcpy-bench --tsv --size $size --variant $variant \
--threads $threads --distribution $distribution;
done;
done;
done;
done;
done | tee result.tsv
42,000 measurements from one machine.
Result: there's no single best memcpy.
There are good ones:
— __memcpy_avx_unaligned from glibc;
but can't be inlined into code due to license
and need to attach own CPU dispatch.
— MemCpy from Cosmopolitan libc ok,
but loses on small sizes.
Still hope to make our own, best memcpy :)
#define VEC_SIZE 16 #define VZEROUPPER 0 #include "memcpy_medium_every_unroll.inl.h" #undef VEC_SIZE #undef VZEROUPPER #define VEC_SIZE 32 #define VZEROUPPER 1 #include "memcpy_medium_every_unroll.inl.h" #undef VEC_SIZE #undef VZEROUPPER #define VEC_SIZE 64 #define VZEROUPPER 1 #include "memcpy_medium_every_unroll.inl.h" #undef VEC_SIZE #undef VZEROUPPER
#define UNROLL_COUNT 1 #include "memcpy_medium.inl.h" #undef UNROLL_COUNT #define UNROLL_COUNT 2 #include "memcpy_medium.inl.h" #undef UNROLL_COUNT #define UNROLL_COUNT 3 #include "memcpy_medium.inl.h" #undef UNROLL_COUNT ... #define UNROLL_COUNT 15 #include "memcpy_medium.inl.h" #undef UNROLL_COUNT #define UNROLL_COUNT 16 #include "memcpy_medium.inl.h" #undef UNROLL_COUNT
#if VEC_SIZE == 16
#define NAME_PART sse
#define VEC_REGISTER "xmm"
#define VEC_MOV_UNALIGNED "movdqu"
#define VEC_MOV_ALIGNED "movdqa"
#define VEC_SIZE_MINUS_1 "0x0f"
#define VEC_SIZEx1 "0x10"
...
#define VEC_SIZEx16 "0x100"
#elif VEC_SIZE == 32
#define NAME_PART avx
#define VEC_REGISTER "ymm"
#define VEC_MOV_UNALIGNED "vmovdqu"
#define VEC_MOV_ALIGNED "vmovdqa"
#define VEC_SIZE_MINUS_1 "0x1f"
#define VEC_SIZEx1 "0x20"
...
#define VEC_SIZEx16 "0x200"
#elif VEC_SIZE == 64
...
void * NAME_FORWARD_UNROLLED(void * __restrict destination, const void * __restrict source, size_t size)
{
void * __restrict ret = destination;
__asm__ __volatile__ (
"mov %[dst], %[ret] \n"
VEC_MOV_UNALIGNED " (%[src]), %%" VEC_REGISTER "15 \n"
VEC_MOV_UNALIGNED " %%" VEC_REGISTER "15, (%[dst]) \n"
"lea -" VEC_SIZEx1 "(%[dst],%[size],1), %%rcx \n"
"mov %[dst], %%r8 \n"
"and $" VEC_SIZE_MINUS_1 ", %%r8 \n"
"sub $" VEC_SIZEx1 ", %%r8 \n"
"sub %%r8, %[src] \n"
"sub %%r8, %[dst] \n"
"add %%r8, %[size] \n"
"1: \n"
VEC_MOV_UNALIGNED " (%[src]), %%" VEC_REGISTER "0 \n"
#if UNROLL_COUNT >= 2
VEC_MOV_UNALIGNED " " VEC_SIZEx1 "(%[src]), %%" VEC_REGISTER "1 \n"
#endif
...
"add $" VEC_SIZExUNROLL ", %[src] \n"
"sub $" VEC_SIZExUNROLL ", %[size] \n"
VEC_MOV_ALIGNED " %%" VEC_REGISTER "0, (%[dst]) \n"
#if UNROLL_COUNT >= 2
VEC_MOV_ALIGNED " %%" VEC_REGISTER "1, " VEC_SIZEx1 "(%[dst]) \n"
#endif
...
"add $" VEC_SIZExUNROLL ", %[dst] \n"
"cmp $" VEC_SIZExUNROLL ", %[size] \n"
"ja 1b \n"
VEC_MOV_UNALIGNED " -" VEC_SIZEx1 "(%[src],%[size],1), %%" VEC_REGISTER "0 \n"
#if UNROLL_COUNT >= 2
VEC_MOV_UNALIGNED " -" VEC_SIZEx2 "(%[src],%[size],1), %%" VEC_REGISTER "1 \n"
#endif
...
VEC_MOV_UNALIGNED " %%" VEC_REGISTER "0, (%%rcx) \n"
#if UNROLL_COUNT >= 2
VEC_MOV_UNALIGNED " %%" VEC_REGISTER "1, -" VEC_SIZEx1 "(%%rcx) \n"
#endif
...
VZEROUPPER_INSTRUCTION
: [dst]"+r"(destination), [src]"+r"(source), [size]"+r"(size), [ret]"=rax"(ret)
:
: "rcx", "r8", "r11",
VEC_REGISTER "0", VEC_REGISTER "1", VEC_REGISTER "2", VEC_REGISTER "3",
VEC_REGISTER "4", VEC_REGISTER "5", VEC_REGISTER "6", VEC_REGISTER "7",
VEC_REGISTER "8", VEC_REGISTER "9", VEC_REGISTER "10", VEC_REGISTER "11",
VEC_REGISTER "12", VEC_REGISTER "13", VEC_REGISTER "14", VEC_REGISTER "15",
"memory");
return ret;
}
Idea: make it so that for large sizes,
memcpy itself chooses the best variant based on statistics!
Let's set a threshold of 30,000 bytes. If less — regular implementation.
If more — run random variant, collect statistics.
memcpy in L1 cache — 50..100 GB/sec,
meaning copying 30 KB takes from 300 ns.
We'd like to spend up to a few percent on statistics collection
— maximum 10 nanoseconds.
Idea: make it so that for large sizes,
memcpy itself chooses the best variant based on statistics!
Statistics collection and variant selection
— maximum 10 nanoseconds.
https://github.com/ClickHouse/ClickHouse/blob/38d665
/utils/memcpy-bench/memcpy-bench.cpp#L497
ALWAYS_INLINE void * call(void * __restrict dst, const void * __restrict src, size_t size)
{
size_t current_count = count++;
if (likely(current_count % probability_distribution_buckets < exploration_probability_threshold))
{
/// Exploitation mode.
return selected_variant.load(std::memory_order_relaxed)(dst, src, size);
}
else
{
/// Exploration mode.
return explore(dst, src, size);
}
}
void * explore(void * __restrict dst, const void * __restrict src, size_t size)
{
size_t current_exploration_count = exploration_count++;
size_t hash = current_exploration_count;
hash *= 0xff51afd7ed558ccdULL;
hash ^= hash >> 33;
VariantWithStatistics & variant = variants[hash % num_cells];
UInt32 time1 = rdtsc();
void * ret = variant.func.load(std::memory_order_relaxed)(dst, src, size);
UInt32 time2 = rdtsc();
UInt32 time = time2 - time1;
if (time < size)
{
++variant.count;
variant.bytes += size;
variant.time += time;
}
On-the-fly learning successfully works in benchmark.
Chooses best variant and works at the speed of the best.
Then I put the self-tuning memcpy in ClickHouse
and started testing on real queries...
On-the-fly learning successfully works in benchmark.
Chooses best variant and works at the speed of the best.
Then I put the self-tuning memcpy in ClickHouse
and started testing on real queries...
And nothing sped up!
Self-tuning memcpy can't consistently
choose the best variant on real workload.
On real workload, when using many threads, memcpy hits memory bandwidth limit, so there's no noticeable difference between different tested variants.
Self-tuning memcpy can't choose the best variant and runs random ones. And this negatively affects the instruction cache.
Not quite :)
Based on the experiments, I still replaced one memcpy with another.
It showed advantages in production
due to more efficient handling of small sizes.
Optimize your code.
Try crazy ideas.
Never give up.
Alexey Milovidov
ClickHouse Developer