How to Move Bytes

How to Move Bytes

Alexey Milovidov
ClickHouse Developer

How to Write Fast Code?

High-level optimization:

— choosing the right algorithm;
— using appropriate data structures;
— choosing interfaces that match the task;

Low-level optimization:

— eliminate unnecessary copying and allocations;
— make sure the right instructions are generated;
— make sure data is well-arranged in memory;

How to Write Fast Code?

Profile it:

4.47%  libstdc++.so.6.0.28  [.] __dynamic_cast
1.42%  libstdc++.so.6.0.28  [.] __cxxabiv1::__vmi_class_type_info::__do_dyncast
1.37%  libstdc++.so.6.0.28  [.] std::operator>><char, std::char_traits<char>, std::allocator<char> >
1.18%  libstdc++.so.6.0.28  [.] __cxxabiv1::__si_class_type_info::__do_dyncast
0.83%  libstdc++.so.6.0.28  [.] std::basic_ios<char, std::char_traits<char> >::_M_cache_locale
0.75%  libstdc++.so.6.0.28  [.] std::locale::id::_M_id
0.73%  libstdc++.so.6.0.28  [.] std::use_facet<std::ctype<char> >
0.64%  libstdc++.so.6.0.28  [.] std::ios_base::_M_init
0.64%  libc-2.31.so         [.] __strcmp_avx2
0.63%  libstdc++.so.6.0.28  [.] std::ostream::_M_insert<unsigned long>

How to Write Fast Code?

Profile it:

4.47%  libstdc++.so.6.0.28  [.] __dynamic_cast
1.42%  libstdc++.so.6.0.28  [.] __cxxabiv1::__vmi_class_type_info::__do_dyncast
1.37%  libstdc++.so.6.0.28  [.] std::operator>><char, std::char_traits<char>, std::allocator<char> >
1.18%  libstdc++.so.6.0.28  [.] __cxxabiv1::__si_class_type_info::__do_dyncast
0.83%  libstdc++.so.6.0.28  [.] std::basic_ios<char, std::char_traits<char> >::_M_cache_locale
0.75%  libstdc++.so.6.0.28  [.] std::locale::id::_M_id
0.73%  libstdc++.so.6.0.28  [.] std::use_facet<std::ctype<char> >
0.64%  libstdc++.so.6.0.28  [.] std::ios_base::_M_init
0.64%  libc-2.31.so         [.] __strcmp_avx2
0.63%  libstdc++.so.6.0.28  [.] std::ostream::_M_insert<unsigned long>

Question

Removing std::stringstream from an inner loop
— is this high-level or low-level optimization?

How to Write Fast Code?

Profile it:

12.96%  clickhouse  [.] memcpy                                    - copying
10.37%  [kernel]    [k] copy_user_generic_string                  - copying
 5.73%  clickhouse  [.] DB::deserializeBinarySSE2<1>           - deserialization
 4.50%  clickhouse  [.] DB::deserializeBinarySSE2<4>           - deserialization
 4.47%  clickhouse  [.] LZ4::(anonymous namespace)::decompressImpl<32ul, false>
 4.34%  clickhouse  [.] LZ4::(anonymous namespace)::decompressImpl<8ul, true>
 4.10%  clickhouse  [.] LZ4::(anonymous namespace)::decompressImpl<16ul, false>
 3.96%  clickhouse  [.] LZ4::(anonymous namespace)::decompressImpl<16ul, true>
        

How to Reduce memcpy Costs?

memcpy — data copying.

How to remove memcpy from the profile?

1. Copy less data.
Not always possible.
Often data needs to be moved to speed everything up.

2. Optimize memcpy implementation.
But hasn't everyone already optimized it?

Important Facts About memcpy

«It's the world's most popular function — one all programmers love»

— Alexandra Justine Roberts Tunney.

Important Facts About memcpy

1. The compiler can replace memcpy with a built-in implementation.
And it does if the size is small and known at compile time.

memcpy(dst, src, 8);

movq    (%rsi), %rax
movq    %rax, (%rdi)

This is controlled by the parameter -fbuiltin-memcpy, -fno-builtin-memcpy.

Built-in implementation can be specified explicitly:
__builtin_memcpy(dst, src, size).

For unknown sizes it will be replaced by a call to memcpy.

Important Facts About memcpy

1. The compiler can replace memcpy with a built-in implementation.
And it does if the size is small and known at compile time.

memcpy(dst, src, 16):

movups  (%rsi), %xmm0
movups  %xmm0, (%rdi)
memcpy(dst, src, 15):

movq    (%rsi), %rax
movq    7(%rsi), %rcx
movq    %rcx, 7(%rdi)
movq    %rax, (%rdi)

Important Facts About memcpy

2. The compiler can replace your code with a call to memcpy.
And it does if you write something similar.

void f(char * __restrict dst, const char * __restrict src, size_t size)
{
    for (size_t i = 0; i < size; ++i)
        dst[i] = src[i];
}

f(char*, char const*, unsigned long):
        testq   %rdx, %rdx
        je      .LBB2_2
        pushq   %rax
        callq   memcpy@PLT
        addq    $8, %rsp
.LBB2_2:
        retq

This is controlled by the parameter -fbuiltin-memcpy (clang)
or -ftree-loop-distribute-patterns (gcc).

Important Facts About memcpy

2. The compiler can replace your code with a call to memcpy.
And it does if you write something similar.

void memcpy(
    char * __restrict dst,
    const char * __restrict src,
    size_t size)
{
    for (size_t i = 0; i < size; ++i)
        dst[i] = src[i];
}

Segmentation fault (stack overflow).

Important Facts About memcpy

3. GLibc contains many memcpy implementations,
from which one suitable for the specific instruction set is chosen.

__memcpy_erms
__memcpy_sse2_unaligned
__memcpy_ssse3
__memcpy_ssse3_back
__memcpy_avx_unaligned
__memcpy_avx_unaligned_erms
__memcpy_avx512_unaligned
__memcpy_avx512_unaligned_erms
__memcpy_avx512_no_vzeroupper

This happens using GNU IFUNC (indirect functions) mechanism
during first use of the function from shared library,
via dynamic loader.

Important Facts About memcpy

4. The memcpy function is built into the processor
and can be executed in a single instruction!

rep movsb  %ds:(%rsi), %es:(%rdi)

This instruction has been present in all x86 CPUs from the very beginning.

If memcpy is implemented in hardware,
why not use that implementation?

How to Optimize memcpy?

Every self-respecting developer
has tried to optimize memcpy at least once in their life.

Before You Begin...

Do you have reproducible performance tests
under representative load?

Can you optimize code at a high level,
removing unnecessary copying?

Maybe you need a function with a slightly different interface?

What to Write memcpy In?

In Assembly language:

memcpy.s

+ you precisely control instruction and register selection,
  their order and even alignment.

function doesn't inline.

link time optimization doesn't work.

What to Write memcpy In?

— in Assembly language;

— in C;

— in C++;

— in C or C++ with inline assembly;

What Does memcpy Consist Of?

1. Copying "tails" — small unaligned
 chunks of data at the beginning and end.

2. Unrolled loop using wide instructions.

memcpy Performance

For small sizes, unknown at compile time.
Example: copying std::string, size ~50 bytes.

Cost of function call:

— if function is located in shared library,
  then its call is indirect, through PLT;

— if function is located in a separate translation unit,
  then it can't be inlined and must be called
  (in case of asm, LTO doesn't help);

— saving registers at the function call site;

memcpy Performance

For large sizes:

— Which instructions are used for copying.

— How the loop is unrolled.

— In what order to traverse the data.

— Whether aligned instructions are used.

— Non-temporary stores.

— Flushing upper halves of registers.

memcpy Performance

Strongly depends on:

— data distribution and load;

— processor model;

memcpy Performance

Example: repne movsb.

One instruction. Implemented using microcode.

For most old CPUs — works slowly in any case.

For many CPUs — works fast, but large startup overhead.

On new Intel processors there's erms flag (Enhanced Repne MovSb)
and it works fast in all cases... but not on AMD.

memcpy Performance

Example: copying with AVX instructions.

For most CPUs works faster than with SSE instructions.

But if the rest of the program code uses SSE,
then after AVX, all SSE instructions slow down forever.

To avoid being expensive, need to call vzeroupper instruction.

But the vzeroupper instruction itself is expensive.

But on the newest CPUs switching is not expensive
and vzeroupper doesn't need to be called.

memcpy Performance

Example: copying with AVX-512 instructions.

On not-so-new processors slower than AVX
and causes frequency reduction.

... if used simultaneously on many cores.

But on new processors it no longer causes this.

But there's no AVX-512 on AMD.

memcpy Performance

Example: copying with non-temporary stores.

_mm_stream* functions, movnt*, vmovnt* instructions.

Writes data to memory, bypassing cache.

Speeds up memcpy for large sizes by 1.5 times.

Completely useless in production.

memcpy Performance

Better not try to optimize memcpy!

On your machine and your benchmark everything will speed up.
But in production for users — it will slow down.

... but I'll try.

Writing memcpy Benchmark

Use different buffer sizes.

Use different sizes for memcpy calls.

Use sizes from random distributions,
so the branch predictor doesn't have it too easy.

Use different numbers of threads.

Copy between buffers in different directions.

Writing memcpy Benchmark

— 9 implementations from glibc.

— Cosmopolitan libc.

— repne movsb.

— Simple loop implementation:
  loop peeling, loop unrolling and vectorization — done by compiler.

— "Chinese memcpy", 2 variants.

— My implementation, from afar similar to the others.

Writing memcpy Benchmark

Run on different machines:

— Intel Cascade Lake;

— Intel Ivy Bridge;

— AMD EPYC;

— AMD Ryzen.

Writing memcpy Benchmark

https://github.com/ClickHouse/ClickHouse/tree/master
/utils/memcpy-bench

for size in 4096 16384 50000 65536 100000 1000000 10000000 100000000; do
  for threads in 1 2 4 $(($(nproc) / 2)) $(nproc); do
    for distribution in 1 2 3 4 5; do
      for variant in {1..13} {21..29}; do
        for i in {1..10}; do
          ./memcpy-bench --tsv --size $size --variant $variant \
            --threads $threads --distribution $distribution;
        done;
      done;
    done;
  done;
done | tee result.tsv

42,000 measurements from one machine.

Writing memcpy Benchmark

Result: there's no single best memcpy.

There are good ones:

__memcpy_avx_unaligned from glibc;
  but can't be inlined into code due to license
  and need to attach own CPU dispatch.

— MemCpy from Cosmopolitan libc ok,
  but loses on small sizes.

Still hope to make our own, best memcpy :)

Generalized memcpy

#define VEC_SIZE 16
#define VZEROUPPER 0
#include "memcpy_medium_every_unroll.inl.h"
#undef VEC_SIZE
#undef VZEROUPPER

#define VEC_SIZE 32
#define VZEROUPPER 1
#include "memcpy_medium_every_unroll.inl.h"
#undef VEC_SIZE
#undef VZEROUPPER

#define VEC_SIZE 64
#define VZEROUPPER 1
#include "memcpy_medium_every_unroll.inl.h"
#undef VEC_SIZE
#undef VZEROUPPER

Generalized memcpy

#define UNROLL_COUNT 1
#include "memcpy_medium.inl.h"
#undef UNROLL_COUNT

#define UNROLL_COUNT 2
#include "memcpy_medium.inl.h"
#undef UNROLL_COUNT

#define UNROLL_COUNT 3
#include "memcpy_medium.inl.h"
#undef UNROLL_COUNT

...

#define UNROLL_COUNT 15
#include "memcpy_medium.inl.h"
#undef UNROLL_COUNT

#define UNROLL_COUNT 16
#include "memcpy_medium.inl.h"
#undef UNROLL_COUNT

Generalized memcpy

#if VEC_SIZE == 16
    #define NAME_PART sse
    #define VEC_REGISTER "xmm"
    #define VEC_MOV_UNALIGNED "movdqu"
    #define VEC_MOV_ALIGNED "movdqa"
    #define VEC_SIZE_MINUS_1 "0x0f"
    #define VEC_SIZEx1 "0x10"
    ...
    #define VEC_SIZEx16 "0x100"
#elif VEC_SIZE == 32
    #define NAME_PART avx
    #define VEC_REGISTER "ymm"
    #define VEC_MOV_UNALIGNED "vmovdqu"
    #define VEC_MOV_ALIGNED "vmovdqa"
    #define VEC_SIZE_MINUS_1 "0x1f"
    #define VEC_SIZEx1 "0x20"
    ...
    #define VEC_SIZEx16 "0x200"
#elif VEC_SIZE == 64
    ...

Generalized memcpy

void * NAME_FORWARD_UNROLLED(void * __restrict destination, const void * __restrict source, size_t size)
{
    void * __restrict ret = destination;

    __asm__ __volatile__ (
    "mov %[dst], %[ret] \n"

    VEC_MOV_UNALIGNED " (%[src]), %%" VEC_REGISTER "15 \n"
    VEC_MOV_UNALIGNED " %%" VEC_REGISTER "15, (%[dst]) \n"

    "lea    -" VEC_SIZEx1 "(%[dst],%[size],1), %%rcx \n"
    "mov    %[dst], %%r8 \n"
    "and    $" VEC_SIZE_MINUS_1 ", %%r8 \n"
    "sub    $" VEC_SIZEx1 ", %%r8 \n"
    "sub    %%r8, %[src] \n"
    "sub    %%r8, %[dst] \n"
    "add    %%r8, %[size] \n"

"1: \n"
    VEC_MOV_UNALIGNED " (%[src]), %%" VEC_REGISTER "0 \n"
#if UNROLL_COUNT >= 2
    VEC_MOV_UNALIGNED " " VEC_SIZEx1 "(%[src]), %%" VEC_REGISTER "1 \n"
#endif
...

    "add    $" VEC_SIZExUNROLL ", %[src] \n"
    "sub    $" VEC_SIZExUNROLL ", %[size] \n"

    VEC_MOV_ALIGNED " %%" VEC_REGISTER "0, (%[dst]) \n"
#if UNROLL_COUNT >= 2
    VEC_MOV_ALIGNED " %%" VEC_REGISTER "1, " VEC_SIZEx1 "(%[dst]) \n"
#endif
...

    "add    $" VEC_SIZExUNROLL ", %[dst] \n"
    "cmp    $" VEC_SIZExUNROLL ", %[size] \n"
    "ja     1b \n"

    VEC_MOV_UNALIGNED " -" VEC_SIZEx1 "(%[src],%[size],1), %%" VEC_REGISTER "0 \n"
#if UNROLL_COUNT >= 2
    VEC_MOV_UNALIGNED " -" VEC_SIZEx2 "(%[src],%[size],1), %%" VEC_REGISTER "1 \n"
#endif
...

    VEC_MOV_UNALIGNED " %%" VEC_REGISTER "0, (%%rcx) \n"
#if UNROLL_COUNT >= 2
    VEC_MOV_UNALIGNED " %%" VEC_REGISTER "1, -" VEC_SIZEx1 "(%%rcx) \n"
#endif
...

    VZEROUPPER_INSTRUCTION

    : [dst]"+r"(destination), [src]"+r"(source), [size]"+r"(size), [ret]"=rax"(ret)
    :
    : "rcx", "r8", "r11",
      VEC_REGISTER "0", VEC_REGISTER "1", VEC_REGISTER "2", VEC_REGISTER "3",
      VEC_REGISTER "4", VEC_REGISTER "5", VEC_REGISTER "6", VEC_REGISTER "7",
      VEC_REGISTER "8", VEC_REGISTER "9", VEC_REGISTER "10", VEC_REGISTER "11",
      VEC_REGISTER "12", VEC_REGISTER "13", VEC_REGISTER "14", VEC_REGISTER "15",
      "memory");

    return ret;
}

Self-Tuning memcpy

Idea: make it so that for large sizes,
memcpy itself chooses the best variant based on statistics!

Let's set a threshold of 30,000 bytes. If less — regular implementation.

If more — run random variant, collect statistics.

memcpy in L1 cache — 50..100 GB/sec,
meaning copying 30 KB takes from 300 ns.

We'd like to spend up to a few percent on statistics collection
maximum 10 nanoseconds.

Self-Tuning memcpy

Idea: make it so that for large sizes,
memcpy itself chooses the best variant based on statistics!

Statistics collection and variant selection
— maximum 10 nanoseconds.

https://github.com/ClickHouse/ClickHouse/blob/38d665
/utils/memcpy-bench/memcpy-bench.cpp#L497

ALWAYS_INLINE void * call(void * __restrict dst, const void * __restrict src, size_t size)
{
    size_t current_count = count++;

    if (likely(current_count % probability_distribution_buckets < exploration_probability_threshold))
    {
        /// Exploitation mode.
        return selected_variant.load(std::memory_order_relaxed)(dst, src, size);
    }
    else
    {
        /// Exploration mode.
        return explore(dst, src, size);
    }
}

void * explore(void * __restrict dst, const void * __restrict src, size_t size)
{
    size_t current_exploration_count = exploration_count++;

    size_t hash = current_exploration_count;
    hash *= 0xff51afd7ed558ccdULL;
    hash ^= hash >> 33;

    VariantWithStatistics & variant = variants[hash % num_cells];

    UInt32 time1 = rdtsc();
    void * ret = variant.func.load(std::memory_order_relaxed)(dst, src, size);
    UInt32 time2 = rdtsc();
    UInt32 time = time2 - time1;

    if (time < size)
    {
        ++variant.count;
        variant.bytes += size;
        variant.time += time;
    }

Self-Tuning memcpy

On-the-fly learning successfully works in benchmark.

Chooses best variant and works at the speed of the best.

Then I put the self-tuning memcpy in ClickHouse
and started testing on real queries...

Self-Tuning memcpy

On-the-fly learning successfully works in benchmark.

Chooses best variant and works at the speed of the best.

Then I put the self-tuning memcpy in ClickHouse
and started testing on real queries...

And nothing sped up!

Self-tuning memcpy can't consistently
choose the best variant on real workload.

Self-Tuning memcpy

Self-Tuning memcpy

On real workload, when using many threads, memcpy hits memory bandwidth limit, so there's no noticeable difference between different tested variants.

Self-tuning memcpy can't choose the best variant and runs random ones. And this negatively affects the instruction cache.

Was All the Work Useless?

Not quite :)

Based on the experiments, I still replaced one memcpy with another.

It showed advantages in production
due to more efficient handling of small sizes.

https://github.com/ClickHouse/ClickHouse/pull/21520

Conclusions

Optimize your code.

Try crazy ideas.

Never give up.

Thank You!

Alexey Milovidov
ClickHouse Developer