Notorious Bugs and How to Avoid Them
Using ClickHouse as an Example

About Me

Alexey, ClickHouse developer.

Notorious Bugs

Bugs in code
Compiler bugs
OS kernel bugs
Hardware bugs
Configuration bugs
Usability bugs

Configuration Bugs

This example is taken from practice.

Not mine.

Very long ago.

In a completely different company.

Configuration Bugs

There is a map-reduce cluster.

It consists of data-nodes and master.

Data-nodes store data.

Data-nodes know the master's address and connect to it.

Master-node monitors where which data should be located,
and gives commands to data-nodes.

Configuration Bugs

Updated the data-nodes configuration on one of the clusters.

By mistake, data-nodes received the master address from another cluster.

...

The master decided that the data-nodes contain some unknown data.

And gave everyone the command to delete this data.

When half the data was deleted, someone noticed.

Megg, Mogg & Owl Series by Simon Hanselmann

Configuration Bugs

The most epic bugs — are those that lead
to unintentional data deletion.

How to avoid:

Don't delete data, but set it aside.

Don't delete unexpected data if the reason is unknown.

Threshold on quantity of unexpected data
— if there's too much, refuse to start.

Isolation of testing and production at the network level.

Configuration Bugs

This example is taken from practice.

Now mine...

Configuration Bugs

One good company had a ClickHouse cluster working strangely.

During operation, replicas didn't synchronize,
and during restart there was a message
that «all data is invalid», and the server wouldn't start.

When setting the force_restore_data flag,
part of the data was set aside.

Configuration Bugs

ClickHouse cluster uses ZooKeeper for coordination.

ZooKeeper stores metadata about
which data should be on which replica.

ZooKeeper — is also a cluster, typically three machines.

In ClickHouse configuration all ZooKeeper machines are specified
and connection is established with a random one.

Configuration Bugs

Reason: instead of a cluster of three ZK nodes,
there were three independent nodes (three clusters of one node).

Solution:
— fix ZK configuration;
— execute ATTACH for data parts from directory detached/unexpeted_*.

Result:
— all data restored;
— replicas synchronized;
— zero loss!

Bugs in Code

Example from 2015.

This bug manifested in production on Yandex.Metrica cluster...

Bugs in Code

Symptom:

Very rarely a user gets an exception

«Checksum doesn't match, corrupted data»

or «LRUCache became inconsistent. There must be a bug in it».

Bugs in Code

«Checksum doesn't match, corrupted data»

— checksums of compressed data blocks are verified before decompression;

— error usually indicates
that data on the file system is corrupted;

— but if you read the same file manually, there is no error;

— once it appears, the error reproduces consistently when repeating the query;

— after server restart, the error disappears for some time.

Bugs in Code

«Checksum doesn't match, corrupted data»

— maybe data is corrupted in RAM?

— but there are no machine check exceptions in dmesg (kern.log);

— but otherwise, there are no manifestations of the error;

Bugs in Code

«LRUCache became inconsistent. There must be a bug in it.»

— here we are clearly told that there's a bug in the code;

— possibly memory corruption?

— but tests under ASan and TSan in CI show nothing;

— and running the server under ASan in production doesn't catch anything

— flushing mark cache temporarily fixes the error.

Bugs in Code

Looking for the bug by staring at code
and reviewing changes in the latest release.

Fixed three other bugs, but the problem remains.

Trying to find patterns by servers,
by time, by load characteristics...

Bugs in Code

The problem manifests only on one of the clusters,
but on others — never.

Only on this cluster is a new feature used
— cache dictionaries.

Cache dictionaries use a manually written allocator
— ArenaWithFreeLists.

Bugs in Code

class ArenaWithFreeLists
{
    Block * free_lists[16] {};

    static auto sizeToPreviousPowerOfTwo(size_t size)
    {
        return _bit_scan_reverse(size - 1);
    }

    char * alloc(size_t size)
    {
        const auto list_idx = findFreeListIndex(size);
        free_lists[list_idx]->...
    }
}

Bugs in Code

int _bit_scan_reverse(int a)

Set dst to the index of the highest set bit in 32-bit integer a.
If no bits are set in a then dst is undefined.

The compiler can use undefined behaviour
to optimize code...

... but instead it just generates bsr %edi, %eax;

the bsr instruction has undefined behavior at the CPU level,
if the operand is zero.

Bugs in Code

The bsr instruction has undefined behavior at the CPU level,
if the operand is zero.

In fact, with a zero argument, the processor leaves the destination register unchanged.

The result depends on how and where the function is inlined by the compiler.

bsrl %edi, %eax retq

$ objdump -Cd /usr/bin/clickhouse-server |
    grep -E 'bsr|^000' | grep -B1 bsr | grep -A2 -E 'Cache|Arena'

00000000010f8690 <DB::ArenaWithFreeLists::findFreeListIndex(unsigned long)>:
 10f8694:       0f bd db                bsr    %ebx,%ebx
00000000010f99d0 <DB::CacheDictionary::setAttributeValue(DB::CacheDictionary::attribute_t&, unsigned long, DB::Field const&) const>:
 10f9a48:       44 0f bd f2             bsr    %edx,%r14d
00000000010f9be0 <DB::CacheDictionary::setDefaultAttributeValue(DB::CacheDictionary::attribute_t&, unsigned long) const>:
 10f9c52:       44 0f bd ea             bsr    %edx,%r13d
0000000001a888a0 <DB::ComplexKeyCacheDictionary::setDefaultAttributeValue(DB::ComplexKeyCacheDictionary::attribute_t&, unsigned long) const>:
 1a88917:       45 0f bd e4             bsr    %r12d,%r12d
0000000001a88b40 <DB::ComplexKeyCacheDictionary::allocKey(unsigned long, std::vector<DB::IColumn const*, std::allocator<DB::IColumn const*> > const&, std::vector<StringRef, std::allocator<StringRef> >&) const>:
 1a88be9:       45 0f bd ed             bsr    %r13d,%r13d
0000000001a88e10 <DB::ComplexKeyCacheDictionary::freeKey(StringRef) const>:
 1a88e5a:       0f bd db                bsr    %ebx,%ebx
0000000001a88fd0 <DB::ComplexKeyCacheDictionary::copyKey(StringRef) const>:
 1a89033:       0f bd ed                bsr    %ebp,%ebp
0000000001a89780 <DB::ComplexKeyCacheDictionary::setAttributeValue(DB::ComplexKeyCacheDictionary::attribute_t&, unsigned long, DB::Field const&) const>:
 1a897f8:       44 0f bd ea             bsr    %edx,%r13d
 1a89929:       45 0f bd e4             bsr    %r12d,%r12d

Bugs in Code

— take garbage instead of array index;
— get memory corruption somewhere far away;

— almost all memory is filled with caches — corrupted cache;
— this — is cache of «marks», offsets in files;

— read file at wrong offset
— get garbage instead of checksum;

«Checksum doesn't match, corrupted data»

https://github.com/ClickHouse/ClickHouse/commit/2368ac36

Fixed December 27, 2015

Bugs in Hardware

Or rather, inevitable effects.

— non-atomicity of writes on RAID;

— bit rot on HDD and SSD;

— bit flips in RAM (but what about ECC?);

— bit flips at the CPU level;

— bit flips in network (but what about TCP and Ethernet checksums?)

Hardware Bugs

https://github.com/ClickHouse/ClickHouse/issues/780

«A malformed znode prevents ClickHouse from starting»

format version: 3
create_time: 2017-05-04 13:00:44
source repliba: sde500
block_id:
merge
20170501_20170504_200_49083_12589
20170504_20170504_49084_49084_0
into
20170501_20170504_200_49084_12590

Hardware Bugs

https://github.com/ClickHouse/ClickHouse/issues/780

«A malformed znode prevents ClickHouse from starting»

format version: 3
create_time: 2017-05-04 13:00:44
source repliba: sde500
block_id:
merge
20170501_20170504_200_49083_12589
20170504_20170504_49084_49084_0
into
20170501_20170504_200_49084_12590

Hardware Bugs

http://dinaburg.org/bitsquatting.html

Rowhammer, ECCploit, RAMBleed...

Always checksum data!

— when writing to file system;

— when transmitting over network.

More Bugs...

Production Metrica cluster. An exception comes in response to the query:

Checksum doesn't match: corrupted data.
Reference: a87ee784054ad3265e316ade23acb8d8.
Actual: 0417637c7f711925046ab4f9d1cf1b68.
Size of compressed block: 773:
while receiving packet from mtxxxlog01-54-3.metrika.yandex.net:9000, 2a02:6b8:...

Looks familiar! Memory corruption again?

Checksum doesn't match: corrupted data

The error started manifesting on February 6, 2019.

Reproduces several times a day
among 1000+ servers with ClickHouse.

We weren't rolling out releases at that time :(

Checksum doesn't match: corrupted data

The error started manifesting on February 6, 2019.

Reproduces several times a day
among 1000+ servers with ClickHouse.

We weren't rolling out releases at that time :(

Couldn't debug the problem
and after a few days, the error disappeared on its own.

Checksum doesn't match: corrupted data

We forgot about it, but on May 15, 2019, the error started manifesting again.

Reproduces several times a day
among 1000+ servers with ClickHouse.

Attempts to reproduce, reviewing logs and graphs — nothing helps.

If the problem can't be reproduced, the only option
— collect all cases and look for patterns.

Looking for Patterns

7 out of 9 servers with E5-2683 v4 were affected by the error, but only about half of the affected are E5-2683 v4.

Doesn't depend on Linux kernel.

Errors usually don't repeat. Except for cluster mtauxyz, where Corrupted data is real (on disk) — different case.

kern.log — nothing interesting.

CPU, IO, Network graphs — nothing interesting.

By network adapter types nothing interesting.

Looking for Patterns Again

Server uptime is high, they don't crash. No segfaults or similar.

Errors are grouped by days (manifest within a couple of days), but not grouped more locally by time. Solar activity?

For different error cases, the packets themselves match: checksum received, expected, size: most errors have only two packet variants.

Compressed block size is small (less than a kilobyte).

No patterns by servers from which we read data.

Binary representation of packet sizes and checksums is unremarkable.

Looking for Patterns Yet Again

Only one of the clusters.

Only third replicas, located in Vladimir data center.

These facts coincide with the previous case, which was back in February — definitely on a different ClickHouse version.

All errors when reading packets over network: while receiving packet from.

The packet on which the error occurred depends on the query structure. For queries differing in structure, error on different checksums. But in queries where the error is on the same checksum, constants differ.

Not Giving Up, Looking for Patterns

In all queries except one there is GLOBAL JOIN.
But one query is unusually simple:

SELECT max(ReceiveTimestamp) FROM tracking_events_all
WHERE APIKey = 1111 AND (OperatingSystem IN ('android', 'ios'))

And compressed block size — only 75 bytes.

Affected servers are grouped by names:
mtxxxlog01-{39..44,57..58,64,68..71,73..74,76}-3

Groups of problematic servers match those
from February.

Problematic servers are in VLA-03, VLA-04,
and non-problematic ones — in VLA-02.

Debugging by Poking Around

Let's find in query_log a query with such an error, for which
size of compressed block is small (= 107) and query is simple:

SELECT hostName() AS h, exception, query
FROM cluster('mtxxxlogs_all_replicas', system.query_log)
WHERE type = 4 AND exception LIKE '%corrupted%' AND event_date >= '2019-01-01'
SETTINGS skip_unavailable_shards = 1

Row 1:
──────
h:         mtxxxlog04-01-3
exception: Code: 40, e.displayText() = DB::Exception: Checksum doesn't match: corrupted data. Reference: f633b841a7a7a80838dd6a89d391bfda. Actual: 8b40502d2ffe5b712b52e03c505ca49f. Size of compressed block: 107.: while receiving packet from mtxxxlog01-14-1.yandex.ru:9000, 2a02:6b8:b011:3000:48fb:2439:729e:4709, e.what() = DB::Exception
query:     SELECT uniqIf(DeviceIDHash,SessionType = 0) AS `ym:ge:users` FROM mobile.generic_events_all WHERE StartDate = toDate('2019-01-01') and APIKey IN (2162208,2174014,2188009,2216512,2216749,2233579,2251144,2254336,2265019,2371901) and EventType = 1 WITH TOTALS  ORDER BY `ym:ge:users` DESC limit 0,100 FORMAT JSONCompact

Debugging by Poking Around

Executed the query using clickhouse-local program to definitely get the same blocks over the network as in that case:

strace -f -e trace=network -s 1000 -x \
clickhouse-local --query "
    SELECT uniqIf(DeviceIDHash, SessionType = 0)
    FROM remote('127.0.0.{2,3}', mobile.generic_events)
    WHERE StartDate = '2019-02-07' AND APIKey IN (616988,711663,507671,835591,262098,159700,635121,509222)
        AND EventType = 1 WITH TOTALS" --config config.xml

The query executes without errors.

Using strace I get a dump of blocks.
Deciphered the packets and found the expected checksum there.

107 bytes
Reference: f633b841a7a7a80838dd6a89d391bfda. Actual: 8b40502d2ffe5b712b52e03c505ca49f.

\x01\x00
\x08\xa8\xa7\xa7\x41\xb8\x33\xf6\xda\xbf\x91\xd3\x89\x6a\xdd\x38

\x82\x6b\x00\x00\x00\x62\x00\x00\x00\xf2\x3b\x01\x00\x02\xff\xff\xff\xff\x00\x01\x01\x2c\x75\x6e\x69\x71\x49\x66\x28\x44\x65\x76\x69\x63\x65\x49\x44\x48\x61\x73\x68\x2c\x20\x65\x71\x75\x61\x6c\x73\x28\x53\x65\x73\x73\x69\x6f\x6e\x54\x79\x70\x65\x2c\x20\x30\x29\x29\x28\x41\x67\x67\x72\x65\x67\x61\x74\x65\x46\x75\x6e\x63\x74\x69\x6f\x6e\x28\x3f\x00\xf0\x03\x2c\x20\x55\x49\x6e\x74\x36\x34\x2c\x20\x55\x49\x6e\x74\x38\x29\x00\x00

\x01
\x00
\x08\xa8\xa7\xa7\x41\xb8\x33\xf6\xda\xbf\x91\xd3\x89\x6a\xdd\x38
\x82
\x6b\x00\x00\x00
\x62\x00\x00\x00
\xf2\x3b\x01\x00\x02\xff\xff\xff\xff\x00
\x01\x01\x2c\x75\x6e\x69\x71\x49\x66\x28
\x44\x65\x76\x69\x63\x65\x49\x44\x48\x61
\x73\x68\x2c\x20\x65\x71\x75\x61\x6c\x73
\x28\x53\x65\x73\x73\x69\x6f\x6e\x54\x79
\x70\x65\x2c\x20\x30\x29\x29\x28\x41\x67
\x67\x72\x65\x67\x61\x74\x65\x46\x75\x6e
\x63\x74\x69\x6f\x6e\x28\x3f\x00\xf0\x03
\x2c\x20\x55\x49\x6e\x74\x36\x34\x2c\x20
\x55\x49\x6e\x74\x38\x29\x00\x00

^A^@^B<FF><FF><FF><FF>^@^A^A,uniqIf(DeviceIDHash, equals(SessionType, 0))(AggregateFunction(uniqIf, UInt64, UInt8)^@^@

01 - field_num
00 - is_overflows
02 - field_num
ffffffff - bucket_num
00 - end of block info

01 - columns
01 - rows
2c 756e69714966284465766963654944486173682c20657175616c732853657373696f6e547970652c20302929    uniqIf(DeviceIDHash, equals(SessionType, 0))
28 41676772656761746546756e6374696f6e28756e697149662c2055496e7436342c2055496e743829            AggregateFunction(uniqIf, UInt64, UInt8)
00 - skip degree
00 - size

Debugging by Poking Around

Wrote a program that does bit flip of a bit at every possible position and calculates checksum.

Found that if you change the value of one bit, you get exactly the corrupted checksum that was in the error message:

echo -ne "\x82\x6b\x00\x00\x00\x62\x00\x00\x00\xf2\x3b\x01\x00\x02\xff\xff\xff\xff\x00\x01
\x01\x2c\x75\x6e\x69\x71\x49\x66\x28\x44\x65\x76\x69\x63\x65\x49\x44\x48\x61\x73\x68\x2c\x20
\x65\x71\x75\x61\x6c\x73\x28\x53\x65\x73\x73\x69\x6f\x6e\x54\x79\x70\x65\x2c\x20\x30\x29\x29
\x28\x41\x67\x67\x72\x65\x67\x61\x74\x65\x46\x75\x6e\x63\x74\x69\x6f\x6e\x28\x3f\x00\xf0\x03
\x2c\x20\x55\x49\x6e\x74\x36\x34\x2c\x20\x55\x49\x6e\x74\x38\x29\x00\x00" \
    | ./checksum | grep 8b40502d2ffe5b712b52e03c505ca49f

8b40502d2ffe5b712b52e03c505ca49f        32, 6

Conclusion: Hardware Problem

With a software error (for example, memory corruption),
single bit flip is unlikely.

But how to localize the source of the problem?

How to Localize Hardware Problem?

The problem appeared and disappeared on certain dates.

Affected servers are grouped by names:
mtxxxlog01-{39..44,57..58,64,68..71,73..74,76}-3

Groups of problematic servers match those
from February.

Problematic servers are in VLA-03, VLA-04,
and non-problematic third replicas — in VLA-02.

Data Gets Corrupted on Network Switches

NOCs reported that they changed switches exactly on those dates.

After replacing switches, the problem disappeared.

But:

— why doesn't ECC memory on switches help?

— why don't TCP checksums help?

— why don't Ethernet checksums help?

https://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html

But ClickHouse Checksums Help!

A 128-bit checksum is calculated for data blocks.

We correctly report the error to the user.

Data corrupted during network transmission is not written anywhere.

Data stored in ClickHouse remains intact!

But ClickHouse Checksums Help.

Actually, we calculate three checksums.

1. For compressed data blocks when writing to file, to network.

2. Overall checksum of compressed data for verification during replication.

3. Overall checksum of uncompressed data for verification during replication.

These checksums don't slow things down!

Improved Error Message

Code: 40, e.displayText() = DB::Exception: Checksum doesn't match: corrupted data. Reference: c61530c3faa0827150b1634f0f87f274. Actual: d1e57e9605d7100d31df4f1ced3d53d5. Size of compressed block: 405633. The mismatch is caused by single bit flip in data block at byte 332325, bit 6. This is most likely due to hardware failure. If you receive broken data over network and the error does not repeat every time, this can be caused by bad RAM on network interface controller or bad controller itself or bad RAM on network switches or bad CPU on network switches (look at the logs on related network switches; note that TCP checksums don't help) or bad RAM on host (look at dmesg or kern.log for enormous amount of EDAC errors, ECC-related reports, Machine Check Exceptions, mcelog; note that ECC memory can fail if the number of errors is huge) or bad CPU on host. If you read data from disk, this can be caused by disk bit rott. This exception protects ClickHouse from data corruption due to hardware failures.

Garbage in Data

JSON is written to the database in one of the columns

— user parameters from JS counter code.

{"jserrs":{"1392":{"Failed to execute 'postMessage' on 'Window': Invalid target origin
'*xmlOÀîåhttp[ÂeSans)`USBÆCrugEêXR÷!add[O²done_ÁÅÐnex-µ4noneOQàepreví<put|;4aí abbr¹êalt^<pa
rea¯\rasîÚaxis³kmb2ÈBbaseÔúábdi7*bdo \n#bigRä2JbodyïUbr\r¡»chaA8cit¡Ñ°codevÄcolw▒KcolsQ▒spPdata¿xû
ddS+&del¹ÝNdfn ó*Ndir÷divNdlæÅ|dt5)emZC▒end[¡½faceFfont@?for \rform ;U<h1▒h23\n¤êh4{h5«úh6Ò®hea
dìE³highomhrHvXhref▒¸°lhtml²üjöi\rÆ]id´ìimg\nWinsÅNîisrkbd)kkind~8lang:SW©lib»øjlink&ælistjkÂloopP
lowGÎmainØümap 5åûmarN¡4max@æÌmenuK¥Ömet6!min4äname óþènav³gnobrÌÎeolEÏÑopenSÕpØ partIping&õpre]aÀ
q¯yHrbVÄrelÝjºrevÆoõroleÓrows´Drp,rtObBrtc5Ãrubyz§«sú17sampÂsizmslotÃ|lªspan¿¯srcÂoQstepÒÓðsub*bsu
pÔMtd}*text8!¾;th/time)²ªýtr¤ïtto{=type6▒4Ëu\"¼ul\rvar5~wbrëþswrapH¼xmpø4bboxÒRùbiasn{)byåclip¡Lcx
cyõá\"dyò7defsYdescÐdurºxEdxéúsdy|tãfillø5fr6öfromgÓéfx¯îfyÍúgZ ±in▒F×in2 ¶±#kS4ek1Ýák2MQk3îk4O él
ine´maskã=ËmodÜØpath ×ÜrÖçrectaà¶refX4×refYONrxe#âryñ´åseedÇ@osetÞ÷stop Zzsvg¥Ìtoz°u1)P2u2®ãuseªÛv
iewÑDìxîæx1UÇÙx2\rLyÃ®y1'2´zj³0showÚ|mathøûûmiâ 7mnñõmoØUPms)rNodeã7;blurÔúcopy ìgcut0ddrag¬Ìdrop¹
èexit@É=}loadÎêmuteN`£playHpushqßsyncbÑ\\zoomH Ô¾css¬¾iconRp--P!0DateÝÒETagjj½GETr8HEAD!ïLink?_ÌPO
STÑiPUT▒4WøVary 5©tel \"¡url Â¨dateH3ØfilegYÛweekBCridèÙ%scanN all#mËtty átvÀîå\nhttpÀîå\nhttpÀîå\
nhttp\nfileyP\ndataÉä\nwssyP\ndataÀîå\nhttpyP\ndataÏo«B¼PèC B C S\nmidiÀQ\nvr È\nusbB¾ð\n/>% .not1
àBcue 4`left\n ¯ÅI.autoXfq«B«°b\nNnòxtB¸Ð*jpgBBÐJ3140B ÐîG¾.elemB¿°\"ýÿÀ1B²ÐJ100BæH¾.IèØÀMHTMLàNÀ1
®ù£1ttpB °§¶.5562NÀaOKTBµð C B`~BB°p*3371À.0ate▒,èýgzipB¦*jpgB\n¹TyesBðE£±N95teBÊjpgÞBð°òb0B°\"Óòd
3BªMNYa00BPØEdaypB§Ð\"¶òb6B£p0:5RÀ9qN\n-B¾°¡ëÇchatB¬\"²òb2B¾0\n4561B·0Êc3B@*jpgBB¦0 2¾îwideB¤·xroo
tB¢õ.pt22B» Npt17B® }}Rpt21BÐ}@ï\nenr1B`r¤¢pt20B¼ð*jpgÞB´`?OÔpt19BÐ¨ þ.4777B¢ ç [pt18B0*jpgBBÐW$AB
~¡=pt16Bª°\"Õòd5B§ `8·pt15B·ÌN6157B® ¾vSpt14Tahoma1Bº`1bËpt13BÊjpgÞB¶À°jpt12B§P ▒r.5768Bàþâ>pt11B·
?BИCBWB(DBBP^B? Bш?B@?BяяяяяяяяяяяяяяяяяяяяяяяяLяяяяяяяяяяяяяяяяяяяяяяяяЬаAР1яяяяяяяя0ПT?ПTПThПTРП
T8йОTр\nПT \nПTpкОTиОTЁЮОTа?ОTx?ОT@?ОT?ЙОTdGИ1¤GИ1яяяяяяяяяяяяяяяяяяяяяяяяL@sР1яяяяр?B`ОВ1ёB`ОВ1?B
8BАBР?BPYюEяяяя2тґ?\rhttp://avatar.botva.ru/fight_log.php?log_id=123456x\\х1!`ОВ1яяяяяяяяяяяяяяяяя
яяяяяяяяИCB?жB(DBBюяяя JBюяяяюяяяюяяяюяяяюяяяюяяя▒BGB> B0:>5 MDD5:B ?@O6:8?ИCBШєB(DBBИ¤mB8ЧmBИ№mB6n
B -nBЁДmBр?BB?tHB ?B3?ж\rhttps://i.botva.ru/css/colorbox/images/controls.pngяяяяяяяяpBР?BX)BИ§mBа:
nBp`mBяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяя`?BШ?B0ґBя}B`сm-?Жd\rhttps://i.botva.ru/i/global/icon
/promo200.pngose(µЙT??ЙT??ЙTP«ЙTяяяяяяяяяяяяяяяяяяяяЬяяяяяяяяюяяя?JBюяяяюяяяюяяяюяяяюяяяюяяя-\nhtt
ps://www.google-analytics.com/analytics.js4d\\<\rhttps://i.botva.ru/i/global ýclass+Plassij3cod
ebas(.rcodetyp×ø8colgroupQ»)colorw>colspan- command×¢compactZ4icontent³h8controlsmt27BºPé'MNap11BW
:mt26Bp\nwoffB«@u-úmt25B¨°*+6Ù1BÀ¾t/mt24 OÔ1À9qJ\nBB@\\Rmt23B¤¼òbcB¯ÑÔLmt22B¨\njpgBB Cc³mt21B¼°h¨N
399BB´\"çmt20B¥°105Bà^³mt19B«ÐBðCàBB¬àv°mt18q2.12Ó1B¶ ºmt17B© v½B0Â1pTHBB`!Émt16B¯0½òbdB@Ü/omt15B¹
°i÷ÊÎ3483B¾`æhÉmt14B¹PJ100BB0Ûÿmt13BPù.\n                        B»ML±mt12B¥!7îJbtnCB@%©»mt11Ú§.15
ÿÿB¦óAtmt10Bp.¤bc2vB¨ ÀîÝmt92BðC B¸õCB£LJmt83B¨Ð(L5780B¶)ùmt74B\"Áòc1B Fìmt65B¤PÀÀ1Ð£À1 À1B \"Îmt5
6BÃ|IExB`4        çmt47B0mµ.2093Bñ3[mt38B²P\n...B«ÀøD)mt29B *jpglB9\\mt1aB¬¸òb8B¾

Garbage in Data

Memory corruption again?

Race condition?

And how are we going to debug this?

And why did we choose C++ language?

Garbage in Data

Path to the solution:

— it's garbage, but valid UTF-8;

— the same garbage was written to two clusters independently;

— there are lots of «я» letters in the garbage!

Bugs from All Over the Internet

Yandex.Metrica collects traffic from > 1 billion devices on the internet.

The database stores > 30,000,000,000,000 rows (page views).

User devices sometimes have bugs...

These bugs are better filtered before writing to database.

Notorious Bugs and How to Avoid ThemUsing ClickHouse as an Example

About Me

Notorious Bugs

Configuration Bugs

Configuration Bugs

Configuration Bugs

Configuration Bugs

Configuration Bugs

Configuration Bugs

Configuration Bugs

Configuration Bugs

Bugs in Code

Bugs in Code

Bugs in Code

Bugs in Code

Bugs in Code

Bugs in Code

Bugs in Code

Bugs in Code

Bugs in Code

Bugs in Code

Bugs in Code

Bugs in Hardware

Hardware Bugs

Hardware Bugs

Hardware Bugs

More Bugs...

Checksum doesn't match: corrupted data

Checksum doesn't match: corrupted data

Checksum doesn't match: corrupted data

Looking for Patterns

Looking for Patterns Again

Looking for Patterns Yet Again

Not Giving Up, Looking for Patterns

Debugging by Poking Around

Debugging by Poking Around

Debugging by Poking Around

Conclusion: Hardware Problem

How to Localize Hardware Problem?

Data Gets Corrupted on Network Switches

But ClickHouse Checksums Help!

But ClickHouse Checksums Help.

Improved Error Message

Garbage in Data

Garbage in Data

Garbage in Data

Bugs from All Over the Internet

Thank You!

Notorious Bugs and How to Avoid Them
Using ClickHouse as an Example