Database Saturday

Why?

Why write specialized data structures yourself?

— because we can?
— just felt like it?
— NIH (not invented here).

It's better to use instead

... reliable, modern, scalable, fault-tolerant, supported, time-tested, community-approved technologies, developed by the best engineers, who wrote perfect code that solves exactly the task you need.

Realtime Anti-fraud

For classifying website visits as human / robot.

Developed in 2012 for Yandex.Metrica needs.

I can neither confirm nor deny
the fact that it's still running in production.

Realtime Anti-fraud

A set of servers, we send information about traffic to them.

Servers update and provide statistics on traffic properties.

A machine-learned formula works on the statistics.

What Traffic Properties Exist

— Event time

— IP address, IP network, geolocation;

— User cookie;

— URL and Referer, as well as their domains;

— User-Agent;

...

What Statistics Can Be Calculated

Counters — number of events by key:
— for example — number of hits for a class C IP network per minute.

Cardinalities — number of unique values for a key:
— example: number of unique cookies for an IP address;
— example: number of unique IP addresses for a cookie;
— example: number of unique sites visited by a user per hour.

Time statistics:
— example: variance of the difference between adjacent events.

Arithmetic: Throughput

Incoming traffic 600,000 events per second,
— 30 billion events per day.

For each event, we update statistics for 15 keys
— 10 million key lookups/sec.

What hardware to use for 10 million lookups/sec.?

Arithmetic: Throughput

HDD — 100..300 lookups/sec.
Would need 100,000 HDDs without redundancy or ~10,000 servers.

SSD — ~100,000 lookups/sec.
Would need 100 SSDs without redundancy or 10 servers?
— but cross out, since read/write ratio is 1:1.

RAM — 10 million / 40 million / 500 million LL cache misses / sec.

Arithmetic: Volume

Want to store data for 1..2 days.
The fattest keys — UserID, URL, Referer, IP.

Cardinalities:
— URL — 1.5 billion per day.
— UserID — 450 million per day;
— IP — 100 million per day;

Arithmetic: Volume

Total ~5 billion keys.
Statistics per key — around 1 KB.
— 5 TB per day.

In 2012, 128 GB of RAM was used per server.
— 40 servers without redundancy.

How to Fit Statistics per Key in 1 KB?

Even a single URL can be several kilobytes.

— never store strings, only 8-byte hashes.

How to Fit Statistics per Key in 1 KB?

Need to calculate very poor statistics.

1. Counters.

UInt64 count = 0;
void update() { ++count; }

8 bytes — horribly too much.

Can we count from one to a billion using one byte?

How to Fit Statistics per Key in 1 KB?

Can we count from one to a billion using one byte?

Yes — you just need to use a random number generator.

— for counter < 8, add one with probability 1.
— for counter [8..16), add one with probability 1/2.
— for counter [16..24), add one with probability 1/4.
— ...
— for counter [128..256), add one with probability 1/2³¹.

The expected value is estimated
by the maximum likelihood method.

The method can be either non-deterministic or deterministic, if you use a hash function on the data instead of a random number generator.

How to Fit Statistics per Key in 1 KB?

2. Cardinalities.

Obviously, we need to use HyperLogLog.
2.5 KB — ~1% error — too good, need worse.
24 bytes — ~50% error — now that's normal.

Can already fit 50 cardinalities in 1 KB.

Too Much Data

5 TB per day — 50 servers without redundancy.
— but need x2 replication for fault tolerance;
— but need to store data slightly more than a day.

We don't have 100 servers. Task is unsolvable.

Too Much Data

Solution — don't store statistics for keys
that occur rarely.

If we see an IP address for the first time — don't store anything about it.
If we encounter it 16 times — start collecting statistics.

But to know that it was encountered for the sixteenth time
— we need to store this somewhere.

Too Much Data

Solution — Counting Bloom Filter.
Out of 128 GB per server, allocate 10..20 GB for CBF.

CBF works like a «sponge» through which only
important keys are filtered.

Remaining data volume decreases to hundreds of GB
and fits on one server.

Counting Bloom Filter

Disadvantage — now if we use 3 hash functions, we'll have to do 3 times more cache misses.

10 million per second -> 30 million per second.

This can still work... even on one server.

Or invent a cache-local Counting Bloom Filter?

How to Send Data to the Server?

Suppose the server should handle
~1 million requests per second.

What technologies to use to write such a server?

— coroutines/fibers?
— DPDK?

How to Send Data to the Server?

We have realtime anti-fraud, but we'll send data
in batches of 1,000 .. 10,000 events anyway.

One second delay is acceptable,
and there are 600,000 events per second.

We'll use a regular HTTP server with a thread pool.

Concurrent Request Processing

How to ensure concurrent processing
of simultaneously incoming requests?

Maybe use lock-free data structures?

Concurrent Request Processing

How to ensure concurrent processing
of simultaneously incoming requests?

Maybe use lock-free
data structures?

No, better to have one global mutex in the server, and process one request at a time
while others wait.

Trading latency for throughput.

How to Parallelize Processing?

How to parallelize request processing across CPU cores?

Split all data by remainder of key hash divided by N into buckets.

All data structures in the server (Counting Bloom Filter, Hash Tables) exist in N instances.

Incoming data batch is passed to a thread pool for processing, each thread processes its own keys.

— internal sharding.

How to Shard Data

between servers?

How to Shard Data?

— no way :(

How to Replicate Data?

Each server writes a request log to disk and allows the replica to read and process this log.

This same log is used for recovery after failure.

Replication is asynchronous eventually inconsistent.

How to Delete Old Data?

Three options:

— split all data into buckets by hour and delete old buckets;

— exponential smoothing: periodically divide counter values in half;

— periodically dump to disk, restart server and read from dump only current data.

Network Traffic

Input: one event — 50 columns UInt64 — 400 bytes
~ 4 GBit/sec.

Output:
— transmit all calculated statistics
— 500 columns UInt32 — 2 KB ~ 20 GBit/sec.

— transmit only machine-learned formula result
— float — 4 bytes per event.

Network Traffic

But we have a 1 GBit network :(

Just compress the data.

LZ4 — too weak.
QuickLZ — for 2012, there was no better alternative,
now not relevant.
2019 — use ZSTD or Brotli.

Robot Criterion

How to make decisions about traffic being robotic?

Machine-learned MatrixNet formula.

Now more advanced technology is available
in open-source: CatBoost.

https://github.com/catboost/

Alternatives

— Redis;
— Aerospike;
— Couchbase;
— Cassandra.

Bonus:

— YT Dynamic Tables;
— RTMR;
— YDB.

How to Stop Being Afraid?

Have a good understanding of hardware capabilities.
Have a good understanding of task properties and its numerical characteristics.
Have a good understanding of internal structure of available data stores.

Confidence and courage.

Use arithmetic to relate
task properties with hardware capabilities.

Be ready to investigate when theory doesn't match practice.

Visitor history database.

Visitor session calculation.

— specialized data structures on SSD+RAM.

Database Saturday

How to Stop Being Afraid
and Start Developing
Specialized Data Structures

Why?

Realtime Anti-fraud

Realtime Anti-fraud

What Traffic Properties Exist

What Statistics Can Be Calculated

Arithmetic: Throughput

Arithmetic: Throughput

Arithmetic: Volume

Arithmetic: Volume

How to Fit Statistics per Key in 1 KB?

How to Fit Statistics per Key in 1 KB?

How to Fit Statistics per Key in 1 KB?

How to Fit Statistics per Key in 1 KB?

Too Much Data

Too Much Data

Too Much Data

Counting Bloom Filter

How to Send Data to the Server?

How to Send Data to the Server?

Concurrent Request Processing

Concurrent Request Processing

How to Parallelize Processing?

How to Shard Data

How to Shard Data?

How to Replicate Data?

How to Delete Old Data?

Network Traffic

Network Traffic

Robot Criterion

Alternatives

How to Stop Being Afraid?

Next

?

How to Stop Being Afraidand Start DevelopingSpecialized Data Structures

Why?

Realtime Anti-fraud

Realtime Anti-fraud

What Traffic Properties Exist

What Statistics Can Be Calculated

Arithmetic: Throughput

Arithmetic: Throughput

Arithmetic: Volume

Arithmetic: Volume

How to Fit Statistics per Key in 1 KB?

How to Fit Statistics per Key in 1 KB?

How to Fit Statistics per Key in 1 KB?

How to Fit Statistics per Key in 1 KB?

Too Much Data

Too Much Data

Too Much Data

Counting Bloom Filter

How to Send Data to the Server?

How to Send Data to the Server?

Concurrent Request Processing

Concurrent Request Processing

How to Parallelize Processing?

How to Shard Data

How to Shard Data?

How to Replicate Data?

How to Delete Old Data?

Network Traffic

Network Traffic

Robot Criterion

Alternatives

How to Stop Being Afraid?

Next

?

How to Stop Being Afraid
and Start Developing
Specialized Data Structures