Author: Alexey Milovidov, 2019-08-17.
Why write specialized data structures yourself?
— because we can?
— just felt like it?
— NIH (not invented here).
It's better to use instead
... reliable, modern, scalable, fault-tolerant, supported, time-tested, community-approved technologies, developed by the best engineers, who wrote perfect code that solves exactly the task you need.
For classifying website visits as human / robot.
Developed in 2012 for Yandex.Metrica needs.
I can neither confirm nor deny
the fact that it's still running in production.
A set of servers, we send information about traffic to them.
Servers update and provide statistics on traffic properties.
A machine-learned formula works on the statistics.
— Event time
— IP address, IP network, geolocation;
— User cookie;
— URL and Referer, as well as their domains;
— User-Agent;
...
Counters — number of events by key:
— for example — number of hits for a class C IP network per minute.
Cardinalities — number of unique values for a key:
— example: number of unique cookies for an IP address;
— example: number of unique IP addresses for a cookie;
— example: number of unique sites visited by a user per hour.
Time statistics:
— example: variance of the difference between adjacent events.
Incoming traffic 600,000 events per second,
— 30 billion events per day.
For each event, we update statistics for 15 keys
— 10 million key lookups/sec.
What hardware to use for 10 million lookups/sec.?
HDD — 100..300 lookups/sec.
Would need 100,000 HDDs without redundancy or ~10,000 servers.
SSD — ~100,000 lookups/sec.
Would need 100 SSDs without redundancy or 10 servers?
— but cross out, since read/write ratio is 1:1.
RAM — 10 million / 40 million / 500 million LL cache misses / sec.
Want to store data for 1..2 days.
The fattest keys — UserID, URL, Referer, IP.
Cardinalities:
— URL — 1.5 billion per day.
— UserID — 450 million per day;
— IP — 100 million per day;
Total ~5 billion keys.
Statistics per key — around 1 KB.
— 5 TB per day.
In 2012, 128 GB of RAM was used per server.
— 40 servers without redundancy.
Even a single URL can be several kilobytes.
— never store strings, only 8-byte hashes.
Need to calculate very poor statistics.
1. Counters.
UInt64 count = 0;
void update() { ++count; }
8 bytes — horribly too much.
Can we count from one to a billion using one byte?
Can we count from one to a billion using one byte?
Yes — you just need to use a random number generator.
— for counter < 8, add one with probability 1.
— for counter [8..16), add one with probability 1/2.
— for counter [16..24), add one with probability 1/4.
— ...
— for counter [128..256), add one with probability 1/231.
The expected value is estimated
by the maximum likelihood method.
The method can be either non-deterministic or deterministic, if you use a hash function on the data instead of a random number generator.
2. Cardinalities.
Obviously, we need to use HyperLogLog.
2.5 KB — ~1% error — too good, need worse.
24 bytes — ~50% error — now that's normal.
Can already fit 50 cardinalities in 1 KB.
5 TB per day — 50 servers without redundancy.
— but need x2 replication for fault tolerance;
— but need to store data slightly more than a day.
We don't have 100 servers. Task is unsolvable.
Solution — don't store statistics for keys
that occur rarely.
If we see an IP address for the first time — don't store anything about it.
If we encounter it 16 times — start collecting statistics.
But to know that it was encountered for the sixteenth time
— we need to store this somewhere.
Solution — Counting Bloom Filter.
Out of 128 GB per server, allocate 10..20 GB for CBF.
CBF works like a «sponge» through which only
important keys are filtered.
Remaining data volume decreases to hundreds of GB
and fits on one server.
Disadvantage — now if we use 3 hash functions, we'll have to do 3 times more cache misses.
10 million per second -> 30 million per second.
This can still work... even on one server.
Or invent a cache-local Counting Bloom Filter?
Suppose the server should handle
~1 million requests per second.
What technologies to use to write such a server?
— coroutines/fibers?
— DPDK?
We have realtime anti-fraud, but we'll send data
in batches of 1,000 .. 10,000 events anyway.
One second delay is acceptable,
and there are 600,000 events per second.
We'll use a regular HTTP server with a thread pool.
How to ensure concurrent processing
of simultaneously incoming requests?
Maybe use lock-free data structures?
How to ensure concurrent processing
of simultaneously incoming requests?
Maybe use lock-free
data structures?
No, better to have one global mutex in the server, and process one request at a time
while others wait.
Trading latency for throughput.
How to parallelize request processing across CPU cores?
Split all data by remainder of key hash divided by N into buckets.
All data structures in the server (Counting Bloom Filter, Hash Tables) exist in N instances.
Incoming data batch is passed to a thread pool for processing, each thread processes its own keys.
— internal sharding.
between servers?
— no way :(
Each server writes a request log to disk and allows the replica to read and process this log.
This same log is used for recovery after failure.
Replication is asynchronous eventually inconsistent.
Three options:
— split all data into buckets by hour and delete old buckets;
— exponential smoothing: periodically divide counter values in half;
— periodically dump to disk, restart server and read from dump only current data.
Input: one event — 50 columns UInt64 — 400 bytes
~ 4 GBit/sec.
Output:
— transmit all calculated statistics
— 500 columns UInt32 —
2 KB ~ 20 GBit/sec.
— transmit only machine-learned formula result
— float — 4 bytes per event.
But we have a 1 GBit network :(
Just compress the data.
LZ4 — too weak.
QuickLZ — for 2012, there was no better alternative,
now not relevant.
2019 — use ZSTD or Brotli.
How to make decisions about traffic being robotic?
Machine-learned MatrixNet formula.
Now more advanced technology is available
in open-source: CatBoost.
— Redis;
— Aerospike;
— Couchbase;
— Cassandra.
Bonus:
— YT Dynamic Tables;
— RTMR;
— YDB.
Have a good understanding of hardware capabilities.
Have a good understanding of task properties and its numerical characteristics.
Have a good understanding of internal structure of available data stores.
Confidence and courage.
Use arithmetic to relate
task properties with hardware capabilities.
Be ready to investigate when theory doesn't match practice.
Visitor history database.
Visitor session calculation.
— specialized data structures on SSD+RAM.