Database Obfuscation

How Can We Be Sure?

We need performance tests.

We added the first automated tests in 2013.

Based on Yandex.Metrica dataset of 1 billion page views.

Queries — those expected to be executed from the Yandex.Metrica interface.

Added comparisons with other systems: Vertica, MonetDB.

Part of functional tests use the same dataset.

Tests on Closed Data

In 2013, ClickHouse was just beginning to be used inside Yandex.

In 2016, ClickHouse became open-source
and closed tests became a serious problem.

Tests on Closed Data

Performance tests are not reproducible externally
— they require closed data that cannot be published.

Part of functional tests are also not available externally.

Performance tests don't evolve. There is a need to significantly expand the performance test suite to check for performance changes in individual system components in isolation.

Performance tests are not run per-commit and for pull requests, external developers have no way to check their code for performance regressions.

Solutions

Throw away old tests?

— known benchmarks:
TPC-H, TPC-DS, Star Schema Benchmark, AMPLab Big Data Benchmark;

— open datasets:
Transportation Statistics "Ontime" data; NYC Taxi rides;

But:

— need to check performance regressions on our queries;
— need to preserve existing functional tests;
— rewriting tests is too costly.

It's important to test performance
on real production data!

Real Data is Important

Example: if we generate test data uniformly and randomly,
then the data won't compress.

But data compression is the most important characteristic for analytical databases.

Data compression is always a trade-off between performance and compression ratio, where there is no single correct solution.

If we don't consider data compression when testing performance, the results are meaningless.

https://yandex.github.io/clickhouse-presentations/highload_siberia_2018/

Real Data is Important

Conclusion:

Test data must have
realistic compression ratios.

Real Data is Important

Example:

Suppose we're interested in the query performance:

SELECT RegionID, uniq(UserID) AS visitors
FROM test.hits
GROUP BY RegionID
ORDER BY visitors DESC
LIMIT 10

Such a query is typical for Yandex.Metrica

What's Important for Query Performance?

— which data structure is used for the uniq aggregate function;

— how many different RegionID values exist;

— how much memory each uniq function state will require;

— it's important to know that the data volume for different regions is distributed unevenly (power law);

— and if so — then it's important that uniq aggregate function states with few values use little memory.

What's Important for Query Performance?

If we just use HyperLogLog
— we get fixed memory consumption
for each RegionID — aggregation key.

— guaranteed low performance.

Real Data is Important

Conclusion:

Test data must represent distribution properties:

— cardinalities — number of distinct values in columns;

— joint cardinalities of multiple columns;

* but these requirements slightly contradict anonymization.

Real Data is Important

Example:

Suppose we're testing the performance
not of an analytical database, but something simpler:
for example, hash tables.

Hash table performance depends on the balance
between quality and speed of the hash function.

Real Data is Important

um, it's just a trivial test ...
clickhouse hashmap takes 10GB
this only takes 2
and it's 60% faster

Alexey Milovidov
This test almost doesn't make sense...
Let me share a dataset with real strings...
[ File : data.tar ]

clickhouse hashmap is faster
from 2 times to 5 times

The Task

Obtain data for performance testing.

Same structure as the closed data.

Anonymized, suitable for public release.

But preserving properties important for testing:

— compression ratios;
— value cardinalities;
— probability distribution properties.

Maximum Task

Create a tool available to external users,
with which anyone can anonymize their dataset for publication.

So we can debug and test performance
on other people's data, similar to production.

And also: anonymized data
should still be interesting to look at.

Explicit Probabilistic Models

Wikipedia, public domain (2008)

Explicit Probabilistic Models

«By eye» for each column select
a family of probability distributions.

Estimate the distribution parameters from this family using statistics.

Generate new data using the obtained distribution.

Explicit Probabilistic Models

Result — a C++ script:

EventTime.day(std::discrete_distribution<>({
    0, 0, 13, 30, 0, 14, 42, 5, 6, 31, 17, 0, 0, 0, 0, 23, 10, ...})(random));
EventTime.hour(std::discrete_distribution<>({
    13, 7, 4, 3, 2, 3, 4, 6, 10, 16, 20, 23, 24, 23, 18, 19, 19, ...})(random));
EventTime.minute(std::uniform_int_distribution<UInt8>(0, 59)(random));
EventTime.second(std::uniform_int_distribution<UInt8>(0, 59)(random));

UInt64 UserID = hash(4, powerLaw(5000, 1.1));
UserID = UserID / 10000000000ULL * 10000000000ULL
    + static_cast<time_t>(EventTime) + UserID % 1000000;

random_with_seed.seed(powerLaw(5000, 1.1));
auto get_random_with_seed = [&]{ return random_with_seed(); };

Explicit Probabilistic Models

Heuristics:

To preserve continuity of time series, we model not the value itself, but the difference from the previous one.

Dependencies between columns must be written explicitly: for example, to generate an IP address use a hash of the visitor ID and add some randomness.

Explicit Probabilistic Models

Advantages:

— conceptual simplicity;

Disadvantages:

— implementation effort;
— the implemented solution is tailored to one type of data.

The "hits" table has more than 100 columns as of 2013.
All of them need a model :(

Explicit Probabilistic Models

Models can be selected automatically
— best fit from a set of models + regularization.

Dependencies between data can also be found automatically
— calculate relative entropy between each pair of columns.

Still too labor-intensive.

Neural Networks

https://thiscatdoesnotexist.com/ (2019)

Neural Networks

«The Unreasonable Effectiveness of Recurrent Neural Networks»

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

The task was taken by a random person «from the street»
— a HSE student for their thesis.

(To obtain the test dataset, NDA was signed).

Neural Networks

Need to generate structured data, not just text.

Two approaches:

Can keep data structure fixed,
and only generate values — «fillers».
But this requires code adaptation.

Can generate text dumps. There will be invalid lines, but they can simply be skipped during loading.

Neural Networks

Data quality at first glance is OK:

Sharif Rinatovich Anvardinov, 2018

Neural Networks (don't work)

Generation speed — approximately 100 rows per second
on one machine with CPU.

— Won't manage to generate a 1 billion row dataset
before thesis defense.

Model size — about a gigabyte
(after training on a dataset of several gigabytes)

— Data anonymization becomes questionable.

Neural network compression and inference acceleration
— should be possible, but nobody did anything.

Neural Networks

Solution is completely impractical
...at least for now.

The person defended their thesis with an excellent grade :)

Code was thrown out and never used again.

Compressed Data Mutation

Need to generate data

—whose compression ratios will be
exactly the same as the original data;

—and data decompresses at exactly the same speed.

How to do this?

Compressed Data Mutation

Need to edit compressed data bytes directly!

Then the compressed data size won't change,
but the data itself will change.

And everything will work fast too.

Compressed Data Mutation

Let's say we're only interested in LZ4.

Data compressed with LZ4 consists of two types of commands:
— copy the next N bytes as is (called "literals");
— repeat N bytes that were in the file at distance M (called "match", minimum repetition length is 4).

In the compressed file we'll leave matches as is,
but change byte values in literals.

As a result, after decompression we get a file where all repeating sequences of length 4 or more still repeat and repeat at the same distances, but consist of a different set of bytes.

Compressed Data Mutation

URL:

http://ljc.she/kdoqdqwpgafe/klwlpm&qw=962788775I0E7bs7OXeAyAx
http://ljc.she/kdoqdqwdffhant.am/wcpoyodjit/cbytjgeoocvdtclac
http://ljc.she/kdoqdqwpgafe/klwlpm&qw=962788775I0E7bs7OXe
http://ljc.she/kdoqdqwdffhant.am/wcpoyodjit/cbytjgeoocvdtclac
http://ljc.she/kdoqdqwdbknvj.s/hmqhpsavon.yf#aortxqdvjja
http://ljc.she/kdoqdqw-bknvj.s/hmqhpsavon.yf#aortxqdvjja
http://ljc.she/kdoqdqwpdtu-Unu-Rjanjna-bbcohu_qxht
http://ljc.she/kdoqdqw-bknvj.s/hmqhpsavon.yf#aortxqdvjja
http://ljc.she/kdoqdqwpdtu-Unu-Rjanjna-bbcohu_qxht
http://ljc.she/kdoqdqw-bknvj.s/hmqhpsavon.yf#aortxqdvjja
http://ljc.she/kdoqdqwpdtu-Unu-Rjanjna-bbcohu-702130

Compressed Data Mutation

Disadvantages:

— dependencies between columns are not preserved;

— data in different compressed blocks is transformed differently;

— data is not anonymized enough.

Random Permutations

and more

Random Permutations

http://fabiensanglard.net/fizzlefade/

Wolfenstein 3D, Id Software, 1992; Fabien Sanglard, 2017

Random Permutations

A one-to-one transformation of a set of values,
that looks random.

Example: Linear-Feedback Shift Register

Wikipedia (KCAuXy4p - Own work), CC0, 2017

Random Permutations

Can be composed of simpler
one-to-one transformations.

Examples:

— multiplication by an odd number with overflow;
— xor-shift: x ^= x >> N;
— CRC-N, where N is the number of argument bits;

https://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers/

Feistel Network

http://antirez.com/news/113

Wikipedia (Amirki derivative work), CC BY-SA 3.0, 2011

Feistel Network (one round)

1. Argument bits are split into two halves:

arg: xxxxyyyy
arg_l: xxxx
arg_r: yyyy

2. The right half goes to the left position. And the right position gets xor of the left half and F(right half):

res: yyyyzzzz
res_l = yyyy = arg_r
res_r = zzzz = arg_l ^ F(arg_r)

This transformation is reversible:
we can compute F(res_l) ^ res_r and get arg_l.

Random Permutations

We'll use random permutations
to transform numeric columns.

— cardinalities are fully preserved;

— joint cardinalities are fully preserved;

— repeating values continue to repeat
at the same distance from each other;

We'll also preserve order of magnitude.

— permutations within one size class.

Markov Models

https://yandex.ru/referats/

Order-N Markov model
— defines the probability of the next character given N previous ones.

P(moma | mom) = 0.9
P(momb | mom) = 0.05
P(momc | mom) = 0.01
...

Due to small memory, they generate funny, nonsensical texts.

But they work fast :)

Markov Models

Title:

Fashion Donuts - Informer.Ru - national accident in express fixik

prizes, traffic jams in Newstin. Trials online nails, but not feast, electroshock footage with discount or - [email protected] - fashion dating - online store STREETBALL on internet auto. Odessa) - AUTO.ria.ua Bazar autoradio films online free and video

Markov Models

Preserving repeating values:

We'll initialize the random number generator
with a hash function of the original value.

Identical values transform to identical ones.

Markov Models

Preserving repeating fragments of values.

We want URLs starting with the same domain,
after transformation to also start with the same domain,
but a different one.


https://www.yandex.ru/images/cats/?id=12345
https://www.yandex.ru/images/cats/?id=67890
->
http://ftp.edu.co/cgi-bin/index.phtml#hello
http://ftp.edu.co/cgi-bin/index.phtml#world

Markov Models

Preserving repeating fragments of values.

Solution: as the random number generator for choosing the next letter, we'll use a hash of an 8-byte sliding window in the data.

https://www.yandex.ru/images/cats/?id=12345
                      ^^^^^^^^

distribution: [aaaa][b][cc][dddd][e][ff][ggggg][h]...

hash("images/c") % total_count:           ^

http://ftp.google.kz/cg...

Markov Models

Progaradar-children pregnant departures or Dachna Brideica and MO | Freezers. - Posters to enter into Accessories
Progaradar-children shore — Yandex.Money: Payment journal five bicycles on Lore - dfghf — ya.ru - real estate in Moscow) by 473682 announcements - Sale Comprom
Progaradar-children free! in large assortment»in Moscow - Embroidery — Omsk Free in Most sandal height
Progaradar-children free! in large assortment»in Moscow, transport portal
Progaradar-children berdyansk Fashions. Recipe with photo gallery and covered loud football [email protected] - Search
Progaradar-children pregnant sale Watch best price, community 2010 | Designer MAIN. Delivery
Progaradar-children free! in large assortment»труди Price: 820 0000 km., Taganda apartments in Saint-Pet
Progaradar-children pregnant monthly - DoramaTv.ru - Dresses around world. Internet sale auto used and on with discount
Progaradar-children pregnant room in 2013 year, policemen used in Stavropavlina and strollers ->  Magnetaz 80 cell.RU
Progaradar-children pregnant - Official forums Kalinin (Ukraine. Avtoria Baksler Kudryavtsev delivery, vacancies, hotel sale
Progaradar-children pregnancy detailed bld. 5, 69, communication W*oychivom - Yandex.Maps, houses, what price tow truck forum games World of Tanks
Progaradar-children cap, fatherland and in pink page 2 from cabinet search by [email protected]
Progaradar-children pregnancy program in China heights Bar, weather Maniku. Records in Smolensk

Result

— put everything together in the clickhouse-obfuscator utility;

— installed in the clickhouse-client package;

— inside — random permutations and Markov models;

— works deterministically, result is parameterized by a key;

— suitable for any tabular data;

— doesn't depend on ClickHouse — can be used for any DB's data;

— works fast: on data with ~100 columns
~10,000 rows per second on a single core;

Result

clickhouse-obfuscator \
    --seed "$(head -c16 /dev/urandom | base64)" \
    --input-format TSV --output-format TSV \
    --structure 'CounterID UInt32, URLDomain String, \
        URL String, SearchPhrase String, Title String' \
    < table.tsv > result.tsv

clickhouse-obfuscator --help

Database Obfuscation

Is ClickHouse Fast?

How Can We Be Sure?

Tests on Closed Data

Tests on Closed Data

Solutions

Real Data is Important

Real Data is Important

Real Data is Important

What's Important for Query Performance?

What's Important for Query Performance?

Real Data is Important

Real Data is Important

Real Data is Important

The Task

Maximum Task

Solution Attempts

Explicit Probabilistic Models

Explicit Probabilistic Models

Explicit Probabilistic Models

Explicit Probabilistic Models

Explicit Probabilistic Models

Explicit Probabilistic Models

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks (don't work)

Neural Networks

Compressed Data Mutation

Compressed Data Mutation

Compressed Data Mutation

Compressed Data Mutation

Compressed Data Mutation

Compressed Data Mutation

Random Permutations

Random Permutations

Random Permutations

Random Permutations

Feistel Network

Feistel Network (one round)

Random Permutations

Markov Models

Markov Models

Markov Models

Markov Models

Markov Models

Markov Models

Result

Result

Result

Result

.