Doing Crazy Stuff With ClickHouse

Doing Crazy Stuff With ClickHouse

Doing Crazy Stuff With ClickHouse

Less known ClickHouse features.

Hidden gems in ClickHouse.

Using ClickHouse like you have never imagine.

Download ClickHouse

Install it on any system with a single binary:

curl https://clickhouse.com/ | sh

Single binary (480 MB), no dependencies!

Works on every platform:
— Linux, macOS, FreeBSD, Windows (WSL2)
— x86_64, ARM 64, PowerPC 64LE

Works with or without installation.

How it is implemented? Read curl https://clickhouse.com/

What's inside ClickHouse?

Successfully downloaded the ClickHouse binary, you can run it as: ./clickhouse You can also install it: sudo ./clickhouse install $ ./clickhouse Use one of the following commands: clickhouse local [args] clickhouse keeper [args] clickhouse client [args] clickhouse keeper-converter [args] clickhouse benchmark [args] clickhouse install [args] clickhouse server [args] clickhouse start [args] clickhouse extract-from-config [args] clickhouse stop [args] clickhouse compressor [args] clickhouse status [args] clickhouse format [args] clickhouse restart [args] clickhouse copier [args] clickhouse static-files-disk-uploader clickhouse obfuscator [args] clickhouse su [args] clickhouse git-import [args] clickhouse hash-binary [args] clickhouse disks [args]

ClickHouse Obfuscator

Anonymize datasets.

Generate and multiply datasets.

Keeping structure and values distribution.

Example 1: generating obfuscated dataset of the same size:

clickhouse-obfuscator \ --structure "c1 String, c2 UInt64, ..." \ --input-format CSV \ --output-format CSV \ --seed "$SEED" \ < source.csv > result.csv

ClickHouse Obfuscator

This dataset does not exist:

┌─SearchPhrase──────────────┬─count()─┐ │ nazarbeko cn32 │ 1035903 │ │ ozzy 4 │ 849431 │ │ mira party" │ 798410 │ │ limp bizkiy │ 436204 │ │ galaxy table │ 237734 │ │ sea party" │ 226540 │ │ kawasaki bios │ 190022 │ │ kawasaki - psy │ 185437 │ │ my love_r │ 179403 │ │ minecraftbuk-40 │ 165758 │ │ asrock antasy xv │ 158144 │ │ galaxy s4 minecraftbuk-40 │ 134530 │ │ kawasaki bioshop │ 127379 │ │ vintage club.su │ 127129 │ └───────────────────────────┴─────────┘

ClickHouse Obfuscator

For performance and load testing.

Keeps the distribution of string lengths, numbers magnitude,
cardinality, correlations, substrings distribution, compression rate...

Example 2: generating obfuscated dataset of smaller or larger size:

clickhouse-obfuscator \ --structure "c1 String, c2 UInt64, ..." \ --input-format CSV \ --output-format CSV \ --seed "$SEED" \ --limit 100 \ < source.csv > result.csv

ClickHouse Obfuscator

Example 3: saving models for parallel generation:

clickhouse-obfuscator \ --structure "c1 String, c2 UInt64, ..." \ --input-format CSV \ --output-format CSV \ --seed "$SEED" \ --limit 0 --save model.bin \ < source.csv clickhouse-obfuscator \ --structure "c1 String, c2 UInt64, ..." \ --input-format CSV \ --output-format CSV \ --seed "$SEED" \ --load model.bin \ < source.csv > result.csv

ClickHouse Obfuscator

For performance and load testing.

Having only 1 million records of data,
generate 1 trillion records to check how ClickHouse will scale.

TLDR: It will scale well :)

ClickHouse Format

Uninteresting part: pretty-print your query.

$ clickhouse-format --query "create table business_secrets(customer_name String, customer_phone_number String, company_id UInt64, description String) ENGINE = MergeTree ORDER BY company_id" CREATE TABLE business_secrets ( `customer_name` String, `customer_phone_number` String, `company_id` UInt64, `description` String ) ENGINE = MergeTree ORDER BY company_id

Uninteresting options: --hilite, --oneline.

ClickHouse Format

It can obfuscate your queries!

$ clickhouse-format --obfuscate --query "CREATE TABLE business_secrets ..." CREATE TABLE scimitar_joy ( isolation_name String, isolation_geography_lane String, compress_id UInt64, damage String ) ENGINE = MergeTree ORDER BY compress_id

Now the query looks much more interesting :)

ClickHouse Format

It can obfuscate your queries!

$ clickhouse-format --seed 1 --obfuscate --query "CREATE TABLE business_secrets ..." CREATE TABLE gun_science ( touch_name String, touch_grenade_inclusion String, chord_id UInt64, traditionalism String ) ENGINE = MergeTree ORDER BY chord_id

Now the query looks much more interesting :)

ClickHouse Format

It can obfuscate your queries!

$ clickhouse-format --obfuscate < queries.sql

Can process multiple queries at once
for consistent transformation.

String literals, numbers and data are transformed by a deterministic PRF.

Use case: sharing the queries for testing.

ClickHouse Git Import

clickhouse-git-import --help

Analyze your code with ClickHouse!

Usage:

clickhouse-git-import

^ run this command inside your Git repository.

ClickHouse Git Import

What is Git?

A distributed version control system?

A content-addressable blockchain graph database?

git — the stupid content tracker
— Linus Torvalds.

Whatever it is... let's load everything to ClickHouse!

ClickHouse Git Import

clickhouse-git-import

^ run this command inside your Git repository.

$ wc -lc line_changes.tsv file_changes.tsv commits.tsv 7528665 2920411878 line_changes.tsv 265724 55133300 file_changes.tsv 62665 8182510 commits.tsv

ClickHouse Git Import

Allows to answer questions like:
— list files with maximum number of authors;
— show me the oldest lines of code in the repository;
— show me the files with longest history;
— list favorite files for author;
— list largest files with lowest number of authors;
— at what weekday the code has highest chance to stay in repository;
— the distribution of code age across repository;
— files sorted by average code age;
— quickly show file with blame info (rough);
— commits and lines of code distribution by time;
— show history for every subdirectory, file, line of file, changes across time;
— how the number of contributors was changed across time;
— list files with most modifications;
— list files that were rewritten most number of time or by most of authors;
— what is percentage of code removal by other authors, across authors;
— the matrix of authors that shows what authors tends to rewrite another authors code;
— the average time before code will be rewritten and the median (half-life of code decay);
— comments/code percentage change in time / by author / by location;
— who tend to write more tests / cpp code / comments.

ClickHouse Git Import

The matrix of authors: who rewrites another author's code.

SELECT prev_author, author AS new_author, count() AS c FROM line_changes WHERE sign = -1 AND file_extension IN ('h', 'cpp') AND line_type NOT IN ('Punct', 'Empty') AND author != prev_author AND prev_author != '' GROUP BY prev_author, author ORDER BY c DESC LIMIT 1 BY prev_author LIMIT 100 ┌─prev_author──────────┬─new_author───────┬─────c─┐ │ Ivan Lezhankin │ Alexey Milovidov │ 18554 │ │ Alexey Arno │ Alexey Milovidov │ 18475 │ │ Michael Kolupaev │ Alexey Milovidov │ 14135 │ │ Alexey Milovidov │ Nikolai Kochetov │ 13435 │ │ Andrey Mironov │ Alexey Milovidov │ 10418 │ │ proller │ Alexey Milovidov │ 7280 │ │ Nikolai Kochetov │ Alexey Milovidov │ 6806 │ │ alexey-milovidov │ Alexey Milovidov │ 5027 │ │ Vitaliy Lyudvichenko │ Alexey Milovidov │ 4390 │ │ Amos Bird │ Ivan Lezhankin │ 3125 │ │ f1yegor │ Alexey Milovidov │ 3119 │ │ Pavel Kartavyy │ Alexey Milovidov │ 3087 │ │ Alexey Zatelepin │ Alexey Milovidov │ 2978 │ │ alesapin │ Alexey Milovidov │ 2949 │ │ Sergey Fedorov │ Alexey Milovidov │ 2727 │ │ Ivan Lezhankin │ Alexey Milovidov │ 2618 │ │ Vasily Nemkov │ Alexey Milovidov │ 2547 │ │ Alexander Tokmakov │ Alexey Milovidov │ 2493 │ │ Nikita Vasilev │ Maksim Kita │ 2420 │ │ Anton Popov │ Amos Bird │ 2127 │ └──────────────────────┴──────────────────┴───────┘

ClickHouse Git Import

The matrix of authors: who rewrites another author's code (LLVM).

┌─prev_author──────────┬─new_author────────────┬─────c─┐ │ Giorgis Georgakoudis │ Johannes Doerfert │ 31766 │ │ Reid Spencer │ Chris Lattner │ 30941 │ │ Ted Kremenek │ Argyrios Kyrtzidis │ 27294 │ │ David Salinas │ Nico Weber │ 23607 │ │ Deepak Panickal │ Zachary Turner │ 22455 │ │ Chris Lattner │ Reid Spencer │ 21297 │ │ Vikram S. Adve │ Chris Lattner │ 20491 │ │ Greg Clayton │ Pavel Labath │ 20181 │ │ Zachary Turner │ Jonas Devlieghere │ 19232 │ │ Douglas Gregor │ John McCall │ 17809 │ │ Howard Hinnant │ Eric Fiselier │ 16951 │ │ hyeongyu kim │ Florian Hahn │ 15966 │ │ Eric Fiselier │ Louis Dionne │ 15562 │ │ Joseph Huber │ Johannes Doerfert │ 15372 │ │ Tobias Grosser │ Michael Kruse │ 13678 │ │ Misha Brukman │ Chris Lattner │ 12823 │ │ Roman Lebedev │ hyeongyu kim │ 12816 │ │ Rui Ueyama │ Rafael Espindola │ 11934 │ │ Evan Cheng │ Chris Lattner │ 11609 │ │ Christopher Di Bella │ Louis Dionne │ 11184 │ │ John Criswell │ Chris Lattner │ 11064 │ │ Brian Gaeke │ Chris Lattner │ 10814 │ │ Sirish Pande │ Chandler Carruth │ 10536 │ │ Nick Kledzik │ Keith Smiley │ 9990 │ │ Jim Ingham │ Greg Clayton │ 9188 │ │ Nicolas Vasilache │ River Riddle │ 9142 │ │ Rafael Espindola │ Rui Ueyama │ 9089 │ │ Dan Gohman │ Chris Lattner │ 9058 │ │ Anton Korobeynikov │ Chris Lattner │ 9004 │ │ Owen Anderson │ Chris Lattner │ 8780 │ │ Mark de Wever │ Petr Hosek │ 8748 │ │ Todd Fiala │ Pavel Labath │ 8507 │ │ Michael J. Spencer │ Rafael Espindola │ 8305 │ │ Nate Begeman │ Chris Lattner │ 8299 │ │ Chandler Carruth │ Benjamin Kramer │ 8254 │ │ Jim Cownie │ Jonathan Peyton │ 8142 │ │ Michael Zuckerman │ Craig Topper │ 8035 │ │ Zhongxing Xu │ Ted Kremenek │ 7872 │ │ Erik Pilkington │ Richard Smith │ 7861 │ │ Shankar Easwaran │ Rui Ueyama │ 7749 │ │ Johnny Chen │ Jason Molenda │ 7746 │ │ Alex Zinenko │ River Riddle │ 7476 │ │ Marshall Clow │ Artem Dergachev │ 7376 │ │ Caroline Tice │ Greg Clayton │ 7181 │ │ Sean Callanan │ Greg Clayton │ 7098 │ │ Argyrios Kyrtzidis │ Douglas Gregor │ 7025 │ │ Daniel Dunbar │ Chris Lattner │ 6940 │ │ Steve Naroff │ Chris Lattner │ 6853 │ │ Tom Stellard │ Matt Arsenault │ 6679 │ │ Jim Grosbach │ Tanya Lattner │ 6562 │ │ Alexey Bataev │ Joseph Huber │ 6521 │ │ Edwin Vane │ Alexander Kornienko │ 6503 │ │ Enrico Granata │ Greg Clayton │ 6501 │ │ John McCall │ Douglas Gregor │ 6454 │ │ Ilya Biryukov │ Sam McCall │ 6395 │ │ Nandor Licker │ Roman Lebedev │ 6305 │ │ Ryan Brown │ Jonas Devlieghere │ 6033 │ │ George Rimar │ Rui Ueyama │ 6033 │ │ Mehdi Amini │ River Riddle │ 5820 │ │ Sanjiv Gupta │ Chris Lattner │ 5672 │ │ Yaxun Liu │ Sven van Haastregt │ 5539 │ │ Simon Atanasyan │ Rafael Espindola │ 5476 │ │ Eric Schweitz │ Valentin Clement │ 5396 │ │ Tanya Lattner │ Chris Lattner │ 5343 │ │ Pavel Labath │ Greg Clayton │ 5314 │ │ Bill Nell │ Maksim Panchenko │ 5311 │ │ Rafael Auler │ Maksim Panchenko │ 5278 │ │ Elena Demikhovsky │ Craig Topper │ 5270 │ │ Bill Wendling │ Chris Lattner │ 5266 │ │ Craig Topper │ Simon Pilgrim │ 5215 │ │ zoecarver │ Petr Hosek │ 5180 │ │ Sebastian Redl │ Douglas Gregor │ 5172 │ │ Nick Lewycky │ Chris Lattner │ 5161 │ │ Jason Molenda │ Pavel Labath │ 4842 │ │ Simon Pilgrim │ Craig Topper │ 4828 │ │ Fariborz Jahanian │ John McCall │ 4744 │ │ Kostya Serebryany │ Vitaly Buka │ 4713 │ │ Lang Hames │ Jakob Stoklund Olesen │ 4665 │ │ Anders Carlsson │ John McCall │ 4547 │ │ Matheus Izvekov │ Jonas Devlieghere │ 4540 │ │ Devang Patel │ Chris Lattner │ 4532 │ │ Maksim Panchenko │ Amir Ayupov │ 4521 │ │ Mike Stump │ Anders Carlsson │ 4443 │ │ Wesley Peck │ Rafael Espindola │ 4440 │ │ JF Bastien │ Louis Dionne │ 4414 │ │ Jim Laskey │ Chris Lattner │ 4402 │ │ peter klausler │ Peter Klausler │ 4380 │ │ Hal Finkel │ Chandler Carruth │ 4151 │ │ Jeff Cohen │ Chris Lattner │ 4128 │ │ Jonas Devlieghere │ Matheus Izvekov │ 4123 │ │ Sergey Dmitrouk │ Daniel Jasper │ 4095 │ │ Reid Kleckner │ Zachary Turner │ 4058 │ │ Colin LeMahieu │ Krzysztof Parzyszek │ 3948 │ │ Kelvin Li │ Diana Picus │ 3913 │ │ Asaf Badouh │ Craig Topper │ 3884 │ │ Tamas Berghammer │ Jonas Devlieghere │ 3803 │ │ Chaoren Lin │ Ying Chen │ 3762 │ │ Ruchira Sasanka │ Chris Lattner │ 3737 │ │ Alkis Evlogimenos │ Chris Lattner │ 3707 │ │ Ying Chen │ Chaoren Lin │ 3668 │ └──────────────────────┴───────────────────────┴───────┘

ClickHouse Local

ClickHouse engine without a server!

Uninteresting use-cases:

— queries on text, JSON and binary files;
— format conversion;
— data preparation;
— interactive analysis of local files;

clickhouse-local --query " SELECT reviewText FROM file('books.ndjson') ORDER BY overall DESC LIMIT 100"

ClickHouse Local

Unusual usages: replacement for every Unix tool.

grep, sed, pv, wc, yes, cat, nl...

Example: replacement for pv:

input | pv | output

input | clickhouse-local --progress \ --input-format LineAsString \ --output-format LineAsString \ --query "SELECT * FROM table" \ | output

But this is not the most weird usage of clickhouse-local.

ClickHouse Local

Example: replacement for grep:

$ time grep -c --text Warning server.log # 2.6 GB file 0.811s $ time rg -c --text Warning server.log 0.289s $ time clickhouse-local --progress --input-format LineAsString --query " SELECT count() FROM table WHERE line LIKE '%Warning%'" < server.log 1.234s :( $ clickhouse-client --query " CREATE TABLE grep (s String) ENGINE = MergeTree ORDER BY ();" $ clickhouse-client --time --query " INSERT INTO grep FORMAT LineAsString" < server.log $ clickhouse-client --time --query " SELECT count() FROM grep WHERE s LIKE '%Warning%'" 0.107s :)

ClickHouse Local

Analyze ClickHouse with itself.

Demo

ClickHouse Local

Can ClickHouse draw pictures?

— Yes:

(echo "P5"; echo "16384 16384 255"; clickhouse-local --progress --format RowBinary --query " SELECT * FROM ( SELECT * FROM file('/usr/bin/clickhouse', RowBinary, 'v UInt8') LIMIT 16384*16384) ORDER BY mortonDecode(2, rowNumberInAllBlocks())") | pnmtopng | pv > binary.png

ClickHouse Local

Can ClickHouse play music?

— Maybe:

clickhouse-local --query " SELECT (sin(number / 100 * log( cityHash64(number % 2 ? number DIV (20000 + number % 101) : number DIV (30000 + number % 107)) % 50 + 2) + 1) * 4096)::Int16 FROM system.numbers" --format RowBinary | aplay -f cd

Linux only. Not recommended to listen!

ClickHouse Local

Serverless build for Cloud Functions

A minimal build of clickhouse-local is less than 50 MB.

Let's wrap it into lambda function and publish at glitch.com!

... using their free plan :)

https://clickhouse.glitch.me/play

This demo has been made by Lorenzo Mangani.

"Please make sure you say this is an experiment in running clickhouse
on the SMALLEST free instance on the internet, 1 shared CPU and 512MB of slow RAM" — Lorenzo.

ClickHouse as a Monitoring Agent

Every ClickHouse instance collects the metrics about ClickHouse
and about the system as a whole.

Launch empty ClickHouse on your server,
and it will collect the metrics and store them inside
the system.asynchronous_metric_log table.

Bonus: http://localhost:8123/dashboard

ClickHouse For Web Crawl Data

https://tranco-list.eu/

Download 7.5 million web pages (~1 TB):

cut -d, -f2 ~/tranco_*.csv | dos2unix | xargs -P100 -I{} bash -c " [ ! -f '{}.out' ] && touch '{}.out' && curl -v -s -S --max-redirs 10 --max-time 10 \ -L '{}' -o '{}.out' 2> '{}.log'"

Insert into ClickHouse:

CREATE TABLE minicrawl ( rank UInt32, domain String, log String CODEC(ZSTD(6)), content String CODEC(ZSTD(6))) ENGINE = MergeTree ORDER BY domain;

ClickHouse For Web Crawl Data

Insert into ClickHouse:

clickhouse-local --input-format CSV --output-format Native \ --structure 'rank UInt32, domain String' --query " SELECT *, file('data/' || domain || '.log') AS log, file('data/' || domain || '.out') AS content FROM table " < tranco_*.csv | clickhouse-client --query "INSERT INTO minicrawl FORMAT Native"

Analyze the data:

SELECT rank, domain FROM minicrawl WHERE content LIKE '%ClickHouse%'; SELECT replaceRegexpAll(extractTextFromHTML(content) AS text, '(?i)(clickhouse)', '\x1b[1;31m\1\x1b[0m') FROM minicrawl WHERE text ILIKE '%clickhouse%';

What Else?

ClickHouse on Raspberry Pi and RISC-V.

ClickHouse on Graviton 3 and Threadripper Pro.

ClickHouse to speed-up MongoDB and make it web scale again.

ClickHouse as a replacement of AWS Athena.

...

Use ClickHouse! Abuse ClickHouse.

Find new horizons with ClickHouse.

ClickHouse 22.12 Release Webinar

Tomorrow: https://clickhouse.com/company/events/v22-12-release-webinar

Q&A