Less known ClickHouse features.
Hidden gems in ClickHouse.
Using ClickHouse like you have never imagine.
Install it on any system with a single binary:
curl https://clickhouse.com/ | sh
Single binary (480 MB), no dependencies!
Works on every platform:
— Linux, macOS, FreeBSD, Windows (WSL2)
— x86_64, ARM 64, PowerPC 64LE
Works with or without installation.
How it is implemented? Read curl https://clickhouse.com/
Successfully downloaded the ClickHouse binary, you can run it as:
./clickhouse
You can also install it:
sudo ./clickhouse install
$ ./clickhouse
Use one of the following commands:
clickhouse local [args] clickhouse keeper [args]
clickhouse client [args] clickhouse keeper-converter [args]
clickhouse benchmark [args] clickhouse install [args]
clickhouse server [args] clickhouse start [args]
clickhouse extract-from-config [args] clickhouse stop [args]
clickhouse compressor [args] clickhouse status [args]
clickhouse format [args] clickhouse restart [args]
clickhouse copier [args] clickhouse static-files-disk-uploader
clickhouse obfuscator [args] clickhouse su [args]
clickhouse git-import [args] clickhouse hash-binary [args]
clickhouse disks [args]
Anonymize datasets.
Generate and multiply datasets.
Keeping structure and values distribution.
Example 1: generating obfuscated dataset of the same size:
clickhouse-obfuscator \
--structure "c1 String, c2 UInt64, ..." \
--input-format CSV \
--output-format CSV \
--seed "$SEED" \
< source.csv > result.csv
This dataset does not exist:
┌─SearchPhrase──────────────┬─count()─┐
│ nazarbeko cn32 │ 1035903 │
│ ozzy 4 │ 849431 │
│ mira party" │ 798410 │
│ limp bizkiy │ 436204 │
│ galaxy table │ 237734 │
│ sea party" │ 226540 │
│ kawasaki bios │ 190022 │
│ kawasaki - psy │ 185437 │
│ my love_r │ 179403 │
│ minecraftbuk-40 │ 165758 │
│ asrock antasy xv │ 158144 │
│ galaxy s4 minecraftbuk-40 │ 134530 │
│ kawasaki bioshop │ 127379 │
│ vintage club.su │ 127129 │
└───────────────────────────┴─────────┘
For performance and load testing.
Keeps the distribution of string lengths, numbers magnitude,
cardinality, correlations, substrings distribution, compression rate...
Example 2: generating obfuscated dataset of smaller or larger size:
clickhouse-obfuscator \
--structure "c1 String, c2 UInt64, ..." \
--input-format CSV \
--output-format CSV \
--seed "$SEED" \
--limit 100 \
< source.csv > result.csv
Example 3: saving models for parallel generation:
clickhouse-obfuscator \
--structure "c1 String, c2 UInt64, ..." \
--input-format CSV \
--output-format CSV \
--seed "$SEED" \
--limit 0 --save model.bin \
< source.csv
clickhouse-obfuscator \
--structure "c1 String, c2 UInt64, ..." \
--input-format CSV \
--output-format CSV \
--seed "$SEED" \
--load model.bin \
< source.csv > result.csv
For performance and load testing.
Having only 1 million records of data,
generate 1 trillion records to check how ClickHouse will scale.
TLDR: It will scale well :)
Uninteresting part: pretty-print your query.
$ clickhouse-format --query "create table business_secrets(customer_name String, customer_phone_number String, company_id UInt64, description String) ENGINE = MergeTree ORDER BY company_id"
CREATE TABLE business_secrets
(
`customer_name` String,
`customer_phone_number` String,
`company_id` UInt64,
`description` String
)
ENGINE = MergeTree
ORDER BY company_id
Uninteresting options: --hilite, --oneline.
It can obfuscate your queries!
$ clickhouse-format --obfuscate --query "CREATE TABLE business_secrets ..."
CREATE TABLE scimitar_joy
(
isolation_name String,
isolation_geography_lane String,
compress_id UInt64,
damage String
)
ENGINE = MergeTree
ORDER BY compress_id
Now the query looks much more interesting :)
It can obfuscate your queries!
$ clickhouse-format --seed 1 --obfuscate --query "CREATE TABLE business_secrets ..."
CREATE TABLE gun_science
(
touch_name String,
touch_grenade_inclusion String,
chord_id UInt64,
traditionalism String
)
ENGINE = MergeTree
ORDER BY chord_id
Now the query looks much more interesting :)
It can obfuscate your queries!
$ clickhouse-format --obfuscate < queries.sql
Can process multiple queries at once
for consistent transformation.
String literals, numbers and data are transformed by a deterministic PRF.
Use case: sharing the queries for testing.
clickhouse-git-import --help
Analyze your code with ClickHouse!
Usage:
clickhouse-git-import
^ run this command inside your Git repository.
What is Git?
A distributed version control system?
A content-addressable blockchain graph database?
git — the stupid content tracker
— Linus Torvalds.
Whatever it is... let's load everything to ClickHouse!
clickhouse-git-import
^ run this command inside your Git repository.
$ wc -lc line_changes.tsv file_changes.tsv commits.tsv
7528665 2920411878 line_changes.tsv
265724 55133300 file_changes.tsv
62665 8182510 commits.tsv
Allows to answer questions like:
— list files with maximum number of authors;
— show me the oldest lines of code in the repository;
— show me the files with longest history;
— list favorite files for author;
— list largest files with lowest number of authors;
— at what weekday the code has highest chance to stay in repository;
— the distribution of code age across repository;
— files sorted by average code age;
— quickly show file with blame info (rough);
— commits and lines of code distribution by time;
— show history for every subdirectory, file, line of file, changes across time;
— how the number of contributors was changed across time;
— list files with most modifications;
— list files that were rewritten most number of time or by most of authors;
— what is percentage of code removal by other authors, across authors;
— the matrix of authors that shows what authors tends to rewrite another authors code;
— the average time before code will be rewritten and the median (half-life of code decay);
— comments/code percentage change in time / by author / by location;
— who tend to write more tests / cpp code / comments.
The matrix of authors: who rewrites another author's code.
SELECT prev_author, author AS new_author, count() AS c
FROM line_changes WHERE sign = -1 AND file_extension IN ('h', 'cpp')
AND line_type NOT IN ('Punct', 'Empty')
AND author != prev_author AND prev_author != ''
GROUP BY prev_author, author ORDER BY c DESC
LIMIT 1 BY prev_author LIMIT 100
┌─prev_author──────────┬─new_author───────┬─────c─┐
│ Ivan Lezhankin │ Alexey Milovidov │ 18554 │
│ Alexey Arno │ Alexey Milovidov │ 18475 │
│ Michael Kolupaev │ Alexey Milovidov │ 14135 │
│ Alexey Milovidov │ Nikolai Kochetov │ 13435 │
│ Andrey Mironov │ Alexey Milovidov │ 10418 │
│ proller │ Alexey Milovidov │ 7280 │
│ Nikolai Kochetov │ Alexey Milovidov │ 6806 │
│ alexey-milovidov │ Alexey Milovidov │ 5027 │
│ Vitaliy Lyudvichenko │ Alexey Milovidov │ 4390 │
│ Amos Bird │ Ivan Lezhankin │ 3125 │
│ f1yegor │ Alexey Milovidov │ 3119 │
│ Pavel Kartavyy │ Alexey Milovidov │ 3087 │
│ Alexey Zatelepin │ Alexey Milovidov │ 2978 │
│ alesapin │ Alexey Milovidov │ 2949 │
│ Sergey Fedorov │ Alexey Milovidov │ 2727 │
│ Ivan Lezhankin │ Alexey Milovidov │ 2618 │
│ Vasily Nemkov │ Alexey Milovidov │ 2547 │
│ Alexander Tokmakov │ Alexey Milovidov │ 2493 │
│ Nikita Vasilev │ Maksim Kita │ 2420 │
│ Anton Popov │ Amos Bird │ 2127 │
└──────────────────────┴──────────────────┴───────┘
The matrix of authors: who rewrites another author's code (LLVM).
┌─prev_author──────────┬─new_author────────────┬─────c─┐
│ Giorgis Georgakoudis │ Johannes Doerfert │ 31766 │
│ Reid Spencer │ Chris Lattner │ 30941 │
│ Ted Kremenek │ Argyrios Kyrtzidis │ 27294 │
│ David Salinas │ Nico Weber │ 23607 │
│ Deepak Panickal │ Zachary Turner │ 22455 │
│ Chris Lattner │ Reid Spencer │ 21297 │
│ Vikram S. Adve │ Chris Lattner │ 20491 │
│ Greg Clayton │ Pavel Labath │ 20181 │
│ Zachary Turner │ Jonas Devlieghere │ 19232 │
│ Douglas Gregor │ John McCall │ 17809 │
│ Howard Hinnant │ Eric Fiselier │ 16951 │
│ hyeongyu kim │ Florian Hahn │ 15966 │
│ Eric Fiselier │ Louis Dionne │ 15562 │
│ Joseph Huber │ Johannes Doerfert │ 15372 │
│ Tobias Grosser │ Michael Kruse │ 13678 │
│ Misha Brukman │ Chris Lattner │ 12823 │
│ Roman Lebedev │ hyeongyu kim │ 12816 │
│ Rui Ueyama │ Rafael Espindola │ 11934 │
│ Evan Cheng │ Chris Lattner │ 11609 │
│ Christopher Di Bella │ Louis Dionne │ 11184 │
│ John Criswell │ Chris Lattner │ 11064 │
│ Brian Gaeke │ Chris Lattner │ 10814 │
│ Sirish Pande │ Chandler Carruth │ 10536 │
│ Nick Kledzik │ Keith Smiley │ 9990 │
│ Jim Ingham │ Greg Clayton │ 9188 │
│ Nicolas Vasilache │ River Riddle │ 9142 │
│ Rafael Espindola │ Rui Ueyama │ 9089 │
│ Dan Gohman │ Chris Lattner │ 9058 │
│ Anton Korobeynikov │ Chris Lattner │ 9004 │
│ Owen Anderson │ Chris Lattner │ 8780 │
│ Mark de Wever │ Petr Hosek │ 8748 │
│ Todd Fiala │ Pavel Labath │ 8507 │
│ Michael J. Spencer │ Rafael Espindola │ 8305 │
│ Nate Begeman │ Chris Lattner │ 8299 │
│ Chandler Carruth │ Benjamin Kramer │ 8254 │
│ Jim Cownie │ Jonathan Peyton │ 8142 │
│ Michael Zuckerman │ Craig Topper │ 8035 │
│ Zhongxing Xu │ Ted Kremenek │ 7872 │
│ Erik Pilkington │ Richard Smith │ 7861 │
│ Shankar Easwaran │ Rui Ueyama │ 7749 │
│ Johnny Chen │ Jason Molenda │ 7746 │
│ Alex Zinenko │ River Riddle │ 7476 │
│ Marshall Clow │ Artem Dergachev │ 7376 │
│ Caroline Tice │ Greg Clayton │ 7181 │
│ Sean Callanan │ Greg Clayton │ 7098 │
│ Argyrios Kyrtzidis │ Douglas Gregor │ 7025 │
│ Daniel Dunbar │ Chris Lattner │ 6940 │
│ Steve Naroff │ Chris Lattner │ 6853 │
│ Tom Stellard │ Matt Arsenault │ 6679 │
│ Jim Grosbach │ Tanya Lattner │ 6562 │
│ Alexey Bataev │ Joseph Huber │ 6521 │
│ Edwin Vane │ Alexander Kornienko │ 6503 │
│ Enrico Granata │ Greg Clayton │ 6501 │
│ John McCall │ Douglas Gregor │ 6454 │
│ Ilya Biryukov │ Sam McCall │ 6395 │
│ Nandor Licker │ Roman Lebedev │ 6305 │
│ Ryan Brown │ Jonas Devlieghere │ 6033 │
│ George Rimar │ Rui Ueyama │ 6033 │
│ Mehdi Amini │ River Riddle │ 5820 │
│ Sanjiv Gupta │ Chris Lattner │ 5672 │
│ Yaxun Liu │ Sven van Haastregt │ 5539 │
│ Simon Atanasyan │ Rafael Espindola │ 5476 │
│ Eric Schweitz │ Valentin Clement │ 5396 │
│ Tanya Lattner │ Chris Lattner │ 5343 │
│ Pavel Labath │ Greg Clayton │ 5314 │
│ Bill Nell │ Maksim Panchenko │ 5311 │
│ Rafael Auler │ Maksim Panchenko │ 5278 │
│ Elena Demikhovsky │ Craig Topper │ 5270 │
│ Bill Wendling │ Chris Lattner │ 5266 │
│ Craig Topper │ Simon Pilgrim │ 5215 │
│ zoecarver │ Petr Hosek │ 5180 │
│ Sebastian Redl │ Douglas Gregor │ 5172 │
│ Nick Lewycky │ Chris Lattner │ 5161 │
│ Jason Molenda │ Pavel Labath │ 4842 │
│ Simon Pilgrim │ Craig Topper │ 4828 │
│ Fariborz Jahanian │ John McCall │ 4744 │
│ Kostya Serebryany │ Vitaly Buka │ 4713 │
│ Lang Hames │ Jakob Stoklund Olesen │ 4665 │
│ Anders Carlsson │ John McCall │ 4547 │
│ Matheus Izvekov │ Jonas Devlieghere │ 4540 │
│ Devang Patel │ Chris Lattner │ 4532 │
│ Maksim Panchenko │ Amir Ayupov │ 4521 │
│ Mike Stump │ Anders Carlsson │ 4443 │
│ Wesley Peck │ Rafael Espindola │ 4440 │
│ JF Bastien │ Louis Dionne │ 4414 │
│ Jim Laskey │ Chris Lattner │ 4402 │
│ peter klausler │ Peter Klausler │ 4380 │
│ Hal Finkel │ Chandler Carruth │ 4151 │
│ Jeff Cohen │ Chris Lattner │ 4128 │
│ Jonas Devlieghere │ Matheus Izvekov │ 4123 │
│ Sergey Dmitrouk │ Daniel Jasper │ 4095 │
│ Reid Kleckner │ Zachary Turner │ 4058 │
│ Colin LeMahieu │ Krzysztof Parzyszek │ 3948 │
│ Kelvin Li │ Diana Picus │ 3913 │
│ Asaf Badouh │ Craig Topper │ 3884 │
│ Tamas Berghammer │ Jonas Devlieghere │ 3803 │
│ Chaoren Lin │ Ying Chen │ 3762 │
│ Ruchira Sasanka │ Chris Lattner │ 3737 │
│ Alkis Evlogimenos │ Chris Lattner │ 3707 │
│ Ying Chen │ Chaoren Lin │ 3668 │
└──────────────────────┴───────────────────────┴───────┘
ClickHouse engine without a server!
Uninteresting use-cases:
— queries on text, JSON and binary files;
— format conversion;
— data preparation;
— interactive analysis of local files;
clickhouse-local --query "
SELECT reviewText FROM file('books.ndjson')
ORDER BY overall DESC LIMIT 100"
Unusual usages: replacement for every Unix tool.
grep, sed, pv, wc, yes, cat, nl...
Example: replacement for pv:
input | pv | output
input | clickhouse-local --progress \
--input-format LineAsString \
--output-format LineAsString \
--query "SELECT * FROM table" \
| output
But this is not the most weird usage of clickhouse-local.
Example: replacement for grep:
$ time grep -c --text Warning server.log # 2.6 GB file
0.811s
$ time rg -c --text Warning server.log
0.289s
$ time clickhouse-local --progress --input-format LineAsString --query "
SELECT count() FROM table WHERE line LIKE '%Warning%'" < server.log
1.234s :(
$ clickhouse-client --query "
CREATE TABLE grep (s String) ENGINE = MergeTree ORDER BY ();"
$ clickhouse-client --time --query "
INSERT INTO grep FORMAT LineAsString" < server.log
$ clickhouse-client --time --query "
SELECT count() FROM grep WHERE s LIKE '%Warning%'"
0.107s :)
Analyze ClickHouse with itself.
Demo
Can ClickHouse draw pictures?
— Yes:
(echo "P5"; echo "16384 16384 255";
clickhouse-local --progress --format RowBinary --query "
SELECT * FROM (
SELECT * FROM file('/usr/bin/clickhouse', RowBinary, 'v UInt8')
LIMIT 16384*16384)
ORDER BY mortonDecode(2, rowNumberInAllBlocks())") |
pnmtopng | pv > binary.png
Can ClickHouse play music?
— Maybe:
clickhouse-local --query "
SELECT (sin(number / 100 * log(
cityHash64(number % 2
? number DIV (20000 + number % 101)
: number DIV (30000 + number % 107))
% 50 + 2) + 1) * 4096)::Int16
FROM system.numbers" --format RowBinary | aplay -f cd
Linux only. Not recommended to listen!
Serverless build for Cloud Functions
A minimal build of clickhouse-local is less than 50 MB.
Let's wrap it into lambda function and publish at glitch.com!
... using their free plan :)
https://clickhouse.glitch.me/play
This demo has been made by Lorenzo Mangani.
"Please make sure you say this is an experiment in running clickhouse
on the SMALLEST free instance on the internet, 1 shared CPU and 512MB of slow RAM" — Lorenzo.
Every ClickHouse instance collects the metrics about ClickHouse
and about the system as a whole.
Launch empty ClickHouse on your server,
and it will collect the metrics and store them inside
the system.asynchronous_metric_log table.
Download 7.5 million web pages (~1 TB):
cut -d, -f2 ~/tranco_*.csv | dos2unix | xargs -P100 -I{} bash -c "
[ ! -f '{}.out' ] && touch '{}.out' &&
curl -v -s -S --max-redirs 10 --max-time 10 \
-L '{}' -o '{}.out' 2> '{}.log'"
Insert into ClickHouse:
CREATE TABLE minicrawl (
rank UInt32, domain String,
log String CODEC(ZSTD(6)),
content String CODEC(ZSTD(6)))
ENGINE = MergeTree
ORDER BY domain;
Insert into ClickHouse:
clickhouse-local --input-format CSV --output-format Native \
--structure 'rank UInt32, domain String' --query "
SELECT *,
file('data/' || domain || '.log') AS log,
file('data/' || domain || '.out') AS content FROM table
" < tranco_*.csv |
clickhouse-client --query "INSERT INTO minicrawl FORMAT Native"
Analyze the data:
SELECT rank, domain FROM minicrawl WHERE content LIKE '%ClickHouse%';
SELECT replaceRegexpAll(extractTextFromHTML(content) AS text,
'(?i)(clickhouse)', '\x1b[1;31m\1\x1b[0m')
FROM minicrawl WHERE text ILIKE '%clickhouse%';
ClickHouse on Raspberry Pi and RISC-V.
ClickHouse on Graviton 3 and Threadripper Pro.
ClickHouse to speed-up MongoDB and make it web scale again.
ClickHouse as a replacement of AWS Athena.
...
Use ClickHouse! Abuse ClickHouse.
Find new horizons with ClickHouse.
Tomorrow: https://clickhouse.com/company/events/v22-12-release-webinar