2009 — first prototype of ClickHouse.
2012 — first production usages in Yandex for logs processing.
2014 — first production usages in Yandex for public service.
2014..2015 — rising adoption in Yandex.
2016 — ClickHouse release in open-source.
2016..2017 — first usages outside of Yandex
— for similar scenarios.
2018..2019 — distribution over the world
— meetups in the US, Europe and China.
2020..2021:
Increased adoption of ClickHouse in large companies.
Due to increase of traffic and shortage of hardware.
ClickHouse appeared in many cloud services.
Alibaba, Tencent, Huawei, Bytedance, JD, Altinity, Yandex, Mail.ru, Sber...
Companies with services around ClickHouse.
Hydrolix, Tinybird, Gigapipe, LogTail...
International, independent company.
Headquarters in Amsterdam.
Distributed team. We are hiring in Europe, Asia-Pacific and all over the world.
Aaron Katz, CEO - ex. Elastic, Salesforce.
Alexey Milovidov, CTO - ex. Yandex.
Yuri Izrailevsky, President of Engineering - ex. Google, Netflix.
Season 2021/2022
1. SQL User Defined Functions (since 21.10)
CREATE FUNCTION ch AS (x) -> (exp(x) + exp(-x)) / 2;
CREATE FUNCTION getRegionName AS (x)
-> dictGet('region', 'name', x);
2. Scriptable User Defined Functions (since 21.11)
<function>
<type>executable</type>
<name>myPlus</name>
<return_type>UInt64</return_type>
<argument><type>UInt64</type></argument>
<argument><type>UInt64</type></argument>
<format>TabSeparated</format>
<command>awk '{ print $1 + $2 }'</command>
<execute_direct>0</execute_direct>
</function>
JSON data type:
CREATE TABLE games (data JSON) ENGINE = MergeTree;
You can insert arbitrary nested JSONs.
Types are automatically inferred on INSERT and merge.
Data is stored in columnar format: columns and subcolumns.
Query nested data naturally.
Example: NBA games dataset
CREATE TABLE games (data String)
ENGINE = MergeTree ORDER BY tuple();
SELECT JSONExtractString(data, 'teams', 1, 'name')
FROM games;
— 0.520 sec.
CREATE TABLE games (data JSON)
ENGINE = MergeTree;
SELECT data.teams.name[1] FROM games;
— 0.015 sec.
DESCRIBE TABLE games
SETTINGS describe_extend_object_types = 1
name: data
type: Tuple( <-- inferred type
`_id.$oid` String,
`date.$date` String,
`teams.abbreviation` Array(String),
`teams.city` Array(String),
`teams.home` Array(UInt8),
`teams.name` Array(String),
`teams.players.ast` Array(Array(Int8)),
`teams.players.blk` Array(Array(Int8)),
`teams.players.drb` Array(Array(Int8)),
`teams.players.fg` Array(Array(Int8)),
`teams.players.fg3` Array(Array(Int8)),
`teams.players.fg3_pct` Array(Array(String)),
`teams.players.fg3a` Array(Array(Int8)),
`teams.players.fg_pct` Array(Array(String)),
`teams.players.fga` Array(Array(Int8)),
`teams.players.ft` Array(Array(Int8)),
`teams.players.ft_pct` Array(Array(String)),
`teams.players.fta` Array(Array(Int8)),
`teams.players.mp` Array(Array(String)),
`teams.players.orb` Array(Array(Int8)),
`teams.players.pf` Array(Array(Int8)),
`teams.players.player` Array(Array(String)),
`teams.players.plus_minus` Array(Array(String)),
`teams.players.pts` Array(Array(Int8)),
`teams.players.stl` Array(Array(Int8)),
`teams.players.tov` Array(Array(Int8)),
`teams.players.trb` Array(Array(Int8)),
`teams.results.ast` Array(Int8),
`teams.results.blk` Array(Int8),
`teams.results.drb` Array(Int8),
`teams.results.fg` Array(Int8),
`teams.results.fg3` Array(Int8),
`teams.results.fg3_pct` Array(String),
`teams.results.fg3a` Array(Int8),
`teams.results.fg_pct` Array(String),
`teams.results.fga` Array(Int16),
`teams.results.ft` Array(Int8),
`teams.results.ft_pct` Array(String),
`teams.results.fta` Array(Int8),
`teams.results.mp` Array(Int16),
`teams.results.orb` Array(Int8),
`teams.results.pf` Array(Int8),
`teams.results.plus_minus` Array(String),
`teams.results.pts` Array(Int16),
`teams.results.stl` Array(Int8),
`teams.results.tov` Array(Int8),
`teams.results.trb` Array(Int8),
`teams.score` Array(Int16),
`teams.won` Array(Int8))
Flexible schema.
You can have columns with strict and flexible schema in one table.
Queries work as fast as with predefined types!
Planned for Q1 2022.
Before:
clickhouse-local --input-format Parquet \
--query 'SELECT * FROM table' \
--structure '
id UInt32, deleted UInt8,
type String, by String, time DateTime,
text String, dead UInt8, parent UInt32,
poll UInt32, kids Array(UInt32),
url String, score Int32, title String,
parts Array(UInt32), descendants Int32'
After:
clickhouse-local --input-format Parquet \
--query 'SELECT * FROM table'
1. No need to specify schema if data already contains it:
Native, Arrow, Parquet, ORC, Avro.
2. Allow to infer schema from text formats:
TSV, CSV, JSONEachRow.
It works for:
— clickhouse-local;
— url;
— file;
— s3.
Planned for Q1 2022.
1. Tokenization, stemming, lemmatization, synonims (21.9..21.11).
— SELECT arrayMap(word -> lemmatize(word), tokens(text)).
2. Charset and language detection (Q4 2021).
3. Semi-duplicate text search (21.1).
— min-hash and sim-hash algorithms.
1. clickhouse-keeper — 100% compatible protocol with ZooKeeper.
— compressed logs and snapshots;
— no issues with zxid overflow;
— no issues with large packets;
— better memory usage;
— no issues with GC and Java heap;
2. Can run embedded into clickhouse-server.
— no need for a separate process;
Preproduction stage, planned for Q4 2021.
Allow to do many small INSERTS.
From many concurrent connections.
Without Kafka and Buffer tables!
Many small INSERTs are combined together
and batched in memory.
Safe by default: every client get response when data is inserted.
Available in 21.11-testing
ClickHouse is not a transactional DBMS, isn't it?
But we need transactions to:
— do atomic INSERTs to multiple partitions;
— do atomic INSERTs to multiple tables and materialzied views;
— make multiple SELECT queries from one snapshot.
Planned for Q2 2022.
... And Incremental Aggregation In Memory
Run GROUP BY query continuously.
Aggregated data is accumulated in memory.
It can be queried as a table.
The server can serve key-value requests with Redis protocol.
Example applications:
— realtime antifraud;
— user profiles and personalization.
Planned for summer 2022.
ClickHouse over S3:
— highly concurrent asynchronous reads (21.11);
— parallel processing on multiple compute nodes (Q4);
— caching in RAM or local disk with node affinity (Q4);
— easy coordination of inserts and merges
without ReplicatedMergeTree (Q1 2022);
Lower cost and higher throughput than EBS volumes.
Bonus: "web" disk for static hosted datasets (21.10).
Streaming queries:
— GROUP BY and ORDER BY on time frames;
— distributed streams;
— persistent streams;
— subscription to tables and external data streams;
Advanced time series analysis.
Tamper-proof data storage.
Embedded ClickHouse.
ETL and batch jobs.
Please:
— Use ClickHouse.
— 💖 ClickHouse.
— Contribute to ClickHouse!