ClickHouse: New Horizons

ClickHouse:
New Horizons

The Origins

2009 — first prototype of ClickHouse.

2012 — first production usages in Yandex for logs processing.

2014 — first production usages in Yandex for public service.

2014..2015 — rising adoption in Yandex.

2016 — ClickHouse release in open-source.

The Origins

2016..2017 — first usages outside of Yandex
— for similar scenarios.

2018..2019 — distribution over the world
— meetups in the US, Europe and China.

The Origins

2020..2021:

Increased adoption of ClickHouse in large companies.
Due to increase of traffic and shortage of hardware.

ClickHouse appeared in many cloud services.
Alibaba, Tencent, Huawei, Bytedance, JD, Altinity, Yandex, Mail.ru, Sber...

Companies with services around ClickHouse.
Hydrolix, Tinybird, Gigapipe, LogTail...

ClickHouse, Inc.

What Is ClickHouse, Inc?

International, independent company.

Headquarters in Amsterdam.

Distributed team. We are hiring in Europe, Asia-Pacific and all over the world.

ClickHouse, Inc

Aaron Katz, CEO - ex. Elastic, Salesforce.

Alexey Milovidov, CTO - ex. Yandex.

Yuri Izrailevsky, President of Engineering - ex. Google, Netflix.




...

What's New In ClickHouse

Season 2021/2022

User Defined Functions and Types

1. SQL User Defined Functions (since 21.10)

CREATE FUNCTION ch AS (x) -> (exp(x) + exp(-x)) / 2; CREATE FUNCTION getRegionName AS (x) -> dictGet('region', 'name', x);

2. Scriptable User Defined Functions (since 21.11)

<function> <type>executable</type> <name>myPlus</name> <return_type>UInt64</return_type> <argument><type>UInt64</type></argument> <argument><type>UInt64</type></argument> <format>TabSeparated</format> <command>awk '{ print $1 + $2 }'</command> <execute_direct>0</execute_direct> </function>

Support For Semistructured Data

JSON data type:

CREATE TABLE games (data JSON) ENGINE = MergeTree;

You can insert arbitrary nested JSONs.

Types are automatically inferred on INSERT and merge.

Data is stored in columnar format: columns and subcolumns.

Query nested data naturally.

Support For Semistructured Data

Example: NBA games dataset

CREATE TABLE games (data String) ENGINE = MergeTree ORDER BY tuple(); SELECT JSONExtractString(data, 'teams', 1, 'name') FROM games;

— 0.520 sec.

CREATE TABLE games (data JSON) ENGINE = MergeTree; SELECT data.teams.name[1] FROM games;

— 0.015 sec.

Support For Semistructured Data

DESCRIBE TABLE games SETTINGS describe_extend_object_types = 1 name: data type: Tuple( <-- inferred type `_id.$oid` String, `date.$date` String, `teams.abbreviation` Array(String), `teams.city` Array(String), `teams.home` Array(UInt8), `teams.name` Array(String), `teams.players.ast` Array(Array(Int8)), `teams.players.blk` Array(Array(Int8)), `teams.players.drb` Array(Array(Int8)), `teams.players.fg` Array(Array(Int8)), `teams.players.fg3` Array(Array(Int8)), `teams.players.fg3_pct` Array(Array(String)), `teams.players.fg3a` Array(Array(Int8)), `teams.players.fg_pct` Array(Array(String)), `teams.players.fga` Array(Array(Int8)), `teams.players.ft` Array(Array(Int8)), `teams.players.ft_pct` Array(Array(String)), `teams.players.fta` Array(Array(Int8)), `teams.players.mp` Array(Array(String)), `teams.players.orb` Array(Array(Int8)), `teams.players.pf` Array(Array(Int8)), `teams.players.player` Array(Array(String)), `teams.players.plus_minus` Array(Array(String)), `teams.players.pts` Array(Array(Int8)), `teams.players.stl` Array(Array(Int8)), `teams.players.tov` Array(Array(Int8)), `teams.players.trb` Array(Array(Int8)), `teams.results.ast` Array(Int8), `teams.results.blk` Array(Int8), `teams.results.drb` Array(Int8), `teams.results.fg` Array(Int8), `teams.results.fg3` Array(Int8), `teams.results.fg3_pct` Array(String), `teams.results.fg3a` Array(Int8), `teams.results.fg_pct` Array(String), `teams.results.fga` Array(Int16), `teams.results.ft` Array(Int8), `teams.results.ft_pct` Array(String), `teams.results.fta` Array(Int8), `teams.results.mp` Array(Int16), `teams.results.orb` Array(Int8), `teams.results.pf` Array(Int8), `teams.results.plus_minus` Array(String), `teams.results.pts` Array(Int16), `teams.results.stl` Array(Int8), `teams.results.tov` Array(Int8), `teams.results.trb` Array(Int8), `teams.score` Array(Int16), `teams.won` Array(Int8))

Support For Semistructured Data

Flexible schema.

You can have columns with strict and flexible schema in one table.

Queries work as fast as with predefined types!

Planned for Q1 2022.

Schema Inference From Formats

Before:

clickhouse-local --input-format Parquet \ --query 'SELECT * FROM table' \ --structure ' id UInt32, deleted UInt8, type String, by String, time DateTime, text String, dead UInt8, parent UInt32, poll UInt32, kids Array(UInt32), url String, score Int32, title String, parts Array(UInt32), descendants Int32'

After:

clickhouse-local --input-format Parquet \ --query 'SELECT * FROM table'

Schema Inference From Formats

1. No need to specify schema if data already contains it:
Native, Arrow, Parquet, ORC, Avro.

2. Allow to infer schema from text formats:
TSV, CSV, JSONEachRow.

It works for:

— clickhouse-local;
— url;
— file;
— s3.

Planned for Q1 2022.

Text Classification and NLP

1. Tokenization, stemming, lemmatization, synonims (21.9..21.11).
— SELECT arrayMap(word -> lemmatize(word), tokens(text)).

2. Charset and language detection (Q4 2021).

3. Semi-duplicate text search (21.1).
— min-hash and sim-hash algorithms.

ZooKeeper Will Go Away

1. clickhouse-keeper — 100% compatible protocol with ZooKeeper.

— compressed logs and snapshots;
— no issues with zxid overflow;
— no issues with large packets;
— better memory usage;
— no issues with GC and Java heap;

2. Can run embedded into clickhouse-server.

— no need for a separate process;

Preproduction stage, planned for Q4 2021.

Asynchronous INSERTs

Allow to do many small INSERTS.

From many concurrent connections.

Without Kafka and Buffer tables!

Many small INSERTs are combined together
and batched in memory.

Safe by default: every client get response when data is inserted.

Available in 21.11-testing

Support For Transactions

ClickHouse is not a transactional DBMS, isn't it?

But we need transactions to:

— do atomic INSERTs to multiple partitions;

— do atomic INSERTs to multiple tables and materialzied views;

— make multiple SELECT queries from one snapshot.

Planned for Q2 2022.

Key-Value Data Marts

... And Incremental Aggregation In Memory

Run GROUP BY query continuously.

Aggregated data is accumulated in memory.

It can be queried as a table.

The server can serve key-value requests with Redis protocol.

Example applications:
— realtime antifraud;
— user profiles and personalization.

Planned for summer 2022.

Decoupled Storage

ClickHouse over S3:

— highly concurrent asynchronous reads (21.11);

— parallel processing on multiple compute nodes (Q4);

— caching in RAM or local disk with node affinity (Q4);

— easy coordination of inserts and merges
  without ReplicatedMergeTree (Q1 2022);

Lower cost and higher throughput than EBS volumes.

Bonus: "web" disk for static hosted datasets (21.10).

What Else?

Streaming queries:
— GROUP BY and ORDER BY on time frames;
— distributed streams;
— persistent streams;
— subscription to tables and external data streams;

Advanced time series analysis.

Tamper-proof data storage.

Embedded ClickHouse.

ETL and batch jobs.

Thank You!

Please:

— Use ClickHouse.

— 💖 ClickHouse.

— Contribute to ClickHouse!