Amazing Features of ClickHouse

Amazing Features of ClickHouse

Already Covered

— SQL UDF, executable UDF;

— executable table function & dictionary source;

— full disk encryption;

— column-level encryption;

Short-Circuit Evaluation

Before:

SELECT IF(number = 0, 0, 123 % number) FROM numbers(10)

— division by zero.

SELECT * FROM numbers(10) WHERE number > 0 AND 10 % number > 0

— division by zero.

— both sides of IF, AND, OR are executed.

SELECT * FROM
(
    SELECT * FROM numbers(10)
    WHERE number > 0
)
WHERE 10 % number > 0

— division by zero.

Short-Circuit Evaluation

Now:

— everything simply works as expected.

Available in 21.9.

Developer — Pavel Kruglov.

Asynchronous INSERT Queries

Ability to do many frequent INSERTs.

From multiple parallel connections.

Without Kafka and without Buffer tables!

Multiple small INSERTs are combined together
into one batch in memory.

Inserts are reliable by default:
client receives a response when data is written to the table.

Available in 21.11-testing.
Developer: Anton Popov, Ivan Lezhankin.

ZooKeeper is Going Away!

1. clickhouse-keeper — 100% compatible with ZooKeeper
in protocol and data model.

— compressed logs and snapshots;
— no issues with zxid overflow;
— no issues with large packets;
— better memory usage;
— no issues with GC and Java heap;

2. Can be run embedded in clickhouse-server.

— no need for a separate service;

Preproduction stage, planned for Q4 2021.
Developer: Alexander Sapin.

SQL/JSON Support

JSON_EXISTS, JSON_VALUE, JSON_QUERY

SELECT JSON_QUERY( '$.array[*][0 to 2, 4]', '{"array":[[0, 1, 2, 3, 4, 5], [0, -1, -2, -3, -4, -5]]}');

Available in 21.8.
Developers — Ksenia Sumarokova, Konstantin Rudenskii, Denis Semenov.

Support For Semistructured Data

JSON data type:

CREATE TABLE games (data JSON) ENGINE = MergeTree;

You can insert arbitrary nested JSONs.

Types are automatically inferred on INSERT and merge.

Data is stored in columnar format: columns and subcolumns.

Query nested data naturally.

Support For Semistructured Data

Example: NBA games dataset

CREATE TABLE games (data String) ENGINE = MergeTree ORDER BY tuple(); SELECT JSONExtractString(data, 'teams', 1, 'name') FROM games;

— 0.520 sec.

CREATE TABLE games (data JSON) ENGINE = MergeTree; SELECT data.teams.name[1] FROM games;

— 0.015 sec.

Support For Semistructured Data

DESCRIBE TABLE games SETTINGS describe_extend_object_types = 1 name: data type: Tuple( <-- inferred type `_id.$oid` String, `date.$date` String, `teams.abbreviation` Array(String), `teams.city` Array(String), `teams.home` Array(UInt8), `teams.name` Array(String), `teams.players.ast` Array(Array(Int8)), `teams.players.blk` Array(Array(Int8)), `teams.players.drb` Array(Array(Int8)), `teams.players.fg` Array(Array(Int8)), `teams.players.fg3` Array(Array(Int8)), `teams.players.fg3_pct` Array(Array(String)), `teams.players.fg3a` Array(Array(Int8)), `teams.players.fg_pct` Array(Array(String)), `teams.players.fga` Array(Array(Int8)), `teams.players.ft` Array(Array(Int8)), `teams.players.ft_pct` Array(Array(String)), `teams.players.fta` Array(Array(Int8)), `teams.players.mp` Array(Array(String)), `teams.players.orb` Array(Array(Int8)), `teams.players.pf` Array(Array(Int8)), `teams.players.player` Array(Array(String)), `teams.players.plus_minus` Array(Array(String)), `teams.players.pts` Array(Array(Int8)), `teams.players.stl` Array(Array(Int8)), `teams.players.tov` Array(Array(Int8)), `teams.players.trb` Array(Array(Int8)), `teams.results.ast` Array(Int8), `teams.results.blk` Array(Int8), `teams.results.drb` Array(Int8), `teams.results.fg` Array(Int8), `teams.results.fg3` Array(Int8), `teams.results.fg3_pct` Array(String), `teams.results.fg3a` Array(Int8), `teams.results.fg_pct` Array(String), `teams.results.fga` Array(Int16), `teams.results.ft` Array(Int8), `teams.results.ft_pct` Array(String), `teams.results.fta` Array(Int8), `teams.results.mp` Array(Int16), `teams.results.orb` Array(Int8), `teams.results.pf` Array(Int8), `teams.results.plus_minus` Array(String), `teams.results.pts` Array(Int16), `teams.results.stl` Array(Int8), `teams.results.tov` Array(Int8), `teams.results.trb` Array(Int8), `teams.score` Array(Int16), `teams.won` Array(Int8))

Support For Semistructured Data

Flexible schema.

You can have columns with strict and flexible schema in one table.

Queries work as fast as with predefined types!

Planned for Q1 2022.
Developer: Anton Popov.

Schema Inference From Formats

Before:

clickhouse-local --input-format Parquet \ --query 'SELECT * FROM table' \ --structure ' id UInt32, deleted UInt8, type String, by String, time DateTime, text String, dead UInt8, parent UInt32, poll UInt32, kids Array(UInt32), url String, score Int32, title String, parts Array(UInt32), descendants Int32'

After:

clickhouse-local --input-format Parquet \ --query 'SELECT * FROM table'

Schema Inference From Formats

1. No need to specify schema if data already contains it:
Native, Arrow, Parquet, ORC, Avro.

2. Allow to infer schema from text formats:
TSV, CSV, JSONEachRow.

It works for:

— clickhouse-local;
— url;
— file;
— s3.

Planned for Q1 2022.
Developer: Pavel Kruglov.

Text Classification and NLP

1. Tokenization, stemming, lemmatization, synonims (21.9..21.11).
— SELECT arrayMap(word -> lemmatize(word), tokens(text)).

Developer: Nikolai Degterinskiy

2. Charset and language detection (Q4 2021).

3. Semi-duplicate text search (21.1).
— min-hash and sim-hash algorithms.

Decoupled Storage

ClickHouse over S3:

— highly concurrent asynchronous reads (21.11);

— parallel processing on multiple compute nodes (Q4);

— caching in RAM or local disk with node affinity (Q4);

— easy coordination of inserts and merges
  without ReplicatedMergeTree (Q1 2022);

Lower cost and higher throughput than EBS volumes.

Bonus: "web" disk for static hosted datasets (21.10).

YAML Configuration

I heard you don't like XML...

Then ClickHouse will support YAML as an alternative for you.

logger: level: trace log: /var/log/clickhouse-server/clickhouse-server.log errorlog: /var/log/clickhouse-server/clickhouse-server.err.log size: 1000M count: 10 http_port: 8123 # Don't worry, YAML configuration tcp_port: 9000 # will also contain comments. mysql_port: 9004 # postgresql_port: 9005 # It's not JSON after all. max_connections: 4096

Developer — Denis Bolonin.

Convenient Minor Features

Support for importing and exporting gz, xz, bz2, zst, lz4.

This works for url, file, s3, hdfs

SELECT INTO OUTFILE,
INSERT FROM INFILE.

Syntax

:: operator for type casting:

SELECT x::UInt64, ['Hello', 'world!']::Array(Nullable(String))

$heredoc$ for string literals:

SELECT $$ Hello, 'C:\WINDOWS\SYSTEM32' world! $$

Developers — Anton Popov, Maksim Kita.

Convenient Minor Features

Tons of dictionary improvements:

— array support in dictionaries;
— Nullable support in dictionaries;
— dictionaries can be created with keys only, without attributes;
— complex_key_range_hashed;
— executable_pool source;
— cache dictionary performance;
— dictGetChildren; dictGetDescendants;
— dictGetOrNull;

Developer — Maksim Kita.

— support for both IPv4/IPv6 in ip_trie;

Developer — Vladimir Cherkasov.

Convenient Minor Features

Tons of dictionary improvements:

All dictionary sources from external databases
allow specifying explicit queries:

CREATE DICTIONARY dict ... SOURCE(PostgreSQL( DB 'db' HOST 'host' PORT 5432 USER 'postgres' QUERY $$ SELECT ... FROM t1 INNER JOIN t2 USING (key) $$))

Developer — Maksim Kita.

Convenient Minor Features

Tons of JOIN improvements:

— Type casting in ON/USING clause.

— Table filters in ON clause.

— CROSS to INNER JOIN transformations and back.

— join_use_nulls support for Array, LowCardinality.

Developer — Vladimir Cherkasov.

Convenient Minor Features

Interactive mode for clickhouse-local.

INTERSECT, EXCEPT, ANY, ALL, EXISTS.

Virtual file system over static files: web disk.

ALTER ... MATERIALIZE COLUMN.

SYSTEM RESTORE REPLICA.

Convenient Minor Features

SQLite database engine.

Integration with S2 Geometry.

s3Cluster table function.

INFORMATION_SCHEMA.

Storing access rights in ZooKeeper.

Did You Notice That:

ClickHouse already supports window functions!

ClickHouse already supports geographical data types:
Point, Ring, Polygon, MultiPolygon and functions:
polygonsIntersection, polygonArea, polygonsWithin, ...

MaterializedMySQL — change data capture from PostgreSQL.

UI Improvements

OPTIONS HTTP request support,

— so Grafana and Observable work without proxy.

Play UI improvements:

— graphs for EXPLAIN;
— sparklines;
— links;

What's Next

Support For Transactions

ClickHouse is not a transactional DBMS, isn't it?

But we need transactions to:

— do atomic INSERTs to multiple partitions;

— do atomic INSERTs to multiple tables and materialzied views;

— make multiple SELECT queries from one snapshot.

Planned for Q2 2022
Developer — Alexander Tokmakov.

Key-Value Data Marts

... And Incremental Aggregation In Memory

Run GROUP BY query continuously.

Aggregated data is accumulated in memory.

It can be queried as a table.

The server can serve key-value requests with Redis protocol.

Example applications:
— realtime antifraud;
— user profiles and personalization.

Planned for summer 2022.

Resource Pools

And workload isolation.

Planned for summer 2022.

.