ClickHouse: Top Features 2021 vs 2022

ClickHouse: Top Features
2021 vs 2022

What's New In ClickHouse

Season 2021/2022

1. My favorite features of 2021.

2. The most interesting to come in 2022.

My favorite features of 2021

Window functions UDF Async inserts PostgreSQL integration clickhouse-local interactive SQL/JSON Full disk encryption Short circuit evaluation Big integers clickhouse-keeper Projections VFS over web server YAML configs GRPC protocol Replicated databases simhash/minhash Text tokenization Transparent compression gz/bz2/xz/zst/lz4 s3Cluster Hedged requests Optimizations with constraints Map data type INTERSECT/EXCEPTANY/ALL/EXISTSUNION DISTINCT JIT compilationDictionaries completeness

Window Functions

They work.

SELECT RANK() OVER (PARTITION BY user_id ORDER BY ts), AVG() OVER (ORDER BY ts RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) ...

Developer — Alexander Kuzmenkov. Available since 21.1.

SQL/JSON

JSON_EXISTS, JSON_VALUE, JSON_QUERY

SELECT JSON_QUERY( '{"array":[[0, 1, 2, 3, 4, 5], [0, -1, -2, -3, -4, -5]]}', '$.array[*][0 to 2, 4]');

Available in 21.8.
Developers — Ksenia Sumarokova, Konstantin Rudenskii, Denis Semenov.

YAML configuration

I heard you don't like XML...

Then ClickHouse will support YAML as an alternative for you.

logger: level: trace log: /var/log/clickhouse-server/clickhouse-server.log errorlog: /var/log/clickhouse-server/clickhouse-server.err.log size: 1000M count: 10 http_port: 8123 # Don't worry, YAML configuration tcp_port: 9000 # will also contain comments. mysql_port: 9004 # postgresql_port: 9005 # It's not JSON after all. max_connections: 4096

Developer — Denis Bolonin.

Data Encryption On-Rest

— Full Disk Encryption:

<disk_local_encrypted> <type>encrypted</type> <disk>disk_local</disk> <path>encrypted/</path> <key from_env="DISK_ENCRYPTION_KEY" /> </disk_local_encrypted>

— Per-Column Encryption:

secret String Codec(LZ4, AES_128_GCM_SIV),

Developer: Vitaly Baranov, Arthur Filatenkov, depressed-pho.

Native Integration With PostgreSQL

You can query (read and write) PostgreSQL in ClickHouse queries*.

You can JOIN with PostgreSQL tables inside your queries to ClickHouse!

You can create a database that will have all the tables in PostgreSQL schema.

You can use PostgreSQL as a dictionary source.

And ClickHouse can also pretend to be PostgreSQL!

<!-- Compatibility with PostgreSQL protocol. ClickHouse will pretend to be PostgreSQL for applications connecting to this port. --> <postgresql_port>9005</postgresql_port>

* — earlier it was also available via odbc.

Native Integration With PostgreSQL

PostgreSQL storage engine;

postgresql table function; It supports shards and replicas!

SELECT * FROM postgresql(`postgres{1|2|3}:5432`, 'clickhouse', 'test_replicas', 'postgres', 'password');

postgresql dictionary source;

PostgreSQL database engine as a view to all tables in PG database;

Similar to MySQL integration... but for PostgreSQL.

Bonus: replication from PostgreSQL.

Developer — Ksenia Sumarokova. Available since 21.2.

What integrations do we have?

— MySQL, PostgreSQL, Hive

— MongoDB, Redis, Cassandra

— Kafka, RabbitMQ

— S3, HDFS, Azure

— ODBC, JDBC

User Defined Function

1. SQL User Defined Functions (since 21.10)

CREATE FUNCTION ch AS (x) -> (exp(x) + exp(-x)) / 2; CREATE FUNCTION getRegionName AS (x) -> dictGet('region', 'name', x);

2. Scriptable User Defined Functions (since 21.11)

<function> <type>executable</type> <name>myPlus</name> <return_type>UInt64</return_type> <argument><type>UInt64</type></argument> <argument><type>UInt64</type></argument> <format>TabSeparated</format> <command>awk '{ print $1 + $2 }'</command> <execute_direct>0</execute_direct> </function>

Scriptable Tables

Data generation and import:

CREATE TABLE test ... ENGINE = Executable('my_script', TSV);

— the script should return data in stdout, similar to "file" engine.

Data transformation:

CREATE TABLE test ... ENGINE = Executable('my_script', TSV, (SELECT ...));

— the script will get source stream to stdin and return data in stdout.

Custom JOIN logic:

CREATE TABLE test ... ENGINE = Executable('my_script', TSV, (SELECT ...), (SELECT ...), ...);

— the script will get source stream to stdin and additional streams in fd 3, 4...

Scriptable Tables

Two variants:

Executable: script is run on every query;

ExecutablePool: a pool of persistent processes;

Table function:

SELECT * FROM executable('my_script', TSV, 'x UInt64...'); SELECT * FROM executable('my_script', TSV, 'x UInt64...', (SELECT ...) ...);

Scriptable Tables

Applications:

— loading data from foreign data source;

— applying machine-learned models to dataset;

— custom processing, e.g. DNS resolution, whois...

Developer — Maksim Kita

Asynchronous INSERTs

Allow to do many small INSERTS.

From many concurrent connections.

Without Kafka and Buffer tables!

Many small INSERTs are combined together
and batched in memory.

Safe by default: every client get response when data is inserted.

Developer — Anton Popov. Available since 21.11

GRPC Protocol

Uncomment <grpc_port>9100</grpc_port> in config.

Feature parity with the native protocol:

— TLS, compression, query progress,
  query cancellation, sessions, external data...

Example: clickhouse-client with GRPC in Python.

utils/grpc-client/clickhouse-grpc-client.py

Developer — Vitaly Baranov. Available since 21.1.

Experimental Features of 2021

That will graduate in 2022:

— ClickHouse Keeper

— Projections

— Replicated database engine

— VFS over S3

Projections

Multiple data representations inside a single table.

— different data order;
— subset of columns;
— subset of rows;
— aggregation.

Difference to materialized views:

— projections data is always consistent;
— updated atomically with the table;
— replicated in the same way as the table;
— projection can be automatically used for SELECT query.

Developer — Amos Bird. Available since 21.6.

Decoupled Storage

ClickHouse over S3:

— highly concurrent asynchronous reads (21.11);

— parallel processing on multiple compute nodes (22.1);

— caching in RAM or local disk with node affinity (Q1 2022);

— easy coordination of inserts and merges
  without ReplicatedMergeTree (Q2 2022);

Lower cost and higher throughput than EBS volumes.

Bonus: "web" disk for static hosted datasets (21.10).

Features 2022

Already released in 22.1:

— Schema Inference

— Replicas for MPP

Table Structure Autodetection

Was:

SELECT * FROM url('https://datasets.clickhouse.com/github_events_v2.native.xz', Native, $$ file_time DateTime, event_type Enum('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11, 'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19, 'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22), actor_login LowCardinality(String), repo_name LowCardinality(String), created_at DateTime, updated_at DateTime, action Enum('none' = 0, 'created' = 1, 'added' = 2, 'edited' = 3, 'deleted' = 4, 'opened' = 5, 'closed' = 6, 'reopened' = 7, 'assigned' = 8, 'unassigned' = 9, 'labeled' = 10, 'unlabeled' = 11, 'review_requested' = 12, 'review_request_removed' = 13, 'synchronize' = 14, 'started' = 15, 'published' = 16, 'update' = 17, 'create' = 18, 'fork' = 19, 'merged' = 20), comment_id UInt64, body String, path String, position Int32, line Int32, ref LowCardinality(String), ref_type Enum('none' = 0, 'branch' = 1, 'tag' = 2, 'repository' = 3, 'unknown' = 4), creator_user_login LowCardinality(String), number UInt32, title String, labels Array(LowCardinality(String)), state Enum('none' = 0, 'open' = 1, 'closed' = 2), locked UInt8, assignee LowCardinality(String), assignees Array(LowCardinality(String)), comments UInt32, author_association Enum('NONE' = 0, 'CONTRIBUTOR' = 1, 'OWNER' = 2, 'COLLABORATOR' = 3, 'MEMBER' = 4, 'MANNEQUIN' = 5), closed_at DateTime, merged_at DateTime, merge_commit_sha String, requested_reviewers Array(LowCardinality(String)), requested_teams Array(LowCardinality(String)), head_ref LowCardinality(String), head_sha String, base_ref LowCardinality(String), base_sha String, merged UInt8, mergeable UInt8, rebaseable UInt8, mergeable_state Enum('unknown' = 0, 'dirty' = 1, 'clean' = 2, 'unstable' = 3, 'draft' = 4), merged_by LowCardinality(String), review_comments UInt32, maintainer_can_modify UInt8, commits UInt32, additions UInt32, deletions UInt32, changed_files UInt32, diff_hunk String, original_position UInt32, commit_id String, original_commit_id String, push_size UInt32, push_distinct_size UInt32, member_login LowCardinality(String), release_tag_name String, release_name String, review_state Enum('none' = 0, 'approved' = 1, 'changes_requested' = 2, 'commented' = 3, 'dismissed' = 4, 'pending' = 5) $$) LIMIT 10
SELECT * FROM url('https://datasets.clickhouse.com/github_events_v2.native.xz', Native, $$ file_time DateTime, event_type Enum('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11, 'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19, 'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22), actor_login LowCardinality(String), repo_name LowCardinality(String), created_at DateTime, updated_at DateTime, action Enum('none' = 0, 'created' = 1, 'added' = 2, 'edited' = 3, 'deleted' = 4, 'opened' = 5, 'closed' = 6, 'reopened' = 7, 'assigned' = 8, 'unassigned' = 9, 'labeled' = 10, 'unlabeled' = 11, 'review_requested' = 12, 'review_request_removed' = 13, 'synchronize' = 14, 'started' = 15, 'published' = 16, 'update' = 17, 'create' = 18, 'fork' = 19, 'merged' = 20), comment_id UInt64, body String, path String, position Int32, line Int32, ref LowCardinality(String), ref_type Enum('none' = 0, 'branch' = 1, 'tag' = 2, 'repository' = 3, 'unknown' = 4), creator_user_login LowCardinality(String), number UInt32, title String, labels Array(LowCardinality(String)), state Enum('none' = 0, 'open' = 1, 'closed' = 2), locked UInt8, assignee LowCardinality(String), assignees Array(LowCardinality(String)), comments UInt32, author_association Enum('NONE' = 0, 'CONTRIBUTOR' = 1, 'OWNER' = 2, 'COLLABORATOR' = 3, 'MEMBER' = 4, 'MANNEQUIN' = 5), closed_at DateTime, merged_at DateTime, merge_commit_sha String, requested_reviewers Array(LowCardinality(String)), requested_teams Array(LowCardinality(String)), head_ref LowCardinality(String), head_sha String, base_ref LowCardinality(String), base_sha String, merged UInt8, mergeable UInt8, rebaseable UInt8, mergeable_state Enum('unknown' = 0, 'dirty' = 1, 'clean' = 2, 'unstable' = 3, 'draft' = 4), merged_by LowCardinality(String), review_comments UInt32, maintainer_can_modify UInt8, commits UInt32, additions UInt32, deletions UInt32, changed_files UInt32, diff_hunk String, original_position UInt32, commit_id String, original_commit_id String, push_size UInt32, push_distinct_size UInt32, member_login LowCardinality(String), release_tag_name String, release_name String, review_state Enum('none' = 0, 'approved' = 1, 'changes_requested' = 2, 'commented' = 3, 'dismissed' = 4, 'pending' = 5) $$) LIMIT 10

Table Structure Autodetection

Now:

SELECT * FROM url( 'https://datasets.clickhouse.com/github_events_v2.native.xz') LIMIT 10

Read the structure:

DESCRIBE url('https://datasets.clickhouse.com/github_events_v2.native.xz')

Developer: Pavel Kruglov.

Schema Inference

— automatically extracts schema from Native, Parquet, ...

— automatically derives schema from JSON, CSV, ...

— works for file/url/s3/hdfs table functions and clickhouse-local

— automatically extracts schema on creation
  of Replicated, Distributed and Merge tables

Features to come in 2022

— GROUPING sets

— Memory overcommit

— Transactions

— Semistructured data

— Backup and Restore

— Extended merge JOIN

— Correlated subqueries

Support For Semistructured Data

JSON data type:

CREATE TABLE games (data JSON) ENGINE = MergeTree;

You can insert arbitrary nested JSONs.

Types are automatically inferred on INSERT and merge.

Data is stored in columnar format: columns and subcolumns.

Query nested data naturally.

Support For Semistructured Data

Example: NBA games dataset

CREATE TABLE games (data String) ENGINE = MergeTree ORDER BY tuple(); SELECT JSONExtractString(data, 'teams', 1, 'name') FROM games;

— 0.520 sec.

CREATE TABLE games (data JSON) ENGINE = MergeTree; SELECT data.teams.name[1] FROM games;

— 0.015 sec.

Support For Semistructured Data

DESCRIBE TABLE games SETTINGS describe_extend_object_types = 1 name: data type: Tuple( <-- inferred type `_id.$oid` String, `date.$date` String, `teams.abbreviation` Array(String), `teams.city` Array(String), `teams.home` Array(UInt8), `teams.name` Array(String), `teams.players.ast` Array(Array(Int8)), `teams.players.blk` Array(Array(Int8)), `teams.players.drb` Array(Array(Int8)), `teams.players.fg` Array(Array(Int8)), `teams.players.fg3` Array(Array(Int8)), `teams.players.fg3_pct` Array(Array(String)), `teams.players.fg3a` Array(Array(Int8)), `teams.players.fg_pct` Array(Array(String)), `teams.players.fga` Array(Array(Int8)), `teams.players.ft` Array(Array(Int8)), `teams.players.ft_pct` Array(Array(String)), `teams.players.fta` Array(Array(Int8)), `teams.players.mp` Array(Array(String)), `teams.players.orb` Array(Array(Int8)), `teams.players.pf` Array(Array(Int8)), `teams.players.player` Array(Array(String)), `teams.players.plus_minus` Array(Array(String)), `teams.players.pts` Array(Array(Int8)), `teams.players.stl` Array(Array(Int8)), `teams.players.tov` Array(Array(Int8)), `teams.players.trb` Array(Array(Int8)), `teams.results.ast` Array(Int8), `teams.results.blk` Array(Int8), `teams.results.drb` Array(Int8), `teams.results.fg` Array(Int8), `teams.results.fg3` Array(Int8), `teams.results.fg3_pct` Array(String), `teams.results.fg3a` Array(Int8), `teams.results.fg_pct` Array(String), `teams.results.fga` Array(Int16), `teams.results.ft` Array(Int8), `teams.results.ft_pct` Array(String), `teams.results.fta` Array(Int8), `teams.results.mp` Array(Int16), `teams.results.orb` Array(Int8), `teams.results.pf` Array(Int8), `teams.results.plus_minus` Array(String), `teams.results.pts` Array(Int16), `teams.results.stl` Array(Int8), `teams.results.tov` Array(Int8), `teams.results.trb` Array(Int8), `teams.score` Array(Int16), `teams.won` Array(Int8))

Support For Semistructured Data

Flexible schema.

You can have columns with strict and flexible schema in one table.

Queries work as fast as with predefined types!

Planned for Q1 2022.

Support For Transactions

ClickHouse is not a transactional DBMS, isn't it?

But we need transactions to:

— do atomic INSERTs to multiple partitions;

— do atomic INSERTs to multiple tables and materialzied views;

— make multiple SELECT queries from one snapshot.

Planned for Q2 2022.

And more...

— Lightweight DELETE

— Unique key constraint

— User Defined Data Types

— Batch jobs and MV with periodic refresh

— Key-value data marts

— CPU, IO and network priorities

— Vector search indices (ANN search)

— Streaming queries

Key-Value Data Marts

... And Incremental Aggregation In Memory

Run GROUP BY query continuously.

Aggregated data is accumulated in memory.

It can be queried as a table.

The server can serve key-value requests with Redis protocol.

Example applications:
— realtime antifraud;
— user profiles and personalization.

Planned for summer 2022.

What Else?

Streaming queries:
— GROUP BY and ORDER BY on time frames;
— distributed streams;
— persistent streams;
— subscription to tables and external data streams;

Thank You!

Please:

— Use ClickHouse.

— 💖 ClickHouse.

— Contribute to ClickHouse!