Season 2021/2022
1. My favorite features of 2021.
2. The most interesting to come in 2022.
Window functions UDF Async inserts PostgreSQL integration clickhouse-local interactive SQL/JSON Full disk encryption Short circuit evaluation Big integers clickhouse-keeper Projections VFS over web server YAML configs GRPC protocol Replicated databases simhash/minhash Text tokenization Transparent compression gz/bz2/xz/zst/lz4 s3Cluster Hedged requests Optimizations with constraints Map data type INTERSECT/EXCEPTANY/ALL/EXISTSUNION DISTINCT JIT compilationDictionaries completeness
They work.
SELECT
RANK() OVER (PARTITION BY user_id ORDER BY ts),
AVG() OVER (ORDER BY ts RANGE
BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
...
Developer — Alexander Kuzmenkov. Available since 21.1.
JSON_EXISTS, JSON_VALUE, JSON_QUERY
SELECT JSON_QUERY(
'{"array":[[0, 1, 2, 3, 4, 5],
[0, -1, -2, -3, -4, -5]]}',
'$.array[*][0 to 2, 4]');
Available in 21.8.
Developers — Ksenia Sumarokova, Konstantin Rudenskii, Denis Semenov.
I heard you don't like XML...
Then ClickHouse will support YAML as an alternative for you.
logger:
level: trace
log: /var/log/clickhouse-server/clickhouse-server.log
errorlog: /var/log/clickhouse-server/clickhouse-server.err.log
size: 1000M
count: 10
http_port: 8123 # Don't worry, YAML configuration
tcp_port: 9000 # will also contain comments.
mysql_port: 9004 #
postgresql_port: 9005 # It's not JSON after all.
max_connections: 4096
Developer — Denis Bolonin.
— Full Disk Encryption:
<disk_local_encrypted>
<type>encrypted</type>
<disk>disk_local</disk>
<path>encrypted/</path>
<key from_env="DISK_ENCRYPTION_KEY" />
</disk_local_encrypted>
— Per-Column Encryption:
secret String Codec(LZ4, AES_128_GCM_SIV),
Developer: Vitaly Baranov, Arthur Filatenkov, depressed-pho.
You can query (read and write) PostgreSQL in ClickHouse queries*.
You can JOIN with PostgreSQL tables inside your queries to ClickHouse!
You can create a database that will have all the tables in PostgreSQL schema.
You can use PostgreSQL as a dictionary source.
And ClickHouse can also pretend to be PostgreSQL!
<!-- Compatibility with PostgreSQL protocol.
ClickHouse will pretend to be PostgreSQL
for applications connecting to this port.
-->
<postgresql_port>9005</postgresql_port>
* — earlier it was also available via odbc.
— PostgreSQL storage engine;
— postgresql table function; It supports shards and replicas!
SELECT * FROM postgresql(`postgres{1|2|3}:5432`,
'clickhouse', 'test_replicas', 'postgres', 'password');
— postgresql dictionary source;
— PostgreSQL database engine as a view to all tables in PG database;
Similar to MySQL integration... but for PostgreSQL.
Bonus: replication from PostgreSQL.
Developer — Ksenia Sumarokova. Available since 21.2.
— MySQL, PostgreSQL, Hive
— MongoDB, Redis, Cassandra
— Kafka, RabbitMQ
— S3, HDFS, Azure
— ODBC, JDBC
1. SQL User Defined Functions (since 21.10)
CREATE FUNCTION ch AS (x) -> (exp(x) + exp(-x)) / 2;
CREATE FUNCTION getRegionName AS (x)
-> dictGet('region', 'name', x);
2. Scriptable User Defined Functions (since 21.11)
<function>
<type>executable</type>
<name>myPlus</name>
<return_type>UInt64</return_type>
<argument><type>UInt64</type></argument>
<argument><type>UInt64</type></argument>
<format>TabSeparated</format>
<command>awk '{ print $1 + $2 }'</command>
<execute_direct>0</execute_direct>
</function>
Data generation and import:
CREATE TABLE test ... ENGINE =
Executable('my_script', TSV);
— the script should return data in stdout, similar to "file" engine.
Data transformation:
CREATE TABLE test ... ENGINE =
Executable('my_script', TSV, (SELECT ...));
— the script will get source stream to stdin and return data in stdout.
Custom JOIN logic:
CREATE TABLE test ... ENGINE =
Executable('my_script', TSV, (SELECT ...), (SELECT ...), ...);
— the script will get source stream to stdin and additional streams in fd 3, 4...
Two variants:
— Executable: script is run on every query;
— ExecutablePool: a pool of persistent processes;
Table function:
SELECT * FROM
executable('my_script', TSV, 'x UInt64...');
SELECT * FROM
executable('my_script', TSV, 'x UInt64...', (SELECT ...) ...);
Applications:
— loading data from foreign data source;
— applying machine-learned models to dataset;
— custom processing, e.g. DNS resolution, whois...
Developer — Maksim Kita
Allow to do many small INSERTS.
From many concurrent connections.
Without Kafka and Buffer tables!
Many small INSERTs are combined together
and batched in memory.
Safe by default: every client get response when data is inserted.
Developer — Anton Popov. Available since 21.11
Uncomment <grpc_port>9100</grpc_port>
in config.
Feature parity with the native protocol:
— TLS, compression, query progress,
query cancellation, sessions, external data...
Example: clickhouse-client with GRPC in Python.
utils/grpc-client/clickhouse-grpc-client.py
Developer — Vitaly Baranov. Available since 21.1.
That will graduate in 2022:
— ClickHouse Keeper
— Projections
— Replicated database engine
— VFS over S3
Multiple data representations inside a single table.
— different data order;
— subset of columns;
— subset of rows;
— aggregation.
Difference to materialized views:
— projections data is always consistent;
— updated atomically with the table;
— replicated in the same way as the table;
— projection can be automatically used for SELECT query.
Developer — Amos Bird. Available since 21.6.
ClickHouse over S3:
— highly concurrent asynchronous reads (21.11);
— parallel processing on multiple compute nodes (22.1);
— caching in RAM or local disk with node affinity (Q1 2022);
— easy coordination of inserts and merges
without ReplicatedMergeTree (Q2 2022);
Lower cost and higher throughput than EBS volumes.
Bonus: "web" disk for static hosted datasets (21.10).
Already released in 22.1:
— Schema Inference
— Replicas for MPP
Was:
SELECT * FROM url('https://datasets.clickhouse.com/github_events_v2.native.xz', Native,
$$
file_time DateTime, event_type Enum('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11, 'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19, 'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22), actor_login LowCardinality(String), repo_name LowCardinality(String), created_at DateTime, updated_at DateTime, action Enum('none' = 0, 'created' = 1, 'added' = 2, 'edited' = 3, 'deleted' = 4, 'opened' = 5, 'closed' = 6, 'reopened' = 7, 'assigned' = 8, 'unassigned' = 9, 'labeled' = 10, 'unlabeled' = 11, 'review_requested' = 12, 'review_request_removed' = 13, 'synchronize' = 14, 'started' = 15, 'published' = 16, 'update' = 17, 'create' = 18, 'fork' = 19, 'merged' = 20), comment_id UInt64, body String, path String, position Int32, line Int32, ref LowCardinality(String), ref_type Enum('none' = 0, 'branch' = 1, 'tag' = 2, 'repository' = 3, 'unknown' = 4), creator_user_login LowCardinality(String), number UInt32, title String, labels Array(LowCardinality(String)), state Enum('none' = 0, 'open' = 1, 'closed' = 2), locked UInt8, assignee LowCardinality(String), assignees Array(LowCardinality(String)), comments UInt32, author_association Enum('NONE' = 0, 'CONTRIBUTOR' = 1, 'OWNER' = 2, 'COLLABORATOR' = 3, 'MEMBER' = 4, 'MANNEQUIN' = 5), closed_at DateTime, merged_at DateTime, merge_commit_sha String, requested_reviewers Array(LowCardinality(String)), requested_teams Array(LowCardinality(String)), head_ref LowCardinality(String), head_sha String, base_ref LowCardinality(String), base_sha String, merged UInt8, mergeable UInt8, rebaseable UInt8, mergeable_state Enum('unknown' = 0, 'dirty' = 1, 'clean' = 2, 'unstable' = 3, 'draft' = 4), merged_by LowCardinality(String), review_comments UInt32, maintainer_can_modify UInt8, commits UInt32, additions UInt32, deletions UInt32, changed_files UInt32, diff_hunk String, original_position UInt32, commit_id String, original_commit_id String, push_size UInt32, push_distinct_size UInt32, member_login LowCardinality(String), release_tag_name String, release_name String, review_state Enum('none' = 0, 'approved' = 1, 'changes_requested' = 2, 'commented' = 3, 'dismissed' = 4, 'pending' = 5)
$$)
LIMIT 10
SELECT * FROM url('https://datasets.clickhouse.com/github_events_v2.native.xz', Native,
$$
file_time DateTime, event_type Enum('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11, 'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19, 'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22), actor_login LowCardinality(String), repo_name LowCardinality(String), created_at DateTime, updated_at DateTime, action Enum('none' = 0, 'created' = 1, 'added' = 2, 'edited' = 3, 'deleted' = 4, 'opened' = 5, 'closed' = 6, 'reopened' = 7, 'assigned' = 8, 'unassigned' = 9, 'labeled' = 10, 'unlabeled' = 11, 'review_requested' = 12, 'review_request_removed' = 13, 'synchronize' = 14, 'started' = 15, 'published' = 16, 'update' = 17, 'create' = 18, 'fork' = 19, 'merged' = 20), comment_id UInt64, body String, path String, position Int32, line Int32, ref LowCardinality(String), ref_type Enum('none' = 0, 'branch' = 1, 'tag' = 2, 'repository' = 3, 'unknown' = 4), creator_user_login LowCardinality(String), number UInt32, title String, labels Array(LowCardinality(String)), state Enum('none' = 0, 'open' = 1, 'closed' = 2), locked UInt8, assignee LowCardinality(String), assignees Array(LowCardinality(String)), comments UInt32, author_association Enum('NONE' = 0, 'CONTRIBUTOR' = 1, 'OWNER' = 2, 'COLLABORATOR' = 3, 'MEMBER' = 4, 'MANNEQUIN' = 5), closed_at DateTime, merged_at DateTime, merge_commit_sha String, requested_reviewers Array(LowCardinality(String)), requested_teams Array(LowCardinality(String)), head_ref LowCardinality(String), head_sha String, base_ref LowCardinality(String), base_sha String, merged UInt8, mergeable UInt8, rebaseable UInt8, mergeable_state Enum('unknown' = 0, 'dirty' = 1, 'clean' = 2, 'unstable' = 3, 'draft' = 4), merged_by LowCardinality(String), review_comments UInt32, maintainer_can_modify UInt8, commits UInt32, additions UInt32, deletions UInt32, changed_files UInt32, diff_hunk String, original_position UInt32, commit_id String, original_commit_id String, push_size UInt32, push_distinct_size UInt32, member_login LowCardinality(String), release_tag_name String, release_name String, review_state Enum('none' = 0, 'approved' = 1, 'changes_requested' = 2, 'commented' = 3, 'dismissed' = 4, 'pending' = 5)
$$)
LIMIT 10
Now:
SELECT * FROM url(
'https://datasets.clickhouse.com/github_events_v2.native.xz') LIMIT 10
Read the structure:
DESCRIBE url('https://datasets.clickhouse.com/github_events_v2.native.xz')
Developer: Pavel Kruglov.
— automatically extracts schema from Native, Parquet, ...
— automatically derives schema from JSON, CSV, ...
— works for file/url/s3/hdfs table functions and clickhouse-local
— automatically extracts schema on creation
of Replicated, Distributed and Merge tables
— GROUPING sets
— Memory overcommit
— Transactions
— Semistructured data
— Backup and Restore
— Extended merge JOIN
— Correlated subqueries
JSON data type:
CREATE TABLE games (data JSON) ENGINE = MergeTree;
You can insert arbitrary nested JSONs.
Types are automatically inferred on INSERT and merge.
Data is stored in columnar format: columns and subcolumns.
Query nested data naturally.
Example: NBA games dataset
CREATE TABLE games (data String)
ENGINE = MergeTree ORDER BY tuple();
SELECT JSONExtractString(data, 'teams', 1, 'name')
FROM games;
— 0.520 sec.
CREATE TABLE games (data JSON)
ENGINE = MergeTree;
SELECT data.teams.name[1] FROM games;
— 0.015 sec.
DESCRIBE TABLE games
SETTINGS describe_extend_object_types = 1
name: data
type: Tuple( <-- inferred type
`_id.$oid` String,
`date.$date` String,
`teams.abbreviation` Array(String),
`teams.city` Array(String),
`teams.home` Array(UInt8),
`teams.name` Array(String),
`teams.players.ast` Array(Array(Int8)),
`teams.players.blk` Array(Array(Int8)),
`teams.players.drb` Array(Array(Int8)),
`teams.players.fg` Array(Array(Int8)),
`teams.players.fg3` Array(Array(Int8)),
`teams.players.fg3_pct` Array(Array(String)),
`teams.players.fg3a` Array(Array(Int8)),
`teams.players.fg_pct` Array(Array(String)),
`teams.players.fga` Array(Array(Int8)),
`teams.players.ft` Array(Array(Int8)),
`teams.players.ft_pct` Array(Array(String)),
`teams.players.fta` Array(Array(Int8)),
`teams.players.mp` Array(Array(String)),
`teams.players.orb` Array(Array(Int8)),
`teams.players.pf` Array(Array(Int8)),
`teams.players.player` Array(Array(String)),
`teams.players.plus_minus` Array(Array(String)),
`teams.players.pts` Array(Array(Int8)),
`teams.players.stl` Array(Array(Int8)),
`teams.players.tov` Array(Array(Int8)),
`teams.players.trb` Array(Array(Int8)),
`teams.results.ast` Array(Int8),
`teams.results.blk` Array(Int8),
`teams.results.drb` Array(Int8),
`teams.results.fg` Array(Int8),
`teams.results.fg3` Array(Int8),
`teams.results.fg3_pct` Array(String),
`teams.results.fg3a` Array(Int8),
`teams.results.fg_pct` Array(String),
`teams.results.fga` Array(Int16),
`teams.results.ft` Array(Int8),
`teams.results.ft_pct` Array(String),
`teams.results.fta` Array(Int8),
`teams.results.mp` Array(Int16),
`teams.results.orb` Array(Int8),
`teams.results.pf` Array(Int8),
`teams.results.plus_minus` Array(String),
`teams.results.pts` Array(Int16),
`teams.results.stl` Array(Int8),
`teams.results.tov` Array(Int8),
`teams.results.trb` Array(Int8),
`teams.score` Array(Int16),
`teams.won` Array(Int8))
Flexible schema.
You can have columns with strict and flexible schema in one table.
Queries work as fast as with predefined types!
Planned for Q1 2022.
ClickHouse is not a transactional DBMS, isn't it?
But we need transactions to:
— do atomic INSERTs to multiple partitions;
— do atomic INSERTs to multiple tables and materialzied views;
— make multiple SELECT queries from one snapshot.
Planned for Q2 2022.
— Lightweight DELETE
— Unique key constraint
— User Defined Data Types
— Batch jobs and MV with periodic refresh
— Key-value data marts
— CPU, IO and network priorities
— Vector search indices (ANN search)
— Streaming queries
... And Incremental Aggregation In Memory
Run GROUP BY query continuously.
Aggregated data is accumulated in memory.
It can be queried as a table.
The server can serve key-value requests with Redis protocol.
Example applications:
— realtime antifraud;
— user profiles and personalization.
Planned for summer 2022.
Streaming queries:
— GROUP BY and ORDER BY on time frames;
— distributed streams;
— persistent streams;
— subscription to tables and external data streams;
Please:
— Use ClickHouse.
— 💖 ClickHouse.
— Contribute to ClickHouse!