ClickHouse: Release 22.3 Webinar

ClickHouse
Release 22.3

Release 22.3 Webinar

1. (35 min) What's new in ClickHouse 22.3.

2. (25 min) Q&A.

Release 22.3

Long-Term Support (LTS) release.

Will be supported for a year until Mar 2023.

Current LTS versions: 21.8, 22.3.

Version 21.3 is no longer supported, it's time to upgrade.

Build Infrastructure

Package repository moved to JFrog's Artifactory:
packages.clickhouse.com

Continuous integration and testing moved to AWS datacenters in the US.

Continuous integration scripts are 100% open-source:
— .github/workflows + tests/ci

We have no infrasctructure, servers or people in Russia.

Bonus: reproducible builds and public build logs:
— to get bit-identical binaries from sources.

Developers: Mikhail Shiryaev, Alexander Sapin, Alexey Milovidov.

Local Cache For Remote Filesystem

What is faster: s3 or local SSD?
s3 or HDD? EBS volume?

What does it mean "faster"? Throughput or latency? Reads or writes? Sequential or random? Request concurrency?

Can we scan compressed data with 50GB/sec on single server?

Demo...

Developers: Ksenia Sumarokova.

Local cache for remote filesystem

Now s3 VFS is as fast as local data for hot queries!

Developers: Ksenia Sumarokova.

Note: ad-hoc metadata cache for s3/hdfs/azure and the cache for Hive tables
are obsolete and going to be removed.

s3 VFS Maturity

s3 disk — production ready in ClickHouse 22.3.
We enabled 100% of ClickHouse functional tests with s3 disk!

Only AWS EC2 + S3 is ready for production.
GCP — known issues exist. AWS Fargate + S3 — known issues exist.

s3 cache — preview stage, configuration is going to be changed.

Zero copy replication — experimental, known issues exist.

Performance Optimizations

by Maksim Kita

1. INSERT into MergeTree:

INSERT INTO test_hits_insert SELECT UserID, WatchID, CounterID FROM hits_100m_single

Was: 22.043 sec (4.54 million rows/s., 90.73 MB/s.)

Now: 11.268 sec (8.87 million rows/s., 177.50 MB/s.)

Two times faster!

(black magic with stable indirect pdqsort and tuple-wise radix sort)

Performance Optimizations

by Maksim Kita

2. Long lists in IN:

SELECT * FROM table WHERE key IN (111, 222, ... a megabyte of something)

Or: direct dictionary from EmbeddedRocksDB table.

Up to 3 times faster on performance tests.

Schema Inference On INSERT

INSERT INTO FUNCTION s3(name, url = 'https://mybucket.s3.amazonaws.com/data.parquet', 'UserID UInt64, CounterID UInt32, URL String, EventTime DateTime...') SELECT * FROM table

Now:

INSERT INTO FUNCTION s3(name, url = 'https://mybucket.s3.amazonaws.com/data.parquet') SELECT * FROM table

No need to specify the table structure.

Developers: Pavel Kruglov.

ClickHouse on ARM (AArch64)

Two months ago we promised to make it production ready.

The promise fulfilled.

100% functional tests are run in CI for AArch64.

Debian packages are published.

Docker containers are available.

Tested on:
APM X-Gene; Cavium ThunderX 1, 2; Raspberry Pi; Pinebook; Google Pixel; Apple M1, M1 Max; Huawei Taishan; AWS Graviton 1, 2, 3; Ampere Altra;

Developers: Mikhail Shiryaev, Alexey Milovidov, Vladimir Cherkasov.

More Fuzzing Methods

ClickHouse CI use tons of fuzzing methods:

— LLVM's libFuzzer.
— AST based query fuzzer.
— Ad-hoc SQL function fuzzer.
— Randomization of thread scheduling order.
— Randomization of timezones.
— SQLancer — logical fuzzer.
— Stress tests.
— Jepsen tests.

Since version 22.3:
— Randomization of query settings in functional tests.
— To test correctness with combinations of settings changes.

Developer: Pavel Kruglov.

More Testing Methods

1. Automated tests for version compatibility.

Run all existing tests to create and fill tables.

Then install previous ClickHouse version on top of the data.

— to be aware of potential incompatibilities.

2. Automated tests for tests.

When a bug is fixed, we check that the new test fails on previous version.

Developers: Pavel Kruglov, Vladimir Cherkasov. Expected in version 22.4.

ClickHouse Keeper

Passing Jepsen tests, ClickHouse functional and integration tests.
Deployed in production.

Since version 22.3:
faster than ZooKeeper on (both) reads and writes.

ClickHouse Keeper is production ready!

Optimizations by: Zhang Li Star and Alexander Sapin.

Nice to have: Prometheus endpoint;
replace ZooKeeper examples in the docs;
official tests by Kyle Kingsbury, a.k.a "Aphyr" if he will be interested.

Security Features

Authentication with X.509 client certificates.

Developer: Eungenue.

A switch to disable plaintext password or no password for users.

Developer: Heena Bansal.

Filtering outgoing connections in MySQL, PostgreSQL federated queries.

Developer: Heena Bansal.

Dynamic Subcolumns

For semistructured data.

The most interesting feature in 22.3.

Available in version 22.3 for preview.

Developer: Anton Popov.

Support For Semistructured Data

JSON data type:

CREATE TABLE games (data JSON) ENGINE = MergeTree;

You can insert arbitrary nested JSONs.

Types are automatically inferred on INSERT and merge.

Data is stored in columnar format: columns and subcolumns.

Query nested data naturally.

Support For Semistructured Data

Example: NBA games dataset

CREATE TABLE games (data String) ENGINE = MergeTree ORDER BY tuple(); SELECT JSONExtractString(data, 'teams', 1, 'name') FROM games;

— 0.520 sec.

CREATE TABLE games (data JSON) ENGINE = MergeTree; SELECT data.teams.name[1] FROM games;

— 0.015 sec.

Support For Semistructured Data

DESCRIBE TABLE games SETTINGS describe_extend_object_types = 1 name: data type: Tuple( <-- inferred type `_id.$oid` String, `date.$date` String, `teams.abbreviation` Array(String), `teams.city` Array(String), `teams.home` Array(UInt8), `teams.name` Array(String), `teams.players.ast` Array(Array(Int8)), `teams.players.blk` Array(Array(Int8)), `teams.players.drb` Array(Array(Int8)), `teams.players.fg` Array(Array(Int8)), `teams.players.fg3` Array(Array(Int8)), `teams.players.fg3_pct` Array(Array(String)), `teams.players.fg3a` Array(Array(Int8)), `teams.players.fg_pct` Array(Array(String)), `teams.players.fga` Array(Array(Int8)), `teams.players.ft` Array(Array(Int8)), `teams.players.ft_pct` Array(Array(String)), `teams.players.fta` Array(Array(Int8)), `teams.players.mp` Array(Array(String)), `teams.players.orb` Array(Array(Int8)), `teams.players.pf` Array(Array(Int8)), `teams.players.player` Array(Array(String)), `teams.players.plus_minus` Array(Array(String)), `teams.players.pts` Array(Array(Int8)), `teams.players.stl` Array(Array(Int8)), `teams.players.tov` Array(Array(Int8)), `teams.players.trb` Array(Array(Int8)), `teams.results.ast` Array(Int8), `teams.results.blk` Array(Int8), `teams.results.drb` Array(Int8), `teams.results.fg` Array(Int8), `teams.results.fg3` Array(Int8), `teams.results.fg3_pct` Array(String), `teams.results.fg3a` Array(Int8), `teams.results.fg_pct` Array(String), `teams.results.fga` Array(Int16), `teams.results.ft` Array(Int8), `teams.results.ft_pct` Array(String), `teams.results.fta` Array(Int8), `teams.results.mp` Array(Int16), `teams.results.orb` Array(Int8), `teams.results.pf` Array(Int8), `teams.results.plus_minus` Array(String), `teams.results.pts` Array(Int16), `teams.results.stl` Array(Int8), `teams.results.tov` Array(Int8), `teams.results.trb` Array(Int8), `teams.score` Array(Int16), `teams.won` Array(Int8))

Support For Semistructured Data

Flexible schema.

You can have columns with strict and flexible schema in one table.

Queries work as fast as with predefined types!

Demo?

What's Next?

Support For Transactions

ClickHouse is not a transactional DBMS, isn't it?

But we need transactions to:

— do atomic INSERTs to multiple partitions;

— do atomic INSERTs to multiple tables and materialzied views;

— make multiple SELECT queries from one snapshot.

Developer: Alexander Tokmakov. Planned for Q2 2022.

Key-Value Data Marts

... And Incremental Aggregation In Memory

Run GROUP BY query continuously.

Aggregated data is accumulated in memory.

It can be queried as a table.

The server can serve key-value requests with Redis protocol.

Example applications:
— realtime antifraud;
— user profiles and personalization.

Developer: Anton Popov. Planned for summer 2022.

What's Next

Roadmap 2022 is published:

https://github.com/ClickHouse/ClickHouse/issues/32513

— open for discussions.

... all great stuff is discussed there, don't miss.

Q&A