1. (35 min) What's new in ClickHouse 22.3.
2. (25 min) Q&A.
Long-Term Support (LTS) release.
Will be supported for a year until Mar 2023.
Current LTS versions: 21.8, 22.3.
Version 21.3 is no longer supported, it's time to upgrade.
Package repository moved to JFrog's Artifactory:
— packages.clickhouse.com
Continuous integration and testing moved to AWS datacenters in the US.
Continuous integration scripts are 100% open-source:
— .github/workflows + tests/ci
We have no infrasctructure, servers or people in Russia.
Bonus: reproducible builds and public build logs:
— to get bit-identical binaries from sources.
Developers: Mikhail Shiryaev, Alexander Sapin, Alexey Milovidov.
What is faster: s3 or local SSD?
s3 or HDD? EBS volume?
What does it mean "faster"? Throughput or latency? Reads or writes? Sequential or random? Request concurrency?
Can we scan compressed data with 50GB/sec on single server?
Demo...
Developers: Ksenia Sumarokova.
Now s3 VFS is as fast as local data for hot queries!
Developers: Ksenia Sumarokova.
Note: ad-hoc metadata cache for s3/hdfs/azure and the cache for Hive tables
are obsolete and going to be removed.
s3 disk — production ready in ClickHouse 22.3.
We enabled 100% of ClickHouse functional tests with s3 disk!
Only AWS EC2 + S3 is ready for production.
GCP — known issues exist. AWS Fargate + S3 — known issues exist.
s3 cache — preview stage, configuration is going to be changed.
Zero copy replication — experimental, known issues exist.
by Maksim Kita
1. INSERT into MergeTree:
INSERT INTO test_hits_insert
SELECT UserID, WatchID, CounterID
FROM hits_100m_single
Was: 22.043 sec (4.54 million rows/s., 90.73 MB/s.)
Now: 11.268 sec (8.87 million rows/s., 177.50 MB/s.)
Two times faster!
(black magic with stable indirect pdqsort and tuple-wise radix sort)
by Maksim Kita
2. Long lists in IN:
SELECT * FROM table
WHERE key IN (111, 222, ... a megabyte of something)
Or: direct dictionary from EmbeddedRocksDB table.
Up to 3 times faster on performance tests.
INSERT INTO FUNCTION s3(name,
url = 'https://mybucket.s3.amazonaws.com/data.parquet',
'UserID UInt64, CounterID UInt32,
URL String, EventTime DateTime...')
SELECT * FROM table
Now:
INSERT INTO FUNCTION s3(name,
url = 'https://mybucket.s3.amazonaws.com/data.parquet')
SELECT * FROM table
No need to specify the table structure.
Developers: Pavel Kruglov.
Two months ago we promised to make it production ready.
The promise fulfilled.
100% functional tests are run in CI for AArch64.
Debian packages are published.
Docker containers are available.
Tested on:
APM X-Gene; Cavium ThunderX 1, 2; Raspberry Pi; Pinebook; Google Pixel; Apple M1, M1 Max; Huawei Taishan; AWS Graviton 1, 2, 3; Ampere Altra;
Developers: Mikhail Shiryaev, Alexey Milovidov, Vladimir Cherkasov.
ClickHouse CI use tons of fuzzing methods:
— LLVM's libFuzzer.
— AST based query fuzzer.
— Ad-hoc SQL function fuzzer.
— Randomization of thread scheduling order.
— Randomization of timezones.
— SQLancer — logical fuzzer.
— Stress tests.
— Jepsen tests.
Since version 22.3:
— Randomization of query settings in functional tests.
— To test correctness with combinations of settings changes.
Developer: Pavel Kruglov.
1. Automated tests for version compatibility.
Run all existing tests to create and fill tables.
Then install previous ClickHouse version on top of the data.
— to be aware of potential incompatibilities.
2. Automated tests for tests.
When a bug is fixed, we check that the new test fails on previous version.
Developers: Pavel Kruglov, Vladimir Cherkasov. Expected in version 22.4.
Passing Jepsen tests, ClickHouse functional and integration tests.
Deployed in production.
Since version 22.3:
faster than ZooKeeper on (both) reads and writes.
ClickHouse Keeper is production ready!
Optimizations by: Zhang Li Star and Alexander Sapin.
Nice to have: Prometheus endpoint;
replace ZooKeeper examples in the docs;
official tests by Kyle Kingsbury, a.k.a "Aphyr" if he will be interested.
Authentication with X.509 client certificates.
Developer: Eungenue.
A switch to disable plaintext password or no password for users.
Developer: Heena Bansal.
Filtering outgoing connections in MySQL, PostgreSQL federated queries.
Developer: Heena Bansal.
For semistructured data.
The most interesting feature in 22.3.
Available in version 22.3 for preview.
Developer: Anton Popov.
JSON data type:
CREATE TABLE games (data JSON) ENGINE = MergeTree;
You can insert arbitrary nested JSONs.
Types are automatically inferred on INSERT and merge.
Data is stored in columnar format: columns and subcolumns.
Query nested data naturally.
Example: NBA games dataset
CREATE TABLE games (data String)
ENGINE = MergeTree ORDER BY tuple();
SELECT JSONExtractString(data, 'teams', 1, 'name')
FROM games;
— 0.520 sec.
CREATE TABLE games (data JSON)
ENGINE = MergeTree;
SELECT data.teams.name[1] FROM games;
— 0.015 sec.
DESCRIBE TABLE games
SETTINGS describe_extend_object_types = 1
name: data
type: Tuple( <-- inferred type
`_id.$oid` String,
`date.$date` String,
`teams.abbreviation` Array(String),
`teams.city` Array(String),
`teams.home` Array(UInt8),
`teams.name` Array(String),
`teams.players.ast` Array(Array(Int8)),
`teams.players.blk` Array(Array(Int8)),
`teams.players.drb` Array(Array(Int8)),
`teams.players.fg` Array(Array(Int8)),
`teams.players.fg3` Array(Array(Int8)),
`teams.players.fg3_pct` Array(Array(String)),
`teams.players.fg3a` Array(Array(Int8)),
`teams.players.fg_pct` Array(Array(String)),
`teams.players.fga` Array(Array(Int8)),
`teams.players.ft` Array(Array(Int8)),
`teams.players.ft_pct` Array(Array(String)),
`teams.players.fta` Array(Array(Int8)),
`teams.players.mp` Array(Array(String)),
`teams.players.orb` Array(Array(Int8)),
`teams.players.pf` Array(Array(Int8)),
`teams.players.player` Array(Array(String)),
`teams.players.plus_minus` Array(Array(String)),
`teams.players.pts` Array(Array(Int8)),
`teams.players.stl` Array(Array(Int8)),
`teams.players.tov` Array(Array(Int8)),
`teams.players.trb` Array(Array(Int8)),
`teams.results.ast` Array(Int8),
`teams.results.blk` Array(Int8),
`teams.results.drb` Array(Int8),
`teams.results.fg` Array(Int8),
`teams.results.fg3` Array(Int8),
`teams.results.fg3_pct` Array(String),
`teams.results.fg3a` Array(Int8),
`teams.results.fg_pct` Array(String),
`teams.results.fga` Array(Int16),
`teams.results.ft` Array(Int8),
`teams.results.ft_pct` Array(String),
`teams.results.fta` Array(Int8),
`teams.results.mp` Array(Int16),
`teams.results.orb` Array(Int8),
`teams.results.pf` Array(Int8),
`teams.results.plus_minus` Array(String),
`teams.results.pts` Array(Int16),
`teams.results.stl` Array(Int8),
`teams.results.tov` Array(Int8),
`teams.results.trb` Array(Int8),
`teams.score` Array(Int16),
`teams.won` Array(Int8))
Flexible schema.
You can have columns with strict and flexible schema in one table.
Queries work as fast as with predefined types!
Demo?
ClickHouse is not a transactional DBMS, isn't it?
But we need transactions to:
— do atomic INSERTs to multiple partitions;
— do atomic INSERTs to multiple tables and materialzied views;
— make multiple SELECT queries from one snapshot.
Developer: Alexander Tokmakov. Planned for Q2 2022.
... And Incremental Aggregation In Memory
Run GROUP BY query continuously.
Aggregated data is accumulated in memory.
It can be queried as a table.
The server can serve key-value requests with Redis protocol.
Example applications:
— realtime antifraud;
— user profiles and personalization.
Developer: Anton Popov. Planned for summer 2022.
Roadmap 2022 is published:
https://github.com/ClickHouse/ClickHouse/issues/32513
— open for discussions.
... all great stuff is discussed there, don't miss.