1. (45 min) What's new in ClickHouse 25.4.
2. (10 min) Guest Talk.
3. (5 min) Q&A.
ClickHouse Release.
— 25 new features 🌸
— 23 performance optimizations 🦋
— 58 bug fixes 🐝
For MergeTree tables.
CREATE TABLE t (
time DateTime CODEC(ZSTD(3)), -- codec on a column level
user_id UInt64, -- this column will use the default codec
...
) ORDER BY time
SETTINGS default_compression_codec = 'ZSTD(1)' -- codec on a table level
By the way, the default codec can be also specified globally in the server's configuration:
$ cat config.d/compression.yaml
compression:
case:
min_part_size: 1000000000 # Optional condition
method: 'zstd'
Developer: Gvoelfin.
# Use the specified paths for databases
# instead of a temporary location:
$ ch --path .
# The database will remain available
# for subsequent invocations.
Demo.
Developer: Alexey Milovidov.
Two new query settings: min_os_cpu_wait_time_ratio_to_throw
and max_os_cpu_wait_time_ratio_to_throw.
When the ratio between the number of threads waiting for CPU and the number of threads running on CPU is more than the min threshold, the server will start randomly refusing some queries. When it approaches the max, the server will refuse every query.
Implements a gradual degradation to prevent full DoS.
Demo.
Developers: Alexey Katsman, Alexey Milovidov.
CREATE TABLE kafka_sasl (key UInt32, value String)
ENGINE = Kafka
SETTINGS kafka_broker_list = 'kafka_sasl:19092',
kafka_security_protocol = 'sasl_plaintext',
kafka_sasl_mechanism = 'SCRAM-SHA-256',
kafka_sasl_username = 'Vasya',
kafka_sasl_password = 'qwerty',
kafka_topic_list = 'topic',
kafka_group_name = 'group',
kafka_format = 'JSONEachRow';
No longer necessary to edit configuration files for every new Kafka endpoint.
Developer: Christoph Wurm.
$ cat config.d/cluster.yaml
remote_servers:
default:
shard:
replica:
host: '10.0.0.2'
port: 9000
bind_host: '10.0.0.1'
Useful when a server has multiple network interfaces and distributed connections should use one.
Developer: Todd Yocum.
Instead of reading data for a column, track the information about what data should be read. Then read the data only when needed.
The column values can be carried around, filtered, but not used in calculations before the latest stages of the query pipeline.
Demo.
Developer: Xiaozhe Yu.
I couldn't believe it didn't work before...
Demo.
Developer: Dmitry Novik.
If you JOIN two MergeTree tables by their primary keys (or their prefixes),
it will create independent lanes in the query pipeline dedicated to the ranges of the primary keys, so the work is paralellized better.
Demo.
Developer: Nikolai Kochetov.
— are subqueries that depend on the columns from the outer scope.
SELECT * FROM users AS u1 WHERE EXISTS (
SELECT * FROM users2 AS u2 WHERE u1.age = u2.age)
They can appear in many different contexts: EXISTS, IN, scalar...
Now we support correlated subqueries
inside the WHERE clause with the EXISTS operator!
Demo.
Developer: Dmitry Novik.
CREATE RESOURCE cpu (MASTER THREAD, WORKER THREAD);
CREATE WORKLOAD all;
CREATE WORKLOAD admin IN all SETTINGS max_concurrent_threads = 10;
CREATE WORKLOAD production IN all SETTINGS max_concurrent_threads = 100;
CREATE WORKLOAD analytics IN production
SETTINGS max_concurrent_threads = 60, weight = 9;
CREATE WORKLOAD ingestion IN production;
SET workload = 'analytics';
CREATE RESOURCE worker_cpu (WORKER THREAD);
CREATE RESOURCE master_cpu (MASTER THREAD);
CREATE WORKLOAD all
SETTINGS max_concurrent_threads = 100 FOR worker_cpu,
max_concurrent_threads = 1000 FOR master_cpu;
Developer: Sergei Trifonov.
Allows querying tables as of a certain time in the past.
Demo
Developer: Brett Hoerner, Dan Ivanik.
Avoid reading the same metadata and manifest files.
Demo
Developer: Han Fei.
To enable partition pruning, switch to the new implementation:
SELECT *
FROM deltaLake('s3://unitycatalogdemobucket/delta_hits_partitioned/', '...', '...', SETTINGS allow_experimental_delta_kernel_rs = 1)
WHERE CounterID = 62
AND EventDate >= '2013-07-14' AND EventDate <= '2013-07-15'
GROUP BY EventDate ORDER BY EventDate
Before: 18.486 sec.
After: 4.389 sec.
Developer: Ksenia Sumarokova.
Support for Delta Lake on Azure Blob Storage.
Developer: Smita Kulkarni.
MergeTree tables on read-only disks can refresh their state and load new data parts, if they appear in the background.
This lets you run unlimited number of readers on top of externally hosted, continuously updating datasets.
Good for data sharing and publishing.
Developer: Alexey Milovidov.
-- there should be a single writer (at most):
CREATE TABLE writer (...) ORDER BY ()
SETTINGS table_disk = true,
disk = disk(
type = object_storage,
object_storage_type = s3,
endpoint = 'https://mybucket.s3.us-east-1.amazonaws.com/data/',
metadata_type = plain_rewritable);
-- there could be unlimited number of readers in any locations:
CREATE TABLE reader (...) ORDER BY ()
SETTINGS table_disk = true, refresh_parts_interval = 1,
disk = disk(
readonly = true,
type = object_storage,
object_storage_type = s3,
endpoint = 'https://mybucket.s3.us-east-1.amazonaws.com/data/',
metadata_type = plain_rewritable);
Demo
Sai Srirampur and Amogh Bharadwaj, engineers at ClickHouse.
A ton of new features for ClickPipes:
— AWS Private Link support;
— managing with ClickHouse Terraform provider;
— Prometheus metrics in the cloud endpoint;
— notifications in Email, Slack, and the UI;
— mutual TLS authentication for Kafka;
Metabase: ClickHouse integration is now shipped by default.
Updates to ODBC, JDBC, and Java client.
Try this:
Thanks to the embedded SSH server, the interactive ClickHouse experience is now available from any platform, including Windows, Android, iPad...
— 🇺🇸 Denver, April 23
— 🇺🇸 Austin, May 13
— 🇬🇧 London, May 14
— 🇹🇷 Istanbul, May 15
— 🇨🇳 Shenzhen, May 17
I will be in person on some of these meetups :)
— A year of Rust in ClickHouse 🦀
— MySQL CDC 🐬
— Query Condition Cache 🤑
— Lazy Materialization 🏖️
— Make Before Break: how we manage ClickHouse
— Observability: Dash0, Poizon, Last9 💹
— Security: Harvey 👮