ClickHouse: Release 25.8 Call

ClickHouse Release 25.8 LTS

ClickHouse release 25.8 LTS

1. (55 min) What's new in ClickHouse 25.8.

2. (5 min) Q&A.

Release 25.8

ClickHouse LTS Release.

โ€” 45 new features ๐ŸŒป

โ€” 47 performance optimizations ๐Ÿ

โ€” 119 bug fixes ๐Ÿ

Small And Nice Features

Virtual Column _table Everywhere

CREATE TABLE a (s String) ORDER BY (); CREATE TABLE b (s String) ORDER BY (); INSERT INTO a VALUES ('Hello'); INSERT INTO b VALUES ('World'); -- This already worked: SELECT *, _table FROM merge('^(a|b)$'); -- This is new in 25.8: SELECT *, _table FROM a UNION ALL SELECT *, _table FROM b; โ”Œโ”€sโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€_tableโ”€โ” 1. โ”‚ Hello โ”‚ a โ”‚ 2. โ”‚ World โ”‚ b โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Developer: Xiaozhe Yu.

External Disks For Temporary Data

Queries with large GROUP BY, ORDER BY or JOIN operations may use
disk space for temporary data when it does not fit in memory.

Before version 25.8, ClickHouse could only use local types of disks
for temporary data.

Since 25.8, any disk type is suitable for temporary data, even S3,
which is especially useful for ephemeral, disk-less machines.

Bonus: temporary data is automatically compressed!

Developer: Azat Khuzhin.

Integration With Arrow Flight ๐Ÿงช

Arrow Flight — a protocol for data exchange in the Apache Arrow
format over GRPC.

Good for column-oriented databases as it keeps the data in the column representation (similar to the ClickHouse native protocol).

Now ClickHouse can query other Arrow Flight data sources
with the arrowflight table function.

And ClickHouse can work as an Arrow Flight server by itself.

Developer: zakr600, Vitaly Baranov.

Integration With Arrow Flight ๐Ÿงช

Arrow Flight server in ClickHouse.

$ cat config.d/flight.yaml arrowflight_port: 6379 arrowflight: enable_ssl: true ssl_cert_file: '/etc/clickhouse-server/cert.pem' ssl_key_file: '/etc/clickhouse-server/key.pem'

Supports both put and get commands.

Developer: zakr600, Vitaly Baranov.

Integration With Arrow Flight ๐Ÿงช

Arrow Flight client in ClickHouse.

SELECT * FROM arrowflight('localhost:6379', 'dataset');

CREATE TABLE table_name ENGINE = ArrowFlight('localhost:6379', 'dataset');

Developer: zakr600, Vitaly Baranov.

Hash Output Format

Want to quickly calculate a checksum of the query result?

:) SELECT number FROM numbers(10) FORMAT Hash c1df5386c5a4dfee17eee0e351914c83

Returns a 128-bit hash value, order sensitive.

Use-case: check that the query returns the same result.

Developer: Robert Schulze.

Writes With Hive-Style Partitioning

Allows splitting data by directories, representing the values of partition key.

CREATE TABLE test (year UInt16, country String, counter UInt8) ENGINE = AzureBlobStorage(account_name = 'devstoreaccount1', ..., format = 'Parquet', partition_strategy = 'hive') PARTITION BY (year, country);

The new partition_strategy parameter allows to switch between
the 'hive' or the old, 'wildcard' modes.

Developer: Arthur Passos.

Writes With Hive-Style Partitioning

Allows splitting data by directories, representing the values of partition key.

:) INSERT INTO test VALUES (2020, 'Zimbabwe', 1), (2021, 'Brazil', 2); :) SELECT _path, * FROM test; โ”Œโ”€_pathโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€yearโ”€โ”ฌโ”€countryโ”€โ”€โ”ฌโ”€counterโ”€โ” 1. โ”‚ data/year=2020/country=Zimbabwe/7351305360873664512.parquet โ”‚ 2020 โ”‚ Zimbabwe โ”‚ 1 โ”‚ 2. โ”‚ data/year=2021/country=Brazil/7351305360894636032.parquet โ”‚ 2021 โ”‚ Brazil โ”‚ 2 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Bonus:
— you can decide whether to write partition columns into data files;
hive partition_strategy also makes partition columns available in SELECTs;

Developer: Arthur Passos.

Protocol Compatibility

ClickHouse is a multi-protocol server. It implements:
— native TCP protocol;
— HTTP(s) REST API; ODBC and JDBC drivers;
— GRPC protocol;
— Arrow Flight protocol;
— ZooKeeper protocol;
MySQL wire protocol;
PostgreSQL wire protocol;

Now, PostgreSQL protocol supports COPY command,
and MySQL protocol implements the information tables
that are necessary for the C# MySQL client to work with ClickHouse.

Developer: Konstantin Vedernikov.

SYSTEM RESTORE DATABASE REPLICA

A command, similar to SYSTEM RESTORE REPLICA,
but for Replicated databases.

Makes it possible to re-create the metadata in Keeper from the local state,
for the case when Keeper was damaged or lost.

Developer: Konstantin Morozov.

Boring Security Features

Grants For Specific URLs in S3

In 25.6 and before:

GRANT S3 ON *.* TO user

In 25.7:

GRANT READ, WRITE ON S3 TO user

In 25.8:

GRANT READ ON S3('s3://foo/.*') TO user

Allows a limited usage of S3 buckets and prevents data exfiltration.

Developer: Artem Brustovetskii.

Using custom IAM roles with S3

SELECT * FROM s3('s3://mybucket/path.csv', CSVWithNames, extra_credentials(role_arn = 'arn:aws:iam::111111111111:role/ClickHouseAccessRole-001'))

Until recently the feature was only available in ClickHouse Cloud.
Since 25.8 it is available for all ClickHouse users.

Bonus: GCP authentication with OAuth is also available in 25.8!

Performance Improvements

Parquet Reader v3 ๐Ÿงช

A new, faster implementation of Parquet from scratch!

Developer: Michael Kolupaev.

Demo

Speculative Retries For Azure

Developer: Alexander Sapin.

Speculative Retries For Azure

The problem: rare spikes of latencies up to 5 sec, 10 sec, 15 sec, ...

The reason: AWS S3 and Azure are complex distributed systems with their own shenanigans.

This problem was solved for AWS and GCP a few years ago with:
— using multiple connections to multiple endpoints;
— rotating endpoints for better distribution of the load;
— running a second request as soon as there is a soft timeout;
— do many retries aggressively;

Speculative Retries For Azure

The problem: rare spikes of latencies up to 5 sec, 10 sec, 15 sec, ...

Problem solved:

  1. Replacing the HTTP client in Azure SDK to our implementation.
  2. Reusing the same logic of aggressive retries as for AWS S3.

Results:

no more latency spikes!

Developer: Alexander Sapin.

Faster Filtering By Secondary Indices

Example:

CREATE TABLE test ( time DateTime, user UInt16, text String, INDEX ix1(user) TYPE set(1000) GRANULARITY 5, INDEX ix2(text) TYPE bloom_filter GRANULARITY 10 ) ORDER BY time;

How to decide in which order read and apply indices?

Developer: Maruth Goyal.

Faster Filtering By Secondary Indices

How to decide in which order read and apply indices?

Since 25.8, indices are read and applied in the order of their sizes,
which means lightweight indices first.

The sizes are measured by compressed data on disk,
and the decision is done for each part individually.

So we have a chance to filter data with less cost!
Adding a heavy index to a table does not affect queries
that filter the most by another, light index.

Developer: Maruth Goyal.

Something Interesting

Lightweight Updates Are Beta

Full-featured UPDATE statement:

UPDATE my_table SET col1 = val1, col2 = val2, ... WHERE condition UPDATE hits SET Title = 'Updated Title' WHERE EventDate = today(); UPDATE wikistat SET hits = hits + 1, time = now() WHERE path = 'ClickHouse';

Experimental in 25.7. Beta in 25.8.

Developer: Anton Popov.

Correlated Subqueries Are Beta

— are subqueries that depend on the columns from the outer scope.

SELECT * FROM users AS u1 WHERE EXISTS ( SELECT * FROM users2 AS u2 WHERE u1.age = u2.age)

They can appear in many different contexts: EXISTS, IN, scalar...

In 25.4 we supported correlated subqueries inside the WHERE clause with the EXISTS operator.

In 25.5 we support scalar correlated subqueries inside the WHERE clause
and correlated subqueries inside SELECT!

Now it is enough to cover the TPC-H test suite without modifying queries.

Developer: Dmitry Novik.

Data Lakes

Data Lakes Improvements

Support for positional and equality deletes in Iceberg.

This works with the ALTER DELETE statement.

Writes into Iceberg are supported with REST and Glue catalogs.

CREATE TABLE can create a new Iceberg table, rather than just attaching.

Support for DROP TABLE for Iceberg in REST and Glue catalogs.

Developer: Konstantin Vedernikov.

Improvements For Delta Lake

Support for writes into Delta Lake tables.

Time travel support for Delta Lake tables.

Developer: Konstantin Vedernikov.

Data Lake Catalogs Are Beta

ClickHouse supports database engines for:

— Unity catalog (since 25.3);

— REST catalog (since 24.12);

— Glue catalog (since 25.3);

— Hive Metastore catalog (since 25.5);

Now all data lakes catalogs are promoted from experimental to beta.

Bonus

Integrations

MongoDB CDC is in Private Preview

Vertical scaling for streaming ClickPipes
to adjust the instance type for suitable workload

Azure Blob Storage connector in ClickPipes is GA

Postgres CDC: compatibility with the JSON data type

ClickHouse has the official AWS Glue connector

Improvements for Python client: perf of Arrow DataFrames

ClickStack in ClickHouse Cloud

Meetups

— ๐Ÿ‡จ๐Ÿ‡ฆ Toronto, Sept 3
— ๐Ÿ‡ฎ๐Ÿ‡ณ Gurgaon/Delhi, Sept 4
— ๐Ÿ‡บ๐Ÿ‡ธ Raleigh, Sept 4
— ๐Ÿ‡บ๐Ÿ‡ธ New York AI, Sept 9
— ๐Ÿ‡ฎ๐Ÿ‡ฑ Tel Aviv, Sept 9
— ๐Ÿ‡น๐Ÿ‡ญ Bangkok, Sept 16
— ๐Ÿ‡ฆ๐Ÿ‡ช Dubai, Sept 16
— ๐Ÿ‡บ๐Ÿ‡ธ Boston, Sept 16
— ๐Ÿ‡ฎ๐Ÿ‡ณ Pune, Sept 20
— ๐Ÿ‡ฎ๐Ÿ‡ณ Madrid, Sept 30

Reading Corner ๐Ÿ“–

https://clickhouse.com/blog/


— Comparison of LLM chat UIs
— Agentic BI in Slack with ClickHouse MCP
— Observability in Character.AI with ClickStack
Text search in ClickHouse
— Can LLMs replace on call SREs today?

Q&A