ClickHouse: Release 25.9 Call

ClickHouse Release 25.9

ClickHouse release 25.9

1. (35 min) What's new in ClickHouse 25.9.

2. (20 min) Guest Talks.

3. (5 min) Q&A.

Release 25.9

ClickHouse Autumn Release.

β€” 25 new features 🍎

β€” 22 performance optimizations 🍁

β€” 83 bug fixes 🌿

Small And Nice Features

Boolean Settings

-- 25.8: SET use_query_cache = true; -- 25.9: SET use_query_cache;

Developer: Thraeka.

isValidASCII

:) SELECT isValidASCII('Hello') β”Œβ”€isValidASCII('Hello')─┐ 1. β”‚ 1 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ :) SELECT isValidASCII('ΠŸΡ€ΠΈΠ²Π΅Ρ‚') β”Œβ”€isValidASCII('ΠŸΡ€ΠΈΠ²Π΅Ρ‚')─┐ 1. β”‚ 0 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ :) SELECT match('hello', '^[[:ascii:]]*$'); :) SELECT match('hello', '^[\x00-\x7F]*$'); :) SELECT arrayAll(c -> ascii(c) < 128, extractAll('hello', '.'));

Developer: Rajat Mohan.

arrayExcept

-- 25.8: :) SELECT arrayFilter(x -> x NOT IN [1, 3, 5], [1, 2, 3, 4]) AS res -- 25.9 :) SELECT arrayExcept([1, 2, 3, 4], [1, 3, 5]) AS res β”Œβ”€res───┐ 1. β”‚ [2,4] β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”˜

Developer: Joanna Hulboj.

Storage Class Specification For S3

CREATE TABLE test (s String) ENGINE = S3('s3://mybucket/test.parquet', storage_class_name = 'INTELLIGENT_TIERING'); INSERT INTO test VALUES ('Hello'); -- works with a table function as well: INSERT INTO FUNCTION s3('s3://mybucket/test.parquet', storage_class_name = 'INTELLIGENT_TIERING');

Developer: Alexander Sapin.

system.database_replicas

A new system table to introspect Replicated databases,
similar to system.replicas for Replicated tables.

:) DESCRIBE system.database_replicas β”Œβ”€name────────────────┬─type───┐ 1. β”‚ database β”‚ String β”‚ 2. β”‚ is_readonly β”‚ UInt8 β”‚ 3. β”‚ max_log_ptr β”‚ Int32 β”‚ 4. β”‚ replica_name β”‚ String β”‚ 5. β”‚ replica_path β”‚ String β”‚ 6. β”‚ zookeeper_path β”‚ String β”‚ 7. β”‚ shard_name β”‚ String β”‚ 8. β”‚ log_ptr β”‚ Int32 β”‚ 9. β”‚ total_replicas β”‚ UInt32 β”‚ 10. β”‚ zookeeper_exception β”‚ String β”‚ 11. β”‚ is_session_expired β”‚ UInt8 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Developer: Konstantin Morozov.

Integration With Arrow Flight πŸ§ͺ

Since 25.8, ClickHouse can query other Arrow Flight data sources
with the arrowFlight table function.

And ClickHouse can work as an Arrow Flight server by itself.

New in 25.9: TLS and authentication.

-- without authentication: SELECT * FROM arrowFlight('host:port', 'dataset_name'); -- with authentication: SELECT * FROM arrowFlight( 'host:port', 'dataset_name', 'username', 'password');

Developer: Vitaly Baranov.

A New Memory Profiler

ClickHouse already has sampling and watermark memory profilers.

Now it uses its memory allocator (jemalloc) for consistent sampling of allocations and deallocations.

It is built in ClickHouse by default and can be used in runtime:

:) SET jemalloc_enable_profiler; :) SET jemalloc_collect_profile_samples_in_trace_log; -- after collecting some data :) SELECT arrayStringConcat(symbols, '\n') AS s, count() AS c, sum(size) AS size, round(size / c) AS avg_size FROM system.trace_log WHERE trace_type = 'JemallocSample' GROUP BY s ORDER BY abs(size) DESC LIMIT 20 FORMAT Vertical

Developer: Antonio Andelic.

A New Memory Profiler

:) SET jemalloc_enable_profiler; :) SET jemalloc_collect_profile_samples_in_trace_log; -- analyze the samples: :) SELECT arrayStringConcat(symbols, '\n') AS s, count() AS c, sum(size) AS size, round(size / c) AS avg_size FROM system.trace_log WHERE trace_type = 'JemallocSample' GROUP BY s ORDER BY abs(size) DESC LIMIT 20 FORMAT Vertical; -- create a heap dump: :) SYSTEM JEMALLOC FLUSH PROFILE;

Demo

Developer: Antonio Andelic.

Performance Improvements

Streaming For Secondary Indices

In 25.8: reading and analyzing the index happens before the query starts reading the data, which introduces a noticable delay when using large indices.

In 25.9: reading the indices happens together when scanning the data. Queries start faster and can finish earlier, e.g., when LIMIT is reached.

:) SET use_skip_indexes_on_data_read;

Demo

Developer: Amos Bird.

JOIN Reordering

Reordering of the JOIN graph, based on the amount of data to read
and on the column-level statistics.

:) SET query_plan_optimize_join_order_limit = 10; :) SET allow_statistics_optimize = 1;

Demo

Developer: Vladimir Cherkasov.

JOIN Reordering

Reordering of the JOIN graph, based on the amount of data to read
and on the column-level statistics.

:) SET query_plan_optimize_join_order_limit = 10; :) SET allow_statistics_optimize = 1;

Demo (TPC-H with statistics)

Developer: Vladimir Cherkasov.

JOIN Reordering

Developer: Vladimir Cherkasov.

Something Interesting

Text Index πŸ§ͺ

A full-text search index in ClickHouse!

— In development since 2022
— first prototype in 2023 (by Harry Lee and Larry Luo)
now rewritten to make it to production
— remains experimental in 25.9

Developers: Anton Popov, Elmi Ahmadov, Jimmy Aguilar Mena.

Text Index πŸ§ͺ

CREATE TABLE text_log ( message String, ... INDEX inv_idx(message) TYPE text(tokenizer = 'default') GRANULARITY 128 ) ENGINE = SharedMergeTree ORDER BY id; SELECT ... WHERE hasToken(message, 'DDLWorker'); SELECT ... WHERE searchAll(message, ['peak', 'memory']); SELECT ... WHERE searchAny(message, tokens('01442_merge_detach_attach'));

Developers: Anton Popov, Elmi Ahmadov, Jimmy Aguilar Mena.

Text Index πŸ§ͺ

Demo

Developers: Anton Popov, Elmi Ahmadov, Jimmy Aguilar Mena.

Vector Index

In development since 2021. Production-ready since 25.8.

ALTER TABLE dbpedia ADD INDEX vector_index(vector) TYPE vector_similarity( 'hnsw', 'cosineDistance', 1536, 'bf16', 64, 512); ALTER TABLE dbpedia MATERIALIZE INDEX vector_index; WITH ... AS reference_vector SELECT ... FROM table WHERE ... ORDER BY cosineDistance(vector, reference_vector) LIMIT 10;

Developer: Shankar Iyer.

Vector Search

22.9 — experimental version

Introduced by Arthur Filatenkov, Vladimir Makarov, Danila Mishin, Nikita Vasilenko, Alexander Piachonkin, Nikita Evsiukov, Hakob Sagatelyan.

23.8 — integration with the USearch library

Introduced by Davit Vardanyan.

25.1 — faster vector indices

By Robert Schulze, Michael Kolupaev.

25.5 — beta (prefiltering, postfiltering, rescoring)

Developer: Shankar Iyer, Robert Schulze.

25.8 — GA (index-only reading, fetch multiplier, binary quantization)

Developer: Shankar Iyer.

Vector Index

HNSW algorithm with quantization options (bf16, i8, b1).

Supports different filtering modes:
post-filtering (run ANN search, then apply other filters)
or pre-filtering (apply regular filters, then run ANN across filtered results).

Supports filtering multiplier
(e.g., find 100 nearest candidates to filter and return 10).

Avoids reading the data column if not necessarily.

Developer: Shankar Iyer.

Data Lakes

Data Lakes Improvements

ALTER UPDATE for Iceberg

An option for DROP TABLE for Iceberg to remove the data

system.iceberg_metadata_log

ORC and Avro for Iceberg data files

Unity catalog on Azure

Distributed INSERT SELECT for data lakes

Developers: Konstantin Vedernikov, Smita Kulkarni.

Bonus

Guest Talk

One More...

NATS JetStream

ClickHouse already supports subscribing to NATS!

JetStream — persistent queues in NATS (similar to Kafka, but simpler).

CREATE TABLE test.nats (key UInt64, value UInt64) ENGINE = NATS SETTINGS nats_url = 'nats1:4444', nats_stream = 'test_stream', nats_consumer_name = 'test_consumer', nats_subjects = 'test_subject', nats_format = 'TSV', nats_row_delimiter = '\n';

Developer: Dmitry Novikov.

Integrations

ClickPipes

Advanced Settings: full API, UI, and Terraform support
JSON arrays & multiple JSON Objects
S3 Unordered Mode: performance for high-volume S3 ingestions
Postgres CDC: added failover replication slot support and PlanetScale Metal compatibility.

Ecosystem

— Official Flink Connector
Python 0.9.0 release with SQLAlchemy 2.x support
Polars integration, and Pandas 1.x/2.x compatibility
Java client 0.9.2 targeting BI tools like Tableau and DBeaver
C++ client 2.6.0: better handling of numeric data types, external data.

Meetups

— πŸ‡ΈπŸ‡¬ Singapore, Sept 25
— πŸ‡¦πŸ‡Ί Melbourne, Sept 30
— πŸ‡―πŸ‡΅ Tokyo, Sept 30
— πŸ‡ͺπŸ‡Έ Madrid, Sept 30
— πŸ‡ͺπŸ‡Έ Barcelona, Oct 1
— πŸ‡ΊπŸ‡Έ San Francisco (Iceberg), Oct 1
— πŸ‡ΊπŸ‡Έ Seattle AI Night, Oct 2
— πŸ‡¨πŸ‡­ Zurich, Oct 9
— πŸ‡ΊπŸ‡Έ Chicago, Oct 9
— πŸ‡¬πŸ‡§ London, Oct 15

Reading Corner πŸ“–

https://clickhouse.com/blog/


— Optimizing GROUP BY with parallel replicas
— Optimizing ClickHouse for high-core Intel CPUs
— Cost-efficient Observability
— ClickStack for Next.JS applications
— Migrating data from TimescaleDB using ClickPipes
— Lyft, Exabeam, Ford, SewerAI
— Can LLMs replace on call SREs today?

Q&A