ClickHouse: Release 25.5 Call

ClickHouse Release 25.5

ClickHouse release 25.5

1. (50 min) What's new in ClickHouse 25.5.

3. (10 min) Q&A.

Release 25.5

ClickHouse Spring/Summer Release.

โ€” 15 new features ๐ŸŒธ

โ€” 23 performance optimizations ๐Ÿฆ‹

โ€” 64 bug fixes ๐Ÿ

A Few Small Features

String Entropy

Two new functions:
stringBytesUniq — the number of distinct bytes in a string;
stringBytesEntropy — Shannon's entropy of the distribution of bytes;

Determine how random the string is.

SELECT round(stringBytesEntropy(randomString(10000)), 2) AS res โ”Œโ”€โ”€resโ”€โ” 1. โ”‚ 7.98 โ”‚ -- close to 8 bit per character โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Demo.

Developer: Sachin Kumar Singh.

Functions For Base32

base32Encode, base32Decode — use the RFC-4648 alphabet

:) WITH 'ClickHouse is a good database' AS s SELECT base32Encode(s), base58Encode(s), base64Encode(s) Row 1: โ”€โ”€โ”€โ”€โ”€โ”€ base32Encode(s): INWGSY3LJBXXK43FEBUXGIDBEBTW633EEBSGC5DBMJQXGZI= base58Encode(s): 44Xce2JCNL96kJGtfCgHPpV4axarmE3843VSYYAQ base64Encode(s): Q2xpY2tIb3VzZSBpcyBhIGdvb2QgZGF0YWJhc2U=

Developer: Joanna Hulboj.

SparseGrams

An extension to ngrams that deterministically adds a small amount of longer infrequent substrings to the set of ngrams.

New functions: sparseGrams, sparseGramsUTF8,
sparseGramsHashes, sparseGramsHashesUTF8.

Demo.

It is an elegant algorithm to accelerate text and code search.

Developers: Konstantin Vedernikov.

Geo Types in Parquet

Parquet files can contain a WKB-encoded geometry data.

In previous ClickHouse versions, this data was recognized as a binary string.

Since version 25.5, it is inferred as Point, LineString, MultiLineString, Polygon or MultiPolygon types in ClickHouse.

Demo.

Developer: Konstantin Vedernikov.

Implicit Table In clickhouse-local

When processing files from stdin, the FROM clause can be omitted:

ch -q "name, price + tax ORDER BY 2 DESC" < data.csv ch -q "arrayJoin(members.login), org_name" < data.json ch -q "arrayJoin(json.members.login), json.org_name " --input-format JSONAsObject < data.json

SELECT can be omitted as well.

This makes clickhouse-local a (faster and richer) replacement of jq.

Demo.

Developer: Alexey Milovidov.

TRUNCATE ALL TABLES LIKE

Now the query TRUNCATE ALL TABLES FROM db
can be clarified: TRUNCATE ALL TABLES FROM db LIKE '%pattern%'

Developer: Yarik Briukhovetskyi.

Improvements for Data Lakes

Support For Partitions By Buckets

For partition pruning of Iceberg tables.

Example: an Iceberg table in AWS Athena:

CREATE TABLE iceberg_benchmark.bucket_partitioned ( watchid bigint, javaenable int, title string, ... ) PARTITIONED BY (bucket(50, watchid)) TBLPROPERTIES ('table_type'='iceberg');

is split into 50 buckets by the hash of watchid.

Developer: Dan Ivanik.

Support For Partitions By Buckets

SELECT watchid FROM "iceberg_benchmark.bucket_partitioned" WHERE watchid = 8005433341155183383 LIMIT 10

ClickHouse 25.4: 51.2 sec.

ClickHouse 25.5: 1.3 sec.

Plus two new functions, icebergHash and icebergBucketTransform.

Developer: Dan Ivanik.

Hive Metastore Catalog

Now ClickHouse has support for every data catalog:

— 24.12: REST

— 25.3: Unity

— 25.3: Glue

— 25.5: Hive Metastore

Developer: Konstantin Vedernikov.

Support For "Version Hint"

Iceberg tables may contain a version-hint.text pointing to the actual snapshot. So it can be used instead of finding the latest metadata file.

A table-level setting, iceberg_use_version_hint to complete the existing iceberg_metadata_file_path, iceberg_metadata_table_uuid, iceberg_recent_metadata_file_by_last_updated_ms_field.

Helps using tables with concurrent updates, without the help of a data catalog*

* yes, everything around Iceberg ecosystem is complicated and brittle. Sorry about that.

Developer: Arnaud Briche.

Performance Improvements

Secondary Indices With FINAL

ClickHouse allows tables to perform data transformations during background merges.

E.g., ReplacingMergeTree does deduplication,
AggregatingMergeTree does aggregation.

If a SELECT query has the FINAL modifier after a table,
it performs merge on SELECT.

The problem: secondary indices couldn't skip ranges of data on SELECT,
because this data can affect the query result due to the merge.

Secondary Indeces With FINAL

The problem: secondary indices couldn't skip ranges of data on SELECT,
because this data can affect the query result due to the merge.

Filtering with secondary indices can be enabled with the setting:
use_skip_indexes_if_final, which allows to filter the data,
but also may skip some of the updates.

A new option in ClickHouse 25.5, use_skip_indexes_if_final_exact_mode
— takes care about all the updates by reading extra ranges after filtering.

Demo.

Developer: Shankar Iyer.

Something Interesting

Correlated Subqueries: More Types

— are subqueries that depend on the columns from the outer scope.

SELECT * FROM users AS u1 WHERE EXISTS ( SELECT * FROM users2 AS u2 WHERE u1.age = u2.age)

They can appear in many different contexts: EXISTS, IN, scalar...

In 25.4 we supported correlated subqueries inside the WHERE clause with the EXISTS operator.

In 25.5 we support scalar correlated subqueries inside the WHERE clause
and correlated subqueries inside SELECT!

Now it is enough to cover the TPC-H test suite without modifying queries.

Developer: Dmitry Novik.

Vector Search Is Beta

22.9 — experimental version

Introduced by Arthur Filatenkov, Vladimir Makarov, Danila Mishin, Nikita Vasilenko, Alexander Piachonkin, Nikita Evsiukov, Hakob Sagatelyan.

23.8 — integration with the USearch library

Introduced by Davit Vardanyan.

25.1 — faster vector indices

By Robert Schulze, Michael Kolupaev.

25.5 — beta

Developer: Shankar Iyer, Robert Schulze.

Vector Search Is Beta

Support for pre-filtering and post-filtering modes.

SET vector_search_filter_strategy = auto|postfilter|prefilter

Determines how to filter data for hybrid search.

Developer: Shankar Iyer.

Integrations

ClickPipes supports Azure Blob Storage in private preview.

dbt v1.9.0 added a microbatch strategy.
and support for column-level TTL.

Java client speeds up insertion with the new RowBinaryWriter.

Grafana integration v4.9.0 added suggestions in the SQL editor,
as well as SQL formatter.
Better support for traces: added kind, status, instrumentation library, links, events and state data.

Bonus

AgentHouse

https://llm.clickhouse.com/

A chat agent connected to ClickHouse demo datasets
using the ClickHouse MCP server.

The ClickHouse MCP server also works with your databases!

Visualizations On The Map

Example with the Foursquare dataset:

https://adsb.exposed/?dataset=Places

Meetups

— ๐Ÿ‡ฏ๐Ÿ‡ต Tokyo, June 5
— ๐Ÿ‡จ๐Ÿ‡พ Cyprus, June 10
— ๐Ÿ‡บ๐Ÿ‡ธ Washington DC, June 12
— ๐Ÿ‡ฎ๐Ÿ‡ฑ Tel Aviv, June 17
— ๐Ÿ‡บ๐Ÿ‡ธ Atlanta, July 8
— ๐Ÿ‡บ๐Ÿ‡ธ New York, July 15

I will be in person on some of these meetups :)

Reading Corner ๐Ÿ“–

https://clickhouse.com/blog/

— Dashboard In ClickHouse Cloud
— ClickHouse and Parquet
— How we made MySQL CDC
— Chartmetric — music artists analytics
— BoundaryML — how to make AI structured
— Wallarm, Flock Safety, Real Estate Analytics

Q&A