ClickHouse: Release 24.7 Webinar

ClickHouse Release 24.7

Release 24.7 Webinar

1. (45 min) What's new in ClickHouse 24.7.

2. (10 min) Guest Talks.

2. (5 min) Q&A.

Release 24.7

ClickHouse Summer Release.

— 18 new features 🏖️

— 12 performance optimizations 🍉

— 76 bug fixes 🍦

A Few New Functions

change- Functions for Date and Time

changeYear, changeMonth, changeDay,
changeHour, changeMinute, changeSecond

:) SELECT changeMonth('2024-06-30'::Date, 7) 2024-07-30 :) SELECT changeHour('2024-06-30 01:02:03'::DateTime, 7) 2024-06-30 07:02:03 :) SELECT changeMonth('2024-07-31'::Date, 6) 2024-07-01

Developer: Maksim Galkin.

groupConcat

:) SELECT groupConcat(number) FROM numbers(10) 0123456789 :) SELECT groupConcat(', ')(number) FROM numbers(10) 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 :) SELECT arrayStringConcat(groupArray(number), ', ') FROM numbers(10) 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

Developer: Yarik Briukhovetskyi.

A Few New System Tables

system.error_log

These tables already existed in previous versions:

Total number of any error messages since server startup:

SELECT value FROM system.events WHERE name = 'LogError'

Total number of errors by their type since server startup:

SELECT * FROM system.errors

The number of any error messages by time, stored persistently:

SELECT event_time, ProfileEvent_LogError FROM system.metric_log ORDER BY event_time

system.error_log

All error messages with full details, persistent:

SELECT * FROM system.text_log WHERE level = 'Error' AND event_date = today() ORDER BY event_time_microseconds

The number of error messages by time, for each type of error, persistent:

SELECT * FROM system.error_log ORDER BY event_time

Developer: Pablo Marcos.

system.detached_tables

If you did DETACH TABLE, you will find the information there
and you can ATTACH the table back.

Developer: Konstantin Morozov.

More Nice Things

AzureQueue

A table engine for streaming consumption from Azure.

It is the same as S3Queue but for Azure:

CREATE TABLE queue ENGINE = AzureQueue('AccountName=devstoreaccount1;AccountKey=...;BlobEndpoint=https://account.blob.core.windows.net/endpoint;', 'test_container') SETTINGS mode = 'unordered'; CREATE TABLE data ENGINE = MergeTree ORDER BY () EMPTY AS SELECT * FROM queue; CREATE MATERIALIZED VIEW consumer TO data AS SELECT * FROM queue;

— Consumes data from a bunch of files on Azure.
— Consumes new files as soon as they appear.
— Supports parallel and distributed consumption.

Developer: Kseniia Sumarokova.

Automatic Named Tuples

If you specify aliases (AS) for elements of a tuple, it will be a named tuple.

SELECT ('Hello' AS a, 123 AS b) AS x, toTypeName(x)

24.6: Tuple(String, UInt8)
24.7: Tuple(a String, b UInt8)

SELECT ('Hello' AS a, 123 AS b) AS x FORMAT JSONEachRow

24.6: {"x":["Hello",123]}
24.7: {"x":{"a":"Hello","b":123}}

Controlled by the setting enable_named_columns_in_function_tuple.

Developer: Amos Bird.

Partitioned DeltaLake Tables

Since version 24.7,
ClickHouse supports reading of partitioned Delta Lake tables,
where partition columns are not stored inside the data files.

Developer: Kseniia Sumarokova.

Performance Improvements

Faster Int To String Conversion

Happens when:
— you output data in a text format (TSV, JSON, etc);
— you do type conversion such as toString;

10..20% faster!

Demo

Developer: Raul Marin.

Faster Parallel Hash Join

By using runtime statistics of hash table sizes.

SET join_algorithm = 'parallel_hash';

Up to 80% faster!

Demo

Developer: Nikita Taranov.

Merge Join Algorithm For ASOF JOIN

ASOF JOIN — a JOIN method to find the closest matching record,
e.g. the closest by time, without strict equality*

* ClickHouse is the first SQL DBMS to introduce the ASOF JOIN in Apr 2019.

SELECT count() FROM hits h ASOF JOIN sessions s ON h.UserID = s.UserID AND h.EventTime > s.StartTime;

SET join_algorithm = 'full_sorting_merge';

Less memory usage and often faster. Demo.

Developer: Vladimir Cherkasov.

Optimizations For Reading In Order

If a query contains ORDER BY corresponding to the table's ORDER BY key,
the "read in order" algorithm is activated
(controlled by the optimize_read_in_order setting).

CREATE TABLE hits (...) ORDER BY CounterID, UserID, EventTime; SELECT UserID FROM hits WHERE SearchEngineID != 0 ORDER BY CounterID;

In ClickHouse 24.7 this algorithm is up to 10x faster,
if the query uses a high-selectivity filter.

Demo.

Developer: Anton Popov.

AZ-Aware Load Balancing In ClickHouse Keeper

Keeper can tell about its availability zone:

$ cat /etc/clickhouse-keeper/config.d/az.yaml placement: use_imds: 0 # Or 1 for auto-detection availability_zone: az2 # Arbitrary string

Server will switch to a closest Keeper node after some time:

$ cat /etc/clickhouse-server/config.d/keeper.yaml prefer_local_availability_zone: 1 fallback_session_lifetime: min: 0 max: 1 placement: # Similar configuration for the server use_imds: 1

Developer: Alexander Tokmakov.

Something Interesting

Startup Scripts

Run SQL commands at server startup.
The server accepts connections only after processing the script.

Good for infrastructure automatization.

$ cat /etc/clickhouse-server/config.d/startup.yaml startup_scripts: scripts: # Run a query: query: "CREATE ROLE OR REPLACE test_role" scripts: # Run a query if another query returned 1: query: "CREATE TABLE test (id UInt64) ENGINE = TinyLog" condition: "SELECT 1"

Developer: Artem Brustovetskii.

On-Disk Storage For Keeper

ClickHouse Keeper is a drop-in replacement for ZooKeeper,
that is faster, uses less memory, more stable, and easier to use.

Traditionally, distributed consensus systems,
such as ZooKeeper, ClickHouse Keeper, Etcd, keep the entire
dataset in RAM, and require enough RAM to accommodate it.

What if you have to keep billions of values in ClickHouse Keeper?

On-Disk Storage For Keeper

$ cat /etc/clickhouse-keeper/config.d/on-disk.yaml keeper_server: experimental_use_rocksdb: 1

With the cost of ~2..3x less performance,
it no longer requires as much RAM as data size.

Developer: Han Fei.

Deduplication In Materialized Views

ClickHouse supports idempotent inserts for exactly-once semantics.

For each INSERT it calculates the deduplication token as a hash of the data,
or a user can provide a custom deduplication token (insert_deduplication_token).

How the deduplication works when there is a chain of materialized views, attached to a table?

24.6 and older: it didn't guarantee exactly once insertion into the views.

24.7: it calculates deterministic Merkle Tree of the source token and its transformations... and everything works as expected!

Developer: Sema Checherinda.

Projections And Lightweight Deletes

A table in ClickHouse can contain "projections" — they represent the same data in a different physical order or an aggregation, to automatically optimize queries by using this data.

CREATE TABLE hits (CounterID UInt32, URL String, ... PROJECTION totals (SELECT CounterID, count(), uniq(URL) GROUP BY CounterID) ) ENGINE = MergeTree ORDER BY (CounterID, EventTime);

What happens to projections when you apply a mutation, such as DELETE?

24.6 and older: projections remains in an inconsistent state,
and deleted rows are still visible when a query uses projections.

24.7: SETTINGS lightweight_mutation_projection_mode = 'throw'/'drop';

Developer: ShiChao Jin.

Something Even More Interesting

Guest Talk

Integrations

Continuous insertion from S3 in ClickPipes.
Support for _path, _size, _timestamp virtual columns.

Kafka connector now supports schema ids inside messages.

Updates for Fivetran and dbt integrations,
as well as the Java, Python, and node.js drivers,
Grafana, and OTEL.

+ Thanks for many fixes to our contributors:
the4thamigo-uk, rjoelnorgren, steffen030, dmitryax, pimeys, pan3793, vrischmann, jmaicher.

Reading Corner

https://clickhouse.com/blog/

— How to replace ZooKeeper with ClickHouse Keeper;
— ClickHouse improves AI at Braintrust;
— ClickHouse Cloud Monitoring with Prometheus;
— User-Facing Dashboard With ClickHouse and Luzmo;
— Rill replaces DuckDB with ClickHouse;

Q&A