ClickHouse: Release 23.12 Webinar

ClickHouse
Release 23.12

Release 23.12 Webinar

1. (50 min) What's new in ClickHouse 23.12.

2. (10 min) Q&A.

Release 23.12

ClickHouse New Year Release.

β€” 21 new feature 🎁

β€” 18 performance optimizations πŸ›·

β€” 37 bug fixes β›„

Small Treats

ORDER BY ALL

Sort by all selected columns:

SELECT ... FROM table ORDER BY ALL

SELECT ... FROM table ORDER BY ALL DESC

SELECT ... FROM table ORDER BY ALL DESC COLLATE 'tr'

Developer: ZhongYuanKai 🎁.

PASTE JOIN

SELECT * FROM ( SELECT toYear(date), medianExact(price) AS m, bar(m, 0, max(m) OVER (), 20) AS b FROM uk_price_paid GROUP BY ALL ORDER BY ALL ASC ) AS t PASTE JOIN ( SELECT * FROM numbers(100) ) AS u

Demo

Developer: Yarik Briukhovetskyi πŸ₯³.

SQID

Short Unique Identifiers From Numbers

:) SELECT sqid(123); UKk :) SELECT sqid(123456789); rDOOFW :) SELECT sqid(123, 456, 789); eVH6til6J

Developers: AwakeLJW, Robert Schulze πŸŽ„.

seriesPeriodDetectFFT

Find the main frequency of a signal.

Demo.

Developer: Bhavna Jindal πŸŽ‡.

SHA-512/256

SELECT hex(SHA512('ClickHouse')) AS a, hex(substring(SHA512('ClickHouse'), 1, 32)) AS b, hex(SHA256('ClickHouse')) AS c, hex(SHA512_256('ClickHouse')) AS d FORMAT Vertical Row 1: ────── a: 668F21E61C95DE56E6D1A3424FD127B68F460A882DE39622084FB3D2E062DC59FE092BA261227BF1675C0F50824A665445FC2C5685FB63148F95F0D0A03781F1 b: 668F21E61C95DE56E6D1A3424FD127B68F460A882DE39622084FB3D2E062DC59 c: DEE91DDB90597936302CFD4341E409C87779ED9E0E8C8FDE580E0AF6CDACE179 d: F02AF5534B170E405BDA96CBB667103E7E022A92C7BF2EF07AA962C719E09350

Developer: Bharat Nallan πŸͺ….

Indices On ALIAS Columns

This was already supported: indices on materialized columns:

CREATE TABLE gis ( lat Float64, lon Float64, mercator_x Float64 MATERIALIZED (lon + 180) / 360, mercator_y Float64 MATERIALIZED 1/2 - log(tan((lat + 90) / 360 * pi())) / 2 / pi(), INDEX idx_x (mercator_x) TYPE minmax, INDEX idx_y (mercator_y) TYPE minmax, PRIMARY KEY mortonEncode(0xFFFFFFFF * mercator_x, 0xFFFFFFFF * mercator_y) ) ENGINE = MergeTree

Indices On ALIAS Columns

This was already supported: indices on functional expressions:

CREATE TABLE gis ( lat Float64, lon Float64, INDEX idx_x ((lon + 180) / 360) TYPE minmax, INDEX idx_y (1/2 - log(tan((lat + 90) / 360 * pi())) / 2 / pi()) TYPE minmax, PRIMARY KEY mortonEncode(0xFFFFFFFF * mercator_x, 0xFFFFFFFF * mercator_y) ) ENGINE = MergeTree

Indices On ALIAS Columns

This is supported since 23.12: indices on ALIAS columns:

CREATE TABLE gis ( lat Float64, lon Float64, mercator_x Float64 ALIAS (lon + 180) / 360, mercator_y Float64 ALIAS 1/2 - log(tan((lat + 90) / 360 * pi())) / 2 / pi(), INDEX idx_x (mercator_x) TYPE minmax, INDEX idx_y (mercator_y) TYPE minmax, PRIMARY KEY mortonEncode(0xFFFFFFFF * mercator_x, 0xFFFFFFFF * mercator_y) ) ENGINE = MergeTree

Developer: Vladimir Cherkasov πŸŽ….

Indices On ALIAS Columns

If you need indexing of location data,
you can also use H3, S2, or Geohash:

:) SELECT geoToH3(52.3676, 4.9041, 6) 605641106796838911 :) SELECT geoToS2(52.3676, 4.9041) 4402065658753719925 :) SELECT geohashEncode(52.3676, 4.9041) t0v5zsubmw2x

What else should we implement for GIS data?

APPLY DELETED MASK

Batch DELETE:

ALTER TABLE t DELETE WHERE user = 'Petya'

— a full rewrite of data parts by copying all the records
  except the deleted ones.

Lightweight DELETE:

DELETE FROM t WHERE user = 'Petya'

— a lazy deletion by creating a mask of records
  to filter out on SELECT and during merges.

ALTER TABLE t APPLY DELETED MASK

— cleans the deleted records after a lightweight delete operation,
  without the need for OPTIMIZE query.

Developer: Anton Popov β›„.

APPLY DELETED MASK

ALTER TABLE t APPLY DELETED MASK

— cleans the deleted records after a lightweight delete operation,
  without the need for OPTIMIZE query.

Essentially transforms lightweight DELETEs to a batch operation.

Bonus:

ALTER TABLE t APPLY DELETED MASK IN PARTITION p

Developer: Anton Popov β›„.

_part_offset Column

Pick up to 100 first records from every data part:

:) SELECT count(), uniq(_part) FROM github_events WHERE _part_offset < 100 β”Œβ”€count()─┬─uniq(_part)─┐ β”‚ 1700 β”‚ 17 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ 1 row in set. Elapsed: 0.007 sec. Processed 139.09 thousand rows

Developer: Amos Bird πŸ‰.

Performance Improvements

count() For Merge Tables

Demo

Developer: Skyoct 🎊.

Memory Usage In Hash JOIN

23.11:

:) SELECT count() FROM hits AS t1 INNER JOIN hits AS t2 ON t1.URL = t2.Referer WHERE t1.URL != '' AND t2.Referer != '' Received exception from server (version 23.11.3): Memory limit (for query) exceeded: would use 98.02 GiB: While executing JoiningTransform.

Memory Usage In Hash JOIN

23.12:

:) SELECT count() FROM hits AS t1 INNER JOIN hits AS t2 ON t1.URL = t2.Referer WHERE t1.URL != '' AND t2.Referer != '' β”Œβ”€β”€β”€β”€β”€count()─┐ β”‚ 51624536761 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ 1 row in set. Elapsed: 131.179 sec. Processed 199.99 million rows. Peak memory usage: 8.36 GiB.

51,624,536,761 joined records, and everything is fine πŸ₯‚

Developer: Vladimir Cherkasov, Liu Neng.

Faster Aggregation

Optimization for the case of identical keys spanning a block:

SELECT number DIV 100000 AS k, avg(number) AS c FROM numbers_mt(1000000000) GROUP BY k ORDER BY k LIMIT 10

Was: 101 ms, 9602551188 rows/sec.

Now: 81 ms, 11695102172 rows/sec.

Developer: Anton Popov β›„.

Faster MIN/MAX

This is a low-level optimization.
It adds batching and CPU dispatching for various instruction sets.

SELECT max(number) FROM system.numbers

Was: 1.48 billion rows/s.
Now: 3.40 billion rows/s.

SELECT max(number) FROM system.numbers_mt

Was: 61.81 billion rows/s.
Now: 69.95 billion rows/s.

Developer: Raúl Marín 🍾.

Optimizations For FINAL

FINAL - apply the merge logic on the fly on SELECT query.

Applicable for Replacing-, Collapsing-, Aggregating-, Summing-,
and other MergeTree tables.

Allows to get the behavior of unique primary key, as well as UPSERT, REPLACE, INSERT IGNORE as in OLTP databases.

Can be specified explicitly in a query:
SELECT ... FROM table FINAL
or with a setting: SET final = 1

A downside: SELECT becomes slower.

Optimizations For FINAL

Do not merge the data across different partitions
if the table's partition key consists of columns from the primary key.

Developer: Maksim Kita 🎈.

Optimizations For FINAL

Find data parts with non-intersecting ranges of the primary key
to avoid merging data from that parts.

Developer: Maksim Kita 🎈.

A few more optimizations are on the way!

Something Interesting

Refreshable Materialized Views

CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name REFRESH EVERY|AFTER interval [OFFSET interval] RANDOMIZE FOR interval DEPENDS ON [db.]name [, [db.]name [, ...]] [TO [db.]name] [(columns)] [ENGINE = engine] AS SELECT ...

??? πŸ€”

Developer: Michael Kolupaev, Michael Guzov.

Refreshable Materialized Views

Run the SELECT query in background
and atomically replace the table with its result.

Run the refresh process using a flexible configured schedule.

Support a dependency graph with multiple materialized views.

Motivation:
— periodic export from external databases
  similar to dictionaries;
— periodic non-incremental calcualtions
  in contrast to the (incremental) materialized views.

Developer: Michael Kolupaev, Michael Guzov 🎁.

Refreshable Materialized Views

Demo.

Developer: Michael Kolupaev, Michael Guzov 🎁.

Bonus

Binary Map

Why ClickHouse binary is large?

Open http://localhost:8123/binary to check it.

Developer: Alexey Milovidov πŸ§‘β€πŸŽ„.

Map Of the Internet

If you liked it, check another project: map of the Internet

https://reversedns.space/

Integrations

Grafana plugin v4-beta:
— improved UX and lowered the learning curve;

Please share your feedback at
https://github.com/grafana/clickhouse-datasource/releases/tag/v4.0.0-beta

Integrations

MySQL compatibility:
substring supports the Enum data type;
date_trunc supports case-insensitive unit names;
format supports all data types;

JavaScript client:
— expose query statistics;
— support CloudFlare workers;
— recognize retriable error codes

Updates for PowerBI, Metabase, dbt, Apache Beam, Kafka Connector.

Reading Corner

https://clickhouse.com/blog/

— ClickHouse + RedPanda, by Goldsky
— A Year of ClickHouse Cloud;
— Druid Deprecation and ClickHouse Adoption at Lyft;
The Geek Narrator: Clickhouse Internals with Tom and Tyler.

Video Recordings: https://www.youtube.com/c/ClickHouseDB

Q&A