1. (50 min) What's new in ClickHouse 23.12.
2. (10 min) Q&A.
ClickHouse New Year Release.
β 21 new feature π
β 18 performance optimizations π·
β 37 bug fixes β
Sort by all selected columns:
SELECT ... FROM table ORDER BY ALL
SELECT ... FROM table ORDER BY ALL DESC
SELECT ... FROM table ORDER BY ALL DESC COLLATE 'tr'
Developer: ZhongYuanKai π.
SELECT * FROM
(
SELECT
toYear(date),
medianExact(price) AS m,
bar(m, 0, max(m) OVER (), 20) AS b
FROM uk_price_paid
GROUP BY ALL
ORDER BY ALL ASC
) AS t
PASTE JOIN
(
SELECT *
FROM numbers(100)
) AS u
Demo
Developer: Yarik Briukhovetskyi π₯³.
Short Unique Identifiers From Numbers
:) SELECT sqid(123);
UKk
:) SELECT sqid(123456789);
rDOOFW
:) SELECT sqid(123, 456, 789);
eVH6til6J
Developers: AwakeLJW, Robert Schulze π.
Find the main frequency of a signal.
Demo.
Developer: Bhavna Jindal π.
SELECT
hex(SHA512('ClickHouse')) AS a,
hex(substring(SHA512('ClickHouse'), 1, 32)) AS b,
hex(SHA256('ClickHouse')) AS c,
hex(SHA512_256('ClickHouse')) AS d
FORMAT Vertical
Row 1:
ββββββ
a: 668F21E61C95DE56E6D1A3424FD127B68F460A882DE39622084FB3D2E062DC59FE092BA261227BF1675C0F50824A665445FC2C5685FB63148F95F0D0A03781F1
b: 668F21E61C95DE56E6D1A3424FD127B68F460A882DE39622084FB3D2E062DC59
c: DEE91DDB90597936302CFD4341E409C87779ED9E0E8C8FDE580E0AF6CDACE179
d: F02AF5534B170E405BDA96CBB667103E7E022A92C7BF2EF07AA962C719E09350
Developer: Bharat Nallan πͺ .
This was already supported: indices on materialized columns:
CREATE TABLE gis
(
lat Float64,
lon Float64,
mercator_x Float64 MATERIALIZED (lon + 180) / 360,
mercator_y Float64 MATERIALIZED 1/2 - log(tan((lat + 90) / 360 * pi())) / 2 / pi(),
INDEX idx_x (mercator_x) TYPE minmax,
INDEX idx_y (mercator_y) TYPE minmax,
PRIMARY KEY mortonEncode(0xFFFFFFFF * mercator_x, 0xFFFFFFFF * mercator_y)
) ENGINE = MergeTree
This was already supported: indices on functional expressions:
CREATE TABLE gis
(
lat Float64,
lon Float64,
INDEX idx_x ((lon + 180) / 360) TYPE minmax,
INDEX idx_y (1/2 - log(tan((lat + 90) / 360 * pi())) / 2 / pi()) TYPE minmax,
PRIMARY KEY mortonEncode(0xFFFFFFFF * mercator_x, 0xFFFFFFFF * mercator_y)
) ENGINE = MergeTree
This is supported since 23.12: indices on ALIAS columns:
CREATE TABLE gis
(
lat Float64,
lon Float64,
mercator_x Float64 ALIAS (lon + 180) / 360,
mercator_y Float64 ALIAS 1/2 - log(tan((lat + 90) / 360 * pi())) / 2 / pi(),
INDEX idx_x (mercator_x) TYPE minmax,
INDEX idx_y (mercator_y) TYPE minmax,
PRIMARY KEY mortonEncode(0xFFFFFFFF * mercator_x, 0xFFFFFFFF * mercator_y)
) ENGINE = MergeTree
Developer: Vladimir Cherkasov π .
If you need indexing of location data,
you can also use H3, S2, or Geohash:
:) SELECT geoToH3(52.3676, 4.9041, 6)
605641106796838911
:) SELECT geoToS2(52.3676, 4.9041)
4402065658753719925
:) SELECT geohashEncode(52.3676, 4.9041)
t0v5zsubmw2x
What else should we implement for GIS data?
Batch DELETE:
ALTER TABLE t DELETE WHERE user = 'Petya'
— a full rewrite of data parts by copying all the records
except the deleted ones.
Lightweight DELETE:
DELETE FROM t WHERE user = 'Petya'
— a lazy deletion by creating a mask of records
to filter out on SELECT and during merges.
ALTER TABLE t APPLY DELETED MASK
— cleans the deleted records after a lightweight delete operation,
without the need for OPTIMIZE query.
Developer: Anton Popov β.
ALTER TABLE t APPLY DELETED MASK
— cleans the deleted records after a lightweight delete operation,
without the need for OPTIMIZE query.
Essentially transforms lightweight DELETEs to a batch operation.
Bonus:
ALTER TABLE t APPLY DELETED MASK IN PARTITION p
Developer: Anton Popov β.
Pick up to 100 first records from every data part:
:) SELECT count(), uniq(_part)
FROM github_events WHERE _part_offset < 100
ββcount()ββ¬βuniq(_part)ββ
β 1700 β 17 β
βββββββββββ΄ββββββββββββββ
1 row in set. Elapsed: 0.007 sec. Processed 139.09 thousand rows
Developer: Amos Bird π.
Demo
Developer: Skyoct π.
23.11:
:) SELECT count() FROM hits AS t1 INNER JOIN hits AS t2
ON t1.URL = t2.Referer
WHERE t1.URL != '' AND t2.Referer != ''
Received exception from server (version 23.11.3):
Memory limit (for query) exceeded: would use 98.02 GiB:
While executing JoiningTransform.
23.12:
:) SELECT count() FROM hits AS t1 INNER JOIN hits AS t2
ON t1.URL = t2.Referer
WHERE t1.URL != '' AND t2.Referer != ''
ββββββcount()ββ
β 51624536761 β
βββββββββββββββ
1 row in set. Elapsed: 131.179 sec. Processed 199.99 million rows.
Peak memory usage: 8.36 GiB.
51,624,536,761 joined records, and everything is fine π₯
Developer: Vladimir Cherkasov, Liu Neng.
Optimization for the case of identical keys spanning a block:
SELECT number DIV 100000 AS k, avg(number) AS c
FROM numbers_mt(1000000000)
GROUP BY k ORDER BY k LIMIT 10
Was: 101 ms, 9602551188 rows/sec.
Now: 81 ms, 11695102172 rows/sec.
Developer: Anton Popov β.
This is a low-level optimization.
It adds batching and CPU dispatching for various instruction sets.
SELECT max(number) FROM system.numbers
Was: 1.48 billion rows/s.
Now: 3.40 billion rows/s.
SELECT max(number) FROM system.numbers_mt
Was: 61.81 billion rows/s.
Now: 69.95 billion rows/s.
Developer: RaΓΊl MarΓn πΎ.
FINAL - apply the merge logic on the fly on SELECT query.
Applicable for Replacing-, Collapsing-, Aggregating-, Summing-,
and other MergeTree tables.
Allows to get the behavior of unique primary key, as well as UPSERT, REPLACE, INSERT IGNORE as in OLTP databases.
Can be specified explicitly in a query:
SELECT ... FROM table FINAL
or with a setting: SET final = 1
A downside: SELECT becomes slower.
Do not merge the data across different partitions
if the table's partition key consists of columns from the primary key.
Developer: Maksim Kita π.
Find data parts with non-intersecting ranges of the primary key
to avoid merging data from that parts.
Developer: Maksim Kita π.
A few more optimizations are on the way!
CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name
REFRESH EVERY|AFTER interval [OFFSET interval]
RANDOMIZE FOR interval
DEPENDS ON [db.]name [, [db.]name [, ...]]
[TO [db.]name] [(columns)] [ENGINE = engine]
AS SELECT ...
??? π€
Developer: Michael Kolupaev, Michael Guzov.
Run the SELECT query in background
and atomically replace the table with its result.
Run the refresh process using a flexible configured schedule.
Support a dependency graph with multiple materialized views.
Motivation:
— periodic export from external databases
similar to dictionaries;
— periodic non-incremental calcualtions
in contrast to the (incremental) materialized views.
Developer: Michael Kolupaev, Michael Guzov π.
Demo.
Developer: Michael Kolupaev, Michael Guzov π.
Why ClickHouse binary is large?
Open http://localhost:8123/binary to check it.
Developer: Alexey Milovidov π§βπ.
If you liked it, check another project: map of the Internet
Grafana plugin v4-beta:
— improved UX and lowered the learning curve;
Please share your feedback at
https://github.com/grafana/clickhouse-datasource/releases/tag/v4.0.0-beta
MySQL compatibility:
— substring supports the Enum data type;
— date_trunc supports case-insensitive unit names;
— format supports all data types;
JavaScript client:
— expose query statistics;
— support CloudFlare workers;
— recognize retriable error codes
Updates for PowerBI, Metabase, dbt, Apache Beam, Kafka Connector.
— ClickHouse + RedPanda, by Goldsky
— A Year of ClickHouse Cloud;
— Druid Deprecation and ClickHouse Adoption at Lyft;
— The Geek Narrator: Clickhouse Internals with Tom and Tyler.
Video Recordings: https://www.youtube.com/c/ClickHouseDB