1. (50 min) What's new in ClickHouse 24.8 LTS.
2. (10 min) Q&A.
ClickHouse LTS Release.
β 19 new features π»
β 8 performance optimizations π
β 65 bug fixes π¦
In clickhouse-client and clickhouse-local:
24.7:
|
24.8:
|
Works for complex SQL scripts with INSERT and SELECT queries.
Developer: FFish.
:) SELECT _path, _etag
FROM s3('s3://ookla-open-data/parquet/performance/**')
LIMIT 10
ETag - an HTTP header, identifying the resource for caching purposes.
In S3 it is typically a weak hash (MD5) of the object content
(but it could also depend on the way how the object was uploaded).
Implemented for S3, Azure, and HDFS.
Developer: Skyoct.
A partitioning of data into different directories,
when an object path contains sub-directories in the form of key=value.
Example:
s3://ookla-open-data/parquet/performance/type=fixed/year=2019/quarter=1/
2019-01-01_performance_fixed_tiles.parquet
ClickHouse 24.8:
SET use_hive_partitioning = 1;
— enables virtual columns type, year, quarter
and automatically infers their types.
Developer: Yarik Briukhovetskyi.
ClickHouse 24.8:
SET use_hive_partitioning = 1;
— enables virtual columns from the paths: type, year, quarter
and automatically infers their types.
Demo
Developer: Yarik Briukhovetskyi.
But we already had a function for that, format:
:) SELECT format('Hello, {} {}!', 123.4, 'world') AS str
ββstrββββββββββββββββββ
β Hello, 123.4 world! β
βββββββββββββββββββββββ
The new function, printf is added for compatibility with Spark:
:) SELECT printf('Hello, %g %s!', 123.4, 'world') AS str
ββstrββββββββββββββββββ
β Hello, 123.4 world! β
βββββββββββββββββββββββ
Developer: TaiYang Li. (for project Gluten, which is an accelerator for Spark)
SET use_query_cache = 1, query_cache_tag = 'abc';
Queries with different query_cache_tag will be cached independently.
Use-case: save certain results to the cache for further access,
without interfering other queries.
The cache for different users is already separated by default,
caching for queries with different significant settings is also separate.
Developer: Sakulali.
A table in ClickHouse can contain "projections" — they represent the same data in a different physical order or an aggregation, to automatically optimize queries by using this data.
CREATE TABLE hits (CounterID UInt32, URL String, ...
PROJECTION totals (SELECT CounterID, count(), uniq(URL) GROUP BY CounterID)
) ENGINE = MergeTree ORDER BY (CounterID, EventTime);
What happens in the case of a non-trivial background merges
in ReplacingMergeTree, AggregatingMergeTree and similar,
and OPTIMIZE table DEDUPLICATE queries?
Before version 24.8, projections became out of sync with the main data.
Developer: ShiChao Jin.
What happens in the case of a non-trivial background merges
in ReplacingMergeTree, AggregatingMergeTree and similar,
and OPTIMIZE table DEDUPLICATE queries?
Before version 24.8, projections became out of sync with the main data.
Since version 24.8, it is controlled by a new table-level setting:
deduplicate_merge_projection_mode = 'throw'/'drop'/'rebuild'
Since 24.7, we also have a setting to control the behavior w.r.t. lightweight deletes: lightweight_mutation_projection_mode.
Developer: ShiChao Jin.
Complex data types are represented by multiple streams of data.
Example 1: an Array is represented by array elements and array sizes.
Example 2: a Nullable is represented by the values and a null mask.
The optimization allows to read only required streams.
Demo
In version 24.8 it is enabled by default!
Developer: Anton Popov.
Join tables contain pre-computed right hand side for JOINs
in the form of a hash table in memory, and they are used
for improving query performance.
Now you can do OPTIMIZE TABLE on a Join table,
and it will lower its memory usage by better packing.
Before: total_rows: 1786034200 formatReadableSize(total_bytes): 257.06 GiB After: total_rows: 1786034200 formatReadableSize(total_bytes): 184.63 GiB
Developer: Duc Canh Le.
If there is a large number of columns and data parts, merges can spend a lot of time in scheduling processing the next unit of work.
In version 24.8, merges are optimized by batching small units of work.
Example: a table with 250 sparse columns:
Before: 26.0 seconds.
After: 13.2 seconds.
It is controlled by the new table-level setting, merge_preferred_step_execution_time_ms,
which is alrady set by default to the optimal value.
Developer: Anton Popov.
If you managed to create too many tables...
Dropping them is now faster by better parallelism.
Developer: Nikita Mikhaylov.
Writing arrays into ORC now is ~15% faster.
Developer: TaiYang Li.
Background:
The "plain_rewritable" disk type allows you to create MergeTree tables
on a shared storage, such as S3, with a single writer and any number of readers,
enabling data lake scenario for data sharing with ClickHouse.
Disks created after 24.8 will use an optimal data layout for S3,
(consolidating metadata in a single prefix).
Demo
Developer: Jilia Kartseva.
Analyzer — is a new infrastructure for query analysis
and optimization in ClickHouse.
It provides better compatibility and feature completeness
and enables complex query optimizations.
22.10: experimental π§βπ¬
24.3: beta π
24.8: production π₯³
It is battle-tested and enabled on all new services in ClickHouse Cloud!
The Kafka engine exists in ClickHouse since 2017
— it implements streaming consumption and data pipelines from Kafka.
Its downside: non-atomic commit to Kafka and to ClickHouse, leading to the possibility of duplicates in the case of retries.
Now there is an option to manage the offsets in Keeper:
SET allow_experimental_kafka_offsets_storage_in_keeper = 1;
CREATE TABLE ... ENGINE = Kafka(
'localhost:19092', 'topic', 'consumer', 'JSONEachRow')
SETTINGS
kafka_keeper_path = '/clickhouse/{database}/kafka',
kafka_replica_name = 'r1';
Developer: JΓ‘nos Benjamin Antal.
CREATE TABLE ... ENGINE = Kafka(
'localhost:19092', 'topic', 'consumer', 'JSONEachRow')
SETTINGS
kafka_keeper_path = '/clickhouse/{database}/kafka',
kafka_replica_name = 'r1';
With the new option it does not rely on Kafka to track the offsets,
and does it by itself with ClickHouse Keeper.
If an insertion attempt fails, it will take exactly the same chunk of data and repeat the insertion, regardless of network or server failures.
This enables deduplication and makes the consumption exactly-once.
Developer: JΓ‘nos Benjamin Antal.
Now ClickHouse supports Prometheus protocols for remote write and read.
The new, TimeSeries Engine implements storage for metrics.
SET allow_experimental_time_series_table = 1;
CREATE TABLE tbl ENGINE = TimeSeries; -- the default options.
CREATE TABLE tbl ENGINE = TimeSeries
DATA ENGINE = MergeTree
TAGS ENGINE = ReplacingMergeTree
METRICS ENGINE = ReplacingMergeTree;
Developer: Vitaly Baranov.
$ cat /etc/clickhouse-server/config.d/prometheus.yaml
prometheus:
port: 8053
handlers:
my_rule_1:
url: '/write'
handler:
type: remote_write
database: default
table: tbl
my_rule_2:
url: '/read'
handler:
type: remote_read
database: default
table: tbl
my_rule_3:
url: '/metrics'
handler:
type: expose_metrics
Developer: Vitaly Baranov.
ClickHouse is listening the Prometheus protocol
and ready to receive metrics.
TimeSeries engine is simple to use, but allows many customizations:
— put some tags (e.g., hostname) into separate columns;
— adjust table's primary key;
— adjust column types;
— ...
But there is more work to do:
— support for PromQL;
Developer: Vitaly Baranov.
SET allow_experimental_json_type = 1;
CREATE TABLE test (time DateTime, data JSON) ORDER BY time;
-- or with parameters:
data JSON(
max_dynamic_paths = N,
max_dynamic_types = M,
some.path TypeName, -- a type hint
SKIP path.to.skip, -- ignore some paths
SKIP REGEXP 'paths_regexp')
Developer: Pavel Kruglov.
How it works:
— Analyzes the JSON and infers data types for every path.
— Stores every path and every distinct type as a subcolumn.
— Up to the maximum number, when it will fallback
to storing the rest of the paths together.
It enables fast column-oriented storage and queries
on arbitrary semistructured data!
Developer: Pavel Kruglov.
How to insert:
— insert with the JSONEachRow format;
— insert a string containing JSON to the column of JSON type;
— insert with the JSONAsObject format
to put the whole object into the JSON column;
— cast from String to JSON;
CREATE TABLE test (data JSON) ENGINE = Memory;
INSERT INTO test VALUES ('{"a" : {"b" : 42}, "c" : [1, 2, 3]}');
INSERT INTO test FORMAT JSONEachRow
{"data": {"a" : {"b" : 42}, "c" : [1, 2, 3]}};
SELECT data FROM test;
Developer: Pavel Kruglov.
How to select:
— read a certain path as a Dynamic column:
SELECT data.a AS x, toTypeName(x) FROM test;
— read a certain path and cast to the desired data type:
SELECT data.a.b::UInt32 AS x, toTypeName(x) FROM test;
— read a certain path and assume its data type:
SELECT data.a.b.:Int64 AS x, toTypeName(x) FROM test;
— read a subobject as JSON:
SELECT data.^a AS x, toTypeName(x) FROM test;
Developer: Pavel Kruglov.
Demo
Developer: Pavel Kruglov.
Continuous insertion from S3 in ClickPipes.
Support for _path, _size, _timestamp virtual columns.
Kafka connector now supports schema ids inside messages
and JSON flattening.
The clickhouse-rs Rust driver is now official!
Updates for Fivetran and dbt integrations,
as well as the Java, Python, and node.js drivers,
Grafana, and OTEL.
+ Thanks for many fixes to our contributors:
the4thamigo-uk, rjoelnorgren, steffen030, dmitryax, pimeys, pan3793, vrischmann, jmaicher, ShoshinNikita, Fiery-Fenix, loyd, v3xro, Defman.
— πΊπΈ Culver City, CA, Aug 21th
— ππ° Hong Kong Happy Hour, Aug 21th
— π¦πΊ Melbourne, Aug 21th
— π¨π³ Guangzhou, Aug 25th
— πΊπΈ Bellevue, WA, Aug 27th
— π¦πΊ Sidney, Sept 5th
— πΊπΈ San Francisco, CA, Sept 5th
— πΊπΈ Raleigh, NC, Sept 9th
— π¨π¦ Toronto, Sept 9th
— πΊπΈ New York, Sept 10th
— π³π± Coffee AMA, Amsterdam, Sept 11th
— πΊπΈ Chicago, IL, Sept 12th
— πΊπΈ Austin, TX, Sept 17th
I will be in person on some of these meetups :)
— ClickHouse joins the AWS Activate program;
— Cognitiv builds better ML models with ClickHouse;
— Azur Games: 120 TB easy migration to ClickHouse Cloud;
— Postgres to ClickHouse replication using PeerDB;
— CryptoHouse: Free Blockchain Analytics;
— Introduction to Ibis with ClickHouse;
— ClickHouse for ML & AI;