ClickHouse: Release 23.6 Webinar

ClickHouse
Release 23.6

Release 23.6 Webinar

1. (50 min) What's new in ClickHouse 23.6.

2. (10 min) Q&A.

Release 23.6

ClickHouse Summer Release.

β€” 10 new features 🌞

β€” 12 performance optimizations πŸ•ΆοΈ

β€” 31 bug fixes ⛱️

All Data Types In Function transform

SELECT transform( what, [from1, from2, ...], [to1, to2, ...], default); SELECT CASE what WHEN from1 THEN to1 WHEN from2 THEN to2 ELSE default END;

23.5: implemented only for numbers, strings, and date-times;
23.6: implemented for everything;

Note: from and to values are required to be constants even in 23.6.

Developer: Alexey Milovidov.

SELECT ... INTO OUTFILE '...' TRUNCATE

Version 23.5 introduced the APPEND modifier.
Version 23.6 adds the TRUNCATE modifier.

:) SELECT * FROM numbers(10) INTO OUTFILE 'numbers.tsv' Ok. :) SELECT * FROM numbers(10) INTO OUTFILE 'numbers.tsv' Cannot open file numbers.tsv, errno: 17, strerror: File exists, consider using APPEND or TRUNCATE. -- this is fine :) SELECT * FROM numbers(10) INTO OUTFILE 'numbers.tsv' TRUNCATE Ok.

Developer: Manas Alekar.

Skip Empty Files On Data Import

New settings: s3_skip_empty_files, hdfs_skip_empty_files,
engine_file_skip_empty_files, engine_url_skip_empty_files.

For what?

— empty CSV, TSV, JSONEachRow, RowBinary, Native
— are valid empty datasets;

— but empty Parquet or ORC files are invalid — we want to skip them.

Q: Do we need a setting to skip any broken files on import?

Developer: Pavel Kruglov.

Rename Files After Processing

clickhouse-local --rename_files_after_processing '%f.processed' --query " INSERT INTO table SELECT * FROM file('*.parquet')"

— to build simple data pipelines, when new files
  constantly appear in the directory;

— it can rename by arbitrary pattern and move
  into other directory, relative to the path;

— the pattern can include the timestamp of processing;

Demo

Developer: Aleksey Golub.

Rename Files After Processing

clickhouse-local --rename_files_after_processing '%f.processed' --query " INSERT INTO table SELECT * FROM file('*.parquet')"

Feature requests for the next versions:

— extend it to S3 and HDFS;

— better guarantees for INSERT SELECT;

Connection String For clickhouse-client

The normal way:

clickhouse-client --host myhost.clickhouse.cloud --secure --password

Alternative way (since 23.6):

clickhouse-client 'clickhouse://user:[email protected]/database?secure'

Developer: Alexey Gerasimchuck.

session_timezone

:) SELECT now() β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€now()─┐ β”‚ 2023-06-29 13:11:16 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ :) SELECT now('America/Los_Angeles') β”Œβ”€now('America/Los_Angeles')─┐ β”‚ 2023-06-29 04:11:21 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ :) SET session_timezone = 'America/Los_Angeles'; :) SELECT now(); β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€now()─┐ β”‚ 2023-06-29 04:11:29 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Developer: Andrey Zvonov.

Integration With MongoDB 5.1+

CREATE TABLE test_mongo ... ENGINE = MongoDB( 'mongo1:27017', 'test', 'simple_table', 'testuser', 'clickhouse'); SELECT * FROM test_mongo; SELECT * FROM mongodb( 'mongo1:27017', 'test', 'simple_table', 'testuser', 'clickhouse', 'x String, y UInt64, ...');

Was: perfectly working until MongoDB 5.1.

But: they changed the protocol :(

Now: everything works up to MongoDB 6.

Developer: Nikolai Degterinskiy.

Performance Optimizations

"Too many parts" Is Relaxed

How many is too many?
— parts_to_delay_insert, parts_to_throw_insert;

How to perform backpressure?
— should it slow-down long running inserts?

"Too many parts" Is Relaxed

Example: inserting 400 billion records with 4 threads on a single machine,
7 million records/second, with concurrent SELECT queries for dashboards.

Decision: enable backpressure during insert
and increase the default value of max parts to 3000.

Background Threads In ReplicatedMergeTree

Before 23.6:

— aggressive scheduling of background threads
(cleanup and merge selecting) was leading to
a high CPU usage and excessive requests to Keeper.

Release 23.6:

— introduce backoff for tables without inserts.

Result:

— moderate CPU usage on a server with 5000 ReplicatedMergeTree tables;
three times less the number of requests to Keeper and network traffic!

Developer: Alexander Tokmakov.

Sorting Almost Sorted Data

If the data is not assumed to be sorted,
but appears to be sorted in certain ranges — detect that and use.

Demo

Developer: Maksim Kita, Alexey Milovidov.

Reading Parquet From Local Files

clickhouse-local over partitioned Parquet files improved 2 times πŸ”₯ clickhouse-local over a single Parquet file improved 4 times ❀️‍πŸ”₯

Demo.

Developer: Michael Kolupaev. Already included in the release 23.5.

Something Interesting

Redis Tables

ClickHouse + Redis = ❀

CREATE TABLE test_redis (...) ENGINE = Redis('host:6379', 0, 'password') PRIMARY KEY k; SELECT * FROM test_redis; SELECT * FROM test_redis WHERE k = 'whatever'; SELECT * FROM test_redis WHERE k IN ('a', 'b'); SELECT * FROM redis( 'host:6379', k, 'k String, v String', 0, 'password');

Developer: Jacky Woo

Redis Tables

ClickHouse + Redis = ❀

Supports SELECT and INSERT,
SELECT by a key, set of keys, or full scan,
JOIN with Redis tables,
ALTER UPDATE and DELETE!

What next? Support for hash data type in Redis?

Developer: Jacky Woo

A Secret Feature

It has a chance to ship in the release 23.6.

... but it is not guaranteed.

Overlay Databases

with Filesystem, S3, and HDFS database engines!

Integrations

ClickHouse is supported by Confluent Cloud:

The integration can be installed as a custom connector.

Updates for Metabase; Node.js, Python, and Go drivers.

Official Terraform provider for ClickHouse Cloud:
https://registry.terraform.io/providers/ClickHouse/clickhouse/latest

Cloud Updates

GCP

— ClickHouse Cloud is GA in GCP since Jun 20th!

Dedicated Tier

— dedicated environment;
— advanced isolation and security;
— custom compute options;
— scheduled upgrades;
— uptime SLAs;
— named lead support engineer;

"chDB is an embedded OLAP SQL engine powered by ClickHouse"

Available as a Python module, Rust, Go, Node.js, and Bun library.

Similar to clickhouse-local, but embedded into your app.

Works with local and remote datasets, and Pandas dataframes.

https://chdb.io/ Developers: Auxten, Lorenzo Mangani.

URLENG.com

"Unlimited ClickHouse URL tables. Like a pastie for SQL/JSON data."

Share your query results and small datasets
on a random server on the Internet.

No guarantees. Just a demo/experiment.

https://get.urleng.com/ Developers: Lorenzo Mangani.

Show Your Integration With ClickHouse

If you want to demonstrate your project on the ClickHouse Webinar,
write to [email protected] or connect with us on https://clickhouse.com/slack

Reading Corner

https://clickhouse.com/blog/

— Choosing the Right Join Algorithm
— Real-time event streaming with ClickHouse,
  Kafka Connect and Confluent Cloud
— CDC From Postgres To ClickHouse

ClickHouse for AI - Vectors, Embedding, Semantic Search, and more...

Video Recordings: https://www.youtube.com/c/ClickHouseDB

Summer Meetups

πŸ‡«πŸ‡· Paris, July 4th;

πŸ‡ΊπŸ‡Έ Boston, July 18th;

πŸ‡ΊπŸ‡Έ New York, July 19th;

πŸ‡¨πŸ‡¦ Toronto, July 20th;

πŸ‡ΈπŸ‡¬ Singapore, July 27th;

πŸ‡ΊπŸ‡Έ San Francisco, Aug 8th;

πŸ‡³πŸ‡± Amsterdam, Aug 31th;

Q&A