ClickHouse
Release 22.11

Release 22.11 Webinar

1. (50 min) What's new in ClickHouse 22.11.

2. (10 min) Q&A.

🌤

Release 22.11

ClickHouse November release.

— 15 new features

— 5 performance optimizations

— 32 bug fixes 🔥

Composite Time Intervals

Examples:

SELECT now() + INTERVAL 1 MONTH; SELECT now() + INTERVAL '1 MONTH'; SELECT now() + INTERVAL 1 MONTH - INTERVAL 2 DAY; SELECT now() + INTERVAL '1 MONTH -2 DAY'; SELECT now() + (INTERVAL 1 MONTH - INTERVAL 2 DAY); SELECT INTERVAL '1 MONTH -2 DAY'; SELECT (INTERVAL 1 MONTH - INTERVAL 2 DAY); SELECT INTERVAL '1 MONTH 1 MONTH';

Developer: Nikolai Degterinsky.

Support For ** Glob

Recursive directory traversal:

SELECT * FROM s3('https://mybucket.s3.amazonaws.com/**/*.tsv')

Now we support: *, **, ?, {abc,def}, {1..10}, {01..10}.

Demo

Developer: Smita Kulkarni.

Functions For Spark Compatibility

— pmod (positiveModulo);
— ascii;
— randCanonical;

SELECT modulo(-3, 10), -3 % 10; -3, -3 SELECT pmod(-3, 10), positiveModulo(-3, 10); 7, 7

Developer: TaiYang Li.

Functions For Convenience

SELECT filesystemCapacity() AS size, formatReadableSize(size) AS bin, formatReadableDecimalSize(size) AS dec size: 1967317549056 bin: 1.79 TiB dec: 1.97 TB

Developer: Alejandro.

milovidov-desktop :) SELECT displayName() ┌─displayName()─────┐ │ milovidov-desktop │ └───────────────────┘

Developer: HongBin.

Retries On INSERT

Session expired. Table is in readonly mode 😠

Never again:

SET insert_keeper_max_retries = 10;

INSERT will survive restarts of ClickHouse Keeper or ZooKeeper
and reconnections.

Developer: Igor Nikonov.

And reconnections performed as quickly as possible, thanks to Raul Marin, see 22.10.

Data Lakes

ClickHouse can work as a server (clickhouse-server)
or as a tool without installation (clickhouse-local).

ClickHouse can store the data
or process externally stored data on the fly.

External data:
— remote databases: MySQL, PostgreSQL, MongoDB, ODBC, JDBC...
— object storages: S3, HDFS, Azure, COSN, OSS...
— from URL and local files;

All possible data formats:
— text: CSV, TSV, JSON, Values, MySQLDump, Regexp...
— binary: Parquet, Arrow, ORC, Avro, Protobuf, MsgPack...
— schemaful and schemaless;

Data Lakes

SELECT count() AS c FROM s3( 'https://clickhouse-public-datasets.s3.amazonaws.com/hits/native/*.native.zst') WHERE URL LIKE '%google%'; -- 5.924 sec. SELECT count() AS c FROM s3Cluster('default', 'https://clickhouse-public-datasets.s3.amazonaws.com/hits/native/*.native.zst') WHERE URL LIKE '%google%'; -- 2.560 sec. SELECT count() AS c FROM s3( 'https://clickhouse-public-datasets.s3.amazonaws.com/hits_compatible/athena_partitioned/hits_*.parquet') WHERE URL LIKE '%google%'; -- 5.887 sec. SELECT count() AS c FROM s3Cluster('default', 'https://clickhouse-public-datasets.s3.amazonaws.com/hits_compatible/athena_partitioned/hits_*.parquet') WHERE URL LIKE '%google%' -- 2.935 sec. SELECT count() AS c FROM hits WHERE URL LIKE '%google%'; -- 0.913 sec.

Data Lakes

Now ClickHouse supports Apache Hudi and Delta Lake
for SELECT queries.

TODO: Apache Iceberg.

Advantages:
— these formats are somewhat resembling MergeTree
allowing incremental data insertion,
approaching to ClickHouse data formats;

Disadvantages:
— alien data formats from Apache/Hadoop/Java world;
— nothing works out of the box
unless you really know how to deal with it;

Data Lakes

How to prepare data for Delta Lake?

wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz mkdir -p ~/hadoop/spark-3.3.1/ tar -xzf spark-3.3.1-bin-hadoop3.tgz -C ~/hadoop/spark-3.3.1/ --strip 1 export SPARK_HOME=~/hadoop/spark-3.3.1 echo "spark.driver.host localhost" > $SPARK_HOME/conf/spark-defaults.conf cd $SPARK_HOME bin/spark-shell \ --packages io.delta:delta-core_2.12:2.1.1,org.apache.hadoop:hadoop-aws:3.3.1 spark.read.load("/var/lib/clickhouse/user_files/hits.parquet").write.format("delta") .mode("overwrite").save("file:///home/ubuntu/spark_result_hits") 22/11/17 00:25:07 ERROR SparkUncaughtExceptionHandler: [Container in shutdown] Uncaught exception in thread Thread[Executor task launch worker for task 32.0 in stage 2.0 (TID 34),5,main] java.lang.OutOfMemoryError: Java heap space 22/11/17 00:25:07 ERROR Utils: Aborting task

Hint: specify --driver-memory 100g for spark-shell.

Data Lakes

Now ClickHouse supports Apache Hudi and Delta Lake
for SELECT queries.

SELECT count() FROM deltaLake( 'https://clickhouse-public-datasets.s3.amazonaws.com/delta_lake/hits/') WHERE URL LIKE '%google%' -- 4.396 sec.

Developers: Daniil Rubin, Ksenia Sumarokova, Flynn ucasfl.

S3 Readonly Disk

Attach tables directly from s3. No local state.

A new disk type — s3_plain.

$ cat /etc/clickhouse-server/config.d/s3_plain.yaml storage_configuration: disks: s3_plain: type: s3_plain endpoint: 'http://clickhouse-public-datasets.s3.amazonaws.com/s3_plain/' policies: s3_plain: volumes: main: disk: s3_plain

Developers: Azat Khuzhin.

S3 Readonly Disk

ATTACH TABLE hits_worm UUID '5eb0149e-d250-4ec4-bfc7-a90ce167690f' ( `WatchID` UInt64, `JavaEnable` UInt8, `Title` String, `GoodEvent` Int16, `EventTime` DateTime, `EventDate` Date, `CounterID` UInt32, `ClientIP` UInt32, `RegionID` UInt32, `UserID` UInt64, `CounterClass` Int8, `OS` UInt8, `UserAgent` UInt8, `URL` String, `Referer` String, `Refresh` UInt8, `RefererCategoryID` UInt16, `RefererRegionID` UInt32, `URLCategoryID` UInt16, `URLRegionID` UInt32, `ResolutionWidth` UInt16, `ResolutionHeight` UInt16, `ResolutionDepth` UInt8, `FlashMajor` UInt8, `FlashMinor` UInt8, `FlashMinor2` String, `NetMajor` UInt8, `NetMinor` UInt8, `UserAgentMajor` UInt16, `UserAgentMinor` FixedString(2), `CookieEnable` UInt8, `JavascriptEnable` UInt8, `IsMobile` UInt8, `MobilePhone` UInt8, `MobilePhoneModel` String, `Params` String, `IPNetworkID` UInt32, `TraficSourceID` Int8, `SearchEngineID` UInt16, `SearchPhrase` String, `AdvEngineID` UInt8, `IsArtifical` UInt8, `WindowClientWidth` UInt16, `WindowClientHeight` UInt16, `ClientTimeZone` Int16, `ClientEventTime` DateTime, `SilverlightVersion1` UInt8, `SilverlightVersion2` UInt8, `SilverlightVersion3` UInt32, `SilverlightVersion4` UInt16, `PageCharset` String, `CodeVersion` UInt32, `IsLink` UInt8, `IsDownload` UInt8, `IsNotBounce` UInt8, `FUniqID` UInt64, `OriginalURL` String, `HID` UInt32, `IsOldCounter` UInt8, `IsEvent` UInt8, `IsParameter` UInt8, `DontCountHits` UInt8, `WithHash` UInt8, `HitColor` FixedString(1), `LocalEventTime` DateTime, `Age` UInt8, `Sex` UInt8, `Income` UInt8, `Interests` UInt16, `Robotness` UInt8, `RemoteIP` UInt32, `WindowName` Int32, `OpenerName` Int32, `HistoryLength` Int16, `BrowserLanguage` FixedString(2), `BrowserCountry` FixedString(2), `SocialNetwork` String, `SocialAction` String, `HTTPError` UInt16, `SendTiming` UInt32, `DNSTiming` UInt32, `ConnectTiming` UInt32, `ResponseStartTiming` UInt32, `ResponseEndTiming` UInt32, `FetchTiming` UInt32, `SocialSourceNetworkID` UInt8, `SocialSourcePage` String, `ParamPrice` Int64, `ParamOrderID` String, `ParamCurrency` FixedString(3), `ParamCurrencyID` UInt16, `OpenstatServiceName` String, `OpenstatCampaignID` String, `OpenstatAdID` String, `OpenstatSourceID` String, `UTMSource` String, `UTMMedium` String, `UTMCampaign` String, `UTMContent` String, `UTMTerm` String, `FromTag` String, `HasGCLID` UInt8, `RefererHash` UInt64, `URLHash` UInt64, `CLID` UInt32 ) ENGINE = MergeTree PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID), EventTime) SETTINGS storage_policy = 's3_plain'

S3 Readonly Disk

SELECT count() FROM hits_worm WHERE URL LIKE '%google%' -- 2.533 sec.

Developers: Azat Khuzhin.

S3 Readonly Disk

storage_configuration: disks: s3_plain_cache: type: cache disk: s3_plain path: /var/lib/clickhouse/cache/ max_size: 1000000000000 policies: s3_plain_cache: volumes: main: disk: s3_plain_cache SELECT count() FROM hits_worm WHERE URL LIKE '%google%' -- 0.719 sec.

Performance Improvements

SELECT path, sum(hits) FROM wikistat WHERE match(path, '^ClickH') GROUP BY path ORDER BY path ┌─path─────────────────┬─sum(hits)─┐ │ ClickHealthy │ 13 │ │ ClickHelp │ 68 │ │ ClickHere │ 1 │ │ ClickHole │ 402127 │ │ ClickHome │ 895 │ │ ClickHome_(software) │ 1337 │ │ ClickHouse │ 361848 │ └──────────────────────┴───────────┘

Was: Elapsed: 415.423 sec. Processed 404.20 billion rows, 12.19 TB
(972.99 million rows/s., 29.35 GB/s.)

Now: Elapsed: 0.019 sec. Processed 303.10 thousand rows, 8.07 MB
(15.70 million rows/s., 418.05 MB/s.)

Developer: Clark CaoLiu.

Bonus

Allow subset of table functions in readonly mode:

:) SELECT name FROM system.table_functions WHERE allow_readonly ORDER BY name ┌─name───────────────┐ │ cluster │ │ clusterAllReplicas │ │ generateRandom │ │ null │ │ numbers │ │ numbers_mt │ │ values │ │ view │ │ viewIfPermitted │ └────────────────────┘

Developer: Smita Kulkarni.

Bonus

Query parameters for INSERT VALUES:

INSERT INTO test VALUES (123, {name:String})

Developer: Nikolai Degterinsky.

Documentation for system.asynchronous_metrics:

SELECT * FROM system.asynchronous_metrics WHERE metric = 'Jitter' metric: Jitter value: 0.000084 description: The difference in time the thread for calculation of the asynchronous metrics was scheduled to wake up and the time it was in fact, woken up. A proxy-indicator of overall system latency and responsiveness.

Developer: Alexey Milovidov.

Integrations

dbt-clickhouse: supports for DBT v1.3;

Python client:
— python v3.11 support;
— improved insert performance;
— better Pandas dataframe compatibility
and performance for date, time, null types;
— extended Numpy with big int support;

clickhouse-go: fix DateTime and Int64 support;

clickhouse-js: supports mutual SSL authentication;

clickhouse-kafka-connect: support for Array and Map types;

ClickHouse Cloud Beta

— available since Oct 4th;

— free 14-day trial up to 10 TB of data;

Try it! https://clickhouse.cloud/.

ClickHouseRelease 22.11

Release 22.11 Webinar

Release 22.11

Composite Time Intervals

Support For ** Glob

Functions For Spark Compatibility

Functions For Convenience

Retries On INSERT

Data Lakes

Data Lakes

Data Lakes

Data Lakes

Data Lakes

S3 Readonly Disk

S3 Readonly Disk

S3 Readonly Disk

S3 Readonly Disk

Performance Improvements

Bonus

Bonus

Integrations

ClickHouse Cloud Beta

Q&A

ClickHouse
Release 22.11