ClickHouse: Release 22.11 Webinar

ClickHouse
Release 22.11

Release 22.11 Webinar

1. (50 min) What's new in ClickHouse 22.11.

2. (10 min) Q&A.

🌀

Release 22.11

ClickHouse November release.

β€” 15 new features

β€” 5 performance optimizations

β€” 32 bug fixes πŸ”₯

Composite Time Intervals

Examples:

SELECT now() + INTERVAL 1 MONTH; SELECT now() + INTERVAL '1 MONTH'; SELECT now() + INTERVAL 1 MONTH - INTERVAL 2 DAY; SELECT now() + INTERVAL '1 MONTH -2 DAY'; SELECT now() + (INTERVAL 1 MONTH - INTERVAL 2 DAY); SELECT INTERVAL '1 MONTH -2 DAY'; SELECT (INTERVAL 1 MONTH - INTERVAL 2 DAY); SELECT INTERVAL '1 MONTH 1 MONTH';

Developer: Nikolai Degterinsky.

Support For ** Glob

Recursive directory traversal:

SELECT * FROM s3('https://mybucket.s3.amazonaws.com/**/*.tsv')

Now we support: *, **, ?, {abc,def}, {1..10}, {01..10}.

Demo

Developer: Smita Kulkarni.

Functions For Spark Compatibility

— pmod (positiveModulo);
— ascii;
— randCanonical;

SELECT modulo(-3, 10), -3 % 10; -3, -3 SELECT pmod(-3, 10), positiveModulo(-3, 10); 7, 7

Developer: TaiYang Li.

Functions For Convenience

SELECT filesystemCapacity() AS size, formatReadableSize(size) AS bin, formatReadableDecimalSize(size) AS dec size: 1967317549056 bin: 1.79 TiB dec: 1.97 TB

Developer: Alejandro.

milovidov-desktop :) SELECT displayName() β”Œβ”€displayName()─────┐ β”‚ milovidov-desktop β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Developer: HongBin.

Retries On INSERT

Session expired. Table is in readonly mode 😠

Never again:

SET insert_keeper_max_retries = 10;

INSERT will survive restarts of ClickHouse Keeper or ZooKeeper
and reconnections.

Developer: Igor Nikonov.

And reconnections performed as quickly as possible, thanks to Raul Marin, see 22.10.

Data Lakes

ClickHouse can work as a server (clickhouse-server)
or as a tool without installation (clickhouse-local).

ClickHouse can store the data
or process externally stored data on the fly.

External data:
— remote databases: MySQL, PostgreSQL, MongoDB, ODBC, JDBC...
— object storages: S3, HDFS, Azure, COSN, OSS...
— from URL and local files;

All possible data formats:
— text: CSV, TSV, JSON, Values, MySQLDump, Regexp...
— binary: Parquet, Arrow, ORC, Avro, Protobuf, MsgPack...
— schemaful and schemaless;

Data Lakes

SELECT count() AS c FROM s3( 'https://clickhouse-public-datasets.s3.amazonaws.com/hits/native/*.native.zst') WHERE URL LIKE '%google%'; -- 5.924 sec. SELECT count() AS c FROM s3Cluster('default', 'https://clickhouse-public-datasets.s3.amazonaws.com/hits/native/*.native.zst') WHERE URL LIKE '%google%'; -- 2.560 sec. SELECT count() AS c FROM s3( 'https://clickhouse-public-datasets.s3.amazonaws.com/hits_compatible/athena_partitioned/hits_*.parquet') WHERE URL LIKE '%google%'; -- 5.887 sec. SELECT count() AS c FROM s3Cluster('default', 'https://clickhouse-public-datasets.s3.amazonaws.com/hits_compatible/athena_partitioned/hits_*.parquet') WHERE URL LIKE '%google%' -- 2.935 sec. SELECT count() AS c FROM hits WHERE URL LIKE '%google%'; -- 0.913 sec.

Data Lakes

Now ClickHouse supports Apache Hudi and Delta Lake
for SELECT queries.

TODO: Apache Iceberg.

Advantages:
— these formats are somewhat resembling MergeTree
  allowing incremental data insertion,
  approaching to ClickHouse data formats;

Disadvantages:
— alien data formats from Apache/Hadoop/Java world;
— nothing works out of the box
  unless you really know how to deal with it;

Data Lakes

How to prepare data for Delta Lake?

wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz mkdir -p ~/hadoop/spark-3.3.1/ tar -xzf spark-3.3.1-bin-hadoop3.tgz -C ~/hadoop/spark-3.3.1/ --strip 1 export SPARK_HOME=~/hadoop/spark-3.3.1 echo "spark.driver.host localhost" > $SPARK_HOME/conf/spark-defaults.conf cd $SPARK_HOME bin/spark-shell \ --packages io.delta:delta-core_2.12:2.1.1,org.apache.hadoop:hadoop-aws:3.3.1 spark.read.load("/var/lib/clickhouse/user_files/hits.parquet").write.format("delta") .mode("overwrite").save("file:///home/ubuntu/spark_result_hits") 22/11/17 00:25:07 ERROR SparkUncaughtExceptionHandler: [Container in shutdown] Uncaught exception in thread Thread[Executor task launch worker for task 32.0 in stage 2.0 (TID 34),5,main] java.lang.OutOfMemoryError: Java heap space 22/11/17 00:25:07 ERROR Utils: Aborting task

Hint: specify --driver-memory 100g for spark-shell.

Data Lakes

Now ClickHouse supports Apache Hudi and Delta Lake
for SELECT queries.

SELECT count() FROM deltaLake( 'https://clickhouse-public-datasets.s3.amazonaws.com/delta_lake/hits/') WHERE URL LIKE '%google%' -- 4.396 sec.

Developers: Daniil Rubin, Ksenia Sumarokova, Flynn ucasfl.

S3 Readonly Disk

Attach tables directly from s3. No local state.

A new disk type — s3_plain.

$ cat /etc/clickhouse-server/config.d/s3_plain.yaml storage_configuration: disks: s3_plain: type: s3_plain endpoint: 'http://clickhouse-public-datasets.s3.amazonaws.com/s3_plain/' policies: s3_plain: volumes: main: disk: s3_plain

Developers: Azat Khuzhin.

S3 Readonly Disk

ATTACH TABLE hits_worm UUID '5eb0149e-d250-4ec4-bfc7-a90ce167690f' ( `WatchID` UInt64, `JavaEnable` UInt8, `Title` String, `GoodEvent` Int16, `EventTime` DateTime, `EventDate` Date, `CounterID` UInt32, `ClientIP` UInt32, `RegionID` UInt32, `UserID` UInt64, `CounterClass` Int8, `OS` UInt8, `UserAgent` UInt8, `URL` String, `Referer` String, `Refresh` UInt8, `RefererCategoryID` UInt16, `RefererRegionID` UInt32, `URLCategoryID` UInt16, `URLRegionID` UInt32, `ResolutionWidth` UInt16, `ResolutionHeight` UInt16, `ResolutionDepth` UInt8, `FlashMajor` UInt8, `FlashMinor` UInt8, `FlashMinor2` String, `NetMajor` UInt8, `NetMinor` UInt8, `UserAgentMajor` UInt16, `UserAgentMinor` FixedString(2), `CookieEnable` UInt8, `JavascriptEnable` UInt8, `IsMobile` UInt8, `MobilePhone` UInt8, `MobilePhoneModel` String, `Params` String, `IPNetworkID` UInt32, `TraficSourceID` Int8, `SearchEngineID` UInt16, `SearchPhrase` String, `AdvEngineID` UInt8, `IsArtifical` UInt8, `WindowClientWidth` UInt16, `WindowClientHeight` UInt16, `ClientTimeZone` Int16, `ClientEventTime` DateTime, `SilverlightVersion1` UInt8, `SilverlightVersion2` UInt8, `SilverlightVersion3` UInt32, `SilverlightVersion4` UInt16, `PageCharset` String, `CodeVersion` UInt32, `IsLink` UInt8, `IsDownload` UInt8, `IsNotBounce` UInt8, `FUniqID` UInt64, `OriginalURL` String, `HID` UInt32, `IsOldCounter` UInt8, `IsEvent` UInt8, `IsParameter` UInt8, `DontCountHits` UInt8, `WithHash` UInt8, `HitColor` FixedString(1), `LocalEventTime` DateTime, `Age` UInt8, `Sex` UInt8, `Income` UInt8, `Interests` UInt16, `Robotness` UInt8, `RemoteIP` UInt32, `WindowName` Int32, `OpenerName` Int32, `HistoryLength` Int16, `BrowserLanguage` FixedString(2), `BrowserCountry` FixedString(2), `SocialNetwork` String, `SocialAction` String, `HTTPError` UInt16, `SendTiming` UInt32, `DNSTiming` UInt32, `ConnectTiming` UInt32, `ResponseStartTiming` UInt32, `ResponseEndTiming` UInt32, `FetchTiming` UInt32, `SocialSourceNetworkID` UInt8, `SocialSourcePage` String, `ParamPrice` Int64, `ParamOrderID` String, `ParamCurrency` FixedString(3), `ParamCurrencyID` UInt16, `OpenstatServiceName` String, `OpenstatCampaignID` String, `OpenstatAdID` String, `OpenstatSourceID` String, `UTMSource` String, `UTMMedium` String, `UTMCampaign` String, `UTMContent` String, `UTMTerm` String, `FromTag` String, `HasGCLID` UInt8, `RefererHash` UInt64, `URLHash` UInt64, `CLID` UInt32 ) ENGINE = MergeTree PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID), EventTime) SETTINGS storage_policy = 's3_plain'

S3 Readonly Disk

SELECT count() FROM hits_worm WHERE URL LIKE '%google%' -- 2.533 sec.

Developers: Azat Khuzhin.

S3 Readonly Disk

storage_configuration: disks: s3_plain_cache: type: cache disk: s3_plain path: /var/lib/clickhouse/cache/ max_size: 1000000000000 policies: s3_plain_cache: volumes: main: disk: s3_plain_cache SELECT count() FROM hits_worm WHERE URL LIKE '%google%' -- 0.719 sec.

Performance Improvements

SELECT path, sum(hits) FROM wikistat WHERE match(path, '^ClickH') GROUP BY path ORDER BY path β”Œβ”€path─────────────────┬─sum(hits)─┐ β”‚ ClickHealthy β”‚ 13 β”‚ β”‚ ClickHelp β”‚ 68 β”‚ β”‚ ClickHere β”‚ 1 β”‚ β”‚ ClickHole β”‚ 402127 β”‚ β”‚ ClickHome β”‚ 895 β”‚ β”‚ ClickHome_(software) β”‚ 1337 β”‚ β”‚ ClickHouse β”‚ 361848 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Was: Elapsed: 415.423 sec. Processed 404.20 billion rows, 12.19 TB
(972.99 million rows/s., 29.35 GB/s.)

Now: Elapsed: 0.019 sec. Processed 303.10 thousand rows, 8.07 MB
(15.70 million rows/s., 418.05 MB/s.)

Developer: Clark CaoLiu.

Bonus

Allow subset of table functions in readonly mode:

:) SELECT name FROM system.table_functions WHERE allow_readonly ORDER BY name β”Œβ”€name───────────────┐ β”‚ cluster β”‚ β”‚ clusterAllReplicas β”‚ β”‚ generateRandom β”‚ β”‚ null β”‚ β”‚ numbers β”‚ β”‚ numbers_mt β”‚ β”‚ values β”‚ β”‚ view β”‚ β”‚ viewIfPermitted β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Developer: Smita Kulkarni.

Bonus

Query parameters for INSERT VALUES:

INSERT INTO test VALUES (123, {name:String})

Developer: Nikolai Degterinsky.

Documentation for system.asynchronous_metrics:

SELECT * FROM system.asynchronous_metrics WHERE metric = 'Jitter' metric: Jitter value: 0.000084 description: The difference in time the thread for calculation of the asynchronous metrics was scheduled to wake up and the time it was in fact, woken up. A proxy-indicator of overall system latency and responsiveness.

Developer: Alexey Milovidov.

Integrations

dbt-clickhouse: supports for DBT v1.3;

Python client:
— python v3.11 support;
— improved insert performance;
— better Pandas dataframe compatibility
  and performance for date, time, null types;
— extended Numpy with big int support;

clickhouse-go: fix DateTime and Int64 support;

clickhouse-js: supports mutual SSL authentication;

clickhouse-kafka-connect: support for Array and Map types;

ClickHouse Cloud Beta

— available since Oct 4th;

— free 14-day trial up to 10 TB of data;

Try it! https://clickhouse.cloud/.

Q&A