1. (50 min) What's new in ClickHouse 22.11.
2. (10 min) Q&A.
ClickHouse November release.
β 15 new features
β 5 performance optimizations
β 32 bug fixes π₯
Examples:
SELECT now() + INTERVAL 1 MONTH;
SELECT now() + INTERVAL '1 MONTH';
SELECT now() + INTERVAL 1 MONTH - INTERVAL 2 DAY;
SELECT now() + INTERVAL '1 MONTH -2 DAY';
SELECT now() + (INTERVAL 1 MONTH - INTERVAL 2 DAY);
SELECT INTERVAL '1 MONTH -2 DAY';
SELECT (INTERVAL 1 MONTH - INTERVAL 2 DAY);
SELECT INTERVAL '1 MONTH 1 MONTH';
Developer: Nikolai Degterinsky.
Recursive directory traversal:
SELECT * FROM
s3('https://mybucket.s3.amazonaws.com/**/*.tsv')
Now we support: *, **, ?, {abc,def}, {1..10}, {01..10}.
Demo
Developer: Smita Kulkarni.
— pmod (positiveModulo);
— ascii;
— randCanonical;
SELECT modulo(-3, 10), -3 % 10;
-3, -3
SELECT pmod(-3, 10), positiveModulo(-3, 10);
7, 7
Developer: TaiYang Li.
SELECT filesystemCapacity() AS size,
formatReadableSize(size) AS bin,
formatReadableDecimalSize(size) AS dec
size: 1967317549056
bin: 1.79 TiB
dec: 1.97 TB
Developer: Alejandro.
milovidov-desktop :) SELECT displayName()
ββdisplayName()ββββββ
β milovidov-desktop β
βββββββββββββββββββββ
Developer: HongBin.
Session expired. Table is in readonly mode π
Never again:
SET insert_keeper_max_retries = 10;
INSERT will survive restarts of ClickHouse Keeper or ZooKeeper
and reconnections.
Developer: Igor Nikonov.
And reconnections performed as quickly as possible, thanks to Raul Marin, see 22.10.
ClickHouse can work as a server (clickhouse-server)
or as a tool without installation (clickhouse-local).
ClickHouse can store the data
or process externally stored data on the fly.
External data:
— remote databases: MySQL, PostgreSQL, MongoDB, ODBC, JDBC...
— object storages: S3, HDFS, Azure, COSN, OSS...
— from URL and local files;
All possible data formats:
— text: CSV, TSV, JSON, Values, MySQLDump, Regexp...
— binary: Parquet, Arrow, ORC, Avro, Protobuf, MsgPack...
— schemaful and schemaless;
SELECT count() AS c FROM s3(
'https://clickhouse-public-datasets.s3.amazonaws.com/hits/native/*.native.zst')
WHERE URL LIKE '%google%';
-- 5.924 sec.
SELECT count() AS c FROM s3Cluster('default',
'https://clickhouse-public-datasets.s3.amazonaws.com/hits/native/*.native.zst')
WHERE URL LIKE '%google%';
-- 2.560 sec.
SELECT count() AS c FROM s3(
'https://clickhouse-public-datasets.s3.amazonaws.com/hits_compatible/athena_partitioned/hits_*.parquet')
WHERE URL LIKE '%google%';
-- 5.887 sec.
SELECT count() AS c FROM s3Cluster('default',
'https://clickhouse-public-datasets.s3.amazonaws.com/hits_compatible/athena_partitioned/hits_*.parquet')
WHERE URL LIKE '%google%'
-- 2.935 sec.
SELECT count() AS c FROM hits WHERE URL LIKE '%google%';
-- 0.913 sec.
Now ClickHouse supports Apache Hudi and Delta Lake
for SELECT queries.
TODO: Apache Iceberg.
Advantages:
— these formats are somewhat resembling MergeTree
allowing incremental data insertion,
approaching to ClickHouse data formats;
Disadvantages:
— alien data formats from Apache/Hadoop/Java world;
— nothing works out of the box
unless you really know how to deal with it;
How to prepare data for Delta Lake?
wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
mkdir -p ~/hadoop/spark-3.3.1/
tar -xzf spark-3.3.1-bin-hadoop3.tgz -C ~/hadoop/spark-3.3.1/ --strip 1
export SPARK_HOME=~/hadoop/spark-3.3.1
echo "spark.driver.host localhost" > $SPARK_HOME/conf/spark-defaults.conf
cd $SPARK_HOME
bin/spark-shell \
--packages io.delta:delta-core_2.12:2.1.1,org.apache.hadoop:hadoop-aws:3.3.1
spark.read.load("/var/lib/clickhouse/user_files/hits.parquet").write.format("delta")
.mode("overwrite").save("file:///home/ubuntu/spark_result_hits")
22/11/17 00:25:07 ERROR SparkUncaughtExceptionHandler: [Container in shutdown]
Uncaught exception in thread Thread[Executor task launch worker for task 32.0 in stage 2.0 (TID 34),5,main]
java.lang.OutOfMemoryError: Java heap space
22/11/17 00:25:07 ERROR Utils: Aborting task
Hint: specify --driver-memory 100g for spark-shell.
Now ClickHouse supports Apache Hudi and Delta Lake
for SELECT queries.
SELECT count() FROM deltaLake(
'https://clickhouse-public-datasets.s3.amazonaws.com/delta_lake/hits/')
WHERE URL LIKE '%google%'
-- 4.396 sec.
Developers: Daniil Rubin, Ksenia Sumarokova, Flynn ucasfl.
Attach tables directly from s3. No local state.
A new disk type — s3_plain.
$ cat /etc/clickhouse-server/config.d/s3_plain.yaml
storage_configuration:
disks:
s3_plain:
type: s3_plain
endpoint: 'http://clickhouse-public-datasets.s3.amazonaws.com/s3_plain/'
policies:
s3_plain:
volumes:
main:
disk: s3_plain
Developers: Azat Khuzhin.
ATTACH TABLE hits_worm UUID '5eb0149e-d250-4ec4-bfc7-a90ce167690f'
(
`WatchID` UInt64,
`JavaEnable` UInt8,
`Title` String,
`GoodEvent` Int16,
`EventTime` DateTime,
`EventDate` Date,
`CounterID` UInt32,
`ClientIP` UInt32,
`RegionID` UInt32,
`UserID` UInt64,
`CounterClass` Int8,
`OS` UInt8,
`UserAgent` UInt8,
`URL` String,
`Referer` String,
`Refresh` UInt8,
`RefererCategoryID` UInt16,
`RefererRegionID` UInt32,
`URLCategoryID` UInt16,
`URLRegionID` UInt32,
`ResolutionWidth` UInt16,
`ResolutionHeight` UInt16,
`ResolutionDepth` UInt8,
`FlashMajor` UInt8,
`FlashMinor` UInt8,
`FlashMinor2` String,
`NetMajor` UInt8,
`NetMinor` UInt8,
`UserAgentMajor` UInt16,
`UserAgentMinor` FixedString(2),
`CookieEnable` UInt8,
`JavascriptEnable` UInt8,
`IsMobile` UInt8,
`MobilePhone` UInt8,
`MobilePhoneModel` String,
`Params` String,
`IPNetworkID` UInt32,
`TraficSourceID` Int8,
`SearchEngineID` UInt16,
`SearchPhrase` String,
`AdvEngineID` UInt8,
`IsArtifical` UInt8,
`WindowClientWidth` UInt16,
`WindowClientHeight` UInt16,
`ClientTimeZone` Int16,
`ClientEventTime` DateTime,
`SilverlightVersion1` UInt8,
`SilverlightVersion2` UInt8,
`SilverlightVersion3` UInt32,
`SilverlightVersion4` UInt16,
`PageCharset` String,
`CodeVersion` UInt32,
`IsLink` UInt8,
`IsDownload` UInt8,
`IsNotBounce` UInt8,
`FUniqID` UInt64,
`OriginalURL` String,
`HID` UInt32,
`IsOldCounter` UInt8,
`IsEvent` UInt8,
`IsParameter` UInt8,
`DontCountHits` UInt8,
`WithHash` UInt8,
`HitColor` FixedString(1),
`LocalEventTime` DateTime,
`Age` UInt8,
`Sex` UInt8,
`Income` UInt8,
`Interests` UInt16,
`Robotness` UInt8,
`RemoteIP` UInt32,
`WindowName` Int32,
`OpenerName` Int32,
`HistoryLength` Int16,
`BrowserLanguage` FixedString(2),
`BrowserCountry` FixedString(2),
`SocialNetwork` String,
`SocialAction` String,
`HTTPError` UInt16,
`SendTiming` UInt32,
`DNSTiming` UInt32,
`ConnectTiming` UInt32,
`ResponseStartTiming` UInt32,
`ResponseEndTiming` UInt32,
`FetchTiming` UInt32,
`SocialSourceNetworkID` UInt8,
`SocialSourcePage` String,
`ParamPrice` Int64,
`ParamOrderID` String,
`ParamCurrency` FixedString(3),
`ParamCurrencyID` UInt16,
`OpenstatServiceName` String,
`OpenstatCampaignID` String,
`OpenstatAdID` String,
`OpenstatSourceID` String,
`UTMSource` String,
`UTMMedium` String,
`UTMCampaign` String,
`UTMContent` String,
`UTMTerm` String,
`FromTag` String,
`HasGCLID` UInt8,
`RefererHash` UInt64,
`URLHash` UInt64,
`CLID` UInt32
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID), EventTime)
SETTINGS storage_policy = 's3_plain'
SELECT count() FROM hits_worm WHERE URL LIKE '%google%'
-- 2.533 sec.
Developers: Azat Khuzhin.
storage_configuration:
disks:
s3_plain_cache:
type: cache
disk: s3_plain
path: /var/lib/clickhouse/cache/
max_size: 1000000000000
policies:
s3_plain_cache:
volumes:
main:
disk: s3_plain_cache
SELECT count() FROM hits_worm WHERE URL LIKE '%google%'
-- 0.719 sec.
SELECT path, sum(hits) FROM wikistat WHERE match(path, '^ClickH')
GROUP BY path ORDER BY path
ββpathββββββββββββββββββ¬βsum(hits)ββ
β ClickHealthy β 13 β
β ClickHelp β 68 β
β ClickHere β 1 β
β ClickHole β 402127 β
β ClickHome β 895 β
β ClickHome_(software) β 1337 β
β ClickHouse β 361848 β
ββββββββββββββββββββββββ΄ββββββββββββ
Was: Elapsed: 415.423 sec. Processed 404.20 billion rows, 12.19 TB
(972.99 million rows/s., 29.35 GB/s.)
Now: Elapsed: 0.019 sec. Processed 303.10 thousand rows, 8.07 MB
(15.70 million rows/s., 418.05 MB/s.)
Developer: Clark CaoLiu.
Allow subset of table functions in readonly mode:
:) SELECT name FROM system.table_functions
WHERE allow_readonly ORDER BY name
ββnameββββββββββββββββ
β cluster β
β clusterAllReplicas β
β generateRandom β
β null β
β numbers β
β numbers_mt β
β values β
β view β
β viewIfPermitted β
ββββββββββββββββββββββ
Developer: Smita Kulkarni.
Query parameters for INSERT VALUES:
INSERT INTO test VALUES (123, {name:String})
Developer: Nikolai Degterinsky.
Documentation for system.asynchronous_metrics:
SELECT * FROM system.asynchronous_metrics WHERE metric = 'Jitter'
metric: Jitter
value: 0.000084
description: The difference in time the thread for calculation of the asynchronous metrics was scheduled to wake up and the time it was in fact, woken up. A proxy-indicator of overall system latency and responsiveness.
Developer: Alexey Milovidov.
dbt-clickhouse: supports for DBT v1.3;
Python client:
— python v3.11 support;
— improved insert performance;
— better Pandas dataframe compatibility
and performance for date, time, null types;
— extended Numpy with big int support;
clickhouse-go: fix DateTime and Int64 support;
clickhouse-js: supports mutual SSL authentication;
clickhouse-kafka-connect: support for Array and Map types;
— available since Oct 4th;
— free 14-day trial up to 10 TB of data;
Try it! https://clickhouse.cloud/.