1. (35 min) What's new in ClickHouse 22.1.
2. (25 min) Q&A.
First monthly release in 2022.
Public roadmap for 2022:
https://github.com/ClickHouse/ClickHouse/issues/32513
And we already releasing new features from the roadmap!
For INTO OUTFILE, FROM INFILE:
Was:
SELECT URLDomain, UserID FROM test.hits
INTO OUTFILE 'hits.csv' FORMAT CSV
Now:
SELECT URLDomain, UserID FROM test.hits INTO OUTFILE 'hits.parquet'
SELECT URLDomain, UserID FROM test.hits INTO OUTFILE 'hits.csv'
SELECT URLDomain, UserID FROM test.hits INTO OUTFILE 'hits.ndjson'
SELECT URLDomain, UserID FROM test.hits INTO OUTFILE 'hits.native'
SELECT URLDomain, UserID FROM test.hits INTO OUTFILE 'hits.csv.gz'
For table functions and engines:
— file, url, s3, hdfs, s3Cluster, hdfsCluster...
Developers: Pavel Kruglov, ZhongYuanKai.
For formats that already contain structure:
— Native, Protobuf, Avro, Parquet, ORC, Arrow.
— CSVWithNamesAndTypes, TSVWithNamesAndTypes.
Was:
SELECT * FROM url('https://datasets.clickhouse.com/github_events_v2.native.xz', Native,
$$
file_time DateTime, event_type Enum('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11, 'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19, 'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22), actor_login LowCardinality(String), repo_name LowCardinality(String), created_at DateTime, updated_at DateTime, action Enum('none' = 0, 'created' = 1, 'added' = 2, 'edited' = 3, 'deleted' = 4, 'opened' = 5, 'closed' = 6, 'reopened' = 7, 'assigned' = 8, 'unassigned' = 9, 'labeled' = 10, 'unlabeled' = 11, 'review_requested' = 12, 'review_request_removed' = 13, 'synchronize' = 14, 'started' = 15, 'published' = 16, 'update' = 17, 'create' = 18, 'fork' = 19, 'merged' = 20), comment_id UInt64, body String, path String, position Int32, line Int32, ref LowCardinality(String), ref_type Enum('none' = 0, 'branch' = 1, 'tag' = 2, 'repository' = 3, 'unknown' = 4), creator_user_login LowCardinality(String), number UInt32, title String, labels Array(LowCardinality(String)), state Enum('none' = 0, 'open' = 1, 'closed' = 2), locked UInt8, assignee LowCardinality(String), assignees Array(LowCardinality(String)), comments UInt32, author_association Enum('NONE' = 0, 'CONTRIBUTOR' = 1, 'OWNER' = 2, 'COLLABORATOR' = 3, 'MEMBER' = 4, 'MANNEQUIN' = 5), closed_at DateTime, merged_at DateTime, merge_commit_sha String, requested_reviewers Array(LowCardinality(String)), requested_teams Array(LowCardinality(String)), head_ref LowCardinality(String), head_sha String, base_ref LowCardinality(String), base_sha String, merged UInt8, mergeable UInt8, rebaseable UInt8, mergeable_state Enum('unknown' = 0, 'dirty' = 1, 'clean' = 2, 'unstable' = 3, 'draft' = 4), merged_by LowCardinality(String), review_comments UInt32, maintainer_can_modify UInt8, commits UInt32, additions UInt32, deletions UInt32, changed_files UInt32, diff_hunk String, original_position UInt32, commit_id String, original_commit_id String, push_size UInt32, push_distinct_size UInt32, member_login LowCardinality(String), release_tag_name String, release_name String, review_state Enum('none' = 0, 'approved' = 1, 'changes_requested' = 2, 'commented' = 3, 'dismissed' = 4, 'pending' = 5)
$$)
LIMIT 10
SELECT * FROM url('https://datasets.clickhouse.com/github_events_v2.native.xz', Native,
$$
file_time DateTime, event_type Enum('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11, 'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19, 'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22), actor_login LowCardinality(String), repo_name LowCardinality(String), created_at DateTime, updated_at DateTime, action Enum('none' = 0, 'created' = 1, 'added' = 2, 'edited' = 3, 'deleted' = 4, 'opened' = 5, 'closed' = 6, 'reopened' = 7, 'assigned' = 8, 'unassigned' = 9, 'labeled' = 10, 'unlabeled' = 11, 'review_requested' = 12, 'review_request_removed' = 13, 'synchronize' = 14, 'started' = 15, 'published' = 16, 'update' = 17, 'create' = 18, 'fork' = 19, 'merged' = 20), comment_id UInt64, body String, path String, position Int32, line Int32, ref LowCardinality(String), ref_type Enum('none' = 0, 'branch' = 1, 'tag' = 2, 'repository' = 3, 'unknown' = 4), creator_user_login LowCardinality(String), number UInt32, title String, labels Array(LowCardinality(String)), state Enum('none' = 0, 'open' = 1, 'closed' = 2), locked UInt8, assignee LowCardinality(String), assignees Array(LowCardinality(String)), comments UInt32, author_association Enum('NONE' = 0, 'CONTRIBUTOR' = 1, 'OWNER' = 2, 'COLLABORATOR' = 3, 'MEMBER' = 4, 'MANNEQUIN' = 5), closed_at DateTime, merged_at DateTime, merge_commit_sha String, requested_reviewers Array(LowCardinality(String)), requested_teams Array(LowCardinality(String)), head_ref LowCardinality(String), head_sha String, base_ref LowCardinality(String), base_sha String, merged UInt8, mergeable UInt8, rebaseable UInt8, mergeable_state Enum('unknown' = 0, 'dirty' = 1, 'clean' = 2, 'unstable' = 3, 'draft' = 4), merged_by LowCardinality(String), review_comments UInt32, maintainer_can_modify UInt8, commits UInt32, additions UInt32, deletions UInt32, changed_files UInt32, diff_hunk String, original_position UInt32, commit_id String, original_commit_id String, push_size UInt32, push_distinct_size UInt32, member_login LowCardinality(String), release_tag_name String, release_name String, review_state Enum('none' = 0, 'approved' = 1, 'changes_requested' = 2, 'commented' = 3, 'dismissed' = 4, 'pending' = 5)
$$)
LIMIT 10
Now:
SELECT * FROM url(
'https://datasets.clickhouse.com/github_events_v2.native.xz') LIMIT 10
Read the structure:
DESCRIBE url('https://datasets.clickhouse.com/github_events_v2.native.xz')
Developer: Pavel Kruglov.
It works even for semistructured and unstructured formats!
— CSV, TSV, CSVWithNames, TSVWithNames,
JSONEachRow, Values, Regexp, MsgPack...
DESCRIBE file('hits.ndjson')
┌─name──────┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ UserID │ Nullable(String) │ │ │ │ │ │
│ URLDomain │ Nullable(String) │ │ │ │ │ │
└───────────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
— Nullable(String) if data is inside string in JSON.
— Nullable(Float64) if it's number.
— Arrays are also supported. Multidimensional arrays.
— Arrays of non-uniform types are parsed as Tuples.
Developer: Pavel Kruglov.
$ echo '{"x":[[123,456]]}' > test.ndjson
$ clickhouse-local --query "SELECT x, toTypeName(x) FROM file('test.ndjson')"
[[123,456]] Array(Array(Nullable(Float64)))
$ echo '{"x":[123,"Hello",["World"]]}' > test.ndjson
$ clickhouse-local --query "SELECT x, toTypeName(x) FROM file('test.ndjson')"
(123,'Hello',['World']) Tuple(Nullable(Float64), Nullable(String), Array(Nullable(String)))
$ echo '{"x":"Hello"} {"y":"World"}' > test.ndjson
$ clickhouse-local --query "SELECT * FROM file('test.ndjson')"
\N Hello
World \N
$ echo '{"x":[[123,"Hello"]]} {"x":[[123,"Hello",456]]}' > test.ndjson
$ clickhouse-local --query "SELECT * FROM file('test.ndjson')"
Code: 636. DB::Exception: Cannot extract table structure from JSONEachRow format file. Error: Automatically defined type Array(Tuple(Nullable(Float64), Nullable(String))) for column x in row 1 differs from type defined by previous rows: Array(Tuple(Nullable(Float64), Nullable(String), Nullable(Float64))).
CSV: String and Float64 types are inferred;
TSV: everything as String;
Column names: c1, c2, ...;
For CSVWithNames, TSVWithNames column names are extracted.
:) DESC file('hits.csv')
c1 Nullable(String)
c2 Nullable(Float64)
:) DESC file('hits.tsv')
c1 Nullable(String)
c2 Nullable(String)
:) DESC file('hits2.tsv', TSVWithNames)
URLDomain Nullable(String)
UserID Nullable(String)
We also support schema on demand!
SELECT c1 AS domain, uniq(c2), count() AS cnt
FROM file('hits.csv')
GROUP BY domain ORDER BY cnt DESC LIMIT 10
SELECT c1::String AS domain, uniq(c2::UInt64), count() AS cnt
FROM file('hits.csv')
GROUP BY domain ORDER BY cnt DESC LIMIT 10
SELECT URLDomain::String AS domain, uniq(UserID::UInt64), count() AS cnt
FROM file('hits.ndjson')
GROUP BY domain ORDER BY cnt DESC LIMIT 10
Bonus: usage example of LineAsString and RawBLOB formats:
:) SELECT extractTextFromHTML(*)
FROM url('https://news.ycombinator.com/', LineAsString);
:) SELECT extractTextFromHTML(*)
FROM url('https://news.ycombinator.com/', RawBLOB);
:) DESC url('https://news.ycombinator.com/', LineAsString)
line String
Developer: Pavel Kruglov.
Formats with external schema: Protobuf and CapNProto:
SELECT * FROM file('data.protobuf')
SETTINGS format_schema = 'path_to_schema:message_name'
Developer: Pavel Kruglov.
Also works for Merge, Distributed and ReplicatedMegreTree!
Was:
CREATE TABLE hits (WatchID UInt64, JavaEnable UInt8, Title String, GoodEvent Int16, EventTime DateTime, EventDate Date, CounterID UInt32, ClientIP UInt32, ClientIP6 FixedString(16), RegionID UInt32, UserID UInt64, CounterClass Int8, OS UInt8, UserAgent UInt8, URL String, Referer String, URLDomain String, RefererDomain String, Refresh UInt8, IsRobot UInt8, ... ParsedParams.Key4 Array(String), ParsedParams.Key5 Array(String), ParsedParams.ValueDouble Array(Float64), IslandID FixedString(16), RequestNum UInt32, RequestTry UInt8)
ENGINE = ReplicatedMegreTree('/clickhouse/tables/{uuid}', '{replica}');
Now:
CREATE TABLE hits
ENGINE = ReplicatedMegreTree('/clickhouse/tables/{uuid}', '{replica}');
CPU and memory usage is displayed in realtime.
In distributed query, memory usage is displayed as sum and max per host.
Demo:
SELECT uniqExact(title)
FROM remoteSecure('play.clickhouse.com',
default.github_events,
'explorer')
Developer: Dmitriy Novik, Alexey Milovidov.
ClickHouse is a distributed MPP DBMS.
Distributed query is processed in parallel by multiple shards.
But only by one replica on each shard.
There is already the old max_parallel_replicas setting.
That requires SAMPLE key and is difficult to use.
New parallel processing option!
SET max_parallel_replicas = 2,
allow_experimental_parallel_reading_from_replicas = 1
Replicas will dynamically select the work to process.
Allows to spread the workload on non-uniform replicas
(e.g. more and less powerful machines).
Works even if some replicas have lag.
Developer: Nikita Mikhailov.
Replicas will register themself in the cluster automatically:
<allow_experimental_cluster_discovery>1
</allow_experimental_cluster_discovery>
<remote_servers>
<auto_cluster>
<discovery>
<path>/clickhouse/discovery/auto_cluster</path>
<shard>1</shard>
</discovery>
</auto_cluster>
</remote_servers>
No need to edit config on adding new replicas!
Developer: Vladimir Cherkasov.
If a column contains mostly zeros, we can encode it in sparse format
and automatically optimize calculations!
CREATE TABLE test.hits ...
ENGINE = MergeTree ORDER BY ...
SETTINGS ratio_of_defaults_for_sparse_serialization = 0.9
Special column encoding like LowCardinality
but it's completely transparent and automatic.
Developer: Anton Popov.
Demo.
A tool to collect information about running ClickHouse instance for support.
Already used in production for support in Yandex Cloud.
We decided to make it open-source as a gift! 🎁
Python3 script, https://github.com/ClickHouse/ClickHouse/tree/master/
utils/clickhouse-diagnostics
Developer: Alexander Burmak.
Go port is being prepared for single-binary distribution.
Integration with Hive as foreign table engine for SELECT queries.
Developers: Taiyang Li, Ksenia Sumarokova.
Integration with Azure Blob Storage similar to S3.
Developers: Jakub Kuklis, Ksenia Sumarokova.
Support for hdfsCluster table function similar to s3Cluster.
Developers: Zhichang Yu, Nikita Mikhailov.
Cramer's V and Theil's U and contingency coefficient.
— measure of dependency between values.
— somewhat like correlation but for discrete values.
:) SELECT cramersV(URL, URLDomain) FROM test.hits
0.98
:) SELECT cramersV(URLDomain, ResolutionWidth) FROM test.hits
0.27
Developers: Artem Tsyganov, Ivan Belyaev, Alexey Milovidov.
First ClickHouse builds for Aarch64 (ARM64) — in Feb 2016.
ClickHouse builds have been tested on:
— APM X-Gene;
— Cavium ThunderX 1, 2;
— Raspberry Pi;
— Pinebook;
— Google Pixel;
— Apple M1;
— Huawei Taishan;
— AWS Graviton 1, 2;
— Ampere Altra;
for basic queries and performance.
We need full coverage by test suite and feature completeness.
3855 of 3887 tests from the main test suite are passing.
Release builds and Docker containers starting from version 22.2.
Remaining: Int128/256, base64, JSON functions.
ClickHouse is already being used in production on AArch64
Resource management:
— memory overcommit for flexible memory limits;
— make scalar subqueries accountable;
— IO priorities for queries;
Object storage support:
— local cache for remote filesystem.
SQL compatibility:
— GROUPING SETS support.
Text classification:
— charset and language detection functions.
Roadmap 2022 is published:
https://github.com/ClickHouse/ClickHouse/issues/32513
— open for discussions.
... all great stuff is discussed there, don't miss.