Author: Alexey Milovidov, 2023-01-17.
Now:
SELECT * FROM url(
'https://datasets.clickhouse.com/github_events_v2.native.xz') LIMIT 10
Read the structure:
DESCRIBE url('https://datasets.clickhouse.com/github_events_v2.native.xz')
Developer: Pavel Kruglov.
SELECT * FROM url('https://datasets.clickhouse.com/github_events_v2.native.xz', Native,
$$
file_time DateTime, event_type Enum('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11, 'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19, 'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22), actor_login LowCardinality(String), repo_name LowCardinality(String), created_at DateTime, updated_at DateTime, action Enum('none' = 0, 'created' = 1, 'added' = 2, 'edited' = 3, 'deleted' = 4, 'opened' = 5, 'closed' = 6, 'reopened' = 7, 'assigned' = 8, 'unassigned' = 9, 'labeled' = 10, 'unlabeled' = 11, 'review_requested' = 12, 'review_request_removed' = 13, 'synchronize' = 14, 'started' = 15, 'published' = 16, 'update' = 17, 'create' = 18, 'fork' = 19, 'merged' = 20), comment_id UInt64, body String, path String, position Int32, line Int32, ref LowCardinality(String), ref_type Enum('none' = 0, 'branch' = 1, 'tag' = 2, 'repository' = 3, 'unknown' = 4), creator_user_login LowCardinality(String), number UInt32, title String, labels Array(LowCardinality(String)), state Enum('none' = 0, 'open' = 1, 'closed' = 2), locked UInt8, assignee LowCardinality(String), assignees Array(LowCardinality(String)), comments UInt32, author_association Enum('NONE' = 0, 'CONTRIBUTOR' = 1, 'OWNER' = 2, 'COLLABORATOR' = 3, 'MEMBER' = 4, 'MANNEQUIN' = 5), closed_at DateTime, merged_at DateTime, merge_commit_sha String, requested_reviewers Array(LowCardinality(String)), requested_teams Array(LowCardinality(String)), head_ref LowCardinality(String), head_sha String, base_ref LowCardinality(String), base_sha String, merged UInt8, mergeable UInt8, rebaseable UInt8, mergeable_state Enum('unknown' = 0, 'dirty' = 1, 'clean' = 2, 'unstable' = 3, 'draft' = 4), merged_by LowCardinality(String), review_comments UInt32, maintainer_can_modify UInt8, commits UInt32, additions UInt32, deletions UInt32, changed_files UInt32, diff_hunk String, original_position UInt32, commit_id String, original_commit_id String, push_size UInt32, push_distinct_size UInt32, member_login LowCardinality(String), release_tag_name String, release_name String, review_state Enum('none' = 0, 'approved' = 1, 'changes_requested' = 2, 'commented' = 3, 'dismissed' = 4, 'pending' = 5)
$$)
LIMIT 10
For formats that already contain structure:
— Native, Protobuf, Avro, Parquet, ORC, Arrow.
— CSVWithNamesAndTypes, TSVWithNamesAndTypes.
It works even for semistructured and unstructured formats!
— CSV, TSV, CSVWithNames, TSVWithNames,
JSONEachRow, Values, Regexp, MsgPack...
DESCRIBE file('hits.ndjson')
┌─name──────┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ UserID │ Nullable(String) │ │ │ │ │ │
│ URLDomain │ Nullable(String) │ │ │ │ │ │
└───────────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
$ echo '{"x":[[123,456]]}' > test.ndjson
$ clickhouse-local --query "SELECT x, toTypeName(x) FROM file('test.ndjson')"
[[123,456]] Array(Array(Nullable(Float64)))
$ echo '{"x":[123,"Hello",["World"]]}' > test.ndjson
$ clickhouse-local --query "SELECT x, toTypeName(x) FROM file('test.ndjson')"
(123,'Hello',['World']) Tuple(Nullable(Float64), Nullable(String), Array(Nullable(String)))
$ echo '{"x":"Hello"} {"y":"World"}' > test.ndjson
$ clickhouse-local --query "SELECT * FROM file('test.ndjson')"
\N Hello
World \N
$ echo '{"x":[[123,"Hello"]]} {"x":[[123,"Hello",456]]}' > test.ndjson
$ clickhouse-local --query "SELECT * FROM file('test.ndjson')"
Code: 636. DB::Exception: Cannot extract table structure from JSONEachRow format file. Error: Automatically defined type Array(Tuple(Nullable(Float64), Nullable(String))) for column x in row 1 differs from type defined by previous rows: Array(Tuple(Nullable(Float64), Nullable(String), Nullable(Float64))).
We also support schema on demand!
SELECT c1 AS domain, uniq(c2), count() AS cnt
FROM file('hits.csv')
GROUP BY domain ORDER BY cnt DESC LIMIT 10
SELECT c1::String AS domain, uniq(c2::UInt64), count() AS cnt
FROM file('hits.csv')
GROUP BY domain ORDER BY cnt DESC LIMIT 10
SELECT URLDomain::String AS domain, uniq(UserID::UInt64), count() AS cnt
FROM file('hits.ndjson')
GROUP BY domain ORDER BY cnt DESC LIMIT 10
Also works for Merge, Distributed and ReplicatedMegreTree!
Was:
CREATE TABLE hits (WatchID UInt64, JavaEnable UInt8, Title String, GoodEvent Int16, EventTime DateTime, EventDate Date, CounterID UInt32, ClientIP UInt32, ClientIP6 FixedString(16), RegionID UInt32, UserID UInt64, CounterClass Int8, OS UInt8, UserAgent UInt8, URL String, Referer String, URLDomain String, RefererDomain String, Refresh UInt8, IsRobot UInt8, ... ParsedParams.Key4 Array(String), ParsedParams.Key5 Array(String), ParsedParams.ValueDouble Array(Float64), IslandID FixedString(16), RequestNum UInt32, RequestTry UInt8)
ENGINE = ReplicatedMegreTree('/clickhouse/tables/{uuid}', '{replica}');
Now:
CREATE TABLE hits
ENGINE = ReplicatedMegreTree('/clickhouse/tables/{uuid}', '{replica}');
— full and incremental backups;
— tables, partitions, databases, all databases;
— full or partial restore;
— a bunch of files or a single archive;
— atomic snapshot for MergeTree,
best effort snapshot for multiple tables;
Developer — Vitaliy Baranov.
BACKUP TABLE t TO File('backup_20220629');
BACKUP TABLE t TO File('backup_20220629.zip');
BACKUP TABLE t TO File('backup_20220630')
SETTINGS base_backup = File('backup_20220629');
BACKUP TABLE system.users, system.grants TO ...
BACKUP TABLE system.functions TO ...
Developer — Vitaliy Baranov.
BACKUP|RESTORE
TABLE [db.]table_name [AS [db.]table_name_in_backup]
[PARTITION[S] partition_expr [,...]] |
DICTIONARY [db.]dictionary_name [AS [db.]name_in_backup] |
DATABASE database_name [AS database_name_in_backup]
[EXCEPT TABLES ...] |
TEMPORARY TABLE table_name [AS table_name_in_backup] |
ALL TEMPORARY TABLES [EXCEPT ...] |
ALL DATABASES [EXCEPT ...] } [,...]
[ON CLUSTER 'cluster_name']
TO|FROM File('path/') | Disk('disk_name', 'path/') | S3(...)
[SETTINGS base_backup = File(...) | Disk(...)]
Developer — Vitaliy Baranov.
BACKUP ALL DATABASES ON CLUSTER TO ...
Developer — Vitaliy Baranov.
Full compatibility with ZooKeeper 3.5 by protocol and data model.
Can run embedded in clickhouse-server. Or separately.
Or replace ZooKeeper in other software stacks.
Passing Jepsen tests, ClickHouse functional and integration tests.
Deployed in production.
Faster than ZooKeeper, consuming less memory.
Compact logs and snapshots. And it makes ClickHouse faster.
Developer: Alexander Sapin, Antonio Andelic
First ClickHouse builds for Aarch64 (ARM64) — in Feb 2016.
ClickHouse builds have been tested on:
— APM X-Gene;
— Cavium ThunderX 1, 2;
— Raspberry Pi;
— Pinebook;
— Google Pixel;
— Apple M1;
— Huawei Taishan;
— AWS Graviton 1, 2 and 3;
— Ampere Altra;
— and on secret Google CPUs;
— automated builds, packages, Docker;
— full functional test runs;
— automated performance tests;
— optimizations with ARM NEON;
— query profiling and introspection;
— GRPC;
What next? PowerPC? RISC-V?
ClickHouse should always work.
If ZooKeeper is unavailable at server startup,
the ReplicatedMergeTree tables will start in read-only mode
and initialize asynchronously in background
as soon as ZooKeeper will be available.
— the same applicable for ClickHouse Keeper as well;
— especially useful for embedded ClickHouse Keeper;
Developers: Antonio Andelic. Since 22.9.
Session expired. Table is in readonly mode 😠
Never again:
SET insert_keeper_max_retries = 10;
INSERT will survive restarts of ClickHouse Keeper or ZooKeeper
and reconnections.
Developer: Igor Nikonov. Since 22.11.
Table is in readonly mode 😠
Was: around one minute for reconnection, for no reason.
Now: milliseconds 🥲
Developer: Raul Marin. Since 22.10.
Was: strict per-query limit, max_memory_usage = 10 GB by default.
Now: query can use the memory if it's available;
in case of memory shortage, the most "overcommitted" query is terminated with exception.
Developer: Dmitry Novik.
ClickHouse should support everything you expect.
SELECT
y::String || '.'
|| (y < toYear(today()) - 2000 - 1 ? '*' : m::String) AS Version,
(n <= 3 OR (is_lts AND lts_n <= 2)) ? '✔️' : '❌' AS Supported
FROM (
SELECT y, m,
count() OVER (ORDER BY y DESC, m DESC) AS n,
m IN (3, 8) AS is_lts,
countIf(is_lts) OVER (ORDER BY y DESC, m DESC) AS lts_n
FROM (
WITH extractGroups(version, 'v(\d+)\.(\d+)') AS v,
v[1]::INT AS y, v[2]::INT AS m
SELECT y, m
FROM file('version_date.tsv', TSV, 'version String, date String')
ORDER BY y DESC, m DESC
LIMIT 1 BY y, m)
)
LIMIT 1 BY Version
FORMAT Markdown
SET mutations_sync = 1;
ALTER TABLE hits
DELETE WHERE Title LIKE '%Mongo%';
— 205 sec (for a table with 100 million records).
DELETE FROM hits
WHERE Title LIKE '%Mongo%';
— ??? sec.
Developers: Alexander Gololobov, Jianmei Zhang.
SELECT DISTINCT repo_name, title
FROM github_events
WHERE title ILIKE (
repo_name LIKE '%ClickHouse%' ? '%fast%' : '%slow%')
AND repo_name IN ('ClickHouse/ClickHouse', 'elastic/elasticsearch')
Now I can put LIKE inside LIKE and looks like you're going to like it.
Developer: Robert Schulze. Since 22.6.
ClickHouse never slows down!
Speed-up of SELECT with FINAL modifier.
It "simply" improves performance up to 4 times.
Especially for complex transforms like Collapsing and Replacing.
Developer: Nikita Taranov.
Optimize COUNT(DISTINCT ...) for low number of GROUP BY keys.
Optimize GROUP BY with CPU prefetcher.
Optimize GROUP BY with better block sizes.
Developer: Nikita Taranov.
— "direct" algorithm:
to join with key-value tables by direct lookups a.k.a. nested loops.
Good if the left table is small, but the right table is like a large dictionary.
Good to use in MATERIALIZED VIEW queries.
— "parallel_hash" algorithm:
speed-up if the right hand table is large.
— "full_sorting_merge" algorithm:
when right hand side is large
and does not fit in memory and does not allow lookups.
— "grace_hash" algorithm:
since in 22.12.
Developer: Vladimir Cherkasov, lgbo-ustc.
ClickHouse can work as a server (clickhouse-server)
or as a tool without installation (clickhouse-local).
ClickHouse can store the data
or process externally stored data on the fly.
External data:
— remote databases: MySQL, PostgreSQL, MongoDB, ODBC, JDBC...
— object storages: S3, HDFS, Azure, COSN, OSS...
— from URL and local files;
All possible data formats:
— text: CSV, TSV, JSON, Values, MySQLDump, Regexp...
— binary: Parquet, Arrow, ORC, Avro, Protobuf, MsgPack...
— schemaful and schemaless;
Now ClickHouse supports Apache Hudi and Delta Lake
for SELECT queries.
SELECT count() FROM deltaLake(
'https://clickhouse-public-datasets.s3.amazonaws.com/delta_lake/hits/')
WHERE URL LIKE '%google%'
-- 4.396 sec.
Developers: Daniil Rubin, Ksenia Sumarokova, Flynn ucasfl. Since 22.11.
Visualizations:
— official ClickHouse plugin for Grafana;
— official support for Superset;
— HEX and Deepnote support.
Data ingestion and processing:
— Kafka Connect integration;
— Airflow, dbt support.
Language drivers:
— official Node.JS driver;
— optimized Go driver;
— a new Python client.
ClickHouse is easy to configure for your needs.
The most simple way to install ClickHouse:
curl https://clickhouse.com/ | sh
Single binary package. Installs the latest version. Includes debug info.
Works on every Linux (x86_64, aarch64, powerpc64le),
macOS (x86_64, M1), FreeBSD and Windows (WSL2).
Was: 2.1 GB.
Now: 446 MB, takes ~5 seconds to decompress on first run.
Developer: Arthur Filatenkov, Yakov Olkhovskiy.
So, ClickHouse supports a lot of protocols:
— HTTP
— HTTPs
— Native TCP
— Native TCP wrapped in PROXYv1
— Native TCP with TLS
— MySQL (with TLS support)
— PostgreSQL (with TLS support)
— GRPC (with TLS)
— Replication protocol over HTTP
— Replication protocol over HTTPs
— Keeper client-server protocol;
— Keeper consensus protocol;
— ...
So, ClickHouse supports a lot of protocols.
How to configure all of them? What if:
— server has multiple network interfaces?
— enable one protocol on multiple ports?
— I want native TCP for localhost only and HTTPs from everywhere?
— I want different TLS certificates for different protocols?
— I want to wrap one protocol in another?
Developer: Yakov Olkhovskiy. Since 22.10.
<protocols>
<tcp>
<type>tcp</type>
<host>::</host>
<port>9000</port>
<description>native protocol</description>
</tcp>
<tcp_secure>
<type>tls</type>
<impl>tcp</impl>
<port>9440</port>
<description>secure native protocol</description>
</tcp_secure>
<tcp_endpoint>
<impl>tcp</impl>
<host>0.0.0.0</host>
<port>9001</port>
<description>native protocol, another</description>
</tcp_endpoint>
<tcp_proxy>
<type>proxy1</type>
<impl>tcp</impl>
<port>9100</port>
<description>native protocol with PROXYv1</description>
</tcp_proxy>

+ penetration testing, bug bounty program, audit reports...
— early access since May;
— beta since Oct 4th;
— GA since Dec 5th;
— free 14-day trial up to 10 TB of data;
Try it! https://clickhouse.cloud/.
Published at https://github.com/ClickHouse/ClickHouse/issues/44767,
open for discussion.
Lightweight DELETE
Replicated database engine
Parallel reading from replicas
Transactions for Replicated tables
Object data type
Enable "Analyzer" by default
JOINs reordering and extended pushdown
Automatic selection of the JOIN algorithm
Correlated subqueries (with decorrelation)
Recursive CTE
Optimization of reading for Parquet
Support for embedded indices inside Parquet
Integration with Apache Iceberg
Streaming consumption from a bunch of files
Asynchronous inserts by default.
Query results cache
Regexp-Tree dictionaries
Batch jobs and refreshable materialized views
Streaming queries
Freeform text format
Websocket protocol for the server
ssh protocol for the server
Key-value data marts
Unique key constraint
PRQL as a dialect
Kusto support