Does ClickHouse support Window Functions?
— Yes.
Initial support in version 21.1. Full support in 21.4.
SET allow_experimental_window_functions = 1
Implemented according to the SQL standard:
— OVER (PARTITION BY ... ORDER BY ...)
— aggregate functions over windows;
— WINDOW clause;
— frame specifications;
— non-aggregate window functions
(rank, dense_rank, row_number, lagInFrame, leadInFrame, etc...);
Developer — Alexander Kuzmenkov.
You can query (read and write) PostgreSQL in ClickHouse queries*.
You can JOIN with PostgreSQL tables inside your queries to ClickHouse!
You can create a database that will have all the tables in PostgreSQL schema.
You can use PostgreSQL as a dictionary source.
And ClickHouse can also pretend to be PostgreSQL!
<!-- Compatibility with PostgreSQL protocol.
ClickHouse will pretend to be PostgreSQL
for applications connecting to this port.
-->
<postgresql_port>9005</postgresql_port>
* — earlier it was also available via odbc.
— PostgreSQL storage engine;
— postgresql table function; It supports shards and replicas!
SELECT * FROM postgresql(`postgres{1|2|3}:5432`,
'clickhouse', 'test_replicas', 'postgres', 'password');
— postgresql dictionary source;
— PostgreSQL database engine as a view to all tables in PG database;
Similar to MySQL integration... but for PostgreSQL.
Upcoming: replication from PostgreSQL.
Developer — Ksenia Sumarokova. Available since 21.2.
The problem:
SELECT CAST(0.1 AS Decimal(30, 10))
0.1000000000 - Ok.
SELECT CAST(0.1 AS Decimal(30, 30))
0.099999999992113607938064642130 - not Ok.
SELECT CAST('0.1' AS Decimal(30, 30))
0.100000000000000000000000000000 - Ok.
The solution:
SELECT 0.1::Decimal(30, 30)
0.100000000000000000000000000000 - Ok.
Developer — Anton Popov. Available since 21.6.
ClickHouse 💖 PostgreSQL
ClickHouse 💖 MySQL equally well
...
And also ClickHouse is trying very hard to love
s3, HDFS, ODBC, JDBC, Mongo, Redis, Cassandra,
Arrow, GRPC, Kafka, RabbitMQ...
Uncomment <grpc_port>9100</grpc_port>
in config.
Feature parity with the native protocol:
— TLS, compression, query progress,
query cancellation, sessions, external data...
Example: clickhouse-client with GRPC in Python.
utils/grpc-client/clickhouse-grpc-client.py
Developer — Vitaly Baranov. Available since 21.1.
Was: 1970–2106 year.
Now: 1925–2283 year.
+222 years for free. No performance drawbacks.
Our grandchildren will use ClickHouse.
Developer — Vasily Nemkov, Alexey Milovidov. Available since 21.4.
extractTextFromHTML function
— best effort text extraction from HTML and XHTML.
ClickHouse can parse HTML. It does this at 20 gigabytes per second.
On 40 cores Intel CascadeLake.
Trace your ClickHouse queries within your infrastructure:
— enabled by default;
— supports opentelemetry HTTP headers;
— multiple spans are annotated inside ClickHouse;
— data is written to system.opentelemetry_log
Developer — Alexander Kuzmenkov. Available since 20.11.
Transparent import/export of compressed files:
— gz, brotli;
— xz, zstd;
Example:
CREATE TABLE github_events_url ( ... ) ENGINE = URL( 'https://datasets.clickhouse.tech/github_events_v2.native.xz', Native);
Developer — Abi Palagashvili. Available since 21.1.
Example: map Map(String, String)
SELECT map['hello']
Developer — Hexiaoting.
— good for key-value queries;
— intended to use as dictionary source;
Developer — Sundy Li.
— UNION DISTINCT;
— REPLACE TABLE and CREATE OR REPLACE TABLE;
— aggregate_functions_null_for_empty;
— Extended CTE;
— Type cast for IN subquery; type cast for JOIN;
— SHOW [CHANGED] SETTINGS;
— POSITION(needle IN haystack);
— DIV / MOD;
— SELECT ALL;
Text analysis with Min-Hash and Sim-Hash algorithms
to find similar / copy-pasted content.
Min-Hash algorithm:
1. Extract every N-word shingle from the text.
2. Calculate hashes of every shingle.
3. Select M hashes with minimum value and M hashes with maximum value.
4. Calculate two total hashes of min-hashes and max-hashes.
5. Texts are duplicates if at least one of their total hashes equals.
Developer — ucasFL. Available since 21.1.
Test that two samples have different means, useful for A/B testing.
— studentTTest;
— welchTTest;
— mannWitneyUTest;
Rank correlation:
— rankCorr.
Upcoming: Cramer's V, Theil's U.
Developer — Nikita Mikhailov et all. Available since 21.1.
Multiple data representations inside a single table.
— different data order;
— subset of columns;
— subset of rows;
— aggregation.
Difference to materialized views:
— projections data is always consistent;
— updated atomically with the table;
— replicated in the same way as the table;
— projection can be automatically used for SELECT query.
Developer — Amos Bird. Available since 21.6.
Send distributed query to multiple replicas — to mitigate tail latencies.
This is needed for distributed queries on large clusters (with large "fanout").
* The largest ClickHouse cluster in Yandex is 630+ servers,
but there are many larger clusters in other companies.
Available in 21.4. Developer — Pavel Kruglov and Nikolai Kochetov.
ClickHouse Roadmap is publicly available on GitHub:
https://github.com/ClickHouse/ClickHouse/issues/17623
I will show you only some highlights and examples.
Provide alternative for ZooKeeper Nested and semistructured data Enable JIT compilation by default Limited support for transactions Backups Separation of storage and compute Short-circuit evaluation Lightweight DELETE/UPDATE Workload management User Defined Functions Simplify replication JOIN improvements Embedded documentation Pluggable auth with tokens YAML configuration Full disk encryption
A working prototype:
CREATE TABLE test (id UInt64, data Object('JSON'))
ENGINE = MergeTree;
INSERT INTO test FORMAT JSONEachRow
{"id": 1, "data": {"k1": "aa", "k2": {"k3": "bb", "k4": 123}}}
{"id": 2, "data": {"k1": "ee", "k5": "ff"}};
SELECT id, data.k1, data.k2.*, data.k2.k4::String FROM test;
Nested JSON is automatically mapped to dynamic subcolumns
and is stored in efficient columar format.
Bonus: multiple levels of Nested types is already supported.
Developer — Anton Popov.
How to process analytical queries not slowly?
There are two techniques:
1. Vectorized query execution.
2. Runtime code generation (JIT).
ClickHouse implements both.
Most of its performance comes from (1) — vectorized processing.
The JIT (2) — was available since version 1.1.54388 in 2018.
But it is disabled by default.
Developers — Denis Skorobogatov, Alexander Sapin, Alexey Milovidov, Maksim Kita.
By default since 21.6.
Synthetic example: murmur hash in SQL.
WITH
bitXor(number, 5544725790478674055) AS x0,
bitXor(x0, bitShiftRight(x0, 33)) AS x1,
x1 * 18397679294719823053 AS x2,
bitXor(x2, bitShiftRight(x2, 33)) AS x3,
x3 * 14181476777654086739 AS x4,
bitXor(x4, bitShiftRight(x4, 33)) AS x5
SELECT count()
FROM numbers_mt(1000000000)
WHERE NOT ignore(x5)
Before: 34.96 GB/s. After: 87.23 GB/s.
— 2.5 times performance improvement.
Work in Progress.
— ZooKeeper network protocol is implemented;
— Abstraction layer over ZooKeeper is used;
— ZooKeeper data model is implemented for testing;
— TestKeeperServer: a server with ZooKeeper data model for testing;
— Jepsen tests continuously run in CI with various failure models;
Benefits:
— less operational complexity;
— fix "zxid overflow" issue; — fix the issue with max packet size;
— fix "session expired" due to gc pauses; — improve memory usage;
— allow compressed snapshots; — allow embedding into clickhouse-server.
Developer — Alexander Sapin.
SELECT IF(number = 0, 0, 123 % number) FROM numbers(10)
— division by zero.
SELECT * FROM numbers(10) WHERE number > 0 AND 10 % number > 0
— division by zero.
— both branches of IF, AND, OR are always evaluated.
SELECT * FROM ( SELECT * FROM numbers(10) WHERE number > 0 ) WHERE 10 % number > 0
— division by zero.
We are considering five ways to implement UDF, two of them are mandatory:
1. UDF as SQL expressions.
CREATE FUNCTION f AS x -> x + 1
2. UDF as executable script.
Interaction via pipes, data is serialized using supported formats.
Before: you have to CREATE every replicated table on every replica.
After: you can CREATE DATABASE ... ENGINE = Replicated(
'/clickhouse/databases/test1', 'shard', 'replicaN');
and every table operation is automatically replicated on all replicas.
Available since 21.3 as experimental feature. Developer — Alexander Tokmakov.
SELECT SearchPhrase, count() FROM s3Cluster('cluster_name',
'http://minio1:9001/root/data/{2010..2021}/*.csv',
'user', 'password', 'CSV', 'SearchPhrase String, ...')
The query will fetch and process data from s3
in parallel and distributed fashion using compute power
and network on all servers in a cluster.
Upcoming: MergeTree tables on external storage
(available as experimental feature).
Upcoming: processing shared MergeTree tables
with dynamically allocated compute nodes.
Available since 21.5. Developer — Nikita Mikhailov.
I heard you don't like XML...
Then ClickHouse will support YAML as an alternative for you.
logger:
level: trace
log: /var/log/clickhouse-server/clickhouse-server.log
errorlog: /var/log/clickhouse-server/clickhouse-server.err.log
size: 1000M
count: 10
http_port: 8123 # Don't worry, YAML configuration
tcp_port: 9000 # will also contain comments.
mysql_port: 9004 #
postgresql_port: 9005 # It's not JSON after all.
max_connections: 4096
Developer — Denis Bolonin.
Read the official roadmap and ask your questions: