ClickHouse, Spring 2021

Author: Alexey Milovidov, 2021-03-11.

ClickHouse Meetup Online

YouTube live stream:
https://www.youtube.com/c/ClickHouseDB:
https://youtu.be/HPJOgzQkRls

Telegram chat: https://telegram.me/clickhouse_ru, clickhouse_en

T-shirts for questions!

If you asked a question via Zoom
— write to me after the event via Telegram direct message:
your question, t-shirt size and courier delivery address.

ClickHouse Features, Spring 2021

Previously on: Autumn/Winter 2020

EXPLAIN queries Compact and In-memory data parts PostgreSQL wire protocol LDAP authentication Atomic database Background data recompression Column transformers RabbitMQ integration 256 bit Decimal Kerberos for Kafka and HDFS Query obfuscation and normalization Embedded Web UI

If You Missed...

External dictionaries improvements — Maksim Kita.

Native PostgreSQL integration — Ksenia Sumarokova.

Replicated databases — Alexander Tokmakov.

GRPC Protocol

Enable in config: <grpc_port>9100</grpc_port>.

All features of the native protocol are available:

— TLS, compression, query progress,
  query cancellation, sessions, external data...

Example: clickhouse-client using GRPC on Python.

utils/grpc-client/clickhouse-grpc-client.py

Developer — Vitaly Baranov. Available since version 21.1.

Hedged Queries

Allows sending a query to another replica,
if one of the replicas takes too long to respond,
and choosing the fastest replica from several.

— to eliminate tail latencies on very large clusters.

Available since version 21.3.

* The largest ClickHouse cluster at Yandex is >630 servers,
but other companies have much larger clusters.

Developer — Pavel Kruglov and Nikolai Kochetov.

OpenTelemetry Support

For tracing ClickHouse queries within a large infrastructure:

— enabled by default;

— understands opentelemetry HTTP headers;

— many spans are already marked inside ClickHouse;

— data is written to system.opentelemetry_log;

Developer — Alexander Kuzmenkov. Available since version 20.11.

Compression Formats for Import and Export

Transparent work with compressed files:

— gz, brotli;
— xz, zstd;

Example:

CREATE TABLE github_events_url
(
...
) ENGINE = URL(
  'https://datasets.clickhouse.com/github_events_v2.native.xz',
  Native);

Developer — Abi Palagashvili. Available since version 21.1.

SQL Compatibility Improvements

— UNION DISTINCT;

— REPLACE TABLE and CREATE OR REPLACE TABLE;

— aggregate_functions_null_for_empty;

— Extended CTE;

— Type cast for IN subquery;

— SHOW [CHANGED] SETTINGS;

— POSITION(needle IN haystack);

— DIV / MOD;

— SELECT ALL;

ANTLR Grammar

— experimental and incomplete;

— will help in developing third-party tools,
  working with ClickHouse queries;

For example: syntax highlighting in code editor.

Developer — Ivan Lezhankin. Available since version 21.1.

Map Data Type

Example: map Map(String, String)

SELECT map['hello']

Developer — Hexiaoting.

EmbeddedRocksDB Table Engine

— for key-value queries;

— ideal as a dictionary source;

Developer — Sundy Li.

ALTER Improvements

ALTER UPDATE / DELETE IN PARTITION:

— limiting the scope of mutations.

ALTER DROP PART:

— now also for Replicated tables.

TTL now removes empty parts:

— no confusion with remaining parts.

Semi-Duplicate Text Search

Text analysis using Min-Hash and Sim-Hash algorithms
to search for similar or copied pieces.

Min-Hash algorithm:

1. Extract all shingles of N consecutive words from the text.

2. Hash the shingles!

3. Select M hashes with minimum values and the same amount with maximum.

4. Hash the hashes!

5. Texts are considered duplicates if at least one of the two hashes matches.

Developer — ucasFL. Available since version 21.1.

Statistical Tests

Tests to check if distributions have different means,
useful for A/B testing.

— studentTTest;

— welchTTest;

— mannWitneyUTest;


Rank correlation:

— rankCorr.

Developer — Nikita Mikhailov and others. Available since version 21.1.

Spring/Summer 2021

ClickHouse Roadmap is publicly available on GitHub:

https://github.com/ClickHouse/ClickHouse/issues/17623

There's too much there... I'll only tell you a little bit.

Main Tasks

Provide alternative for ZooKeeper Nested and semistructured data Limited support for transactions Backups Hedged requests Window functions Separation of storage and compute Short-circuit evaluation Projections Lightweight DELETE/UPDATE Workload management User Defined Functions Simplify replication JOIN improvements Embedded documentation Pluggable auth with tokens

https://github.com/ClickHouse/ClickHouse/issues/17623

Window Functions

In development. Initial support in version 21.1.

SET allow_experimental_window_functions = 1

Already in release:
— OVER (PARTITION BY ... ORDER BY ...)
— aggregate functions over windows;
— WINDOW clause; — frame specifications;

In development:
— non-aggregate window functions (rank, etc...);

Developer — Alexander Kuzmenkov.

Nested and Semistructured Data

In development. Initial support in version 21.1.

Multiple nesting:

cart Nested(
    item_id UInt64,
    item_price Decimal(20, 5),
    features Nested(
        ...))

SELECT cart.item_id, cart.features.f1 FROM table

SELECT cart.* FROM table

Support for nested JSON and Protobuf.

Developer — Anton Popov. Available since version 21.1.

Projections: Developer — Amos Bird.

Multiple data representations in one table.

— different sort keys;
— column subset;
— row subset;
— pre-aggregations.

Under review.

Differences from materialized views:

— data in projections is guaranteed to be consistent;
— updated atomically with the table;
— replicated the same way as the table;
— best projections are automatically chosen for SELECT.

Alternative for ZooKeeper

In development.
— ZooKeeper network protocol implemented;
— abstraction layer over ZooKeeper is used;
— ZooKeeper data model implemented;
— TestKeeperServer: server with ZooKeeper data model for tests;
— NuKeeperServer: server with distributed RAFT consensus;
— snapshots, logs, crash recovery implemented;

Advantages:
— simpler operation;
— fixing "zxid overflow";
— fixing max packet size issues;
— fixing "session expired" due to GC pauses;
— reduced memory consumption;
— compressed snapshots and logs;
— embedding into clickhouse-server.

Developer — Alexander Sapin.

Short-Circuit Expression Evaluation

SELECT IF(number = 0, 0, 123 % number) FROM numbers(10)

— division by zero.

SELECT * FROM numbers(10) WHERE number > 0 AND 10 % number > 0

— division by zero.

— both branches of IF, AND and OR are always executed.

SELECT * FROM
(
    SELECT * FROM numbers(10)
    WHERE number > 0
)
WHERE 10 % number > 0

— division by zero.

User Defined Functions

Considering 5 ways to implement UDF, two of them are mandatory:

1. UDF as SQL expressions.

CREATE FUNCTION f AS x -> x + 1

2. UDF as an executable program.

Interaction via pipes, data serialized in any format.

Support for Very Frequent Inserts

What happens if you send INSERTs to ClickHouse one row at a time?

Before: nothing good.
— you had to buffer data yourself, or use Kafka, or install clickhouse-bulk, kittenhouse, ...

Will be: all good.
— ClickHouse will automatically collect data in batches and perform asynchronous INSERT.

New ClickHouse Versions

21.4 — testing.

21.3 — prestable, LTS until 2022-03-01.

21.2 — stable.

21.1 — stable.

20.12 — stable.

20.12 ... 20.9 — obsolete.

20.8 — LTS until 2021-09-30.

...

20.3 — obsolete.

?

Public roadmap 2021:

https://github.com/ClickHouse/ClickHouse/issues/17623

.