Even More New Features of ClickHouse

Previously on: Summer 2020

PostgreSQL wire protocol Geographical dictionaries Data coarsening with TTL expressions Parallel calculation of FINAL Insertion into ODBC and JDBC Cassandra dictionary source Dictionary layout for SSD Settings for data formats in dictionaries Support for Distributed over Distributed queries Query metrics from PMU GROUP BY optimization on sorted data Apache Arrow import/export MsgPack format "direct" dictionary source

EXPLAIN Query

EXPLAIN [AST|SYNTAX|PLAN|PIPELINE] SELECT ...

EXPLAIN PLAN header = 0, description = 1, actions = 0, optimize = 1 SELECT ...

EXPLAIN PIPELINE header = 0, graph = 1, compact = 1 SELECT ...

Available since version 20.6. Developer — Nikolai Kochetov.

WITH subqueries (CTE)

— for subqueries:

WITH a AS (SELECT ...), b AS (SELECT ...), c AS (SELECT ...) SELECT * FROM a, b WHERE x IN c

— global scope:

WITH a AS (SELECT ...) SELECT * FROM (SELECT * FROM a)

Available since version 20.10. Developer — Amos Bird.

LDAP Authentication

— for authentication of existing users (20.7);

— LDAP as a user directory (20.11);

— LDAP as a role directory;

— Kerberos authentication;

Developers — Denis Glazachev, Vitaly Baranov.

Atomic Database

— DROP TABLE, RENAME TABLE queries are now non-blocking;

— EXCHANGE TABLES query;

Developer — Alexander Tokmakov.
Default since version 20.10. Backward compatibility since version 20.4.

Replication Improvements

— removal of leader selection (20.5);

— parallel insert with quorum (20.10);

— ability to retry insert on different replicas (20.10);

— ALTER DROP/DETACH PART query (20.12);

Developers:
Alexey Milovidov, Aleksandra Latysheva, Alexander Sapin, Nicolae Vartolomei.

New Data Format in MergeTree

— Compact data parts (since version 20.3).

— In-memory data parts (since version 20.6).

— Write-Ahead Log (since version 20.6).

— Durability settings (since version 20.10).

Developer — Anton Popov.

Three Data Part Formats

1. Wide — classic format.

2. Compact — all columns in one file.

3. Memory — data in RAM.

Three Data Part Formats

Controlled by settings:

— min_bytes_for_wide_part,
min_rows_for_wide_part:

if size is larger — use wide format.

— min_bytes_for_compact_part,
min_rows_for_compact_part:

if size is larger — use compact format,
if smaller — use memory format.

Wide > Compact > Memory

Wide Parts

Classic format.

Each column and index in a separate file.

Optimal for reading data from disks, including slow ones.

Cheap ALTER for adding or removing columns.

Compact Parts

All columns in one file.

Optimal for inserting new data.
Especially on slow filesystems.

Less optimal for reading.

Recommended for small data parts.

Not recommended for all parts.

Available since version 20.3, but disabled by default.
Since version 20.10, it will be used for parts up to 10 MB.

In-Memory Parts

Data in RAM.

+ Write-Ahead Log, can be disabled.

— in_memory_parts_enable_wal;
— write_ahead_log_max_bytes.

Even more optimal for inserting new data...
... if write-ahead log is disabled.

More efficient for reading...
... but data in RAM is uncompressed.

All parts are replicated as usual.

Experimental feature.

Durability

If all your data is located on one server...

If you don't use replication...

If replication exists, but within one region...

— should we just call fsync before responding to INSERT?

Durability

Eat My Data (2007):
https://www.youtube.com/watch?v=LMe7hf2G1po

Files Are Hard (2015):
https://danluu.com/file-consistency/

PostgreSQL "Fsyncgate" (2018):
https://lwn.net/Articles/752063/

Durability

If all your data is located on one server...
If you don't use replication...
If replication exists, but within one region...

Then here you go:

— min_rows_to_fsync_after_merge;
— min_compressed_bytes_to_fsync_after_merge;
— min_compressed_bytes_to_fsync_after_fetch;
— fsync_after_insert;
— fsync_part_directory;
— write_ahead_log_bytes_to_fsync;
— write_ahead_log_interval_ms_to_fsync;
— in_memory_parts_insert_sync.

Query Normalization and Obfuscation

Functions normalizeQuery, normalizedQueryHash.

SELECT normalizeQuery(query) FROM system.query_log

— replacement of literals with ?

— replacement of literal lists with ?..

— replacement of complex aliases with `?`

Available since version 20.8.

Query Obfuscation

A friend of mine had a slow query...
... but he didn't want to show it.

Solution:

clickhouse-format --obfuscate < query.sql

Available since version 20.10.

Data obfuscation: https://www.youtube.com/watch?v=2iR7i4akL44

Recompression of Old Data

CREATE TABLE hits
(
  event_time DateTime,
  ...
) ENGINE MergeTree ORDER BY ...

TTL event_time + INTERVAL 1 MONTH
        RECOMPRESS CODEC(ZSTD(1)),
    event_time + INTERVAL 1 YEAR
        RECOMPRESS CODEC(ZSTD(6))

Developer — Alexander Sapin. Available since version 20.10.

Minor Features for ALTER

CREATE TABLE hits
(
  event_time DateTime CODEC(Delta, Default),
  ...
) ENGINE MergeTree ORDER BY ...

ALTER TABLE hits MODIFY COLUMN c
    REMOVE COMMENT|CODEC|TTL
        |DEFAULT|MATERIALIZED|ALIAS

Developer — Alexander Sapin. Available since version 20.10.

Column List Transformations

SELECT * EXCEPT(secret_column) FROM table;

SELECT table.* REPLACE(
  (URL LIKE '%yandex%' ? '' : URL) AS URL) FROM table;

SELECT COLUMNS('^packet_') FROM table; SELECT t.* APPLY(sum) FROM table AS t; SELECT COLUMNS(x, y, z) APPLY(sum) FROM table;

Developer — Amos Bird, mfridental. Available since version 20.10.

COLUMNS('regexp'): developer — mfridental. Available since version 19.12.

New Versions of ClickHouse

20.12 — testing.

20.11 — prestable.

20.10 — stable.

20.9 — stable.

20.8 — LTS until 2021-09-30.

20.7 ... 20.4 — obsolete.

20.3 — LTS until 2021-03-12.

...

19.14 — obsolete.

What Else?

Data import from RabbitMQ.

Kerberos authentication for Kafka.

Beta unlimited storage in Yandex Cloud.

Regexp, RawBLOB, JSONAsString, LinesAsString formats.

Running ClickHouse without packages and configuration.

system.crash_log table, Sentry.

Rank correlation calculation.

256bit Decimal.

What's Next?

Backups.

Projections.

Moving away from ZooKeeper.

OpenTracing support.

Query hedging.

Reading column slices.

Geodata analysis functions.

Semi-duplicate text analysis.

Public Roadmap 2020.

https://clickhouse.tech/docs/ru/whats-new/extended-roadmap/

~ 500 tasks with detailed description — dependencies, assignees...

Public Roadmap 2021.

— expected at the end of November.