Even Newer ClickHouse Features

ClickHouse Online Meetup

YouTube stream: https://www.youtube.com/c/ClickHouseDB

Telegram chat: https://telegram.me/clickhouse_ru, clickhouse_en

T-shirts for questions!

If you asked a question via Zoom
— write to me in Telegram after the event:
your question, t-shirt size and courier delivery address.

Even Newer ClickHouse Features

New Data Format in MergeTree

— Compact data parts (since version 20.3).

— Data parts in memory (since version 20.6).

— Write-Ahead Log (since version 20.6).

— Durability settings (since version 20.10).

Разработчик — Антон Попов.

Three Data Part Formats

1. Wide — классический format.

2. Compact — все столбцы в одном файле.

3. Memory — данные в оперативке.

Three Data Part Formats

Controlled by settings:

min_bytes_for_wide_part,
min_rows_for_wide_part:

  если размер больше — использовать wide format.

min_bytes_for_compact_part,
min_rows_for_compact_part:

  если размер больше — использовать compact формат,
  если меньше — использовать memory format.

Wide > Compact > Memory

Wide Parts

Классический format.

Each column and index in a separate file.

Optimal for reading data from disks, including slow ones.

Cheap ALTER for adding or removing columns.

Compact Parts

All columns in one file.

Optimal for inserting new data.
Especially on slow filesystems.

Less optimal for reading.

Recommended for small-sized data parts.

Not recommended to use for all parts.

Available since version 20.3, but disabled by default.
Since version 20.10 will be used for parts up to 10 MB.

In-Memory Parts

Data in RAM.

+ Write-Ahead Log, can be disabled.

— in_memory_parts_enable_wal;
— write_ahead_log_max_bytes.

Even more optimal for inserting new data...
... if write-ahead log is disabled.

More efficient for reading...
... but data in RAM is uncompressed.

All parts are replicated as usual.

Experimental feature.

Durability

If all your data is on one server...

If you don't use replication...

If replication exists, but within one region...

— should we just call fsync before INSERT response?

Durability

Eat My Data (2007):
https://www.youtube.com/watch?v=LMe7hf2G1po

Files Are Hard (2015):
https://danluu.com/file-consistency/

PostgreSQL "Fsyncgate" (2018):
https://lwn.net/Articles/752063/

Durability

If all your data is on one server...
If you don't use replication...
If replication exists, but within one region...

Then here:

— min_rows_to_fsync_after_merge;
— min_compressed_bytes_to_fsync_after_merge;
— min_compressed_bytes_to_fsync_after_fetch;
— fsync_after_insert;
— fsync_part_directory;
— write_ahead_log_bytes_to_fsync;
— write_ahead_log_interval_ms_to_fsync;
— in_memory_parts_insert_sync.

Query Normalization and Obfuscation

Functions normalizeQuery, normalizedQueryHash.

SELECT normalizeQuery(query) FROM system.query_log

— replace literals with ?

— replace lists of literals with ?..

— replace complex aliases with `?`

Available since version 20.8.

Query Obfuscation

A friend of mine had a slow query...
... but he didn't want to show it.

Solution:

clickhouse-format --obfuscate < query.sql

Available since version 20.10.

Data obfuscation: https://www.youtube.com/watch?v=2iR7i4akL44

 

Recompression of Old Data

CREATE TABLE hits ( event_time DateTime, ... ) ENGINE MergeTree ORDER BY ... TTL event_time + INTERVAL 1 MONTH RECOMPRESS CODEC(ZSTD(1)), event_time + INTERVAL 1 YEAR RECOMPRESS CODEC(ZSTD(6))

Разработчик — Александр Сапин. Available since version 20.10.

ALTER Improvements

CREATE TABLE hits ( event_time DateTime CODEC(Delta, Default), ... ) ENGINE MergeTree ORDER BY ... ALTER TABLE hits MODIFY COLUMN c REMOVE COMMENT|CODEC|TTL |DEFAULT|MATERIALIZED|ALIAS

Разработчик — Александр Сапин. Available since version 20.10.

Column List Transformations

SELECT * EXCEPT(secret_column) FROM table; SELECT table.* REPLACE( (URL LIKE '%yandex%' ? '' : URL) AS URL) FROM table; SELECT COLUMNS('^packet_') FROM table; SELECT t.* APPLY(sum) FROM table AS t; SELECT COLUMNS(x, y, z) APPLY(sum) FROM table;

Разработчик — Amos Bird, mfridental. Available since version 20.10.

COLUMNS('regexp'): разработчик — mfridental. Доступно с версии 19.12.

New ClickHouse Versions

20.10 — testing.

20.9 — stable.

20.8 — LTS until 2021-09-30.

20.7 — stable.

20.6 ... 20.4 — obsolete.

20.3 — LTS until 2021-03-12.

...

19.14 — obsolete.

What Else?

Data import from RabbitMQ.

Kerberos authentication for Kafka.

Beta unlimited storage in Yandex.Cloud.

WITH for subqueries and with global scope.

Formats Regexp, RawBLOB, JSONAsString, LinesAsString.

Running clickhouse without packages and configuration.

system.crash_log table, Sentry.

Rank correlation calculation.

256bit Decimal.

What's Next?

Backups.

Projections.

Move away from ZooKeeper.

OpenTracing support.

Query hedging.

Reading column slices.

Functions анализа геоданных.

Text near-duplicate analysis.

Public roadmap 2020.

https://clickhouse.tech/docs/ru/whats-new/extended-roadmap/

~ 500 задач с подробным описанием — зависимости, исполнители...

.