Amazing ClickHouse Features

ClickHouse Online Meetup

YouTube stream: https://www.youtube.com/c/ClickHouseDB

Telegram chat: https://telegram.me/clickhouse_ru, clickhouse_en

T-shirts for questions!

If you asked a question via Zoom
— write to me in Telegram after the event:
your question, t-shirt size and courier delivery address.

ClickHouse Features:
Spring/Summer 2020

Downsampling and Aggregation of Old Data

... using TTL expressions.
Are you already familiar with TTL expressions?

CREATE TABLE (d Date, ip IPv4 TTL d + INTERVAL 1 MONTH, ...) ENGINE = MergeTree ... CREATE TABLE t (d Date, ...) ENGINE = MergeTree ... TTL d + INTERVAL 1 MONTH TTL d + INTERVAL 1 MONTH DELETE TTL d + INTERVAL 1 MONTH TO DISK disk_name TTL d + INTERVAL 1 MONTH TO VOLUME volume_name

Downsampling and Aggregation of Old Data

... using TTL expressions.

Downsampling:

TTL d + INTERVAL 1 MONTH DELETE WHERE is_yandex = 0

Aggregation:

CREATE TABLE t (d Date, a UInt64, b String, c String... ENGINE = MergeTree ORDER BY (a, b, c) TTL d + INTERVAL 1 MONTH GROUP BY a, b SET x = sum(x), y = quantile(0.9)(y)

a, b - prefix of the sorting key

Downsampling and Aggregation of Old Data

Aggregation:

CREATE TABLE t (d Date, a UInt64, b String, c String... ENGINE = MergeTree ORDER BY (a, b, c) TTL d + INTERVAL 1 MONTH GROUP BY a, b SET x = sum(x), y = quantile(0.9)(y)

— GROUP BY by prefix of the sorting key;

— aggregate functions can be calculated
from any combination of columns: SET x = argMax(x, time);

— если выражение SET для столбца
не задано — эквивалентно SET x = any(x);

— not yet possible: combine WHERE and GROUP BY.

Downsampling and Aggregation of Old Data

... using TTL expressions.

Well suited for time series
— coarse-graining of old data.

Similar to GraphiteMergeTree,
but applicable to all MergeTree tables.

Similar to AggregatingMergeTree и SummingMergeTree,
but with time thresholds before aggregation.

Доступно в версии 20.5. Разработчик — Николай Сорокин.

Geographical Functions and Dictionaries

Remember pointInPolygon...

demo

Geographical Dictionaries

Reverse geocoding task:

— finding polygon by coordinates;

— find the smallest polygon,
  return its properties;

— among sets (millions) of polygons;

— with reasonable speed
  and memory consumption;

Geographical Dictionaries

CREATE DICTIONARY polygon_dict ( key Array(Array(Array(Array(Float64)))), name String, value UInt64 ) PRIMARY KEY key SOURCE(...) LAYOUT(POLYGON_INDEX_CELL())

key - multi-polygon:
Array(Float64) - coordinate pair, also can be Tuple(Float64, Float64);
Array(Array(Float64)) - ring;
Array(Array(Array(Float64))) - polygon with holes;
Array(Array(Array(Array(Float64)))) - multi-polygon.

Geographical Dictionaries

SELECT dictGet('polygon_dict', 'name', (x, y))

Performance:

Russian district boundaries (20,000 pieces, 12 mln. points)
from OpenStreetMap.

— 2 GB RAM;
— loading - 2.5 minutes;
— speed 900,000 lookups/sec * number of CPU cores.

Non-constant polygons in pointInPolygon:
Доступно в версии 20.5. Разработчики — Алексей Илюхов, Алексей Коряков.

Polygon dictionaries:
Ожидается в версии 20.6. Разработчики — Андрей Чулков, Антон Кваша, Артур Петуховский.

ClickHouse Interfaces

— Native

— HTTP

  — ODBC

  — JDBC

— MySQL

— ???

ClickHouse Interfaces

— Native

— HTTP

  — ODBC

  — JDBC

— MySQL

— PostgreSQL!

PostgreSQL Interface in ClickHouse

Tested with psql и JDBC Postgres driver.

Works for both read and write.

Not supported:
extended query (parameterized queries) and COPY.

demo

Ожидается в версии 20.6. Разработчик — Мовсес Элбакян.

ClickHouse Interfaces

— Native

— HTTP

  — ODBC

  — JDBC

— MySQL — 19.8

— PostgreSQL — 20.6

— GRPC — prototype, in development

Formats

Apache Arrow

ORC for read and for write

MsgPack, Regexp, Markdown, JSONAsString...

Which format to implement tomorrow?

Apache Arrow:
Доступно в версии 20.5. Разработчики — Жанна Зосимова, hczhcz.

MsgPack, Regexp, Markdown:
Доступно в версии 20.4. Разработчик — Павел Круглов.

JSONAsString, ORC output:
Доступно в версии 20.5. Разработчик — Павел Круглов.

Dictionaries

MongoDB source

Redis source

Cassandra source — 20.5

Layout "direct" и "complex_key_direct" — 20.5

Format settings for dictionaries — 20.4

CREATE DICTIONARY ... SETTINGS format_csv_delimiter = '|'

+ format settings for Kafka tables — 20.5

Caching Dictionaries on SSD

Problem example:

Large constantly updated
banner reference in MySQL (~10 billion rows).

If using cache-dictionaries, cache size is not enough
and too many requests go to MySQL.

Solution:

Гибридная структура данных — оперативка + SSD.
In RAM 16 bytes per key + fresh data buffer.

All data still doesn't fit in cache...
but now hit rate is about 100% and everything is fast.

Ожидается в версии 20.6. Разработчик — Никита Васильев.

Integrations

Writing to ODBC and JDBC Tables

SELECT * FROM odbc('DSN', table); INSERT INTO TABLE FUNCTION odbc('DSN', table);

Доступно в версии 20.5. Разработчик — Артемий Бобровский.

mongo table function

Streaming from RabbitMQ

Replication from MySQL

Optimizations

GROUP BY and DISTINCT on Sorted Data

demo

Доступно в версии 20.5. Разработчик — Дмитрий Рубашкин.

Optimizations

FINAL parallelization.

Optimizations сортировок.

Using RAM in two-level aggregation.

Parallel and distributed INSERT SELECT.

SQL Compatibility

NULL in CREATE TABLE. Type synonyms (20.5):

SET data_type_default_nullable = 1;

CREATE TEMPORARY TABLE test ( x INT NULL, y VARCHAR(255) NOT NULL, z BLOB);

Comparing any types with string constants (20.5):

SELECT ... FROM table WHERE x = '1'

Preserving Nullable on CAST (20.6):

SELECT toTypeName(x), toTypeName(CAST(x AS FLOAT)); Nullable(UInt8) Nullable(Float32)

SQL Compatibility

Aggregate functions count / count distinct / uniq from NULL. (20.5)

SELECT COUNT(DISTINCT x) FROM (SELECT NULL AS x) 0

String and FixedString comparison with padding: (20.4)

SELECT 'x' = CAST('x' AS FixedString(10))

DATE and TIMESTAMP operators (20.4):

SELECT DATE '2020-06-18'

More Features

No more leader election in replication.
Distributed over Distributed.
perf events metrics.

Data parts in RAM and WAL.
Opt-in error sending to Sentry.
Aggregate functions for statistical tests.
Decimal256, Decimal512, Float16 types.
Text near-duplicate search.
OpenTracing support.

Public roadmap 2020!

https://clickhouse.tech/docs/ru/whats-new/extended-roadmap/

~ 500 задач с подробным описанием — зависимости, исполнители...

.