ClickHouse Online Meetup

YouTube stream: https://www.youtube.com/c/ClickHouseDB

Telegram chat: https://telegram.me/clickhouse_ru, clickhouse_en

T-shirts for questions!

If you asked a question via Zoom
— write to me in Telegram after the event:
your question, t-shirt size and courier delivery address.

ClickHouse Features:
Spring/Summer 2020

Downsampling and Aggregation of Old Data

... using TTL expressions.
Are you already familiar with TTL expressions?

CREATE TABLE (d Date, 
    ip IPv4 TTL d + INTERVAL 1 MONTH, ...)
ENGINE = MergeTree ...

CREATE TABLE t (d Date, ...) ENGINE = MergeTree ...
TTL d + INTERVAL 1 MONTH

TTL d + INTERVAL 1 MONTH DELETE TTL d + INTERVAL 1 MONTH TO DISK disk_name TTL d + INTERVAL 1 MONTH TO VOLUME volume_name

Downsampling and Aggregation of Old Data

... using TTL expressions.

Downsampling:

TTL d + INTERVAL 1 MONTH 
    DELETE WHERE is_yandex = 0

Aggregation:

CREATE TABLE t (d Date, a UInt64, b String, c String...
    ENGINE = MergeTree ORDER BY (a, b, c)
    TTL d + INTERVAL 1 MONTH
    GROUP BY a, b SET x = sum(x), y = quantile(0.9)(y)

a, b - prefix of the sorting key

Downsampling and Aggregation of Old Data

Aggregation:

CREATE TABLE t (d Date, a UInt64, b String, c String...
    ENGINE = MergeTree ORDER BY (a, b, c)
    TTL d + INTERVAL 1 MONTH
    GROUP BY a, b SET x = sum(x), y = quantile(0.9)(y)

— GROUP BY by prefix of the sorting key;

— aggregate functions can be calculated
from any combination of columns: SET x = argMax(x, time);

— если выражение SET для столбца
не задано — эквивалентно SET x = any(x);

— not yet possible: combine WHERE and GROUP BY.

Downsampling and Aggregation of Old Data

... using TTL expressions.

Well suited for time series
— coarse-graining of old data.

Similar to GraphiteMergeTree,
but applicable to all MergeTree tables.

Similar to AggregatingMergeTree и SummingMergeTree,
but with time thresholds before aggregation.

Доступно в версии 20.5. Разработчик — Николай Сорокин.

Geographical Functions and Dictionaries

Remember pointInPolygon...

demo

Geographical Dictionaries

Reverse geocoding task:

— finding polygon by coordinates;

— find the smallest polygon,
return its properties;

— among sets (millions) of polygons;

— with reasonable speed
and memory consumption;

Geographical Dictionaries

CREATE DICTIONARY polygon_dict
(
    key Array(Array(Array(Array(Float64)))),
    name String,
    value UInt64
)
PRIMARY KEY key
SOURCE(...)
LAYOUT(POLYGON_INDEX_CELL())

key - multi-polygon:
Array(Float64) - coordinate pair, also can be Tuple(Float64, Float64);
Array(Array(Float64)) - ring;
Array(Array(Array(Float64))) - polygon with holes;
Array(Array(Array(Array(Float64)))) - multi-polygon.

Geographical Dictionaries

SELECT dictGet('polygon_dict', 'name', (x, y))

Performance:

Russian district boundaries (20,000 pieces, 12 mln. points)
from OpenStreetMap.

— 2 GB RAM;
— loading - 2.5 minutes;
— speed 900,000 lookups/sec * number of CPU cores.

Non-constant polygons in pointInPolygon:
Доступно в версии 20.5. Разработчики — Алексей Илюхов, Алексей Коряков.

Polygon dictionaries:
Ожидается в версии 20.6. Разработчики — Андрей Чулков, Антон Кваша, Артур Петуховский.

ClickHouse Interfaces

— Native

— HTTP

— ODBC

— JDBC

— MySQL

— ???

ClickHouse Interfaces

— Native

— HTTP

— ODBC

— JDBC

— MySQL

— PostgreSQL!

PostgreSQL Interface in ClickHouse

Tested with psql и JDBC Postgres driver.

Works for both read and write.

Not supported:
extended query (parameterized queries) and COPY.

demo

Ожидается в версии 20.6. Разработчик — Мовсес Элбакян.

ClickHouse Interfaces

— Native

— HTTP

— ODBC

— JDBC

— MySQL — 19.8

— PostgreSQL — 20.6

— GRPC — prototype, in development

Formats

Apache Arrow

ORC for read and for write

MsgPack, Regexp, Markdown, JSONAsString...

Which format to implement tomorrow?

Apache Arrow:
Доступно в версии 20.5. Разработчики — Жанна Зосимова, hczhcz.

MsgPack, Regexp, Markdown:
Доступно в версии 20.4. Разработчик — Павел Круглов.

JSONAsString, ORC output:
Доступно в версии 20.5. Разработчик — Павел Круглов.

Dictionaries

MongoDB source

Redis source

Cassandra source — 20.5

Layout "direct" и "complex_key_direct" — 20.5

Format settings for dictionaries — 20.4

CREATE DICTIONARY ...
    SETTINGS format_csv_delimiter = '|'

+ format settings for Kafka tables — 20.5

Caching Dictionaries on SSD

Problem example:

Large constantly updated
banner reference in MySQL (~10 billion rows).

If using cache-dictionaries, cache size is not enough
and too many requests go to MySQL.

Solution:

Гибридная структура данных — оперативка + SSD.
In RAM 16 bytes per key + fresh data buffer.

All data still doesn't fit in cache...
but now hit rate is about 100% and everything is fast.

Ожидается в версии 20.6. Разработчик — Никита Васильев.

Integrations

Writing to ODBC and JDBC Tables

SELECT * FROM odbc('DSN', table);
INSERT INTO TABLE FUNCTION odbc('DSN', table);

Доступно в версии 20.5. Разработчик — Артемий Бобровский.

mongo table function

Streaming from RabbitMQ

Replication from MySQL

Optimizations

GROUP BY and DISTINCT on Sorted Data

demo

Доступно в версии 20.5. Разработчик — Дмитрий Рубашкин.

Optimizations

FINAL parallelization.

Optimizations сортировок.

Using RAM in two-level aggregation.

Parallel and distributed INSERT SELECT.

SQL Compatibility

NULL in CREATE TABLE. Type synonyms (20.5):

SET data_type_default_nullable = 1;

CREATE TEMPORARY TABLE test ( x INT NULL, y VARCHAR(255) NOT NULL, z BLOB);

Comparing any types with string constants (20.5):

SELECT ... FROM table WHERE x = '1'

Preserving Nullable on CAST (20.6):

SELECT toTypeName(x), toTypeName(CAST(x AS FLOAT));

Nullable(UInt8) Nullable(Float32)

SQL Compatibility

Aggregate functions count / count distinct / uniq from NULL. (20.5)

SELECT COUNT(DISTINCT x) FROM (SELECT NULL AS x)

0

String and FixedString comparison with padding: (20.4)

SELECT 'x' = CAST('x' AS FixedString(10))

DATE and TIMESTAMP operators (20.4):

SELECT DATE '2020-06-18'

More Features

No more leader election in replication.
Distributed over Distributed.
perf events metrics.

Data parts in RAM and WAL.
Opt-in error sending to Sentry.
Aggregate functions for statistical tests.
Decimal256, Decimal512, Float16 types.
Text near-duplicate search.
OpenTracing support.

Public roadmap 2020!

https://clickhouse.tech/docs/ru/whats-new/extended-roadmap/

~ 500 задач с подробным описанием — зависимости, исполнители...

ClickHouse Online Meetup

ClickHouse Features:Spring/Summer 2020

Downsampling and Aggregation of Old Data

Downsampling and Aggregation of Old Data

Downsampling and Aggregation of Old Data

Downsampling and Aggregation of Old Data

Geographical Functions and Dictionaries

Geographical Dictionaries

Geographical Dictionaries

Geographical Dictionaries

ClickHouse Interfaces

ClickHouse Interfaces

PostgreSQL Interface in ClickHouse

ClickHouse Interfaces

Formats

Dictionaries

Caching Dictionaries on SSD

Integrations

Optimizations

Optimizations

SQL Compatibility

SQL Compatibility

More Features

Public roadmap 2020!

.

ClickHouse Features:
Spring/Summer 2020