YouTube stream: https://www.youtube.com/c/ClickHouseDB
Telegram chat: https://telegram.me/clickhouse_ru, clickhouse_en
T-shirts for questions!
If you asked a question via Zoom
— write to me in Telegram after the event:
your question, t-shirt size and courier delivery address.
... using TTL expressions.
Are you already familiar with TTL expressions?
CREATE TABLE (d Date,
ip IPv4 TTL d + INTERVAL 1 MONTH, ...)
ENGINE = MergeTree ...
CREATE TABLE t (d Date, ...) ENGINE = MergeTree ...
TTL d + INTERVAL 1 MONTH
TTL d + INTERVAL 1 MONTH DELETE
TTL d + INTERVAL 1 MONTH TO DISK disk_name
TTL d + INTERVAL 1 MONTH TO VOLUME volume_name
... using TTL expressions.
Downsampling:
TTL d + INTERVAL 1 MONTH
DELETE WHERE is_yandex = 0
Aggregation:
CREATE TABLE t (d Date, a UInt64, b String, c String...
ENGINE = MergeTree ORDER BY (a, b, c)
TTL d + INTERVAL 1 MONTH
GROUP BY a, b SET x = sum(x), y = quantile(0.9)(y)
a, b - prefix of the sorting key
Aggregation:
CREATE TABLE t (d Date, a UInt64, b String, c String...
ENGINE = MergeTree ORDER BY (a, b, c)
TTL d + INTERVAL 1 MONTH
GROUP BY a, b SET x = sum(x), y = quantile(0.9)(y)
— GROUP BY by prefix of the sorting key;
— aggregate functions can be calculated
from any combination of columns: SET x = argMax(x, time);
— если выражение SET для столбца
не задано — эквивалентно SET x = any(x);
— not yet possible: combine WHERE and GROUP BY.
... using TTL expressions.
Well suited for time series
— coarse-graining of old data.
Similar to GraphiteMergeTree,
but applicable to all MergeTree tables.
Similar to AggregatingMergeTree и SummingMergeTree,
but with time thresholds before aggregation.
Доступно в версии 20.5. Разработчик — Николай Сорокин.
Remember pointInPolygon...
demo
Reverse geocoding task:
— finding polygon by coordinates;
— find the smallest polygon,
return its properties;
— among sets (millions) of polygons;
— with reasonable speed
and memory consumption;
CREATE DICTIONARY polygon_dict
(
key Array(Array(Array(Array(Float64)))),
name String,
value UInt64
)
PRIMARY KEY key
SOURCE(...)
LAYOUT(POLYGON_INDEX_CELL())
key - multi-polygon:
Array(Float64) - coordinate pair, also can be Tuple(Float64, Float64);
Array(Array(Float64)) - ring;
Array(Array(Array(Float64))) - polygon with holes;
Array(Array(Array(Array(Float64)))) - multi-polygon.
SELECT dictGet('polygon_dict', 'name', (x, y))
Performance:
Russian district boundaries (20,000 pieces, 12 mln. points)
from OpenStreetMap.
— 2 GB RAM;
— loading - 2.5 minutes;
— speed 900,000 lookups/sec * number of CPU cores.
Non-constant polygons in pointInPolygon:
Доступно в версии 20.5. Разработчики — Алексей Илюхов, Алексей Коряков.
Polygon dictionaries:
Ожидается в версии 20.6. Разработчики — Андрей Чулков, Антон Кваша, Артур Петуховский.
— Native
— HTTP
— ODBC
— JDBC
— MySQL
— ???
— Native
— HTTP
— ODBC
— JDBC
— MySQL
— PostgreSQL!
Tested with psql и JDBC Postgres driver.
Works for both read and write.
Not supported:
extended query (parameterized queries) and COPY.
demo
Ожидается в версии 20.6. Разработчик — Мовсес Элбакян.
— Native
— HTTP
— ODBC
— JDBC
— MySQL — 19.8
— PostgreSQL — 20.6
— GRPC — prototype, in development
Apache Arrow
ORC for read and for write
MsgPack, Regexp, Markdown, JSONAsString...
Which format to implement tomorrow?
Apache Arrow:
Доступно в версии 20.5. Разработчики — Жанна Зосимова, hczhcz.
MsgPack, Regexp, Markdown:
Доступно в версии 20.4. Разработчик — Павел Круглов.
JSONAsString, ORC output:
Доступно в версии 20.5. Разработчик — Павел Круглов.
MongoDB source
Redis source
Cassandra source — 20.5
Layout "direct" и "complex_key_direct" — 20.5
Format settings for dictionaries — 20.4
CREATE DICTIONARY ...
SETTINGS format_csv_delimiter = '|'
+ format settings for Kafka tables — 20.5
Problem example:
Large constantly updated
banner reference in MySQL (~10 billion rows).
If using cache-dictionaries, cache size is not enough
and too many requests go to MySQL.
Solution:
Гибридная структура данных — оперативка + SSD.
In RAM 16 bytes per key + fresh data buffer.
All data still doesn't fit in cache...
but now hit rate is about 100% and everything is fast.
Ожидается в версии 20.6. Разработчик — Никита Васильев.
Writing to ODBC and JDBC Tables
SELECT * FROM odbc('DSN', table);
INSERT INTO TABLE FUNCTION odbc('DSN', table);
Доступно в версии 20.5. Разработчик — Артемий Бобровский.
mongo table function
Streaming from RabbitMQ
Replication from MySQL
GROUP BY and DISTINCT on Sorted Data
demo
Доступно в версии 20.5. Разработчик — Дмитрий Рубашкин.
FINAL parallelization.
Optimizations сортировок.
Using RAM in two-level aggregation.
Parallel and distributed INSERT SELECT.
NULL in CREATE TABLE. Type synonyms (20.5):
SET data_type_default_nullable = 1;
CREATE TEMPORARY TABLE test (
x INT NULL, y VARCHAR(255) NOT NULL, z BLOB);
Comparing any types with string constants (20.5):
SELECT ... FROM table WHERE x = '1'
Preserving Nullable on CAST (20.6):
SELECT toTypeName(x), toTypeName(CAST(x AS FLOAT));
Nullable(UInt8) Nullable(Float32)
Aggregate functions count / count distinct / uniq from NULL. (20.5)
SELECT COUNT(DISTINCT x) FROM (SELECT NULL AS x)
0
String and FixedString comparison with padding: (20.4)
SELECT 'x' = CAST('x' AS FixedString(10))
DATE and TIMESTAMP operators (20.4):
SELECT DATE '2020-06-18'
No more leader election in replication.
Distributed over Distributed.
perf events metrics.
Data parts in RAM and WAL.
Opt-in error sending to Sentry.
Aggregate functions for statistical tests.
Decimal256, Decimal512, Float16 types.
Text near-duplicate search.
OpenTracing support.
https://clickhouse.tech/docs/ru/whats-new/extended-roadmap/
~ 500 задач с подробным описанием — зависимости, исполнители...