ClickHouse Features You Have Never Heard Of

Author: Alexey Milovidov, 2020-05-20.

ClickHouse Features 2020

ClickHouse Features

You Have Never Heard Of

ClickHouse Features

You Have Never Heard Of

or: How To Use ClickHouse In the Most Unusual Ways

ClickHouse as a Hardware Benchmark

— How to choose appropriate server configurations for ClickHouse?

— What are the difference between cloud and baremetal setup?


Realistic test for hardware based on real production data.

ClickHouse as a Hardware Benchmark

1. Anonymized production data
 available to download for everyone.

Data from web analytics system (click stream).
Prepared with clickhouse-obfuscator tool.

2. ClickHouse is built as a single binary,
 and doesn't depend on Linux distribution.

For Linux x86_64, Linux AArch64, Mac OS, FreeBSD.
For every commit and every pull request.

ClickHouse as a Hardware Benchmark

ClickHouse can utilize all available CPU,
memory bandwidth and IO resources.

— then we can use it for hardware benchmark!

Guide:
https://clickhouse.com/docs/en/operations/performance_test/

Test will take about 30 minutes. Doesn't require ClickHouse install.

ClickHouse as a Hardware Benchmark

Available results:
— baremetal;
— cloud instances (AWS, Azure);
— laptops and workstations.

CPU:
— x86_64: Intel, AMD;
— aarch64: AWS Graviton, Pinebook, Huawei Taishan;

IO:
— HDD, SSD, NVM-e, Optane.

All results are croudsourced.

ClickHouse Without a Server

cat code.txt \ | sed -r -e 's/^ +//' \ | grep . | sort | uniq -c \ | sort -rn | head -n10 38215 { 33495 } 4133 }; 3085 else 2447 namespace DB 1911 */ 1761 return false; 1718 public: 1448 break; 1210 #pragma once

1.665 sec.
export LC_ALL=C: 0.376 sec.

ClickHouse Without a Server

clickhouse-local --structure 's String' --query " SELECT count() AS c, trimLeft(s) AS key FROM table WHERE key != '' GROUP BY key ORDER BY c DESC LIMIT 10" \ < code.txt 38215 { 33466 } 4128 }; 3085 else 2447 namespace DB 1911 */ 1761 return false; 1718 public: 1433 break; 1210 #pragma once

0.103 sec. — 3.7–16 times faster.

ClickHouse Without a Server

(demo)

Processing GitHub Archive data:
https://www.gharchive.org/

ClickHouse Without a Server

SELECT * FROM file('*.json.gz', TSV, 'data String') WHERE JSONExtractString(data, 'actor', 'login') = 'alexey-milovidov' LIMIT 10

1. Table function file.
2. Support for glob patterns.
3. Support for processing compressed files.
4. Functions for JSON.
5. Parallel processing of a set of files.
6. Parallel parsing of data formats.

ClickHouse Without a Server

Use cases:

— processing of text files;
— transformation of data formats;
— preparation of partitions of MergeTree tables;
— data preprocessing before insert.

Serverless ClickHouse

https://mybranch.dev/posts/clickhouse-on-cloud-run/
— Alex Reid.

Text Formats

CustomSeparated
format_custom_escaping_rule
format_custom_field_delimiter
format_custom_row_before/between/after_delimiter
format_custom_result_before/after_delimiter

Template
format_template_resultset
format_template_row
format_template_rows_between_delimiter

Regexp
format_regexp
format_regexp_escaping_rule
format_regexp_skip_unmatched

ClickHouse For Semistructured Data

CREATE TABLE logs ( time EventTime, message String, ) ... SELECT JSONExtractString(message, 'actor', 'login') FROM logs; ALTER TABLE logs ADD COLUMN actor_login MATERIALIZED JSONExtractString(message, 'actor', 'login'); SELECT actor_login FROM logs; ALTER TABLE logs UPDATE actor_login = actor_login;

ClickHouse Accelerates MySQL

CREATE TABLE counters AS mysql( 'host:3306', conv_main, counters, 'metrika', 'password') SELECT name, count() FROM counters GROUP BY name ORDER BY count() DESC LIMIT 10 ┌─name────────────────────────────────────────┬─count()─┐ │ │ 4951979 │ │ ᴺᵁᴸᴸ │ 1318393 │ │ 1 │ 53300 │ │ Дзен │ 22817 │ │ Счетчик │ 20577 │ │ Лендинг │ 17232 │ │ счетчик │ 13772 │ │ null - Красноярский край, Эвенкийский район │ 10735 │ │ 123 │ 10495 │ │ Мой сайт │ 9454 │ └─────────────────────────────────────────────┴─────────┘ 10 rows in set. Elapsed: 15.131 sec. Processed 19.37 million rows...

ClickHouse Accelerates MySQL

mysql> SELECT name, count(*) FROM conv_main.counters GROUP BY name ORDER BY count(*) DESC LIMIT 10 -> ; +-----------------+----------+ | name | count(*) | +-----------------+----------+ | | 4952010 | | NULL | 1318393 | | 1 | 53305 | | счетчик | 44744 | | Дзен | 31120 | | лендинг | 17989 | | Мой сайт | 15815 | | Сайт | 15407 | | test | 11775 | | Метрика | 11213 | +-----------------+----------+ 10 rows in set (5 min 41.79 sec)
SELECT any(name), count() FROM counters GROUP BY lowerUTF8(name) ORDER BY count() DESC LIMIT 10 ┌─any(name)─┬─count()─┐ │ │ 4951978 │ │ ᴺᵁᴸᴸ │ 1318393 │ │ 1 │ 53300 │ │ счетчик │ 34675 │ │ Дзен │ 31113 │ │ лендинг │ 17982 │ │ Мой сайт │ 15773 │ │ Сайт │ 15363 │ │ test │ 11773 │ │ Метрика │ 11189 │ └───────────┴─────────┘ 10 rows in set. Elapsed: 14.551 sec. Processed 19.37 million rows...

ClickHouse Accelerates MySQL

CREATE DICTIONARY counters_dict ( `counter_id` UInt32, `name` String ) PRIMARY KEY counter_id SOURCE(MYSQL( HOST 'host' PORT '3306' DB 'conv_main' TABLE 'counters' USER 'metrika' PASSWORD 'password')) LIFETIME(MIN 3600 MAX 7200) LAYOUT(HASHED())
SELECT name, count() FROM counters_dict GROUP BY name ORDER BY count() DESC LIMIT 10 ┌─name────────────────────────────────────────┬─count()─┐ │ │ 6270371 │ │ 1 │ 53300 │ │ Дзен │ 22817 │ │ Счетчик │ 20577 │ │ Лендинг │ 17232 │ │ счетчик │ 13772 │ │ null - Красноярский край, Эвенкийский район │ 10735 │ │ 123 │ 10495 │ │ Мой сайт │ 9454 │ │ test │ 8598 │ └─────────────────────────────────────────────┴─────────┘ 10 rows in set. Elapsed: 6.216 sec. Processed 19.37 million rows...
SELECT CounterID, count(), dictGet('default.counters_dict', 'name', toUInt64(CounterID)) AS name FROM test.hits_orig WHERE Title LIKE '%Яндекс%' GROUP BY CounterID ORDER BY count() DESC LIMIT 20 ┌─CounterID─┬─count()─┬─name──────────────────────────────────────────┐ │ 160656 │ 21258 │ Яндекс Маркет │ │ 10849243 │ 7686 │ images.yandex.ru – контекст – новый интерфейс │ │ 158751 │ 6145 │ Директ на Погоде │ │ 16227523 │ 3250 │ maps.yandex.ru - RTB │ │ 115080 │ 3129 │ Погода КУБР │ │ 20793439 │ 3051 │ fotki.yandex.ru │

ClickHouse Accelerates MySQL

CREATE TABLE counters_mergetree AS counters_dict ENGINE = MergeTree ORDER BY counter_id INSERT INTO counters_mergetree SELECT * FROM counters_dict -- 11 sec. SELECT name, count() FROM counters_mergetree GROUP BY name ORDER BY count() DESC LIMIT 10 ┌─name────────────────────────────────────────┬─count()─┐ │ │ 6270371 │ │ 1 │ 53300 │ │ Дзен │ 22817 │ │ Счетчик │ 20577 │ │ Лендинг │ 17232 │ │ счетчик │ 13772 │ │ null - Красноярский край, Эвенкийский район │ 10735 │ │ 123 │ 10495 │ │ Мой сайт │ 9454 │ │ test │ 8598 │ └─────────────────────────────────────────────┴─────────┘ 10 rows in set. Elapsed: 0.619 sec. Processed 19.37 million rows...

ClickHouse Accelerates MySQL

SELECT name, count() FROM counters_mergetree GROUP BY name ORDER BY count() DESC LIMIT 10 ┌─name────────────────────────────────────────┬─count()─┐ │ │ 6270371 │ │ 1 │ 53300 │ │ Дзен │ 22817 │ │ Счетчик │ 20577 │ │ Лендинг │ 17232 │ │ счетчик │ 13772 │ │ null - Красноярский край, Эвенкийский район │ 10735 │ │ 123 │ 10495 │ │ Мой сайт │ 9454 │ │ test │ 8598 │ └─────────────────────────────────────────────┴─────────┘ 10 rows in set. Elapsed: 0.619 sec. Processed 19.37 million rows...

ClickHouse Accelerates MySQL

CREATE DATABASE conv_main ENGINE = MySQL('host:3306', conv_main, 'metrika', 'password'); SHOW TABLES FROM conv_main ┌─name───────────────────────────────────────┐ │ AdvEngines2 │ │ AdvEnginesPlaces │ │ AdvEnginesPlacesNames │ │ AdvPlaces │ │ BadUniqIDs │ ...

ClickHouse For Data Integration

SELECT * FROM mysql(...) AS a LEFT JOIN odbc(...) AS b ON ... LEFT JOIN url(...) AS c ON ... ...

Cascade Distribution of Dictionaries

A task:

Suppose you have a ClickHouse cluster, say 100 servers.
Every server should load a dictionary from MySQL.
But if you create it on every server, MySQL
 will not keep the load on dictionary update.

Solution:

Create MySQL dictionary on just two ClickHouse nodes.
Dictionaries are available as tables in ClickHouse.
Create Distributed table that will look to them as replicas for reading.
Create dictionaries with ClickHouse engine on every ClickHouse server.

Dictionaries can look to another dictionaries!

Dictionaries For Sharding Key

CREATE TABLE distr ENGINE = Distributed( cluster, db, table, dictGet('sharding', 'shard_num', id)); SET optimize_skip_unused_shards = 1;

Machine Learning in ClickHouse

1. Applying pre-learned CatBoost models.
 Example: data gaps filling.

2. Machine learning in ClickHouse directly.

https://presentations.clickhouse.com/meetup31/ml.pdf

Machine Learning In ClickHouse Directly

Machine learning model is an aggregate function.
It learns while aggregates data.

SELECT stochasticLogisticRegression( 0.1, 0.0, 1.0, 'Adam')(target, param1, param2) FROM table GROUP BY k

— learns logistic regression for every k.

SELECT stochasticLogisticRegressionState(...

— learns logistic regression for every k and returns internal state of the model.

The state has type AggregateFunction(stochasticLogisticRegression(0.1, 0.0, 1.0, 'Adam'), ...).

Machine Learning Model is an Aggregate Function.

The internal state of aggregate functiuon
— is a first class data type, it can be stored in tables.

You can apply this state as learned model for inference.

INSERT INTO models SELECT stochasticLogisticRegressionState( 0.1, 0.0, 1.0, 'Adam')(target, param1, param2) AS state FROM table; WITH (SELECT state FROM models) AS model SELECT applyMLModel(model, param1, param2) AS predict FROM table;

ClickHouse as a Graph Database

Example: get the top 5 pagerank value from graph “wz”

SELECT vp(v, 'name’), -- retrieve the vertex property “name” pagerank FROM pagerank( -- the graph algorithm wz, -- the graph table 5, -- iterations 0.85, -- damping factor 0.01 -- epsilon ) ORDER BY pagerank DESC LIMIT 5

https://presentations.clickhouse.com/meetup24/2.%20SQLGraph%20--%20When%20ClickHouse%20marries%20graph%20processing%20Amoisbird.pdf

Developer: Amos Bird. Experimental, not available in open-source :(

UDF in ClickHouse

Solution: dictionary of type cache with executable data source.

It works like this:

./executable < keys > result

Results are cached in memory.

ClickHouse on GPU

https://github.com/ClickHouse/ClickHouse/pull/9503

Developed in nVidia. Prototype stage.

https://github.com/ClickHouse/ClickHouse/pull/10232

Sort on GPU. A prototype, actually slower.

ClickHouse as an "Application Server"

https://github.com/ClickHouse/ClickHouse/pull/7572

– predefined handlers for parametrized queries:

https://clickhouse.example.net/report/?user_id=123


Developer: Zhang2014, TCeason and Alexander Tokmakov.

Summary

You can always find something new in ClickHouse!

Or develop and contribute something new.

Something new even for ClickHouse developers.

.