Extraordinary Capabilities of ClickHouse

Author: Alexey Milovidov, 2020-04-06.

Extraordinary
Capabilities of ClickHouse

Alexey Milovidov

Extraordinary Capabilities of ClickHouse,

or How to Use ClickHouse Not for Its Intended Purpose.

ClickHouse for Hardware Benchmarking

1. Anonymized production data
 available externally for testing.

https://youtu.be/2iR7i4akL44

2. ClickHouse is built as a binary
 independent of the distribution.

For Linux x86_64, Linux AArch64, Mac OS.
For every commit and every pull request.

ClickHouse for Hardware Benchmarking

ClickHouse uses all CPU resources, memory bus, disks?
— then it can be used for hardware benchmarking.

Guide:
https://clickhouse.com/docs/en/operations/performance_test/

Test will take about 30 minutes. Doesn't require ClickHouse installation.

ClickHouse for Hardware Benchmarking

Available results:
— bare metal servers;
— cloud instances (AWS, Azure);
— laptops and workstations.

CPU:
— x86_64: Intel, AMD;
— aarch64: AWS Graviton, Pinebook, Huawei Taishan;

Disks:
— HDD, SSD, NVM-e, Optane.

All results are crowdsourced from users.

ClickHouse Without a Server

cat code.txt \ | sed -r -e 's/^ +//' \ | grep . | sort | uniq -c \ | sort -rn | head -n10 38215 { 33495 } 4133 }; 3085 else 2447 namespace DB 1911 */ 1761 return false; 1718 public: 1448 break; 1210 #pragma once

1.665 sec.
export LC_ALL=C: 0.376 sec.

ClickHouse Without a Server

clickhouse-local --structure 's String' --query " SELECT count() AS c, trimLeft(s) AS key FROM table WHERE key != '' GROUP BY key ORDER BY c DESC LIMIT 10" \ < code.txt 38215 { 33466 } 4128 }; 3085 else 2447 namespace DB 1911 */ 1761 return false; 1718 public: 1433 break; 1210 #pragma once

0.103 sec. — 3.7–16 times faster.

ClickHouse Without a Server

(demo)

Processing GitHub Archive data:
https://www.gharchive.org/

ClickHouse Without a Server

SELECT * FROM file('*.json.gz', TSV, 'data String') WHERE JSONExtractString(data, 'actor', 'login') = 'alexey-milovidov' LIMIT 10

1. Table function file.
2. Support for glob patterns.
3. Support for compressed files.
4. Functions for working with JSON.
5. Parallel file processing.
6. Parallel parsing.

ClickHouse Without a Server

Use cases:

— processing text files;
— format conversion;
— preparing partitions for MergeTree;
— data preprocessing before insertion.

Serverless ClickHouse

https://mybranch.dev/posts/clickhouse-on-cloud-run/
— Alex Reid.

Text Formats

CustomSeparated
format_custom_escaping_rule
format_custom_field_delimiter
format_custom_row_before/between/after_delimiter
format_custom_result_before/after_delimiter

Template
format_template_resultset
format_template_row
format_template_rows_between_delimiter

Regexp
format_regexp
format_regexp_escaping_rule
format_regexp_skip_unmatched

ClickHouse for Semi-Structured Data

CREATE TABLE logs ( time EventTime, message String, ) ... SELECT JSONExtractString(message, 'actor', 'login') FROM logs; ALTER TABLE logs ADD COLUMN actor_login MATERIALIZED JSONExtractString(message, 'actor', 'login'); SELECT actor_login FROM logs; ALTER TABLE logs UPDATE actor_login = actor_login;

ClickHouse Accelerates MySQL

CREATE TABLE counters AS mysql( 'host:3306', conv_main, counters, 'metrika', 'password') SELECT name, count() FROM counters GROUP BY name ORDER BY count() DESC LIMIT 10 ┌─name────────────────────────────────────────┬─count()─┐ │ │ 4951979 │ │ ᴺᵁᴸᴸ │ 1318393 │ │ 1 │ 53300 │ │ Дзен │ 22817 │ │ Счетчик │ 20577 │ │ Лендинг │ 17232 │ │ счетчик │ 13772 │ │ null - Красноярский край, Эвенкийский район │ 10735 │ │ 123 │ 10495 │ │ Мой сайт │ 9454 │ └─────────────────────────────────────────────┴─────────┘ 10 rows in set. Elapsed: 15.131 sec. Processed 19.37 million rows...

ClickHouse Accelerates MySQL

mysql> SELECT name, count(*) FROM conv_main.counters GROUP BY name ORDER BY count(*) DESC LIMIT 10 -> ; +-----------------+----------+ | name | count(*) | +-----------------+----------+ | | 4952010 | | NULL | 1318393 | | 1 | 53305 | | счетчик | 44744 | | Дзен | 31120 | | лендинг | 17989 | | Мой сайт | 15815 | | Сайт | 15407 | | test | 11775 | | Метрика | 11213 | +-----------------+----------+ 10 rows in set (5 min 41.79 sec)
SELECT any(name), count() FROM counters GROUP BY lowerUTF8(name) ORDER BY count() DESC LIMIT 10 ┌─any(name)─┬─count()─┐ │ │ 4951978 │ │ ᴺᵁᴸᴸ │ 1318393 │ │ 1 │ 53300 │ │ счетчик │ 34675 │ │ Дзен │ 31113 │ │ лендинг │ 17982 │ │ Мой сайт │ 15773 │ │ Сайт │ 15363 │ │ test │ 11773 │ │ Метрика │ 11189 │ └───────────┴─────────┘ 10 rows in set. Elapsed: 14.551 sec. Processed 19.37 million rows...

ClickHouse Accelerates MySQL

CREATE DICTIONARY counters_dict ( `counter_id` UInt32, `name` String ) PRIMARY KEY counter_id SOURCE(MYSQL( HOST 'host' PORT '3306' DB 'conv_main' TABLE 'counters' USER 'metrika' PASSWORD 'password')) LIFETIME(MIN 3600 MAX 7200) LAYOUT(HASHED())
SELECT name, count() FROM counters_dict GROUP BY name ORDER BY count() DESC LIMIT 10 ┌─name────────────────────────────────────────┬─count()─┐ │ │ 6270371 │ │ 1 │ 53300 │ │ Дзен │ 22817 │ │ Счетчик │ 20577 │ │ Лендинг │ 17232 │ │ счетчик │ 13772 │ │ null - Красноярский край, Эвенкийский район │ 10735 │ │ 123 │ 10495 │ │ Мой сайт │ 9454 │ │ test │ 8598 │ └─────────────────────────────────────────────┴─────────┘ 10 rows in set. Elapsed: 6.216 sec. Processed 19.37 million rows...
SELECT CounterID, count(), dictGet('default.counters_dict', 'name', toUInt64(CounterID)) AS name FROM test.hits_orig WHERE Title LIKE '%Яндекс%' GROUP BY CounterID ORDER BY count() DESC LIMIT 20 ┌─CounterID─┬─count()─┬─name──────────────────────────────────────────┐ │ 160656 │ 21258 │ Яндекс Маркет │ │ 10849243 │ 7686 │ images.yandex.ru – контекст – новый интерфейс │ │ 158751 │ 6145 │ Директ на Погоде │ │ 16227523 │ 3250 │ maps.yandex.ru - RTB │ │ 115080 │ 3129 │ Погода КУБР │ │ 20793439 │ 3051 │ fotki.yandex.ru │

ClickHouse Accelerates MySQL

CREATE TABLE counters_mergetree AS counters_dict ENGINE = MergeTree ORDER BY counter_id INSERT INTO counters_mergetree SELECT * FROM counters_dict -- 11 sec. SELECT name, count() FROM counters_mergetree GROUP BY name ORDER BY count() DESC LIMIT 10 ┌─name────────────────────────────────────────┬─count()─┐ │ │ 6270371 │ │ 1 │ 53300 │ │ Дзен │ 22817 │ │ Счетчик │ 20577 │ │ Лендинг │ 17232 │ │ счетчик │ 13772 │ │ null - Красноярский край, Эвенкийский район │ 10735 │ │ 123 │ 10495 │ │ Мой сайт │ 9454 │ │ test │ 8598 │ └─────────────────────────────────────────────┴─────────┘ 10 rows in set. Elapsed: 0.619 sec. Processed 19.37 million rows...

ClickHouse Accelerates MySQL

SELECT name, count() FROM counters_mergetree GROUP BY name ORDER BY count() DESC LIMIT 10 ┌─name────────────────────────────────────────┬─count()─┐ │ │ 6270371 │ │ 1 │ 53300 │ │ Дзен │ 22817 │ │ Счетчик │ 20577 │ │ Лендинг │ 17232 │ │ счетчик │ 13772 │ │ null - Красноярский край, Эвенкийский район │ 10735 │ │ 123 │ 10495 │ │ Мой сайт │ 9454 │ │ test │ 8598 │ └─────────────────────────────────────────────┴─────────┘ 10 rows in set. Elapsed: 0.619 sec. Processed 19.37 million rows...

ClickHouse Accelerates MySQL

CREATE DATABASE conv_main ENGINE = MySQL('host:3306', conv_main, 'metrika', 'password'); SHOW TABLES FROM conv_main ┌─name───────────────────────────────────────┐ │ AdvEngines2 │ │ AdvEnginesPlaces │ │ AdvEnginesPlacesNames │ │ AdvPlaces │ │ BadUniqIDs │ ...

ClickHouse for Data Integration

SELECT * FROM mysql(...) AS a LEFT JOIN odbc(...) AS b ON ... LEFT JOIN url(...) AS c ON ... ...

Cascade Distribution of Dictionaries

Task:

Suppose there is a ClickHouse cluster, 100 servers.
Each needs a dictionary from MySQL.
If we create it on each one, MySQL
 won't handle the load when dictionaries are updated.

Solution:

On two ClickHouse servers, create a dictionary from MySQL.
Dictionaries are available as tables in ClickHouse.
Create a Distributed table that looks at them as replicas.
On all ClickHouse servers, create dictionaries from ClickHouse.

Dictionaries for Sharding

CREATE TABLE distr ENGINE = Distributed( cluster, db, table, dictGet('sharding', 'shard_num', id)); SET optimize_skip_unused_shards = 1;

Machine Learning in ClickHouse

1. Applying pre-learned CatBoost models.
 Example: filling gaps in data.

2. Training models directly in ClickHouse.

https://presentations.clickhouse.com/meetup31/ml.pdf

Training Models Directly in ClickHouse

Machine learning models as aggregate functions.

SELECT stochasticLogisticRegression( 0.1, 0.0, 1.0, 'Adam')(target, param1, param2) FROM table GROUP BY k

— trains logistic regression for each k.

SELECT stochasticLogisticRegressionState(...

— trains logistic regression for each k and returns the state.

The state has type AggregateFunction(stochasticLogisticRegression(0.1, 0.0, 1.0, 'Adam'), ...).

Machine Learning Models as Aggregate Functions

The state of an aggregate function
— is a full-fledged data type, can be stored in tables.

States of machine learning functions
can be used to apply the model.

INSERT INTO models SELECT stochasticLogisticRegressionState( 0.1, 0.0, 1.0, 'Adam')(target, param1, param2) AS state FROM table; WITH (SELECT state FROM models) AS model SELECT applyMLModel(model, param1, param2) AS predict FROM table;

ClickHouse as a Graph Database

Example: get the top 5 pagerank value from graph "wz"

SELECT vp(v, 'name'), -- retrieve the vertex property "name" pagerank FROM pagerank( -- the graph algorithm wz, -- the graph table 5, -- iterations 0.85, -- damping factor 0.01 -- epsilon ) ORDER BY pagerank DESC LIMIT 5

https://presentations.clickhouse.com/meetup24/2.%20SQLGraph%20--%20When%20ClickHouse%20marries%20graph%20processing%20Amoisbird.pdf

Developed by Amos Bird. Not available in open-source :(

UDF in ClickHouse

Solution: cache-dictionary with executable source.

ClickHouse on GPU

https://github.com/ClickHouse/ClickHouse/pull/9503

Developed by nVidia.

ClickHouse as Application Server

https://github.com/ClickHouse/ClickHouse/pull/7572

Developed by Zhang2014.

Conclusions

You can always find something new in ClickHouse.

Or implement something new.

New even for its developers.

.