Extraordinary Capabilities of ClickHouse

ClickHouse for Hardware Benchmarking

1. Anonymized production data
available externally for testing.

https://youtu.be/2iR7i4akL44

2. ClickHouse is built as a binary
independent of the distribution.

For Linux x86_64, Linux AArch64, Mac OS.
For every commit and every pull request.

ClickHouse for Hardware Benchmarking

ClickHouse uses all CPU resources, memory bus, disks?
— then it can be used for hardware benchmarking.

Guide:
https://clickhouse.com/docs/en/operations/performance_test/

Test will take about 30 minutes. Doesn't require ClickHouse installation.

ClickHouse for Hardware Benchmarking

Available results:
— bare metal servers;
— cloud instances (AWS, Azure);
— laptops and workstations.

CPU:
— x86_64: Intel, AMD;
— aarch64: AWS Graviton, Pinebook, Huawei Taishan;

Disks:
— HDD, SSD, NVM-e, Optane.

All results are crowdsourced from users.

ClickHouse Without a Server

cat code.txt \
    | sed -r -e 's/^ +//' \
    | grep . | sort | uniq -c \
    | sort -rn | head -n10

  38215 {
  33495 }
   4133 };
   3085 else
   2447 namespace DB
   1911 */
   1761 return false;
   1718 public:
   1448 break;
   1210 #pragma once

1.665 sec.
export LC_ALL=C: 0.376 sec.

ClickHouse Without a Server

clickhouse-local --structure 's String' --query "
    SELECT count() AS c, trimLeft(s) AS key
    FROM table WHERE key != ''
    GROUP BY key ORDER BY c DESC LIMIT 10" \
    < code.txt

38215   {
33466   }
4128    };
3085    else
2447    namespace DB
1911    */
1761    return false;
1718    public:
1433    break;
1210    #pragma once

0.103 sec. — 3.7–16 times faster.

ClickHouse Without a Server

(demo)

Processing GitHub Archive data:
https://www.gharchive.org/

ClickHouse Without a Server

SELECT * FROM
    file('*.json.gz', TSV, 'data String')
WHERE JSONExtractString(data, 'actor', 'login')
    = 'alexey-milovidov'
LIMIT 10

1. Table function file.
2. Support for glob patterns.
3. Support for compressed files.
4. Functions for working with JSON.
5. Parallel file processing.
6. Parallel parsing.

ClickHouse Without a Server

Use cases:

— processing text files;
— format conversion;
— preparing partitions for MergeTree;
— data preprocessing before insertion.

Serverless ClickHouse

https://mybranch.dev/posts/clickhouse-on-cloud-run/
— Alex Reid.

Text Formats

— CustomSeparated
format_custom_escaping_rule
format_custom_field_delimiter
format_custom_row_before/between/after_delimiter
format_custom_result_before/after_delimiter

— Template
format_template_resultset
format_template_row
format_template_rows_between_delimiter

— Regexp
format_regexp
format_regexp_escaping_rule
format_regexp_skip_unmatched

ClickHouse for Semi-Structured Data

CREATE TABLE logs
(
    time EventTime,
    message String,
) ...

SELECT JSONExtractString(message, 'actor', 'login') FROM logs;

ALTER TABLE logs ADD COLUMN actor_login
    MATERIALIZED JSONExtractString(message, 'actor', 'login');

SELECT actor_login FROM logs;

ALTER TABLE logs UPDATE actor_login = actor_login;

ClickHouse Accelerates MySQL

CREATE TABLE counters AS mysql(
    'host:3306', conv_main, counters, 'metrika', 'password')

SELECT name, count() FROM counters
GROUP BY name ORDER BY count() DESC LIMIT 10

┌─name────────────────────────────────────────┬─count()─┐
│                                             │ 4951979 │
│ ᴺᵁᴸᴸ                                        │ 1318393 │
│ 1                                           │   53300 │
│ Дзен                                        │   22817 │
│ Счетчик                                     │   20577 │
│ Лендинг                                     │   17232 │
│ счетчик                                     │   13772 │
│ null - Красноярский край, Эвенкийский район │   10735 │
│ 123                                         │   10495 │
│ Мой сайт                                    │    9454 │
└─────────────────────────────────────────────┴─────────┘

10 rows in set. Elapsed: 15.131 sec. Processed 19.37 million rows...

ClickHouse Accelerates MySQL

mysql> SELECT name, count(*) FROM conv_main.counters
GROUP BY name ORDER BY count(*) DESC LIMIT 10
    -> ;
+-----------------+----------+
| name            | count(*) |
+-----------------+----------+
|                 |  4952010 |
| NULL            |  1318393 |
| 1               |    53305 |
| счетчик         |    44744 |
| Дзен            |    31120 |
| лендинг         |    17989 |
| Мой сайт        |    15815 |
| Сайт            |    15407 |
| test            |    11775 |
| Метрика         |    11213 |
+-----------------+----------+
10 rows in set (5 min 41.79 sec)

SELECT any(name), count()
FROM counters
GROUP BY lowerUTF8(name)
ORDER BY count() DESC LIMIT 10

┌─any(name)─┬─count()─┐
│           │ 4951978 │
│ ᴺᵁᴸᴸ      │ 1318393 │
│ 1         │   53300 │
│ счетчик   │   34675 │
│ Дзен      │   31113 │
│ лендинг   │   17982 │
│ Мой сайт  │   15773 │
│ Сайт      │   15363 │
│ test      │   11773 │
│ Метрика   │   11189 │
└───────────┴─────────┘

10 rows in set. Elapsed: 14.551 sec. Processed 19.37 million rows...

ClickHouse Accelerates MySQL

CREATE DICTIONARY counters_dict
(
    `counter_id` UInt32,
    `name` String
)
PRIMARY KEY counter_id
SOURCE(MYSQL(
    HOST 'host' PORT '3306'
    DB 'conv_main' TABLE 'counters'
    USER 'metrika' PASSWORD 'password'))
LIFETIME(MIN 3600 MAX 7200)
LAYOUT(HASHED())

SELECT name, count() FROM counters_dict
GROUP BY name ORDER BY count() DESC LIMIT 10

┌─name────────────────────────────────────────┬─count()─┐
│                                             │ 6270371 │
│ 1                                           │   53300 │
│ Дзен                                        │   22817 │
│ Счетчик                                     │   20577 │
│ Лендинг                                     │   17232 │
│ счетчик                                     │   13772 │
│ null - Красноярский край, Эвенкийский район │   10735 │
│ 123                                         │   10495 │
│ Мой сайт                                    │    9454 │
│ test                                        │    8598 │
└─────────────────────────────────────────────┴─────────┘

10 rows in set. Elapsed: 6.216 sec. Processed 19.37 million rows...

SELECT
    CounterID,
    count(),
    dictGet('default.counters_dict', 'name', toUInt64(CounterID)) AS name
FROM test.hits_orig
WHERE Title LIKE '%Яндекс%'
GROUP BY CounterID
ORDER BY count() DESC
LIMIT 20

┌─CounterID─┬─count()─┬─name──────────────────────────────────────────┐
│    160656 │   21258 │ Яндекс Маркет                                 │
│  10849243 │    7686 │ images.yandex.ru – контекст – новый интерфейс │
│    158751 │    6145 │ Директ на Погоде                              │
│  16227523 │    3250 │ maps.yandex.ru - RTB                          │
│    115080 │    3129 │ Погода КУБР                                   │
│  20793439 │    3051 │ fotki.yandex.ru                               │

ClickHouse Accelerates MySQL

CREATE TABLE counters_mergetree AS counters_dict
    ENGINE = MergeTree ORDER BY counter_id

INSERT INTO counters_mergetree SELECT * FROM counters_dict -- 11 sec.

SELECT name, count() FROM counters_mergetree
GROUP BY name ORDER BY count() DESC LIMIT 10

┌─name────────────────────────────────────────┬─count()─┐
│                                             │ 6270371 │
│ 1                                           │   53300 │
│ Дзен                                        │   22817 │
│ Счетчик                                     │   20577 │
│ Лендинг                                     │   17232 │
│ счетчик                                     │   13772 │
│ null - Красноярский край, Эвенкийский район │   10735 │
│ 123                                         │   10495 │
│ Мой сайт                                    │    9454 │
│ test                                        │    8598 │
└─────────────────────────────────────────────┴─────────┘

10 rows in set. Elapsed: 0.619 sec. Processed 19.37 million rows...

ClickHouse Accelerates MySQL

SELECT name, count() FROM counters_mergetree
GROUP BY name ORDER BY count() DESC LIMIT 10

┌─name────────────────────────────────────────┬─count()─┐
│                                             │ 6270371 │
│ 1                                           │   53300 │
│ Дзен                                        │   22817 │
│ Счетчик                                     │   20577 │
│ Лендинг                                     │   17232 │
│ счетчик                                     │   13772 │
│ null - Красноярский край, Эвенкийский район │   10735 │
│ 123                                         │   10495 │
│ Мой сайт                                    │    9454 │
│ test                                        │    8598 │
└─────────────────────────────────────────────┴─────────┘

10 rows in set. Elapsed: 0.619 sec. Processed 19.37 million rows...

ClickHouse Accelerates MySQL

CREATE DATABASE conv_main
    ENGINE = MySQL('host:3306', conv_main, 'metrika', 'password');

SHOW TABLES FROM conv_main

┌─name───────────────────────────────────────┐
│ AdvEngines2                                │
│ AdvEnginesPlaces                           │
│ AdvEnginesPlacesNames                      │
│ AdvPlaces                                  │
│ BadUniqIDs                                 │
...

ClickHouse for Data Integration

SELECT *
    FROM mysql(...) AS a
    LEFT JOIN odbc(...) AS b ON ...
    LEFT JOIN url(...) AS c ON ...
...

Cascade Distribution of Dictionaries

Task:

Suppose there is a ClickHouse cluster, 100 servers.
Each needs a dictionary from MySQL.
If we create it on each one, MySQL
won't handle the load when dictionaries are updated.

Solution:

On two ClickHouse servers, create a dictionary from MySQL.
Dictionaries are available as tables in ClickHouse.
Create a Distributed table that looks at them as replicas.
On all ClickHouse servers, create dictionaries from ClickHouse.

Dictionaries for Sharding

CREATE TABLE distr ENGINE = Distributed(
    cluster, db, table,
    dictGet('sharding', 'shard_num', id));

SET optimize_skip_unused_shards = 1;

Machine Learning in ClickHouse

1. Applying pre-learned CatBoost models.
Example: filling gaps in data.

2. Training models directly in ClickHouse.

https://presentations.clickhouse.com/meetup31/ml.pdf

Training Models Directly in ClickHouse

Machine learning models as aggregate functions.

SELECT stochasticLogisticRegression(
    0.1, 0.0, 1.0, 'Adam')(target, param1, param2)
FROM table GROUP BY k

— trains logistic regression for each k.

SELECT stochasticLogisticRegressionState(...

— trains logistic regression for each k and returns the state.

The state has type AggregateFunction(stochasticLogisticRegression(0.1, 0.0, 1.0, 'Adam'), ...).

Machine Learning Models as Aggregate Functions

The state of an aggregate function
— is a full-fledged data type, can be stored in tables.

States of machine learning functions
can be used to apply the model.

INSERT INTO models
SELECT stochasticLogisticRegressionState(
    0.1, 0.0, 1.0, 'Adam')(target, param1, param2) AS state
    FROM table;

WITH (SELECT state FROM models) AS model
SELECT applyMLModel(model, param1, param2) AS predict
FROM table;

ClickHouse as a Graph Database

Example: get the top 5 pagerank value from graph "wz"

SELECT vp(v, 'name'), -- retrieve the vertex property "name"
    pagerank
FROM pagerank( -- the graph algorithm
    wz,        -- the graph table
    5,         -- iterations
    0.85,      -- damping factor
    0.01       -- epsilon
)
ORDER BY pagerank DESC
LIMIT 5

https://presentations.clickhouse.com/meetup24/2.%20SQLGraph%20--%20When%20ClickHouse%20marries%20graph%20processing%20Amoisbird.pdf

Developed by Amos Bird. Not available in open-source :(

UDF in ClickHouse

Solution: cache-dictionary with executable source.

ClickHouse on GPU

https://github.com/ClickHouse/ClickHouse/pull/9503

Developed by nVidia.

ClickHouse as Application Server

https://github.com/ClickHouse/ClickHouse/pull/7572

Developed by Zhang2014.

Conclusions

You can always find something new in ClickHouse.

Or implement something new.

New even for its developers.

ExtraordinaryCapabilities of ClickHouse