ClickHouse Features You've Never Heard Of

ClickHouse as a Hardware Benchmark

— How to choose appropriate server configurations for ClickHouse?

— What are the difference between cloud and baremetal setup?

Realistic test for hardware based on real production data.

ClickHouse as a Hardware Benchmark

1. Anonymized production data
available to download for everyone.

Data from web analytics system (click stream).
Prepared with clickhouse-obfuscator tool.

2. ClickHouse is built as a single binary,
and doesn't depend on Linux distribution.

For Linux x86_64, Linux AArch64, Mac OS, FreeBSD.
For every commit and every pull request.

ClickHouse as a Hardware Benchmark

ClickHouse can utilize all available CPU,
memory bandwidth and IO resources.

— then we can use it for hardware benchmark!

Guide:
https://clickhouse.com/docs/en/operations/performance_test/

Test will take about 30 minutes. Doesn't require ClickHouse install.

ClickHouse as a Hardware Benchmark

Available results:
— baremetal;
— cloud instances (AWS, Azure);
— laptops and workstations.

CPU:
— x86_64: Intel, AMD;
— aarch64: AWS Graviton, Pinebook, Huawei Taishan;

IO:
— HDD, SSD, NVM-e, Optane.

All results are croudsourced.

ClickHouse Without a Server

cat code.txt \
    | sed -r -e 's/^ +//' \
    | grep . | sort | uniq -c \
    | sort -rn | head -n10

  38215 {
  33495 }
   4133 };
   3085 else
   2447 namespace DB
   1911 */
   1761 return false;
   1718 public:
   1448 break;
   1210 #pragma once

1.665 sec.
export LC_ALL=C: 0.376 sec.

ClickHouse Without a Server

clickhouse-local --structure 's String' --query "
    SELECT count() AS c, trimLeft(s) AS key
    FROM table WHERE key != ''
    GROUP BY key ORDER BY c DESC LIMIT 10" \
    < code.txt

38215   {
33466   }
4128    };
3085    else
2447    namespace DB
1911    */
1761    return false;
1718    public:
1433    break;
1210    #pragma once

0.103 sec. — 3.7–16 times faster.

ClickHouse Without a Server

(demo)

Processing GitHub Archive data:
https://www.gharchive.org/

ClickHouse Without a Server

SELECT * FROM
    file('*.json.gz', TSV, 'data String')
WHERE JSONExtractString(data, 'actor', 'login')
    = 'alexey-milovidov'
LIMIT 10

1. Table function file.
2. Support for glob patterns.
3. Support for processing compressed files.
4. Functions for JSON.
5. Parallel processing of a set of files.
6. Parallel parsing of data formats.

ClickHouse Without a Server

Use cases:

— processing of text files;
— transformation of data formats;
— preparation of partitions of MergeTree tables;
— data preprocessing before insert.

Serverless ClickHouse

https://mybranch.dev/posts/clickhouse-on-cloud-run/
— Alex Reid.

Text Formats

— CustomSeparated
format_custom_escaping_rule
format_custom_field_delimiter
format_custom_row_before/between/after_delimiter
format_custom_result_before/after_delimiter

— Template
format_template_resultset
format_template_row
format_template_rows_between_delimiter

— Regexp
format_regexp
format_regexp_escaping_rule
format_regexp_skip_unmatched

ClickHouse For Semistructured Data

CREATE TABLE logs
(
    time EventTime,
    message String,
) ...

SELECT JSONExtractString(message, 'actor', 'login') FROM logs;

ALTER TABLE logs ADD COLUMN actor_login
    MATERIALIZED JSONExtractString(message, 'actor', 'login');

SELECT actor_login FROM logs;

ALTER TABLE logs UPDATE actor_login = actor_login;

ClickHouse Accelerates MySQL

CREATE TABLE counters AS mysql(
    'host:3306', conv_main, counters, 'metrika', 'password')

SELECT name, count() FROM counters
GROUP BY name ORDER BY count() DESC LIMIT 10

┌─name────────────────────────────────────────┬─count()─┐
│                                             │ 4951979 │
│ ᴺᵁᴸᴸ                                        │ 1318393 │
│ 1                                           │   53300 │
│ Дзен                                        │   22817 │
│ Счетчик                                     │   20577 │
│ Лендинг                                     │   17232 │
│ счетчик                                     │   13772 │
│ null - Красноярский край, Эвенкийский район │   10735 │
│ 123                                         │   10495 │
│ Мой сайт                                    │    9454 │
└─────────────────────────────────────────────┴─────────┘

10 rows in set. Elapsed: 15.131 sec. Processed 19.37 million rows...

ClickHouse Accelerates MySQL

mysql> SELECT name, count(*) FROM conv_main.counters
GROUP BY name ORDER BY count(*) DESC LIMIT 10
    -> ;
+-----------------+----------+
| name            | count(*) |
+-----------------+----------+
|                 |  4952010 |
| NULL            |  1318393 |
| 1               |    53305 |
| счетчик         |    44744 |
| Дзен            |    31120 |
| лендинг         |    17989 |
| Мой сайт        |    15815 |
| Сайт            |    15407 |
| test            |    11775 |
| Метрика         |    11213 |
+-----------------+----------+
10 rows in set (5 min 41.79 sec)

SELECT any(name), count()
FROM counters
GROUP BY lowerUTF8(name)
ORDER BY count() DESC LIMIT 10

┌─any(name)─┬─count()─┐
│           │ 4951978 │
│ ᴺᵁᴸᴸ      │ 1318393 │
│ 1         │   53300 │
│ счетчик   │   34675 │
│ Дзен      │   31113 │
│ лендинг   │   17982 │
│ Мой сайт  │   15773 │
│ Сайт      │   15363 │
│ test      │   11773 │
│ Метрика   │   11189 │
└───────────┴─────────┘

10 rows in set. Elapsed: 14.551 sec. Processed 19.37 million rows...

ClickHouse Accelerates MySQL

CREATE DICTIONARY counters_dict
(
    `counter_id` UInt32,
    `name` String
)
PRIMARY KEY counter_id
SOURCE(MYSQL(
    HOST 'host' PORT '3306'
    DB 'conv_main' TABLE 'counters'
    USER 'metrika' PASSWORD 'password'))
LIFETIME(MIN 3600 MAX 7200)
LAYOUT(HASHED())

SELECT name, count() FROM counters_dict
GROUP BY name ORDER BY count() DESC LIMIT 10

┌─name────────────────────────────────────────┬─count()─┐
│                                             │ 6270371 │
│ 1                                           │   53300 │
│ Дзен                                        │   22817 │
│ Счетчик                                     │   20577 │
│ Лендинг                                     │   17232 │
│ счетчик                                     │   13772 │
│ null - Красноярский край, Эвенкийский район │   10735 │
│ 123                                         │   10495 │
│ Мой сайт                                    │    9454 │
│ test                                        │    8598 │
└─────────────────────────────────────────────┴─────────┘

10 rows in set. Elapsed: 6.216 sec. Processed 19.37 million rows...

SELECT
    CounterID,
    count(),
    dictGet('default.counters_dict', 'name', toUInt64(CounterID)) AS name
FROM test.hits_orig
WHERE Title LIKE '%Яндекс%'
GROUP BY CounterID
ORDER BY count() DESC
LIMIT 20

┌─CounterID─┬─count()─┬─name──────────────────────────────────────────┐
│    160656 │   21258 │ Яндекс Маркет                                 │
│  10849243 │    7686 │ images.yandex.ru – контекст – новый интерфейс │
│    158751 │    6145 │ Директ на Погоде                              │
│  16227523 │    3250 │ maps.yandex.ru - RTB                          │
│    115080 │    3129 │ Погода КУБР                                   │
│  20793439 │    3051 │ fotki.yandex.ru                               │

ClickHouse Accelerates MySQL

CREATE TABLE counters_mergetree AS counters_dict
    ENGINE = MergeTree ORDER BY counter_id

INSERT INTO counters_mergetree SELECT * FROM counters_dict -- 11 sec.

SELECT name, count() FROM counters_mergetree
GROUP BY name ORDER BY count() DESC LIMIT 10

┌─name────────────────────────────────────────┬─count()─┐
│                                             │ 6270371 │
│ 1                                           │   53300 │
│ Дзен                                        │   22817 │
│ Счетчик                                     │   20577 │
│ Лендинг                                     │   17232 │
│ счетчик                                     │   13772 │
│ null - Красноярский край, Эвенкийский район │   10735 │
│ 123                                         │   10495 │
│ Мой сайт                                    │    9454 │
│ test                                        │    8598 │
└─────────────────────────────────────────────┴─────────┘

10 rows in set. Elapsed: 0.619 sec. Processed 19.37 million rows...

ClickHouse Accelerates MySQL

SELECT name, count() FROM counters_mergetree
GROUP BY name ORDER BY count() DESC LIMIT 10

┌─name────────────────────────────────────────┬─count()─┐
│                                             │ 6270371 │
│ 1                                           │   53300 │
│ Дзен                                        │   22817 │
│ Счетчик                                     │   20577 │
│ Лендинг                                     │   17232 │
│ счетчик                                     │   13772 │
│ null - Красноярский край, Эвенкийский район │   10735 │
│ 123                                         │   10495 │
│ Мой сайт                                    │    9454 │
│ test                                        │    8598 │
└─────────────────────────────────────────────┴─────────┘

10 rows in set. Elapsed: 0.619 sec. Processed 19.37 million rows...

ClickHouse Accelerates MySQL

CREATE DATABASE conv_main
    ENGINE = MySQL('host:3306', conv_main, 'metrika', 'password');

SHOW TABLES FROM conv_main

┌─name───────────────────────────────────────┐
│ AdvEngines2                                │
│ AdvEnginesPlaces                           │
│ AdvEnginesPlacesNames                      │
│ AdvPlaces                                  │
│ BadUniqIDs                                 │
...

ClickHouse For Data Integration

SELECT *
    FROM mysql(...) AS a
    LEFT JOIN odbc(...) AS b ON ...
    LEFT JOIN url(...) AS c ON ...
...

Cascade Distribution of Dictionaries

A task:

Suppose you have a ClickHouse cluster, say 100 servers.
Every server should load a dictionary from MySQL.
But if you create it on every server, MySQL
will not keep the load on dictionary update.

Solution:

Create MySQL dictionary on just two ClickHouse nodes.
Dictionaries are available as tables in ClickHouse.
Create Distributed table that will look to them as replicas for reading.
Create dictionaries with ClickHouse engine on every ClickHouse server.

Dictionaries can look to another dictionaries!

Dictionaries For Sharding Key

CREATE TABLE distr ENGINE = Distributed(
    cluster, db, table,
    dictGet('sharding', 'shard_num', id));

SET optimize_skip_unused_shards = 1;

Machine Learning in ClickHouse

1. Applying pre-learned CatBoost models.
Example: data gaps filling.

2. Machine learning in ClickHouse directly.

https://presentations.clickhouse.com/meetup31/ml.pdf

Machine Learning In ClickHouse Directly

Machine learning model is an aggregate function.
It learns while aggregates data.

SELECT stochasticLogisticRegression(
    0.1, 0.0, 1.0, 'Adam')(target, param1, param2)
FROM table GROUP BY k

— learns logistic regression for every k.

SELECT stochasticLogisticRegressionState(...

— learns logistic regression for every k and returns internal state of the model.

The state has type AggregateFunction(stochasticLogisticRegression(0.1, 0.0, 1.0, 'Adam'), ...).

Machine Learning Model is an Aggregate Function.

The internal state of aggregate functiuon
— is a first class data type, it can be stored in tables.

You can apply this state as learned model for inference.

INSERT INTO models
SELECT stochasticLogisticRegressionState(
    0.1, 0.0, 1.0, 'Adam')(target, param1, param2) AS state
    FROM table;

WITH (SELECT state FROM models) AS model
SELECT applyMLModel(model, param1, param2) AS predict
FROM table;

ClickHouse as a Graph Database

Example: get the top 5 pagerank value from graph “wz”

SELECT vp(v, 'name’), -- retrieve the vertex property “name”
    pagerank
FROM pagerank( -- the graph algorithm
    wz,        -- the graph table
    5,         -- iterations
    0.85,      -- damping factor
    0.01       -- epsilon
)
ORDER BY pagerank DESC
LIMIT 5

https://presentations.clickhouse.com/meetup24/2.%20SQLGraph%20--%20When%20ClickHouse%20marries%20graph%20processing%20Amoisbird.pdf

Developer: Amos Bird. Experimental, not available in open-source :(

UDF in ClickHouse

Solution: dictionary of type cache with executable data source.

It works like this:

./executable < keys > result

Results are cached in memory.

ClickHouse on GPU

https://github.com/ClickHouse/ClickHouse/pull/9503

Developed in nVidia. Prototype stage.

https://github.com/ClickHouse/ClickHouse/pull/10232

Sort on GPU. A prototype, actually slower.

ClickHouse as an "Application Server"

https://github.com/ClickHouse/ClickHouse/pull/7572

– predefined handlers for parametrized queries:

https://clickhouse.example.net/report/?user_id=123

Developer: Zhang2014, TCeason and Alexander Tokmakov.

Summary

You can always find something new in ClickHouse!

Or develop and contribute something new.

Something new even for ClickHouse developers.

ClickHouse Features You Have Never Heard Of

ClickHouse Features 2020

ClickHouse Features

You Have Never Heard Of

ClickHouse Features

You Have Never Heard Of

ClickHouse as a Hardware Benchmark

ClickHouse as a Hardware Benchmark

ClickHouse as a Hardware Benchmark

ClickHouse as a Hardware Benchmark

ClickHouse Without a Server

ClickHouse Without a Server

ClickHouse Without a Server

ClickHouse Without a Server

ClickHouse Without a Server

Serverless ClickHouse

Text Formats

ClickHouse For Semistructured Data

ClickHouse Accelerates MySQL

ClickHouse Accelerates MySQL

ClickHouse Accelerates MySQL

ClickHouse Accelerates MySQL

ClickHouse Accelerates MySQL

ClickHouse Accelerates MySQL

ClickHouse For Data Integration

Cascade Distribution of Dictionaries

Dictionaries For Sharding Key

Machine Learning in ClickHouse

Machine Learning In ClickHouse Directly

Machine Learning Model is an Aggregate Function.

ClickHouse as a Graph Database

UDF in ClickHouse

ClickHouse on GPU

ClickHouse as an "Application Server"

Summary

.