Author: Alexey Milovidov, 2020-05-20.
or: How To Use ClickHouse In the Most Unusual Ways
— How to choose appropriate server configurations for ClickHouse?
— What are the difference between cloud and baremetal setup?
Realistic test for hardware based on real production data.
1. Anonymized production data
available to download for everyone.
Data from web analytics system (click stream).
Prepared with clickhouse-obfuscator tool.
2. ClickHouse is built as a single binary,
and doesn't depend on Linux distribution.
For Linux x86_64, Linux AArch64, Mac OS, FreeBSD.
For every commit and every pull request.
ClickHouse can utilize all available CPU,
memory bandwidth and IO resources.
— then we can use it for hardware benchmark!
Guide:
https://clickhouse.com/docs/en/operations/performance_test/
Test will take about 30 minutes. Doesn't require ClickHouse install.
Available results:
— baremetal;
— cloud instances (AWS, Azure);
— laptops and workstations.
CPU:
— x86_64: Intel, AMD;
— aarch64: AWS Graviton, Pinebook, Huawei Taishan;
IO:
— HDD, SSD, NVM-e, Optane.
All results are croudsourced.
cat code.txt \
| sed -r -e 's/^ +//' \
| grep . | sort | uniq -c \
| sort -rn | head -n10
38215 {
33495 }
4133 };
3085 else
2447 namespace DB
1911 */
1761 return false;
1718 public:
1448 break;
1210 #pragma once
1.665 sec.
export LC_ALL=C: 0.376 sec.
clickhouse-local --structure 's String' --query "
SELECT count() AS c, trimLeft(s) AS key
FROM table WHERE key != ''
GROUP BY key ORDER BY c DESC LIMIT 10" \
< code.txt
38215 {
33466 }
4128 };
3085 else
2447 namespace DB
1911 */
1761 return false;
1718 public:
1433 break;
1210 #pragma once
0.103 sec. — 3.7–16 times faster.
(demo)
Processing GitHub Archive data:
https://www.gharchive.org/
SELECT * FROM
file('*.json.gz', TSV, 'data String')
WHERE JSONExtractString(data, 'actor', 'login')
= 'alexey-milovidov'
LIMIT 10
1. Table function file.
2. Support for glob patterns.
3. Support for processing compressed files.
4. Functions for JSON.
5. Parallel processing of a set of files.
6. Parallel parsing of data formats.
Use cases:
— processing of text files;
— transformation of data formats;
— preparation of partitions of MergeTree tables;
— data preprocessing before insert.
https://mybranch.dev/posts/clickhouse-on-cloud-run/
— Alex Reid.
— CustomSeparated
format_custom_escaping_rule
format_custom_field_delimiter
format_custom_row_before/between/after_delimiter
format_custom_result_before/after_delimiter
— Template
format_template_resultset
format_template_row
format_template_rows_between_delimiter
— Regexp
format_regexp
format_regexp_escaping_rule
format_regexp_skip_unmatched
CREATE TABLE logs
(
time EventTime,
message String,
) ...
SELECT JSONExtractString(message, 'actor', 'login') FROM logs;
ALTER TABLE logs ADD COLUMN actor_login
MATERIALIZED JSONExtractString(message, 'actor', 'login');
SELECT actor_login FROM logs;
ALTER TABLE logs UPDATE actor_login = actor_login;
CREATE TABLE counters AS mysql(
'host:3306', conv_main, counters, 'metrika', 'password')
SELECT name, count() FROM counters
GROUP BY name ORDER BY count() DESC LIMIT 10
┌─name────────────────────────────────────────┬─count()─┐
│ │ 4951979 │
│ ᴺᵁᴸᴸ │ 1318393 │
│ 1 │ 53300 │
│ Дзен │ 22817 │
│ Счетчик │ 20577 │
│ Лендинг │ 17232 │
│ счетчик │ 13772 │
│ null - Красноярский край, Эвенкийский район │ 10735 │
│ 123 │ 10495 │
│ Мой сайт │ 9454 │
└─────────────────────────────────────────────┴─────────┘
10 rows in set. Elapsed: 15.131 sec. Processed 19.37 million rows...
mysql> SELECT name, count(*) FROM conv_main.counters
GROUP BY name ORDER BY count(*) DESC LIMIT 10
-> ;
+-----------------+----------+
| name | count(*) |
+-----------------+----------+
| | 4952010 |
| NULL | 1318393 |
| 1 | 53305 |
| счетчик | 44744 |
| Дзен | 31120 |
| лендинг | 17989 |
| Мой сайт | 15815 |
| Сайт | 15407 |
| test | 11775 |
| Метрика | 11213 |
+-----------------+----------+
10 rows in set (5 min 41.79 sec)
SELECT any(name), count()
FROM counters
GROUP BY lowerUTF8(name)
ORDER BY count() DESC LIMIT 10
┌─any(name)─┬─count()─┐
│ │ 4951978 │
│ ᴺᵁᴸᴸ │ 1318393 │
│ 1 │ 53300 │
│ счетчик │ 34675 │
│ Дзен │ 31113 │
│ лендинг │ 17982 │
│ Мой сайт │ 15773 │
│ Сайт │ 15363 │
│ test │ 11773 │
│ Метрика │ 11189 │
└───────────┴─────────┘
10 rows in set. Elapsed: 14.551 sec. Processed 19.37 million rows...
CREATE DICTIONARY counters_dict
(
`counter_id` UInt32,
`name` String
)
PRIMARY KEY counter_id
SOURCE(MYSQL(
HOST 'host' PORT '3306'
DB 'conv_main' TABLE 'counters'
USER 'metrika' PASSWORD 'password'))
LIFETIME(MIN 3600 MAX 7200)
LAYOUT(HASHED())
SELECT name, count() FROM counters_dict
GROUP BY name ORDER BY count() DESC LIMIT 10
┌─name────────────────────────────────────────┬─count()─┐
│ │ 6270371 │
│ 1 │ 53300 │
│ Дзен │ 22817 │
│ Счетчик │ 20577 │
│ Лендинг │ 17232 │
│ счетчик │ 13772 │
│ null - Красноярский край, Эвенкийский район │ 10735 │
│ 123 │ 10495 │
│ Мой сайт │ 9454 │
│ test │ 8598 │
└─────────────────────────────────────────────┴─────────┘
10 rows in set. Elapsed: 6.216 sec. Processed 19.37 million rows...
SELECT
CounterID,
count(),
dictGet('default.counters_dict', 'name', toUInt64(CounterID)) AS name
FROM test.hits_orig
WHERE Title LIKE '%Яндекс%'
GROUP BY CounterID
ORDER BY count() DESC
LIMIT 20
┌─CounterID─┬─count()─┬─name──────────────────────────────────────────┐
│ 160656 │ 21258 │ Яндекс Маркет │
│ 10849243 │ 7686 │ images.yandex.ru – контекст – новый интерфейс │
│ 158751 │ 6145 │ Директ на Погоде │
│ 16227523 │ 3250 │ maps.yandex.ru - RTB │
│ 115080 │ 3129 │ Погода КУБР │
│ 20793439 │ 3051 │ fotki.yandex.ru │
CREATE TABLE counters_mergetree AS counters_dict
ENGINE = MergeTree ORDER BY counter_id
INSERT INTO counters_mergetree SELECT * FROM counters_dict -- 11 sec.
SELECT name, count() FROM counters_mergetree
GROUP BY name ORDER BY count() DESC LIMIT 10
┌─name────────────────────────────────────────┬─count()─┐
│ │ 6270371 │
│ 1 │ 53300 │
│ Дзен │ 22817 │
│ Счетчик │ 20577 │
│ Лендинг │ 17232 │
│ счетчик │ 13772 │
│ null - Красноярский край, Эвенкийский район │ 10735 │
│ 123 │ 10495 │
│ Мой сайт │ 9454 │
│ test │ 8598 │
└─────────────────────────────────────────────┴─────────┘
10 rows in set. Elapsed: 0.619 sec. Processed 19.37 million rows...
SELECT name, count() FROM counters_mergetree
GROUP BY name ORDER BY count() DESC LIMIT 10
┌─name────────────────────────────────────────┬─count()─┐
│ │ 6270371 │
│ 1 │ 53300 │
│ Дзен │ 22817 │
│ Счетчик │ 20577 │
│ Лендинг │ 17232 │
│ счетчик │ 13772 │
│ null - Красноярский край, Эвенкийский район │ 10735 │
│ 123 │ 10495 │
│ Мой сайт │ 9454 │
│ test │ 8598 │
└─────────────────────────────────────────────┴─────────┘
10 rows in set. Elapsed: 0.619 sec. Processed 19.37 million rows...
CREATE DATABASE conv_main
ENGINE = MySQL('host:3306', conv_main, 'metrika', 'password');
SHOW TABLES FROM conv_main
┌─name───────────────────────────────────────┐
│ AdvEngines2 │
│ AdvEnginesPlaces │
│ AdvEnginesPlacesNames │
│ AdvPlaces │
│ BadUniqIDs │
...
SELECT *
FROM mysql(...) AS a
LEFT JOIN odbc(...) AS b ON ...
LEFT JOIN url(...) AS c ON ...
...
A task:
Suppose you have a ClickHouse cluster, say 100 servers.
Every server should load a dictionary from MySQL.
But if you create it on every server, MySQL
will not keep the load on dictionary update.
Solution:
Create MySQL dictionary on just two ClickHouse nodes.
Dictionaries are available as tables in ClickHouse.
Create Distributed table that will look to them as replicas for reading.
Create dictionaries with ClickHouse engine on every ClickHouse server.
Dictionaries can look to another dictionaries!
CREATE TABLE distr ENGINE = Distributed(
cluster, db, table,
dictGet('sharding', 'shard_num', id));
SET optimize_skip_unused_shards = 1;
1. Applying pre-learned CatBoost models.
Example: data gaps filling.
2. Machine learning in ClickHouse directly.
Machine learning model is an aggregate function.
It learns while aggregates data.
SELECT stochasticLogisticRegression(
0.1, 0.0, 1.0, 'Adam')(target, param1, param2)
FROM table GROUP BY k
— learns logistic regression for every k.
SELECT stochasticLogisticRegressionState(...
— learns logistic regression for every k and returns internal state of the model.
The state has type AggregateFunction(stochasticLogisticRegression(0.1, 0.0, 1.0, 'Adam'), ...).
The internal state of aggregate functiuon
— is a first class data type, it can be stored in tables.
You can apply this state as learned model for inference.
INSERT INTO models
SELECT stochasticLogisticRegressionState(
0.1, 0.0, 1.0, 'Adam')(target, param1, param2) AS state
FROM table;
WITH (SELECT state FROM models) AS model
SELECT applyMLModel(model, param1, param2) AS predict
FROM table;
Example: get the top 5 pagerank value from graph “wz”
SELECT vp(v, 'name’), -- retrieve the vertex property “name”
pagerank
FROM pagerank( -- the graph algorithm
wz, -- the graph table
5, -- iterations
0.85, -- damping factor
0.01 -- epsilon
)
ORDER BY pagerank DESC
LIMIT 5
Developer: Amos Bird. Experimental, not available in open-source :(
Solution: dictionary of type cache with executable data source.
It works like this:
./executable < keys > result
Results are cached in memory.
https://github.com/ClickHouse/ClickHouse/pull/9503
Developed in nVidia. Prototype stage.
https://github.com/ClickHouse/ClickHouse/pull/10232
Sort on GPU. A prototype, actually slower.
https://github.com/ClickHouse/ClickHouse/pull/7572
– predefined handlers for parametrized queries:
https://clickhouse.example.net/report/?user_id=123
Developer: Zhang2014, TCeason and Alexander Tokmakov.
You can always find something new in ClickHouse!
Or develop and contribute something new.
Something new even for ClickHouse developers.