SQL User Defined Functions (since 21.10)
CREATE FUNCTION ch AS (x) -> (exp(x) + exp(-x)) / 2;
CREATE FUNCTION getRegionName AS (x)
-> dictGet('region', 'name', x);
— the functions are polymorphic;
— they are substituted as expressions;
— queries are optimized for CSE and JIT compilation;
Limitations:
— no loops;
— no recursion;
Introspection:
SELECT name, create_query
FROM system.functions
WHERE origin = 'SQLUserDefined'
┌─name─────┬─create_query────────────────────────────────────────┐
│ ch │ CREATE FUNCTION ch AS x -> ((exp(x) + exp(-x)) / 2) │
└──────────┴─────────────────────────────────────────────────────┘
Persistence:
— functions are stored in /var/lib/clickhouse/user_defined/.
Distributed queries:
— functions should be available on initiator node.
Executable User Defined Functions (since 21.11)
<function>
<type>executable</type>
<name>myPlus</name>
<return_type>UInt64</return_type>
<argument><type>UInt64</type></argument>
<argument><type>UInt64</type></argument>
<format>TabSeparated</format>
<command>awk '{ print $1 + $2 }'</command>
<execute_direct>0</execute_direct>
</function>
Developer: Maksim Kita.
Two variants:
executable — a worker process is run for every block of data;
executable_pool — a pool of long running worker processes.
Multiple processes are run concurrently — as the number of query threads.
Data is serialized/deserialized in specified format via stdin/stdout.
#!/usr/bin/python3
import sys
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
if __name__ == '__main__':
sentiment_analyzer = SentimentIntensityAnalyzer()
# Read chunk length
for number in sys.stdin:
length = int(number)
# Read lines from chunk
for _ in range(0, length):
line = sys.stdin.readline()
score = sentiment_analyzer.polarity_scores(line)
print(str(score['compound']) + '\n', end='')
# Flush results to stdout
sys.stdout.flush()
<function>
<type>executable_pool</type>
<name>sentenceScore</name>
<return_type>Float64</return_type>
<argument>
<type>String</type>
</argument>
<format>TabSeparated</format>
<command>sentence_analyzer</command>
<send_chunk_header>1</send_chunk_header>
</function>
SELECT sentenceScore('ClickHouse is fast') as score
┌─score─┐
│ 0 │
└───────┘
SELECT avg(sentenceScore(sentence)) AS avg_sentence_score
FROM executable('sentence_generator.py 10000',
'TabSeparated',
'sentence String')
┌────────────avg_score─┐
│ 0.030663238759543694 │
└──────────────────────┘
Data generation and import:
CREATE TABLE test ... ENGINE =
Executable('my_script', TSV);
— the script should return data in stdout, similar to "file" engine.
Data transformation:
CREATE TABLE test ... ENGINE =
Executable('my_script', TSV, (SELECT ...));
— the script will get source stream to stdin and return data in stdout.
Custom JOIN logic:
CREATE TABLE test ... ENGINE =
Executable('my_script', TSV, (SELECT ...), (SELECT ...), ...);
— the script will get source stream to stdin and additional streams in fd 3, 4...
Two variants:
— Executable: script is run on every query;
— ExecutablePool: a pool of persistent processes;
Table function:
SELECT * FROM
executable('my_script', TSV, 'x UInt64...');
SELECT * FROM
executable('my_script', TSV, 'x UInt64...', (SELECT ...) ...);
Applications:
— loading data from foreign data source;
— applying machine-learned models to dataset;
— custom processing, e.g. DNS resolution, whois...
clickhouse-local is like clickhouse-client and clickhouse-server together
but without the server.
Demo.
Allows to do many small INSERTS.
From many concurrent connections.
Without Kafka and Buffer tables!
Many small INSERTs are combined together
and batched in memory.
Safe by default: every client get response when data is inserted.
Developer: Anton Popov.
Demo.
For external data sources.
Was:
SELECT * FROM mysql('myserver:3306', 'Northwind', 'Orders', 'user', 'password')
Now:
SELECT * FROM mysql(mydb, table = 'Orders')
<named_collections>
<postgres1>
<user>postgres</user>
<password>mysecretpassword</password>
<host>postgres1</host>
<port>5432</port>
<database>postgres</database>
<table>test_table</table>
</postgres1>
<postgres2>
<user>postgres</user>
<password>mysecretpassword</password>
<host>postgres1</host>
</postgres2>
</named_collections>
— Allow to predefine connection parameters in config file.
— Any parameters (address, database/table name, user/password, etc).
— Parameters can be overridden in the query.
— Work for table engines, table functions and dictionaries.
— Work for mysql, postgresql, mongo, s3, remote, hdfs...
— Configuration is updated on the fly.
Developer: Ksenia Sumarokova.
— Full Disk Encryption:
<disk_local_encrypted>
<type>encrypted</type>
<disk>disk_local</disk>
<path>encrypted/</path>
<key from_env="DISK_ENCRYPTION_KEY" />
</disk_local_encrypted>
— Per-Column Encryption:
secret String Codec(LZ4, AES_128_GCM_SIV),
Developer: Vitaly Baranov, Arthur Filatenkov, depressed-pho.
SQL standard way for database introspection
— for compatibility with tools and protocols.
Demo
Developer: Alexander Tokmakov.
Unicode normalization:
normalizeUTF8NFC, normalizeUTF8NFD,
normalizeUTF8NFKC, normalizeUTF8NFKD.
Demo
Developer: Slava Boben.
— L1, L2, Lp, L∞ spaces, any dimension;
— distance, norm, cosine distance functions;
— scalar product on tuples and arrays.
Applications:
— analysis of scientific datasets;
— analysis of vector embeddings;
Example:
SELECT L2Norm((1, 2, 3)), LpNorm((1, 2, 3), 10), L1Distance((1, 2, 3), (3, 2, 1))
Developer: Alexey Boykov.
I heard you don't like XML...
Then ClickHouse will support YAML as an alternative for you.
logger:
level: trace
log: /var/log/clickhouse-server/clickhouse-server.log
errorlog: /var/log/clickhouse-server/clickhouse-server.err.log
size: 1000M
count: 10
http_port: 8123 # Don't worry, YAML configuration
tcp_port: 9000 # will also contain comments.
mysql_port: 9004 #
postgresql_port: 9005 # It's not JSON after all.
max_connections: 4096
Developer — Denis Bolonin.
Was:
SELECT IF(number = 0, 0, 123 % number) FROM numbers(10)
— division by zero.
SELECT * FROM numbers(10) WHERE number > 0 AND 10 % number > 0
— division by zero.
— both branches of IF, AND, OR are calculated.
SELECT * FROM ( SELECT * FROM numbers(10) WHERE number > 0 ) WHERE 10 % number > 0
— division by zero.
Now:
— works just fine!
Available in 21.9.
Developer — Pavel Kruglov.
1. clickhouse-keeper — 100% compatible protocol with ZooKeeper.
— compressed logs and snapshots;
— no issues with zxid overflow;
— no issues with large packets;
— better memory usage;
— no issues with GC and Java heap;
2. Can run embedded into clickhouse-server.
— no need for a separate process;
Production ready in 21.12.
JSON_EXISTS, JSON_VALUE, JSON_QUERY
SELECT JSON_QUERY(
'{"array":[[0, 1, 2, 3, 4, 5],
[0, -1, -2, -3, -4, -5]]}',
'$.array[*][0 to 2, 4]');
Available in 21.8.
Developers — Ksenia Sumarokova, Konstantin Rudenskii, Denis Semenov.
— INTERSECT, EXCEPT, ANY, ALL, EXISTS operators.
— JOIN condition supports OR.
— CapNProto as output format.
— OPTIONS HTTP request for CORS.
— Binary literals: SELECT 0b101010.
— Allow user to change log levels without restart.
— experimental feature!
Streming consumption of log files.
It's like Kafka and RabbitMQ engine
but for local append-only logs, e.g. application logs.
Supports log rotation, tracks offsets...
Developers: ucasFL (flynn) and Ksenia Sumarokova.
JSON data type:
CREATE TABLE games (data JSON) ENGINE = MergeTree;
You can insert arbitrary nested JSONs.
Types are automatically inferred on INSERT and merge.
Data is stored in columnar format: columns and subcolumns.
Query nested data naturally.
Example: NBA games dataset
CREATE TABLE games (data String)
ENGINE = MergeTree ORDER BY tuple();
SELECT JSONExtractString(data, 'teams', 1, 'name')
FROM games;
— 0.520 sec.
CREATE TABLE games (data JSON)
ENGINE = MergeTree;
SELECT data.teams.name[1] FROM games;
— 0.015 sec.
DESCRIBE TABLE games
SETTINGS describe_extend_object_types = 1
name: data
type: Tuple( <-- inferred type
`_id.$oid` String,
`date.$date` String,
`teams.abbreviation` Array(String),
`teams.city` Array(String),
`teams.home` Array(UInt8),
`teams.name` Array(String),
`teams.players.ast` Array(Array(Int8)),
`teams.players.blk` Array(Array(Int8)),
`teams.players.drb` Array(Array(Int8)),
`teams.players.fg` Array(Array(Int8)),
`teams.players.fg3` Array(Array(Int8)),
`teams.players.fg3_pct` Array(Array(String)),
`teams.players.fg3a` Array(Array(Int8)),
`teams.players.fg_pct` Array(Array(String)),
`teams.players.fga` Array(Array(Int8)),
`teams.players.ft` Array(Array(Int8)),
`teams.players.ft_pct` Array(Array(String)),
`teams.players.fta` Array(Array(Int8)),
`teams.players.mp` Array(Array(String)),
`teams.players.orb` Array(Array(Int8)),
`teams.players.pf` Array(Array(Int8)),
`teams.players.player` Array(Array(String)),
`teams.players.plus_minus` Array(Array(String)),
`teams.players.pts` Array(Array(Int8)),
`teams.players.stl` Array(Array(Int8)),
`teams.players.tov` Array(Array(Int8)),
`teams.players.trb` Array(Array(Int8)),
`teams.results.ast` Array(Int8),
`teams.results.blk` Array(Int8),
`teams.results.drb` Array(Int8),
`teams.results.fg` Array(Int8),
`teams.results.fg3` Array(Int8),
`teams.results.fg3_pct` Array(String),
`teams.results.fg3a` Array(Int8),
`teams.results.fg_pct` Array(String),
`teams.results.fga` Array(Int16),
`teams.results.ft` Array(Int8),
`teams.results.ft_pct` Array(String),
`teams.results.fta` Array(Int8),
`teams.results.mp` Array(Int16),
`teams.results.orb` Array(Int8),
`teams.results.pf` Array(Int8),
`teams.results.plus_minus` Array(String),
`teams.results.pts` Array(Int16),
`teams.results.stl` Array(Int8),
`teams.results.tov` Array(Int8),
`teams.results.trb` Array(Int8),
`teams.score` Array(Int16),
`teams.won` Array(Int8))
Flexible schema.
You can have columns with strict and flexible schema in one table.
Queries work as fast as with predefined types!
Planned for Q1 2022.
Developer: Anton Popov.
Before:
clickhouse-local --input-format Parquet \
--query 'SELECT * FROM table' \
--structure '
id UInt32, deleted UInt8,
type String, by String, time DateTime,
text String, dead UInt8, parent UInt32,
poll UInt32, kids Array(UInt32),
url String, score Int32, title String,
parts Array(UInt32), descendants Int32'
After:
clickhouse-local --input-format Parquet \
--query 'SELECT * FROM table'
1. No need to specify schema if data already contains it:
Native, Arrow, Parquet, ORC, Avro.
2. Allow to infer schema from text formats:
TSV, CSV, JSONEachRow.
It works for:
— clickhouse-local;
— url;
— file;
— s3.
Planned for Q1 2022.
Developer: Pavel Kruglov.
ClickHouse is not a transactional DBMS, isn't it?
But we need transactions to:
— do atomic INSERTs to multiple partitions;
— do atomic INSERTs to multiple tables and materialzied views;
— make multiple SELECT queries from one snapshot.
Planned for Q2 2022.
... And Incremental Aggregation In Memory
Run GROUP BY query continuously.
Aggregated data is accumulated in memory.
It can be queried as a table.
The server can serve key-value requests with Redis protocol.
Example applications:
— realtime antifraud;
— user profiles and personalization.
Planned for summer 2022.
ClickHouse over S3:
— highly concurrent asynchronous reads (21.11);
— parallel processing on multiple compute nodes (22.1);
— caching in RAM or local disk with node affinity (Q1 2022);
— easy coordination of inserts and merges
without ReplicatedMergeTree (Q1 2022);
Lower cost and higher throughput than EBS volumes.
Bonus: "web" disk for static hosted datasets (21.10).