ClickHouse: New And Upcoming Features

ClickHouse:
New And Upcoming Features

User Defined Functions

SQL User Defined Functions (since 21.10)

CREATE FUNCTION ch AS (x) -> (exp(x) + exp(-x)) / 2; CREATE FUNCTION getRegionName AS (x) -> dictGet('region', 'name', x);

— the functions are polymorphic;
— they are substituted as expressions;
— queries are optimized for CSE and JIT compilation;

Limitations:
— no loops;
— no recursion;

User Defined Functions

Introspection:

SELECT name, create_query FROM system.functions WHERE origin = 'SQLUserDefined' ┌─name─────┬─create_query────────────────────────────────────────┐ │ ch │ CREATE FUNCTION ch AS x -> ((exp(x) + exp(-x)) / 2) │ └──────────┴─────────────────────────────────────────────────────┘

Persistence:
— functions are stored in /var/lib/clickhouse/user_defined/.

Distributed queries:
— functions should be available on initiator node.

User Defined Functions

Executable User Defined Functions (since 21.11)

<function> <type>executable</type> <name>myPlus</name> <return_type>UInt64</return_type> <argument><type>UInt64</type></argument> <argument><type>UInt64</type></argument> <format>TabSeparated</format> <command>awk '{ print $1 + $2 }'</command> <execute_direct>0</execute_direct> </function>

Developer: Maksim Kita.

Executable UDF

Two variants:

executable — a worker process is run for every block of data;
executable_pool — a pool of long running worker processes.

Multiple processes are run concurrently — as the number of query threads.

Data is serialized/deserialized in specified format via stdin/stdout.

Scriptable UDF Example

#!/usr/bin/python3 import sys import nltk from nltk.sentiment import SentimentIntensityAnalyzer if __name__ == '__main__': sentiment_analyzer = SentimentIntensityAnalyzer() # Read chunk length for number in sys.stdin: length = int(number) # Read lines from chunk for _ in range(0, length): line = sys.stdin.readline() score = sentiment_analyzer.polarity_scores(line) print(str(score['compound']) + '\n', end='') # Flush results to stdout sys.stdout.flush()

Executable UDF Example

<function> <type>executable_pool</type> <name>sentenceScore</name> <return_type>Float64</return_type> <argument> <type>String</type> </argument> <format>TabSeparated</format> <command>sentence_analyzer</command> <send_chunk_header>1</send_chunk_header> </function>

Executable UDF Example

SELECT sentenceScore('ClickHouse is fast') as score ┌─score─┐ │ 0 │ └───────┘
SELECT avg(sentenceScore(sentence)) AS avg_sentence_score FROM executable('sentence_generator.py 10000', 'TabSeparated', 'sentence String') ┌────────────avg_score─┐ │ 0.030663238759543694 │ └──────────────────────┘

Scriptable Tables

Data generation and import:

CREATE TABLE test ... ENGINE = Executable('my_script', TSV);

— the script should return data in stdout, similar to "file" engine.

Data transformation:

CREATE TABLE test ... ENGINE = Executable('my_script', TSV, (SELECT ...));

— the script will get source stream to stdin and return data in stdout.

Custom JOIN logic:

CREATE TABLE test ... ENGINE = Executable('my_script', TSV, (SELECT ...), (SELECT ...), ...);

— the script will get source stream to stdin and additional streams in fd 3, 4...

Scriptable Tables

Two variants:

Executable: script is run on every query;

ExecutablePool: a pool of persistent processes;

Table function:

SELECT * FROM executable('my_script', TSV, 'x UInt64...'); SELECT * FROM executable('my_script', TSV, 'x UInt64...', (SELECT ...) ...);

Scriptable Tables

Applications:

— loading data from foreign data source;

— applying machine-learned models to dataset;

— custom processing, e.g. DNS resolution, whois...

Interactive mode for clickhouse-local

clickhouse-local is like clickhouse-client and clickhouse-server together
but without the server.

Demo.

Asynchronous INSERTs

Allows to do many small INSERTS.

From many concurrent connections.

Without Kafka and Buffer tables!

Many small INSERTs are combined together
and batched in memory.

Safe by default: every client get response when data is inserted.

Developer: Anton Popov.

Asynchronous INSERTs

Demo.

Predefined Connections

For external data sources.

Was:

SELECT * FROM mysql('myserver:3306',
                    'Northwind', 'Orders',
                    'user', 'password')

Now:

SELECT * FROM mysql(mydb, table = 'Orders')

Predefined Connections

<named_collections> <postgres1> <user>postgres</user> <password>mysecretpassword</password> <host>postgres1</host> <port>5432</port> <database>postgres</database> <table>test_table</table> </postgres1> <postgres2> <user>postgres</user> <password>mysecretpassword</password> <host>postgres1</host> </postgres2> </named_collections>

Predefined Connections

— Allow to predefine connection parameters in config file.

— Any parameters (address, database/table name, user/password, etc).

— Parameters can be overridden in the query.

— Work for table engines, table functions and dictionaries.

— Work for mysql, postgresql, mongo, s3, remote, hdfs...

— Configuration is updated on the fly.

Developer: Ksenia Sumarokova.

Data Encryption On-Rest

— Full Disk Encryption:

<disk_local_encrypted> <type>encrypted</type> <disk>disk_local</disk> <path>encrypted/</path> <key from_env="DISK_ENCRYPTION_KEY" /> </disk_local_encrypted>

— Per-Column Encryption:

secret String Codec(LZ4, AES_128_GCM_SIV),

Developer: Vitaly Baranov, Arthur Filatenkov, depressed-pho.

INFORMATION_SCHEMA

SQL standard way for database introspection

— for compatibility with tools and protocols.

Demo

Developer: Alexander Tokmakov.

Text Processing

Unicode normalization:
normalizeUTF8NFC, normalizeUTF8NFD,
normalizeUTF8NFKC, normalizeUTF8NFKD.

Demo

Developer: Slava Boben.

Multidimensional Distances and Norms

— L1, L2, Lp, L spaces, any dimension;

— distance, norm, cosine distance functions;

— scalar product on tuples and arrays.

Applications:

— analysis of scientific datasets;

— analysis of vector embeddings;

Example:
SELECT L2Norm((1, 2, 3)), LpNorm((1, 2, 3), 10), L1Distance((1, 2, 3), (3, 2, 1))

Developer: Alexey Boykov.

YAML configuration

I heard you don't like XML...

Then ClickHouse will support YAML as an alternative for you.

logger: level: trace log: /var/log/clickhouse-server/clickhouse-server.log errorlog: /var/log/clickhouse-server/clickhouse-server.err.log size: 1000M count: 10 http_port: 8123 # Don't worry, YAML configuration tcp_port: 9000 # will also contain comments. mysql_port: 9004 # postgresql_port: 9005 # It's not JSON after all. max_connections: 4096

Developer — Denis Bolonin.

Short-circuit evaluation

Was:

SELECT IF(number = 0, 0, 123 % number) FROM numbers(10)

— division by zero.

SELECT * FROM numbers(10) WHERE number > 0 AND 10 % number > 0

— division by zero.

— both branches of IF, AND, OR are calculated.

SELECT * FROM
(
    SELECT * FROM numbers(10)
    WHERE number > 0
)
WHERE 10 % number > 0

— division by zero.

Short-circuit evaluation

Now:

— works just fine!

Available in 21.9.

Developer — Pavel Kruglov.

ZooKeeper Goes Away

1. clickhouse-keeper — 100% compatible protocol with ZooKeeper.

— compressed logs and snapshots;
— no issues with zxid overflow;
— no issues with large packets;
— better memory usage;
— no issues with GC and Java heap;

2. Can run embedded into clickhouse-server.

— no need for a separate process;

Production ready in 21.12.

Support For SQL/JSON

JSON_EXISTS, JSON_VALUE, JSON_QUERY

SELECT JSON_QUERY( '{"array":[[0, 1, 2, 3, 4, 5], [0, -1, -2, -3, -4, -5]]}', '$.array[*][0 to 2, 4]');

Available in 21.8.
Developers — Ksenia Sumarokova, Konstantin Rudenskii, Denis Semenov.

Small and Neat Things

— INTERSECT, EXCEPT, ANY, ALL, EXISTS operators.

— JOIN condition supports OR.

— CapNProto as output format.

— OPTIONS HTTP request for CORS.

— Binary literals: SELECT 0b101010.

— Allow user to change log levels without restart.

FileLog Table Engine

— experimental feature!

Streming consumption of log files.

It's like Kafka and RabbitMQ engine
but for local append-only logs, e.g. application logs.

Supports log rotation, tracks offsets...

Developers: ucasFL (flynn) and Ksenia Sumarokova.

Upcoming in 2022

Support For Semistructured Data

JSON data type:

CREATE TABLE games (data JSON) ENGINE = MergeTree;

You can insert arbitrary nested JSONs.

Types are automatically inferred on INSERT and merge.

Data is stored in columnar format: columns and subcolumns.

Query nested data naturally.

Support For Semistructured Data

Example: NBA games dataset

CREATE TABLE games (data String) ENGINE = MergeTree ORDER BY tuple(); SELECT JSONExtractString(data, 'teams', 1, 'name') FROM games;

— 0.520 sec.

CREATE TABLE games (data JSON) ENGINE = MergeTree; SELECT data.teams.name[1] FROM games;

— 0.015 sec.

Support For Semistructured Data

DESCRIBE TABLE games SETTINGS describe_extend_object_types = 1 name: data type: Tuple( <-- inferred type `_id.$oid` String, `date.$date` String, `teams.abbreviation` Array(String), `teams.city` Array(String), `teams.home` Array(UInt8), `teams.name` Array(String), `teams.players.ast` Array(Array(Int8)), `teams.players.blk` Array(Array(Int8)), `teams.players.drb` Array(Array(Int8)), `teams.players.fg` Array(Array(Int8)), `teams.players.fg3` Array(Array(Int8)), `teams.players.fg3_pct` Array(Array(String)), `teams.players.fg3a` Array(Array(Int8)), `teams.players.fg_pct` Array(Array(String)), `teams.players.fga` Array(Array(Int8)), `teams.players.ft` Array(Array(Int8)), `teams.players.ft_pct` Array(Array(String)), `teams.players.fta` Array(Array(Int8)), `teams.players.mp` Array(Array(String)), `teams.players.orb` Array(Array(Int8)), `teams.players.pf` Array(Array(Int8)), `teams.players.player` Array(Array(String)), `teams.players.plus_minus` Array(Array(String)), `teams.players.pts` Array(Array(Int8)), `teams.players.stl` Array(Array(Int8)), `teams.players.tov` Array(Array(Int8)), `teams.players.trb` Array(Array(Int8)), `teams.results.ast` Array(Int8), `teams.results.blk` Array(Int8), `teams.results.drb` Array(Int8), `teams.results.fg` Array(Int8), `teams.results.fg3` Array(Int8), `teams.results.fg3_pct` Array(String), `teams.results.fg3a` Array(Int8), `teams.results.fg_pct` Array(String), `teams.results.fga` Array(Int16), `teams.results.ft` Array(Int8), `teams.results.ft_pct` Array(String), `teams.results.fta` Array(Int8), `teams.results.mp` Array(Int16), `teams.results.orb` Array(Int8), `teams.results.pf` Array(Int8), `teams.results.plus_minus` Array(String), `teams.results.pts` Array(Int16), `teams.results.stl` Array(Int8), `teams.results.tov` Array(Int8), `teams.results.trb` Array(Int8), `teams.score` Array(Int16), `teams.won` Array(Int8))

Support For Semistructured Data

Flexible schema.

You can have columns with strict and flexible schema in one table.

Queries work as fast as with predefined types!

Planned for Q1 2022.
Developer: Anton Popov.

Schema Inference From Formats

Before:

clickhouse-local --input-format Parquet \ --query 'SELECT * FROM table' \ --structure ' id UInt32, deleted UInt8, type String, by String, time DateTime, text String, dead UInt8, parent UInt32, poll UInt32, kids Array(UInt32), url String, score Int32, title String, parts Array(UInt32), descendants Int32'

After:

clickhouse-local --input-format Parquet \ --query 'SELECT * FROM table'

Schema Inference From Formats

1. No need to specify schema if data already contains it:
Native, Arrow, Parquet, ORC, Avro.

2. Allow to infer schema from text formats:
TSV, CSV, JSONEachRow.

It works for:

— clickhouse-local;
— url;
— file;
— s3.

Planned for Q1 2022.
Developer: Pavel Kruglov.

Support For Transactions

ClickHouse is not a transactional DBMS, isn't it?

But we need transactions to:

— do atomic INSERTs to multiple partitions;

— do atomic INSERTs to multiple tables and materialzied views;

— make multiple SELECT queries from one snapshot.

Planned for Q2 2022.

Key-Value Data Marts

... And Incremental Aggregation In Memory

Run GROUP BY query continuously.

Aggregated data is accumulated in memory.

It can be queried as a table.

The server can serve key-value requests with Redis protocol.

Example applications:
— realtime antifraud;
— user profiles and personalization.

Planned for summer 2022.

Decoupled Storage

ClickHouse over S3:

— highly concurrent asynchronous reads (21.11);

— parallel processing on multiple compute nodes (22.1);

— caching in RAM or local disk with node affinity (Q1 2022);

— easy coordination of inserts and merges
  without ReplicatedMergeTree (Q1 2022);

Lower cost and higher throughput than EBS volumes.

Bonus: "web" disk for static hosted datasets (21.10).

Roadmap 2022

https://github.com/ClickHouse/ClickHouse/issues/32513

.