ClickHouse: Release 21.11 Webinar

ClickHouse
Release 21.11
Webinar

Release 21.11 Webinar

1. (30 min) What's new in ClickHouse 21.11.

2. (30 min) Q&A.

What's New in 21.11

Monthly stable release for November.

Interactive mode for clickhouse-local

clickhouse-local is like clickhouse-client and clickhouse-server together
but without the server.

Demo.

Asynchronous INSERTs

Allows to do many small INSERTS.

From many concurrent connections.

Without Kafka and Buffer tables!

Many small INSERTs are combined together
and batched in memory.

Safe by default: every client get response when data is inserted.

Developer: Anton Popov.

Asynchronous INSERTs

Demo.

Scriptable UDF

In 21.10 we have introduced SQL UDFs
and scriptable tables.

Now we have scriptable UDFs.

<function> <type>executable</type> <name>myPlus</name> <return_type>UInt64</return_type> <argument><type>UInt64</type></argument> <argument><type>UInt64</type></argument> <format>TabSeparated</format> <command>awk '{ print $1 + $2 }'</command> <execute_direct>0</execute_direct> </function>

Developer: Maksim Kita.

Scriptable UDF

Two variants:

executable — a worker process is run for every block of data;
executable_pool — a pool of long running worker processes.

Multiple processes are run concurrently — as the number of query threads.

Data is serialized/deserialized in specified format via stdin/stdout.

Scriptable UDF, Example

#!/usr/bin/python3 import sys import nltk from nltk.sentiment import SentimentIntensityAnalyzer if __name__ == '__main__': sentiment_analyzer = SentimentIntensityAnalyzer() # Read chunk length for number in sys.stdin: length = int(number) # Read lines from chunk for _ in range(0, length): line = sys.stdin.readline() score = sentiment_analyzer.polarity_scores(line) print(str(score['compound']) + '\n', end='') # Flush results to stdout sys.stdout.flush()

Executable UDF Example

<function> <type>executable_pool</type> <name>sentenceScore</name> <return_type>Float64</return_type> <argument> <type>String</type> </argument> <format>TabSeparated</format> <command>user_scripts/sentence_analyzer</command> <send_chunk_header>1</send_chunk_header> </function>

Executable UDF Example

function: type: executable_pool name: sentenceScore return_type: Float64 arguments: - argument: type: UInt64 format: TSV command: user_scripts/sentence_analyzer

Executable UDF Example

SELECT sentenceScore('ClickHouse is fast') as score ┌─score─┐ │ 0 │ └───────┘
SELECT avg(sentenceScore(sentence)) AS avg_sentence_score FROM executable('sentence_generator.py 10000', 'TabSeparated', 'sentence String') ┌────────────avg_score─┐ │ 0.030663238759543694 │ └──────────────────────┘

User Defined Functions

Upcoming:

— User Defined Types (21.12)

Predefined Connections

For external data sources.

Was:

SELECT * FROM mysql('myserver:3306',
                    'Northwind', 'Orders',
                    'user', 'password')

Now:

SELECT * FROM mysql(mydb, table = 'Orders')

Predefined Connections

<named_collections> <postgres1> <user>postgres</user> <password>mysecretpassword</password> <host>postgres1</host> <port>5432</port> <database>postgres</database> <table>test_table</table> </postgres1> <postgres2> <user>postgres</user> <password>mysecretpassword</password> <host>postgres1</host> </postgres2> </named_collections>

Predefined Connections

— Allow to predefine connection parameters in config file.

— Any parameters (address, database/table name, user/password, etc).

— Parameters can be overridden in the query.

— Work for table engines, table functions and dictionaries.

— Work for mysql, postgresql, mongo, s3, remote, hdfs...

— Configuration is updated on the fly.

Developer: Ksenia Sumarokova.

INFORMATION_SCHEMA

SQL standard way for database introspection

— for compatibility with tools and protocols.

Demo

Developer: Alexander Tokmakov.

Text Processing

1. Tokenization, stemming, lemmatization, synonims (21.9..21.11).
— SELECT arrayMap(word -> lemmatize(word), tokens(text)).

New function: tokens.

SELECT tokens('I 💖 ClickHouse') ['I','💖','ClickHouse']

Developers: Nikolai Degterinskiy, Maksim Kita.

Text Processing

2. Unicode normalization:
normalizeUTF8NFC, normalizeUTF8NFD,
normalizeUTF8NFKC, normalizeUTF8NFKD.

Demo

Developer: Slava Boben.

Multidimensional Distances and Norms

— L1, L2, Lp, L spaces, any dimension;

— distance, norm, cosine distance functions;

— scalar product on tuples and arrays.

Applications:

— analysis of scientific datasets;

— analysis of vector embeddings;

Example:
SELECT L2Norm((1, 2, 3)), LpNorm((1, 2, 3), 10), L1Distance((1, 2, 3), (3, 2, 1))

Developer: Alexey Boykov.

Small and Neat Things

— EXISTS (subquery).

— JOIN condition supports OR.

— CapNProto as output format.

— OPTIONS HTTP request for CORS.

— Binary literals: SELECT 0b101010.

— Allow user to change log levels without restart.

FileLog Table Engine

— experimental feature!

Streming consumption of log files.

It's like Kafka and RabbitMQ engine
but for local append-only logs, e.g. application logs.

Supports log rotation, tracks offsets...

Developers: ucasFL (flynn) and Ksenia Sumarokova.

Performance And Reliability

Preemptive scheduler for background merges.

— lower the chance to get "Too many parts" on insertion.

Developer: Nikita Mikhaylov.

Make reading from MergeTree over s3 to work properly.

— it's x100 performance improvement,
like 40 sec -> 394 ms for point query with 100 columns.

Developers: Alexander Sapin, Ksenia Sumarokova, Alexey Milovidov.

What Else?

Learning portal: https://clickhouse.com/learn/

— step-by-step tutorials and examples!

Full changelog:

https://github.com/ClickHouse/ClickHouse/blob/master/CHANGELOG.md

ClickHouse Is Hiring!

Calling all visionaries and builders!
ClickHouse is growing and expects to double in size. Get in early:

Apply here: clickhouse.com/careers or [email protected]

Global, fully remote roles available:
Cloud Software Engineer (Go, Kubernetes)
— Support Engineer
— Core Software Engineer (C++)

Send feedback: [email protected]
Sign up to our newsletter.

Q&A