The Most Fresh ClickHouse Features

The Most Fresh
ClickHouse Features

In previous episodes:

https://clickhouse.yandex/presentations/meetup36/new_features/

CONSTRAINTs Prepared Statements Format ORC Format Template ORDER BY optimization WITH FILL text_log metric_log Query Profiler Globs for File/URL/HDFS RPM packages neighbor os_schedule_priority ALTER MODIFY SETTING SELECT FROM VALUES COLUMNS matcher LIVE VIEW Settings constraints Row Level Security Secondary Indices TTL expressions MySQL protocol Adaptive index granularity Advanced text search JSON functions Cascade MATERIALIZED VIEWs WITH ROLLUP/CUBE ASOF JOIN

I won't tell you...

Tiered Storage
— Robert Hodges already presented it.

Managing Dictionaries With SQL Queries

External Dictionaries (was)

campaigns_dictionary.xml:

<yandex> <dictionary> <name>campaigns</name> <source> <mysql> <password>...</password> <port>3306</port> <user>metrika</user> <replica> <host>mtexample.metr.yandex.net</host> <priority>1</priority> </replica> <db>dicts</db> <table>campaign_names</table> </mysql> </source> <lifetime> <min>14400</min> <max>18000</max> </lifetime> <layout> <cache> <size_in_cells>1048576</size_in_cells> </cache> </layout> <structure> <id> <name>OrderID</name> </id> <attribute> <name>cid</name> <type>UInt32</type> <null_value>0</null_value> </attribute> <attribute> <name>name</name> <type>String</type> <null_value /> </attribute> </structure> </dictionary> </yandex>

Managing Dictionaries With SQL Queries

CREATE DICTIONARY db.dict ( region_id Int64 DEFAULT -1, name String DEFAULT 'unknown', population UInt64 ) PRIMARY KEY region_id SOURCE(MYSQL( USER 'root' PASSWORD '...' DB 'dictionaries' TABLE 'regions' REPLICA(PRIORITY 1 HOST 'server1' PORT 3306) REPLICA(PRIORITY 2 HOST 'server2' PORT 3306))) LAYOUT(HASHED()) LIFETIME(MIN 300 MAX 600);

Managing Dictionaries With SQL Queries

SELECT dictGetString('db.dict', 'name', toInt64(213));

Created dictionaries are also available as tables to read.

SELECT * FROM db.dict;

— available since 19.17.

Developers: — Alexander Sapin, Maxim Sabyanin

Redis as Dictionary Source

— available since 19.16.

Developers — Alexey Basmanov, Anton Popov

Layout "sparse_hashed"

— available since 19.15.

Developer — Azat Khuzhin

Parallel Parsing Of Data Formats

clickhouse-local \ -S 'WatchID UInt64, JavaEnable UInt8, Title String, ...' \ --query "SELECT count() FROM table" < hits.tsv

Was: 46.2 sec., 192 096 rows/sec., 168 MB/sec.
Now: 4.5 sec., 1 935 419 rows/sec., 1.7 GB/sec.

Works for TSV, CSV, JSONEachRow.

Works in clickhouse-client, clickhouse-local,
HTTP interface; File, URL, HDFS, S3 tables.

Enabled with input_format_parallel_parsing setting.
And it's already enabled by default!

— available since 20.1.

Developers — Nikita Mikhailov, Oleg Ershov

Import/Export Compressed Files Directly

For tables with File, URL, HDFS, S3 engines:

SELECT * FROM url('https://site.ru/data.gz', TSV, 'phrase String, counter_id UInt32, ...') INSERT INTO TABLE FUNCTION url( 'https://site.ru/data.gz', TSV, 'phrase String, counter_id UInt32, ...')


Autodetect by file extension or explicit: gz/gzip, br/brotli, none, auto:

SELECT * FROM url('https://site.ru/data', TSV, 'phrase String, counter_id UInt32, ...', 'gzip')

Developer — Andrey Bodrov

Demo

Processing ~TB of data from GHArchive.

Optimization of VALUES With Expressions

CREATE TABLE test.null (x UInt64, d DateTime) ENGINE = Null clickhouse-client --query "INSERT INTO test.null VALUES" <...

Inserting 1 mln. records like (1, '2020-02-04 00:00:00'):
0.108 sec.

Inserting 1 mln. records like (1, now()):
40.4 sec. — 400 times slower.

Inserting 1 mln. records like (1, now()) with enabled setting
input_format_values_deduce_templates_of_expressions:
0.112 sec. — almost without performance regression.

— available since 19.16.

Developer — Alexander Tokmakov

FORMAT JSONCompactEachRow

[11825667, "закрепление журные людьми и"] [7588965, "выпискусстановке фото 3д фирма"] [30622428, "губка перерьер ванновгород главный"] [15009727, "uhfvvf yfbck.n gtycb. ltym hj;ltyb 3"] [31235604, "порно друзья дачные коллетение спит 66"] [1036200, "oomph"] [22446879, "ионистранны в туре 17 март 2014 смотреть"] [792887, "пп оптики билей ремиум новый фасонвар"] [315961, "рангику"] [1151353, "дом-2 последние смотреть онлайн фигур"]

— available since 20.1.

Developer — Mikhail Korotov

FORMAT CustomSeparated

Controlled by the settings:

format_custom_escaping_rule
format_custom_field_delimiter
format_custom_row_before_delimiter
format_custom_row_after_delimiter
format_custom_row_between_delimiter
format_custom_result_before_delimiter
format_custom_result_after_delimiter

Template Format

Allow to define a template for data formatting/parsing.

A template contains substitutions and delimieters.

Each substitution specifies data escaping rule:
Quoted, Escaped, CSV, JSON, XML, Raw.

Website ${domain:Quoted} has ${count:Raw} pageviews.

You can specify a template for rows, a delimiter between rows
and a template to wrap resultset.

Example: to parse web access logs.
Example: to parse deeply nested JSON.
Example: generate HTML right in ClickHouse.

Improvements of Text Formats

input_format_csv_unquoted_null_literal_as_null
— parse unquoted NULL in CSV as NULL (otherwise you have to \N).

input_format_null_as_default
— parse NULL as default value for non-Nullable columns.

input_format_tsv_empty_as_default
— parse empty values in TSV as default values.

input_format_with_names_use_header
— use headers in TSV and CSV to determine column order;
— fill missing columns with default values.

input_format_defaults_for_omitted_fields
— calculate complex default values for JSONEachRow, TSV, CSV.

FORMAT Avro

Apache Avro — row-oriented schemaful binary format with embedded schema.

When data is exported in Avro format, the schema is derived from the table definition.

When data is imported in Avro format, we read it according to the schema and convert types.

Developer — Andrew Onyshchuk

— available since 20.2.

FORMAT AvroConfluent

Schema is not embedded directly but loaded from schema registry.

The format is designed specifically to use for Kafka messages.

Do you remember that we also support Parquet, ORC, Protobuf?

s3 Import/Export

Table function s3 and table engine S3:

SELECT * FROM s3(...) INSERT INTO TABLE FUNCTION s3(...) CREATE TABLE table ENGINE = S3(...)

Supports wildcards (globs) to read multiple files.

Supports _file and _path virtual columns.

Supports authentication, compressed files...

— full support in 20.2.

Developer — Vladimir Chebotarev, Olga Khvostikova and Pavel Kovalenko

JOINs In External Memory

Allows to execute JOIN even when right hand doesn't fit in memory.

SET partial_merge_join = 1;

Developer — Artem Zuikov

Geographical Data Processing

greatCircleDistance
— improved performance 3 times
with the code from Andrey Aksenov;

greatCircleAngle
— the distance on sphere in degrees;
— useful for astronomical data;

geoDistance
— calculates the distance on WGS-84 ellipsoid
(under certain assumptions);

— available since 20.1.

Integration With H3 Library

DateTime64 Data Type

DateTime with arbitrary subsecond precision.

It's like DECIMAL for DateTime.

Developer — Vasily Nemkov

Open-Source Hardware Benchmark

To compare different hardware and cloud providers.

(demo)

More Various Functions...

arraySplit multiMatchAllIndices multiFuzzyMatchAllIndices arrayCompact isValidJSON arrayFill arrayReverseFill JSONExtractArrayRaw FQDN
char repeat

More Various Features...

Handler for Prometheus.

Aggregate function combiners: -OrNull and -OrDefault.

Bloom filter secondary index for arrays.

Virtual column _shard_num.

Skipping unknown settings in distributed queries.

Allow using FINAL and SAMPLE simultaneously.

Optimization of uniqCombined; uniqCombined64.

Instant count() in trivial queries.

What's Next?

February 2020

RBAC

Polymorphic Data Parts

Lock-free ALTERs

Custom HTTP handlers

Query cache for IN subqueries

Spring 2020

VFS for storage/compute decoupling

Pluggable authentication and access control

Adaptive merge JOIN algorithm

Workload management

Public Roadmap 2020!

https://clickhouse.yandex/docs/ru/extended_roadmap/

~ 500 tasks with detailed description — status, dependencies, assignees...

~ 50 are already done :)

.

.

Web site: https://clickhouse.yandex/

Maillist: [email protected]

YouTube: https://www.youtube.com/c/ClickHouseDB

Telegram chat: https://telegram.me/clickhouse_ru, clickhouse_en

GitHub: https://github.com/ClickHouse/ClickHouse/

Twitter: https://twitter.com/ClickHouseDB

Google groups: https://groups.google.com/forum/#!forum/clickhouse