ClickHouse meetup in Ekaterinburg

ClickHouse: Present and Future

What is ClickHouse?

ClickHouse - distributed analytical column-oriented DBMS

Why column-oriented?

How row-oriented systems work:

Why column-oriented?

How column-oriented systems work:

Why ClickHouse?

Nothing ready-made was suitable.

So we made ClickHouse.

«Evolution of Data Structures in Yandex.Metrica»

https://habrahabr.ru/company/yandex/blog/273305/

Metrica 2.0

In Brief

column-oriented
linear scalability
fault tolerance
real-time data loading
online (sub-second) queries
SQL dialect support + extensions
(arrays, nested data structures, domain-specific functions, sampling)

Main Metrica Cluster

>22 trillion rows
472 servers
data processing speed up to two terabytes per second

* If you want to try ClickHouse, one server is enough.

ClickHouse at Yandex

We managed to make the system relatively user-friendly.

From the very beginning we had detailed documentation.

Within a couple of years, ClickHouse spread to other Yandex departments.

Mail, Market, Direct, Webmaster, AdFox, Infrastructure, Business Analytics...

There are cases when analysts independently installed ClickHouse on virtual machines and successfully used it without any questions.

Open-source

Then we decided — ClickHouse is too good a system for us to keep it to ourselves.

To make it more fun, we need to get people outside hooked on ClickHouse, let them enjoy it. We decided to go open-source.

Open-source

Apache 2.0 license — minimum restrictions.

Goal — maximum product distribution.

We want Yandex's product to be used worldwide.

See "Yandex Opens ClickHouse"

https://habrahabr.ru/company/yandex/blog/303282/

When to Use ClickHouse

Well-structured, cleaned, immutable events.

Click stream. Web analytics. Ad networks. RTB. E-commerce.

Online game analytics. Sensor and monitoring data. Telecom data.

Financial transactions. Stock market analytics.

When NOT to Use ClickHouse

OLTP
ClickHouse has no UPDATE and full-fledged transactions.

Key-Value
If you need frequent update queries by key, use another solution.

Blob-store, document oriented
ClickHouse is designed for a large number of fine-grained data.

Over-normalized data
Better to make a wide fact table.

Why is ClickHouse so Fast?

— out of desperation.

Yandex.Metrica must work.

Why is ClickHouse so Fast?

Algorithmic optimization.

MergeTree, data locality on disk
— fast range queries.

Example: the uniqCombined function consists of a combination of three different data structures suitable for different cardinality ranges.

Low-level optimization.

Example: vectorized query execution.

Specialization and attention to detail.

Example: we have 17 different GROUP BY algorithms. The best one is chosen for your query.

ClickHouse Adoption

Hundreds of companies in Russia and nearby
Yandex, Mail.ru, Rambler, SKB Kontur…

Dozens of companies in Europe, USA, China
Cloudflare, Wikimedia, Lifestreet…

Unusual ClickHouse Use Cases

Search engine and analytics for Bitcoin transactions:
https://blockchair.com/

"Pretty large tables are running, only one server is used and everything works very quickly — with any filters and sorting, everything is almost instantaneous."

Bioinformatics - evolutionary genetics:
https://github.com/msestak/FindOrigin

"We are exploring evolution of novel genes in genomes because if seems that genomes are far from being static as previously believed and what actually happens is that new genes are constantly being added and old genes are lost."

LHCb experiment at LHC:
https://www.yandex.com/company/press_center/press_releases/2012/2012-04-10/

What's New in ClickHouse

Usability Improvements

Type casting for Merge-type tables

input_format_allow_errors_* settings

Ability to create more than 16 dictionaries with ODBC source

Loading part of configuration from ZK

OPTIMIZE DEDUPLICATE

ALTER of primary key: Enum, Date <-> UInt16, DateTime <-> UInt32

clickhouse --extract-from-config

Distributed Queries

Disabling lagging replicas

Disabling replicas without the table

Original query source in system.processes, system.query_log

INSERT SELECT Convenience

Type casting in INSERT SELECT

INSERT SELECT: by positions instead of names

GIS Functions

pointInEllipses

greatCircleDistance

Aggregate Functions

-ForEach combinator

groupArrayInsertAt

topK (beta)

Interfaces

ODBC driver - build and functionality on Windows

HTTPS server

Introspection

system.parts - exact row count

system.columns - uncompressed size

system.part_log

NULLs (beta)

NULL for JOIN: join_use_nulls setting

NULLS FIRST, LAST for ORDER BY

NULL support in IN

NULL support in higher-order functions

if, multiIf, ifNull, nullIf, coalesce

toNullable, assumeNotNull

Nullable type support in aggregate functions

NULL as result of subquery returning empty set

And More

ALTER ... DROP COLUMN ... FROM PARTITION

preferred_block_size_bytes setting

Ability to enable result buffering in HTTP interface

But That's Not All

KILL QUERY
LIMIT BY
SELECT INTO OUTFILE
clickhouse-local
cross-replication
UUID and MAC encoding functions
Proper HTTP response codes
Proper comparison operation logic
Progress in HTTP headers
Cache stampede fix
system.build_options, system.graphite
Distributed query tracing in system.processes, system.query_log
Ability to skip errors in text formats
fsync_metadata setting
timezone config parameter, timezone() function
decodeURLComponent
Faster gzip in HTTP interface
max_table_size_to_drop
proper build and packages
DISTINCT speedup
Buffer table optimization
FixedString optimization
RIGHT/FULL JOIN improvements

Roadmap

End of May — Beginning of June 2017

Distributed DDL queries

Dictionary table engine, Dictionary database engine

Reloading dictionaries using user-defined state query

June–July 2017

SYSTEM queries

Session concept in HTTP interface

Limit on concurrent replica downloads

NULLs: fix almost all remaining issues

July 2017

SELECT `db`.`table`.`column`

Q3-Q4 2017

Custom partitioning key for MergeTree

Ability to write JOIN as in regular SQL

Resource pools (CPU, disk IO, network bandwidth) for queries

Q4-Q5 2017

Initial UPDATE/DELETE support

Community

Website: https://clickhouse.com/

Google groups: https://groups.google.com/forum/#!forum/clickhouse

Mailing list: [email protected]

Telegram chat: https://telegram.me/clickhouse_en and https://telegram.me/clickhouse_ru (already 668 participants)

GitHub: https://github.com/ClickHouse/ClickHouse/

+ meetings. Moscow, Saint Petersburg, Novosibirsk,
Ekaterinburg, San Francisco... Next: Kiev, Minsk...

ClickHouse meetup in Ekaterinburg

ClickHouse: Present and Future

What is ClickHouse?

Why column-oriented?

Why column-oriented?

Why ClickHouse?

Metrica 2.0

In Brief

Main Metrica Cluster

ClickHouse at Yandex

Open-source

Open-source

When to Use ClickHouse

When NOT to Use ClickHouse

Why is ClickHouse so Fast?

Why is ClickHouse so Fast?

ClickHouse Adoption

Unusual ClickHouse Use Cases

What's New in ClickHouse

Usability Improvements

Distributed Queries

INSERT SELECT Convenience

GIS Functions

Aggregate Functions

Interfaces

Introspection

NULLs (beta)

And More

But That's Not All

Roadmap

End of May — Beginning of June 2017

June–July 2017

July 2017

Q3-Q4 2017

Q4-Q5 2017

Community

That's All for Now