UWDC 2017

A Bit About ClickHouse

About Me

Alexey, ClickHouse developer.

Since 2008, I worked on the data processing engine for Yandex.Metrica.

History

Yandex.Metrica is a web analytics service.

First in Russia, second in the world.

Daily ~25 billion events arrive.

Need to show reports in real-time.

Old Metrica (2008–2014)

Everything worked great. Users could get approximately 50 different reports.

But there's a problem. We want more. We want each report to be infinitely customizable.

Report Constructor

Quickly made a prototype and based on it implemented "Report Constructor".

This is 2010.

It became clear where to move next.

We need a good column-oriented DBMS.

Why column-oriented?

This is how row-oriented systems work:

Why column-oriented?

This is how column-oriented systems work:

Why ClickHouse?

Nothing ready suited us.

So we made ClickHouse.

«Evolution of Data Structures in Yandex.Metrica»

https://habrahabr.ru/company/yandex/blog/273305/

Metrica 2.0

In Brief

column-oriented
linear scalability
fault tolerance
real-time data loading
online (sub-second) queries
SQL dialect support + extensions
(arrays, nested data structures, domain-specific functions, sampling)

Main Metrica Cluster

>22 trillion rows
472 servers
data processing speed up to two terabytes per second

* If you want to try ClickHouse, one server is enough.

ClickHouse at Yandex

We managed to make the system relatively convenient.

From the very beginning we had detailed documentation.

Over a couple of years, ClickHouse spread to other Yandex departments.

Mail, Market, Direct, Webmaster, AdFox, Infrastructure, Business Analytics...

There are cases when analysts independently installed ClickHouse on virtual machines and successfully used it without any questions.

Open-source

Then we decided — ClickHouse is too good a system for us alone to use.

To make it more fun, let's get people outside hooked on ClickHouse, let them enjoy it. We decided to make it open-source.

Open-source

Apache 2.0 license — minimum restrictions.

Goal — maximum product distribution.

We want Yandex product to be used worldwide.

See "Yandex Opens ClickHouse"

https://habrahabr.ru/company/yandex/blog/303282/

When to Use ClickHouse

Well-structured, cleaned, immutable events.

Click stream. Web analytics. Ad networks. RTB. E-commerce.

Online game analytics. Sensor and monitoring data. Telecom data.

Financial transactions. Stock market analytics.

When NOT to Use ClickHouse

OLTP
ClickHouse has no UPDATE and full transactions.

Key-Value
If you need frequent queries to update by key, use another solution.

Blob-store, document oriented
ClickHouse is designed for large amounts of fine-grained data.

ClickHouse Adoption

Hundreds of companies in Russia and nearby
Yandex, Mail.ru, Rambler, SKB Kontur…

Dozens of companies in Europe, USA, China
Cloudflare, Wikimedia, Lifestreet…

Unusual ClickHouse Use Cases

Search engine and analytics for Bitcoin transactions:
https://blockchair.com/

"Quite large tables are running, using only one server and everything works very fast — with any filters and sorting, almost instantaneous."

Bioinformatics - evolutionary genetics:
https://github.com/msestak/FindOrigin

"We are exploring evolution of novel genes in genomes because it seems that genomes are far from being static as previously believed and what actually happens is that new genes are constantly being added and old genes are lost."

LHCb experiment at CERN:
https://www.yandex.com/company/press_center/press_releases/2012/2012-04-10/

Why is ClickHouse so fast?

— out of desperation.

Yandex.Metrica must work.

Why is ClickHouse so fast?

To quickly process an analytical query, the system must:

1. Read fast.

2. Compute fast.

Why is ClickHouse so fast?

1. Read fast.

– locality by primary key;
– columns - read only what's needed;
– strict typing;
– data compression.

2. Compute fast.

– vectorized engine;
– specialization of data structures;
– low-level optimizations.

Why is ClickHouse so fast?

Algorithmic optimization.

MergeTree, data locality on disk
— fast range queries.

Example: uniqCombined function consists of a combination of three different data structures, suitable for different cardinality ranges.

Low-level optimization.

Example: vectorized query execution.

Specialization and attention to detail.

Example: we have 17 different algorithms for GROUP BY. The best one is chosen for your query.

ClickHouse vs. Spark

https://www.percona.com/blog/2017/02/13/clickhouse-new-opensource-columnar-database/

ClickHouse vs. typical row-oriented DBMS

Itai Shirav:

«I haven't made a rigorous comparison, but I did convert a time-series table with 9 million rows from Postgres to ClickHouse.

Under ClickHouse queries run about 100 times faster, and the table takes 20 times less disk space. Which is pretty amazing if you ask me».

Bao Dang:

«Obviously, ClickHouse outperformed PostgreSQL at any metric».

https://github.com/AnalyticsGo/AnalyticsGo/issues/1

ClickHouse vs. Vertica

Timur Shenkao:

«ClickHouse is extremely fast at simple SELECTs without joins, much faster than Vertica».

ClickHouse vs. PrestoDB

Ömer Osman Koçak:

«When we evaluated ClickHouse the results were great compared to Prestodb. Even though the columnar storage optimizations for ORC and Clickhouse is quite similar, Clickhouse uses CPU and Memory resources more efficiently (Presto also uses vectorized execution but cannot take advantage of hardware level optimizations such as SIMD instruction sets because it's written in Java so that's fair) so we also wanted to add support for Clickhouse for our open-source analytics platform Rakam (https://github.com/rakam-io/rakam)»

ClickHouse vs. Google BigQuery

«ClickHouse shows comparable speed on such query for 30 days and 8 times faster (!) on such query. We plan to test other queries as well, haven't gotten to it yet.

Query execution speed is stable. In Google BigQuery during peak loads, for example at 4:00 p.m. PDT or at the beginning of the month, query execution time can noticeably increase».

ClickHouse vs. Druid

«This year we deployed a setup based on Druid — Imply Analytics Platform, as well as Tranquility, and were ready to launch to production… But after ClickHouse came out we immediately abandoned Druid, even though we spent two months studying and implementing it».

https://habrahabr.ru/company/smi2/blog/314558/