PG Day 2017

A Bit About ClickHouse

About Me

Alexey, ClickHouse developer.

Since 2008, I've been working on the data processing engine for Yandex.Metrica.

History

Yandex.Metrica is a web analytics service.

First in Russia, second in the world.

About ~25 billion events arrive daily.

We need to show reports in real time.

Old Metrica (2008–2014)

Everything worked great. Users could get about 50 different reports.

But there's a problem. We wanted more. For each report to be arbitrarily customizable.

Report Builder

We quickly made a prototype and implemented a "Report Builder" based on it.

This was in 2010.

It became clear where to move forward.

We needed a good column-oriented DBMS.

Why Column-Oriented?

This is how row-oriented systems work:

Why Column-Oriented?

This is how column-oriented systems work:

Why ClickHouse?

Nothing ready-made was suitable.

So we made ClickHouse.

"Evolution of Data Structures in Yandex.Metrica"

https://habrahabr.ru/company/yandex/blog/273305/

Metrica 2.0

In Brief

column-oriented
linear scalability
fault tolerance
real-time data loading
online (sub-second) queries
support for SQL dialect + extensions
(arrays, nested data structures, domain-specific functions, sampling)

Metrica's Main Cluster

>23 trillion rows
500 servers
data processing speed up to two terabytes per second

* If you want to try ClickHouse, one server is enough.

ClickHouse at Yandex

We managed to make a relatively convenient system.

From the very beginning, we had detailed documentation.

Within a couple of years, ClickHouse spread to other departments of Yandex.

Mail, Market, Direct, Webmaster, AdFox, Infrastructure, Business Analytics...

There are cases where analysts independently installed ClickHouse on virtual machines and successfully used it without any questions.

Open-Source

Then we decided — ClickHouse is too good a system for us to keep to ourselves.

To make it more fun, we need to get people outside to use ClickHouse, let them enjoy it. We decided to make it open-source.

Open-Source

Apache 2.0 license — minimal restrictions.

Goal — maximum product distribution.

We want Yandex's product to be used worldwide.

See "Yandex Opens ClickHouse"

https://habrahabr.ru/company/yandex/blog/303282/

When to Use ClickHouse

Well-structured, cleaned, immutable events.

Click stream. Web analytics. Ad networks. RTB. E-commerce.

Online game analytics. Sensor and monitoring data. Telecom data.

Financial transactions. Stock market analytics.

When Not to Use ClickHouse

OLTP
ClickHouse has no UPDATE and full-fledged transactions.

Key-Value
If you need frequent update queries by key, use another solution.

Blob-store, document oriented
ClickHouse is intended for large amounts of fine-grained data.

ClickHouse Adoption

Hundreds of companies in Russia and nearby
Yandex, Mail.ru, Rambler, SKB Kontur, SMI2…

Dozens of companies in Europe, USA, China
Cloudflare, Wikimedia, Lifestreet, Vertamedia, Carto, Nvidia, Infinidat…

How do we analyze over O(100B) DNS requests daily.

"ClickHouse enables us and our customers to explore the dataset in real time to get operational insights. Due to many of the optimizations built into ClickHouse we are able to store the data for a long time allowing us to look at events in perspective and at historical trends."

Wikimedia.org

"clickhouse is a columnar datastore that we are using as an aid to run complex SQL queries on the edit data "lake" that we have as a result of the edit reconstruction project. It is similar to Druid but faster for complex queries."

https://phabricator.wikimedia.org/T150343

Unusual Examples of ClickHouse Use

Search engine and analytics for Bitcoin transactions:
https://blockchair.com/

"Pretty large tables are running, only one server is used and everything works very fast — with any filters and sorting everything is almost instant."

Bioinformatics - evolutionary genetics:
https://github.com/msestak/FindOrigin

"We are exploring evolution of novel genes in genomes because it seems that genomes are far from being static as previously believed and what actually happens is that new genes are constantly being added and old genes are lost."

LHCb experiment at LHC:
https://www.yandex.com/company/press_center/press_releases/2012/2012-04-10/

Why Is ClickHouse So Fast?

— out of necessity.

Yandex.Metrica must work.

Why Is ClickHouse So Fast?

To quickly process an analytical query, the system must:

1. Read quickly.

2. Compute quickly.

Why Is ClickHouse So Fast?

1. Read quickly.

– locality by primary key;
– columns - we read only what's needed;
– strong typing;
– data compression.

2. Compute quickly.

– vectorized engine;
– specialization of data structures;
– low-level optimizations.

Why Is ClickHouse So Fast?

Algorithmic optimization.

MergeTree, data locality on disk
— fast range queries.

Example: the uniqCombined function consists of a combination of three different data structures, suitable for different cardinality ranges.

Low-level optimization.

Example: vectorized query execution.

Specialization and attention to detail.

Example: we have 17 different algorithms for executing GROUP BY. The best one is chosen for your query.

Independent Benchmarks

ClickHouse vs. Spark & MariaDB Column Store

https://www.percona.com/blog/2017/02/13/clickhouse-new-opensource-columnar-database/

Independent Benchmarks

ClickHouse vs. Redshift

https://www.altinity.com/blog/2017/6/20/clickhouse-vs-redshift

ClickHouse vs. Greenplum

Significant Limitations of ClickHouse

Limited JOIN support

No UPDATE, DELETE

Weak SQL standard compatibility

ClickHouse vs. Vertica

Timur Shenkao:

«ClickHouse is extremely fast at simple SELECTs without joins, much faster than Vertica».

ClickHouse vs. PrestoDB

Ömer Osman Koçak:

«When we evaluated ClickHouse the results were great compared to Prestodb. Even though the columnar storage optimizations for ORC and Clickhouse is quite similar, Clickhouse uses CPU and Memory resources more efficiently (Presto also uses vectorized execution but cannot take advantage of hardware level optimizations such as SIMD instruction sets because it's written in Java so that's fair) so we also wanted to add support for Clickhouse for our open-source analytics platform Rakam (https://github.com/rakam-io/rakam)»

ClickHouse vs. Google BigQuery

«ClickHouse shows comparable speed on this query for 30 days and 8 times faster (!) on this query. We plan to test other queries as well, haven't gotten to it yet.

Query execution speed is stable. In Google BigQuery during peak loads, for example at 4:00 p.m. PDT or at the beginning of the month, query execution time can noticeably increase».

ClickHouse vs. Druid

«This year we deployed a Druid-based build — Imply Analytics Platform, as well as Tranquility, and were ready to launch into production… But after ClickHouse came out, we immediately abandoned Druid, even though we spent two months studying and implementing it».

https://habrahabr.ru/company/smi2/blog/314558/