Author: Alexey Milovidov, 2017-11-09.
Alexey, ClickHouse developer.
1. I'm tired of it.
2. Everyone already knows.
Common mistakes we've encountered in practice.
Example:
Table hits: 105 columns,
row size in uncompressed binary format — 718 bytes.
Batch size 1000000 rows:
Insert into MergeTree table
— 506,650 rows per second;
Insert into ReplicatedMergeTree table with two replicas
— 393,940 rows per second;
Insert into ReplicatedMergeTree table
with two replicas and insert_quorum = 2
— 259,660 rows per second;
Example:
Table hits: 105 columns,
row size in uncompressed binary format — 718 bytes.
Batch size — one row:
Insert into MergeTree table
— 59 rows per second;
Insert into ReplicatedMergeTree table with two replicas
— 6 rows per second;
Insert into ReplicatedMergeTree table
with two replicas and insert_quorum = 2
— 2 rows per second;
— 10,000 .. 100,000 times slower.
Why does this happen?
This is a limitation of ClickHouse.
MergeTree tables don't have a log and memtable.
Data is immediately written to the file system,
arranged by columns.
Method 1.
Use a distributed queue (Kafka).
Extract data from the queue in batches once per second
and write to ClickHouse.
Disadvantages:
— cumbersome construction;
— another complex distributed system.
Method 2.
Write logs to a file.
Once per second, rename the file and open a new one.
A separate process takes the oldest file and inserts it into ClickHouse.
Disadvantages:
— possibility of data loss in case of hardware failure.
Method 3.
Server process accumulates a batch of logs in an internal buffer.
Periodically, the accumulated buffer is swapped with an empty one
and a separate thread writes the accumulated data to ClickHouse.
Disadvantages:
— data loss if the process is killed with kill -9;
— data loss or OOM if unable to write to ClickHouse;
Method 4.
Server process keeps an open connection to ClickHouse,
in which an INSERT query is made and rows are sent one by one
or in small batches with Transfer-Encoding: chunked.
Periodically, the data stream ends
and a new INSERT query is executed.
In this case, ClickHouse will buffer the data
on its side.
Disadvantages:
— data loss if the process is killed with kill -9;
Method 5.
https://github.com/nikepan/clickhouse-bulk
— intermediate server for grouping INSERTs into batches;
Method 6.
Buffer tables.
Advantages:
— very easy to start using;
Disadvantages:
— doesn't completely solve the problem, batches are still needed;
— introduces other problems (no log, possible data loss).
Support for direct import from Kafka inside ClickHouse.
— inserting data in VALUES format with computed expressions;
— inserting data belonging to a large
number of partitions in each batch.
Example:
Country String,
City String,
Browser String,
IPAddress String,
...
Strings compress almost as well as numeric identifiers.
There's almost no difference in disk and IO.
But during query processing, the CPU difference is at least several times.
Example:
SELECT uniqExact(IPAddress) FROM ip_address_num
— 0.073 sec, 137.79 million rows/s.
SELECT uniqExact(IPAddress) FROM ip_address_string
— 0.265 sec, 37.70 million rows/s.
SELECT formatReadableSize(sum(data_compressed_bytes)) FROM system.columns WHERE table = 'ip_address_num'
— 30.92 MiB vs. 43.65 MiB.
1. If there are few different values,
just store numeric identifiers.
Converting to strings and back can be done on the application side.
This is a typical practice applicable to almost all DBMSs.
Example:
Region String
'Moscow and Moscow Region'
RegionID UInt32
250
2. If there are few different values, you can use the Enum data type.
Enum works as efficiently as a numeric identifier.
Disadvantage: sometimes you need to do ALTER.
ALTER is free, but waits for running queries on a lock.
Example:
DeviceType String
'mobile'
DeviceType Enum8('desktop' = 1, 'mobile' = 2, 'tablet' = 3, 'tv' = 4)
3. If there aren't too many different values,
just store numeric identifiers.
For convenience, you can connect external dictionaries.
From ClickHouse, MySQL, Mongo, Postgres, ODBC, file, HTTP server...
Example: advertising campaigns and banners in Yandex.Direct.
ClickHouse stores CampaignID UInt32, BannerID UInt64.
Dictionaries are connected from MySQL.
Campaigns — always in memory. Banners — cache dictionary.
Bonus: easy to change strings.
Features: cache dictionary from MySQL
works normally only with hit rate ~ 100%.
4. If you can't do auto-increment, you can hash.
When using a 64-bit hash, there will be collisions
(p ~ 1/2 at N ~ 1000000000).
To avoid worrying, add
client identifier to the dictionary key.
Bonus — for many operations, hashes alone are sufficient.
5. Short strings are fine to store as is.
Example:
URLDomain String
'yandex.ru'
BrowserLanguage String
'ru'
6. If the set of strings is large and unlimited,
and a significant portion is unique.
Example: URL, Referer, SearchPhrase, Title.
Then making dictionaries is completely pointless and contraindicated.
Store as is!
Don't try to put them in a separate table and normalize the data.
7. Sometimes it's reasonable to store redundant columns with extracted data.
Example: there's a URL column.
Add another URLDomain column.
SELECT domain(URL) AS k, count() AS c FROM test.hits GROUP BY k
— 0.166 sec. Processed 762.68 MB (53.60 million rows/s., 4.61 GB/s.)
SELECT URLDomain AS k, count() AS c FROM test.hits GROUP BY k
— 0.067 sec. Processed 188.06 MB (132.44 million rows/s., 2.81 GB/s.)
On disk: URL - 126.80 MiB, URLDomain - 5.61 MiB.
8. Use the correct data types.
ClientIP String '5.255.255.5'
ClientIPv6 String '2a02:6b8:a::a'
DeviceID String 'c9792d6e-ab40-42e6-b171-aa872e880dbd'
ClientIP UInt32
ClientIPv6 FixedString(16)
DeviceID FixedString(16)
9. Sometimes strings are better preprocessed in advance.
BrowserVersion String
'12.3.67542.29'
splitByChar('.', BrowserVersion)[1]
BrowserVersion1 UInt8
BrowserVersion2 UInt16
BrowserVersion3 UInt32
BrowserVersion4 UInt16
10. Very long strings and very long arrays.
Don't store them in ClickHouse at all.
Two options:
1. Truncate on insert.
Example: in one service, only
the first 512 user event parameters are stored;
2. Store externally, and in ClickHouse — hashes.
You can load data into a Log table with String fields,
then analyze them in ClickHouse,
and determine the correct types before loading further.
:) SHOW TABLES ┌─name──────┐ │ stats_0 │ │ stats_1 │ │ stats_2 │ │ stats_3 │ │ stats_4 │ │ stats_5 │ │ stats_6 │ │ stats_7 │ │ stats_8 │ │ stats_9 │ │ stats_10 │ ... │ stats_999 │ └───────────┘
Old habits from MySQL experience:
— lack of clustered primary key in MyISAM tables in MySQL;
— simpler maintenance operations with small tables;
— microsharding.
1. Data in the table is ordered by primary key.
Reading a range is efficient regardless of the total
amount of data in the table. Primary key is clustered.
2. Table already contains internal partitioning.
Each table is a rather bulky object.
For each table, some background merge work
must be constantly performed.
3. No problems with maintenance operations.
ALTER ADD/DROP COLUMN executes in O(1)
Table consists of immutable pieces and recovery after failure is trivial.
4. ClickHouse is about throughput.
Processing time for 10 rows is about the same as 10,000 rows.
If your tables contain less than 10,000 rows,
then smaller doesn't matter.
One big table! Don't be afraid.
Example: we store a table with Graphite metrics,
on one server — 2,000,000,000,000 rows.
Bonus:
Support for arbitrary partitioning key in ClickHouse.
If you still need many small tables without an index
— use StripeLog.
A reasonable approach to sharding. From another world.
Don't use this approach in ClickHouse.
The essence — data is divided into 1000 virtual shards
in any convenient way.
Each virtual shard is mapped to a set of servers arbitrarily, and the mapping is stored separately.
On each physical server, several instances
— in different databases or on different ports.
Reason — simplify the task of data rebalancing.
Why you shouldn't use this for ClickHouse?
1. If data is located in different ClickHouse instances,
then query processing by different instances requires
interprocess communication via TCP.
Even over localhost, this is much slower than in-process.
2. With a single instance, threads process data as they can, work is evenly distributed across threads.
With different instances, one instance will finish work before others — tail latency.
3. Even if tables are located in one ClickHouse instance,
you get problems due to a large number of small tables.
When combining tables with a Merge table, some optimizations are disabled.
Preaggregation is normal. Even in ClickHouse.
But too granular preaggregation is pointless.
Example*:
SummingMergeTree(EventDate,
(OrderID, EventDate, CampaignID, BannerID, AdID, TypeID,
DisplayID, ContextID, TargetType, AbcID, IsGood, CostType, PageID),
8192)
* column names changed.
Disadvantages:
— data volume decreases, but only by several times compared to non-aggregated data;
— developers constantly ask to extend the primary key, but ClickHouse has no support for ALTER PRIMARY KEY;
— among aggregates, only sums are calculated (could be worked around with AggregatingMergeTree)
— suitable only for primitive reports, not for analytics.
For ClickHouse, non-aggregated data is the best scenario.
Preaggregation can also be used for simple cases.
SELECT count() FROM hits WHERE URL = 'https://yandex.ru/?cid=1764026490'
SELECT count() FROM hits WHERE URL = 'https://yandex.ru/?cid=132741461'
SELECT count() FROM hits WHERE URL = 'https://yandex.ru/?cid=4897535466'
SELECT count() FROM hits WHERE URL = 'https://yandex.ru/?cid=1301256137'
SELECT count() FROM hits WHERE URL = 'https://yandex.ru/?cid=2149858559'
SELECT count() FROM hits WHERE URL = 'https://yandex.ru/?cid=544614111'
...
SELECT URL, count() FROM hits WHERE URL IN (...) GROUP BY URL
If the index isn't used, then each query
is a separate pass through the data;
If the index is used, then each query
reads slightly more data than needed,
for example, reads 100,000 rows instead of one row
— again many passes through the same data.
Can be rewritten as one query.
Bonuses:
— in ClickHouse, you don't need to worry about
passing megabytes of data in the IN section;
— in ClickHouse, the index always works no slower than a full scan;
Features:
— IN (SELECT ...) doesn't use the index;
— in distributed query processing,
query text is transmitted without compression.
What if queries come from an API?
— limit API users:
quotas, adaptive throttling.
Bonus: you can set up quotas per user in ClickHouse by passing quota_key.
If the API is accessible externally, quotas are mandatory!
What is this?
You write the same data identically
to completely independent servers.
Advantages:
— no ZooKeeper in infrastructure;
— ease of use, integration into existing pipeline.
Disadvantages:
— need to restore replicas manually;
— due to errors and non-deterministic behavior on the application side, data diverges during operation;
— you can't determine which copy of data is correct.
Bonus:
In ClickHouse, it's easy to create a Distributed table,
that points to "manual" replicas.
Using load_balancing = 'nearest_hostname'
you can avoid flaps with diverging replicas.
Correct solution:
Use ReplicatedMergeTree.
Advantages:
— recovery after failure;
— replicas monitor data consistency themselves.
For primary data — use the MergeTree family.
— table doesn't have to contain Date
— you can use Date DEFAULT '2000-01-01',
— or tables without partitioning.
Reasons:
— sorting by key, index;
— atomic insert;
— no locks with concurrent INSERT and SELECT;
— immutable data structures on disk;
— replication;
For one-time loading of small-medium volumes
— Log.
For temporary batches for intermediate processing
— StripeLog, TinyLog.
For small volume of temporary data
— Memory.
Example:
SELECT url, count() FROM stats
ANY INNER JOIN urls USING (url_id)
Why is this bad?
ClickHouse only supports Hash JOIN. No support for Merge JOIN.
Hash JOIN is random access in memory.
For high cardinality, worse than reading data inplace.
Solution: for joined data with high cardinality,
it's better to store it in the main table.
10. Using OPTIMIZE queries to
simulate UPDATE functionality.
11. Using distributed JOIN
with large right table.
ClickHouse poorly composes query plan for distributed JOINs.
12. Using ClickHouse only to
read data back with SELECT *.
13. Performing excessively heavy computations
inside ClickHouse (high cycle/byte value).
Website: https://clickhouse.com/
Google groups: https://groups.google.com/forum/#!forum/clickhouse
Mailing list: [email protected]
Telegram chat: https://telegram.me/clickhouse_en and https://telegram.me/clickhouse_ru (already 1041 members)
GitHub: https://github.com/ClickHouse/ClickHouse/
+ meetups. Moscow, St. Petersburg, Novosibirsk, Yekaterinburg, Minsk, Berlin... Next: Nizhny Novgorod, Moscow.