Author: Alexey Milovidov, 2017-12-06.
Alexey, ClickHouse developer.
Common mistakes we've encountered in practice.
Example:
Table hits: 105 columns,
row size in uncompressed binary format — 718 bytes.
Batch size 1000000 rows:
Insert into MergeTree table
— 506,650 rows per second;
Insert into ReplicatedMergeTree with two replicas
— 393,940 rows per second;
Insert into ReplicatedMergeTree
with two replicas and insert_quorum = 2
— 259,660 rows per second;
Example:
Table hits: 105 columns,
row size in uncompressed binary format — 718 bytes.
Batch size — one row:
Insert into MergeTree table
— 59 rows per second;
Insert into ReplicatedMergeTree with two replicas
— 6 rows per second;
Insert into ReplicatedMergeTree
with two replicas and insert_quorum = 2
— 2 rows per second;
— 10,000 .. 100,000 times slower.
Why does this happen?
This is a ClickHouse limitation.
MergeTree tables don't contain log and memtable.
Data is immediately written to the file system,
laid out by columns.
Method 1.
Use a distributed queue (Kafka).
Pull data from the queue in batches once per second
and write to ClickHouse.
Disadvantages:
— cumbersome construction;
— another complex distributed system.
Method 2.
Write logs to a file.
Once per second, rename the file and open a new one.
A separate process takes the oldest file and inserts it into ClickHouse.
Disadvantages:
— possibility of data loss on hardware failure.
Method 3.
Server process accumulates a batch of logs in an internal buffer.
Periodically, the accumulated buffer swaps with an empty one
and a separate thread writes the accumulated data to ClickHouse.
Disadvantages:
— data loss on kill -9 of the process;
— data loss or OOM when unable to write to ClickHouse;
Method 4.
Server process keeps a connection to ClickHouse open,
in which an INSERT query is made and rows are sent one by one
or in small batches with Transfer-Encoding: chunked.
Periodically, the data stream ends
and a new INSERT query is executed.
In this case, ClickHouse will buffer the data
on its side itself.
Disadvantages:
— data loss on kill -9 of the process;
Method 5.
https://github.com/nikepan/clickhouse-bulk
— intermediate server for grouping INSERTs into batches;
Method 6.
Buffer tables.
Advantages:
— very easy to start using;
Disadvantages:
— doesn't fully solve the problem, batches still needed;
— adds other problems (no log, possible data loss).
Support for importing from Kafka directly inside ClickHouse.
— inserting data in VALUES format with computed expressions;
— inserting data belonging to a large
number of partitions in each batch.
Example:
Country String,
City String,
Browser String,
IPAddress String,
...
Strings compress almost as well as numeric identifiers.
There's almost no difference in disk and IO.
But when processing a query, the CPU difference is at least several times.
Example:
SELECT uniqExact(IPAddress) FROM ip_address_num
— 0.073 sec, 137.79 million rows/s.
SELECT uniqExact(IPAddress) FROM ip_address_string
— 0.265 sec, 37.70 million rows/s.
SELECT formatReadableSize(sum(data_compressed_bytes)) FROM system.columns WHERE table = 'ip_address_num'
— 30.92 MiB vs. 43.65 MiB.
1. If there are few different values,
just write numeric identifiers.
Converting to strings and back can be done on the application side.
This — is typical practice applicable to almost all DBMS.
Example:
Region String
'Moscow and Moscow Region'
RegionID UInt32
250
2. If there are few different values, you can use the Enum data type.
Enum works as efficiently as a numeric identifier.
Disadvantage: sometimes you need to do ALTER.
ALTER is free, but waits for executing queries on lock.
Example:
DeviceType String
'mobile'
DeviceType Enum8('desktop' = 1, 'mobile' = 2, 'tablet' = 3, 'tv' = 4)
3. If there aren't too many different values,
just write numeric identifiers.
For convenience, you can connect external dictionaries.
From ClickHouse, MySQL, Mongo, Postgres, ODBC, file, HTTP server...
Example: advertising campaigns and banners in Yandex.Direct.
ClickHouse stores CampaignID UInt32, BannerID UInt64.
Dictionaries from MySQL are connected.
Campaigns — always in RAM. Banners — cache dictionary.
Bonus: easy to change strings.
Features: cache dictionary from MySQL
works normally only with hit rate ~ 100%.
4. If there's no way to do auto-increment, you can hash.
When using 64bit hash, there will be collisions
(p ~ 1/2 at N ~ 1000000000).
To be safe, add
client identifier to the dictionary key.
Bonus — for many operations, hashes alone are enough.
5. Short strings are fine to store as is.
Example:
URLDomain String
'yandex.ru'
BrowserLanguage String
'ru'
6. If the set of strings is large and unlimited,
and a significant portion is unique.
Example: URL, Referer, SearchPhrase, Title.
Then making dictionaries is completely pointless and contraindicated.
Store as is!
No need to try to move to a separate table and normalize data.
7. Sometimes it's reasonable to store redundant columns with data extracts.
Example: there is a URL column.
Add a URLDomain column.
SELECT domain(URL) AS k, count() AS c FROM test.hits GROUP BY k
— 0.166 sec. Processed 762.68 MB (53.60 million rows/s., 4.61 GB/s.)
SELECT URLDomain AS k, count() AS c FROM test.hits GROUP BY k
— 0.067 sec. Processed 188.06 MB (132.44 million rows/s., 2.81 GB/s.)
On disk: URL - 126.80 MiB, URLDomain - 5.61 MiB.
8. Use the right data types.
ClientIP String '5.255.255.5'
ClientIPv6 String '2a02:6b8:a::a'
DeviceID String 'c9792d6e-ab40-42e6-b171-aa872e880dbd'
ClientIP UInt32
ClientIPv6 FixedString(16)
DeviceID FixedString(16)
9. Sometimes it's better to preprocess strings in advance.
BrowserVersion String
'12.3.67542.29'
splitByChar('.', BrowserVersion)[1]
BrowserVersion1 UInt8
BrowserVersion2 UInt16
BrowserVersion3 UInt32
BrowserVersion4 UInt16
10. Very long strings and very long arrays.
Don't store in ClickHouse at all.
Two options:
1. Truncate on insert.
Example: in one service, only
the first 512 user event parameters are stored;
2. Store outside, and in ClickHouse — hashes.
You can load data into a Log-type table with String fields,
then analyze them already in ClickHouse,
and determine the correct types before loading further.
:) SHOW TABLES ┌─name──────┐ │ stats_0 │ │ stats_1 │ │ stats_2 │ │ stats_3 │ │ stats_4 │ │ stats_5 │ │ stats_6 │ │ stats_7 │ │ stats_8 │ │ stats_9 │ │ stats_10 │ ... │ stats_999 │ └───────────┘
Old habits from MySQL experience:
— absence of clustered primary key in MyISAM tables in MySQL;
— simpler maintenance operations with small tables;
— microsharding.
1. Data in the table is ordered by primary key.
Reading a range is efficient regardless of the total
amount of data in the table. Primary key is clustered.
2. The table already contains partitioning inside itself.
Each table — is a rather bulky object.
For each table, some background merge work
must be constantly performed.
3. No problems with maintenance operations.
ALTER ADD/DROP COLUMN executes in O(1)
Table consists of immutable pieces and recovery after failure is trivial.
4. ClickHouse — is about throughput.
Processing time for 10 rows is about the same as 10,000 rows.
If your tables contain less than 10,000 rows,
then less doesn't matter anymore.
One big table! Don't be afraid.
Example: we store a table with Graphite metrics,
on one server — 2,000,000,000,000 rows.
Bonus:
Support for custom partitioning key in ClickHouse.
If you really need many small tables without index
— use StripeLog.
Reasonable approach to sharding. From another world.
Don't use this approach in ClickHouse.
The idea — data is divided into 1000 virtual shards
any convenient way.
Each virtual shard maps to a set of servers arbitrarily, and the mapping is stored separately.
On each physical server, multiple instances
— in different databases or on different ports.
Reason — simplify the data rebalancing task.
Why not use this for ClickHouse?
1. If data is located in different ClickHouse instances,
then query processing by different instances requires
inter-process interaction over TCP.
Even over localhost this is much slower than in-process.
2. In case of one instance, threads process data as possible, work is evenly distributed across threads.
In case of different instances, one instance will finish work earlier than others — tail latency.
3. Even if tables are located in one ClickHouse instance,
you get problems due to a large number of small tables.
When combining tables using a Merge table, some optimizations are disabled.
Pre-aggregation — is normal. Even in ClickHouse.
But too granular pre-aggregation — is pointless.
Example*:
SummingMergeTree(EventDate,
(OrderID, EventDate, CampaignID, BannerID, AdID, TypeID,
DisplayID, ContextID, TargetType, AbcID, IsGood, CostType, PageID),
8192)
* column names changed.
Disadvantages:
— data volume decreases, but only several times compared to non-aggregated data;
— developers constantly ask to expand primary key, but ClickHouse has no ALTER PRIMARY KEY support;
— among aggregates, only sums are calculated (could be worked around with AggregatingMergeTree)
— suitable only for primitive reports, not suitable for analytics.
For ClickHouse, non-aggregated data — is the best scenario.
Pre-aggregation can also be used for simple cases.
SELECT count() FROM hits WHERE URL = 'https://yandex.ru/?cid=1764026490'
SELECT count() FROM hits WHERE URL = 'https://yandex.ru/?cid=132741461'
SELECT count() FROM hits WHERE URL = 'https://yandex.ru/?cid=4897535466'
SELECT count() FROM hits WHERE URL = 'https://yandex.ru/?cid=1301256137'
SELECT count() FROM hits WHERE URL = 'https://yandex.ru/?cid=2149858559'
SELECT count() FROM hits WHERE URL = 'https://yandex.ru/?cid=544614111'
...
SELECT URL, count() FROM hits WHERE URL IN (...) GROUP BY URL
If index is not used, then each query
— is a separate pass through data;
If index is used, then each query
captures a bit of extra data for reading,
for example, reads 100,000 rows instead of one row
— again many passes through the same data.
Can be rewritten as one query.
Bonuses:
— in ClickHouse, you can safely pass
megabytes of data to the IN section;
— in ClickHouse, index always works no slower than full scan;
Features:
— IN (SELECT ...) doesn't use index;
— in distributed query processing,
query text is transmitted without compression.
What if queries come from API?
— throttle API users:
quotas, adaptive throttling.
Bonus: you can configure quotas per user in ClickHouse, passing quota_key.
If API is accessible externally, quotas are mandatory!
What is this?
You write the same data in the same way
to completely independent servers.
Advantages:
— absence of ZooKeeper in infrastructure;
— ease of use, integration into existing pipeline.
Disadvantages:
— need to restore replicas manually;
— due to errors and non-deterministic work on the application side, data diverges during operation;
— you can't decide which data copy is correct.
Bonus:
In ClickHouse, it's easy to create a Distributed table,
that looks at «manual» replicas.
Using load_balancing = 'nearest_hostname'
you can avoid flaps in case of diverging replicas.
Proper solution:
Use ReplicatedMergeTree.
Advantages:
— recovery after failure;
— replicas monitor data consistency themselves.
For main data — use the MergeTree family.
— table doesn't have to contain Date
— you can use Date DEFAULT '2000-01-01',
— or tables without partitioning.
Reasons:
— sorting by key, index;
— atomic insert;
— no locks during concurrent INSERT and SELECT;
— immutable data structures on disk;
— replication;
For one-time loading of small-medium volumes
— Log.
For temporary batches for intermediate processing
— StripeLog, TinyLog.
For small amount of temporary data
— Memory.
Example:
SELECT url, count() FROM stats
ANY INNER JOIN urls USING (url_id)
Why is this bad?
ClickHouse only supports Hash JOIN. No Merge JOIN support.
Hash JOIN — is random access in RAM.
For high cardinality, worse than reading data inplace.
Solution: for high cardinality joined data,
better to use their storage in the main table.
10. Using OPTIMIZE queries to
simulate UPDATE functionality.
11. Using distributed JOINs
with large right table.
ClickHouse poorly composes query plan for distributed JOINs.
12. Using ClickHouse only to
read data back with SELECT *.
13. Performing excessively heavy computations
inside ClickHouse (high cycle/byte value).
Website: https://clickhouse.com/
Google groups: https://groups.google.com/forum/#!forum/clickhouse
Mailing list: [email protected]
Telegram chat: https://telegram.me/clickhouse_en and https://telegram.me/clickhouse_ru (already 1041 participants)
GitHub: https://github.com/ClickHouse/ClickHouse/
+ meetings. Moscow, Saint Petersburg, Novosibirsk, Ekaterinburg, Minsk, Berlin... Next: Nizhny Novgorod, Moscow.