Uncovering the Real-World Performance of AWS Graviton4 with ClickHouse

What Is ClickHouse?

An open-source analytic DBMS.

Used everywhere: Microsoft, Spotify, Cloudflare, Lyft, Deutsche Bank,
+ thousands companies.

Open-source since 2016, currently the most popular analytic DBMS.

Scalable from a laptop or a server to datacenters.

1,482 contributors and 34,700 stars.

What Is ClickHouse?

ClickHouse is like Postgres, but for analytics.

Fast SQL queries with low latency and high concurrency
with real-time insertion.

Available in ClickHouse Cloud, on AWS Marketplace.

Machines

AWS has a lot of instance types:
— m, c, r, x, z, u; t, a; i, im, d; g, p, f, inf, trn, vt, hpc, mac;

With tweaks:
— -d with local disks; -n with faster network; -flex;

Different types and generations of CPU:
— i - Intel, a - AMD; up to 7th generation as of May 2024

Different CPU architectures:
— x86_64 - Intel and AMD; AArch64 (aka ARM64) - Graviton;

Example: r6idn-24xlarge:
RAM-optimized (8 GB per vCPU), 6th-generation of Intel CPU,
with local SSD, network optimized, 4*24 vCPU.

Graviton

AWS Graviton is an Aarch64 CPU, custom built by AWS.

2018 - 1st generation; ... 2024 - 4th generation (currently in preview!)

It is a different architecture, so not all software is going to work.
Some have to be adapted, compiled, and tested on ARM.

Good news: ClickHouse works! deb, rpm, tgz are available for AArch64

Quick installation autodetects the architecture:

curl https://clickhouse.com/ | sh

Let's Play!

Step 1: run a benchmark on every generation.

ClickBench: https://benchmark.clickhouse.com/hardware

— fully automated benchmark, runs on all instance types.
— attempts to mimic a clickstream analytics workload;

Btw, you can run it by yourself and submit results.

Results (demo): https://pastila.nl/?000b1ba6/c224ddf960900f4f2d0d9e100cef5445.html

How To Decide?

Throughput on a single query:
— how quickly a massively-parallel query runs, e.g. 1 second vs 1.5 seconds;
— depends on the number of CPU; aggregate performance; and mem bw;
and on the software optimization for a particular instruction set;

Latency on short queries:
— how quickly a small query runs, e.g. 25 ms vs 50 ms.
— depends on the speed of a single CPU core;

Total load capacity:
— how many concurrent users and QPS can we sustain.
— depends on the aggregate CPU performance and memory bw;

How To Decide?

Availability of the instances:
— in particular regions; in particular configurations.
— ask your AWS architect.

Software compatibility:
— and how well it is tested on this architecture.

Cost/performance:
— this also depends on what performance metric is in comparison.

Let's Play!

Step 1: estimate total capacity.

By running all ClickBench's queries in parallel:

clickhouse-benchmark -c32 -i1000 < queries.sql

machine	QPS	cost
r7i.8xlarge	2.800	$2.0160
r7g.8xlarge	3.500 (+25%)	$1.7136 (-15%)
r8g.8xlarge	4.595 (+64%)	$1.8851 (-9%)

Should we use Graviton today?

Summary comparison with contemporary Intel/AMD machines:

Graviton 1 (2018):
— low powerful machines, not comparable in performance.

Graviton 2 (2020):
— comparable throughput, but single-core performance is lower.

Graviton 3 (2022):
— better throughput and comparable single-core performance.

Graviton 4 (2024):
— even better throughput, lower latency, and more cores 😋.

Can we use Graviton today?

In ClickHouse Cloud.

It looks obvious: we can get more power for lower price!

But it is not so obvious...

Availability of disk instances
ClickHouse Cloud uses S3 and local SSDs for cache. But Graviton 3 instances with local SSDs started to be available in required regions only recently*.
* we are introducing a "distributed cache" to decouple disks and remove this requirement.

Live migration
A cluster should be able to run in a hybrid mode — some replicas x86_64, some AArch64.

Can we use Graviton today?

Orchestration and infrastructure
All components have to be ported to AArch64 as well.

Full continuous integration with all test suites
had to be enabled on Graviton instance types.

Feature parity
Every existing feature should work on AArch64, even rarely used ones.
Especially our own debugging and introspection capabilities.

Pricing and performance consistency
We cannot randomly give 2x powerful machines in a subset of regions,
as it could lead into surprises for customers.

Can we use Graviton today?

We have to do it! The advantages are overwhelming.

So we prepared everything and migrated
our staging environment to Graviton :) (m7gd)

We use staging environment for testing and for personal, internal,
and demo projects.

Case 1: CI Logs Cluster

A cluster in the Cloud, that collects logs from all builds and tests.

We run ~2,000,000 tests every day, and each test generates a lot of logs.

4.3 trillion rows, 65 TiB compressed data, 1.39 PiB uncompressed.

Let's run a heavy query... Scan a table with 1.08 trillion reconds.

SELECT sum(cityHash64(*)) AS x FROM build_time_trace

Case 1: CI Logs Cluster

Let's run a heavy query... Scan a table with 1.08 trillion reconds:

clickhouse-cloud :) SELECT sum(cityHash64(*)) AS x FROM build_time_trace ┌────────────────────x─┐ │ 18424954377991503633 │ └──────────────────────┘ Elapsed: 3745.202 sec. Processed 1.07 trillion rows, 374.14 TB (285.45 million rows/s., 99.90 GB/s.)

This is before the migration to Graviton.

How much faster is it after the migration?

Case 1: CI Logs Cluster

How much faster is it after the migration to Graviton?

clickhouse-cloud :) SELECT sum(cityHash64(*)) AS x FROM build_time_trace ┌────────────────────x─┐ │ 18424954377991503633 │ └──────────────────────┘ Elapsed: 3395.191 sec. Processed 1.08 trillion rows, 376.63 TB (316.99 million rows/s., 110.93 GB/s.)

— about 10% faster.

It was mostly network bound, reading from S3,
and we rather should have used network-optimized instances.

Case 2: A Public Demo

Demo: https://adsb.exposed/

Case 2: A Public Demo

Demo: https://adsb.exposed/

r6i.metal: 16.27 GB/sec;

r8g.24xlarge (Graviton 4): 26.71 GB/sec;

— 64% faster!

Takeaways

If you can use Graviton, you should already do so.

Graviton 4 is going to be amazing... the only question is availability.

We will make ClickHouse Cloud even faster!

What Is ClickHouse?

What Is ClickHouse?

Machines

Graviton

Let's Play!

How To Decide?

How To Decide?

Let's Play!

Should we use Graviton today?

Can we use Graviton today?

Can we use Graviton today?

Can we use Graviton today?

Case 1: CI Logs Cluster

Case 1: CI Logs Cluster

Case 1: CI Logs Cluster

Case 2: A Public Demo

Case 2: A Public Demo

Takeaways

Q&A