Building an Analytic Application With ClickHouse

Building an Analytic Application With ClickHouse

Photo by Spicypepper999, 2014

Let's build a demo app on ClickHouse

What for?

— Testing.

— An internal company contest.

— I always like to play with ClickHouse and have fun :)

How?

We will need:

— an idea.

— a nice dataset.

— a way to analyze and visialize it.

A Dataset

I need a large and interesting dataset. How to get it? 🤔

Examples:

— social networks (Reddit, Twitter, Stackoverflow);
— web crawl (CommonCrawl);
— public sector (NYPD crime reports, UK house prices);
— wikipedia dumps and page views;
— maps (OpenStreetMap);
— sensor data (sensor.community);

https://clickhouse.com/docs/en/getting-started/example-datasets

ADS-B ✈️

Automatic Dependent Surveillance–Broadcast

The data broadcast by "transponders"
installed on airplanes (and not only planes).

Contains a lot of telemetry (coordinates, velocity, heading, roll angle)
and metadata (aircraft type, owner, year).

Unencrypted and open for use without restrictions.

You can collect the data using a software-defined radio! 📡

I want it in ClickHouse!

There are many ADS-B data exchanges:

— adsb.fi
— adsb.one
— adsb.lol
— adsbhub.org
— airplanes.live
— openflights.org
— flightaware.com
— adsbexchange.com
— flightradar24.com
— opensky-network.org

The Easy Part

Download all the data.

Insert into ClickHouse.

The Easy Part

The data is just a bunch of JSON files, produced by the readsb tool.

{"icao":"020100", "r":"CN-COH", "t":"AT76", <-- metadata "dbFlags":0, traces "desc":"ATR-72-600", | "timestamp": 1680307200.000, v "trace":[ [44993.52,36.589873,-8.205326,20000,200.9,340.2,1,0,null,"adsb_icao",21050,32,175,0.7], [45022.97,36.615921,-8.216915,20000,200.9,340.2,1,0,null,"adsb_icao",21050,null,null,null], [45063.81,36.651352,-8.232708,20000,200.9,340.2,1,0,{"type":"adsb_icao","flight":"RAM980E ","alt_geom":21050,"ias":175,"tas":242,"mach":0.384,"wd":304,"ws":50,"track":340.21,"roll":0.00,"mag_heading":334.34,"true_heading":333.25,"baro_rate":0,"geom_rate":32,"squawk":"7737","category":"A2","nav_qnh":1013.6,"nav_altitude_mcp":20000,"nic":8,"rc":186,"version":2,"nic_baro":1,"nac_p":9,"nac_v":2,"sil":3,"sil_type":"perhour","alert":0,"spi":0},"adsb_icao",21050,32,175,0.0], [45081.72,36.667191,-8.239746,20000,200.9,340.2,1,0,null,"adsb_icao",21050,32,175,0.0], [45094.87,36.678589,-8.244839,20000,199.6,340.4,0,64,null,"adsb_icao",21050,64,173,0.0], [45120.17,36.700993,-8.254765,20000,199.9,340.1,1,0,null,"adsb_icao",21050,0,173,0.0],

The Easy Part

Join two JSON files and cleanup the data:

clickhouse-local --query " INSERT INTO FUNCTION remoteSecure('...aws.clickhouse.cloud', default.planes_mercator, 'default', '') SELECT CAST(timestamp + time_offset AS DateTime64(3)) AS time, time::Date AS date, icao, r, t, dbFlags, noRegData, ownOp, year, desc, lat, lon, toInt32OrZero(altitude), ... FROM file('${TMPFILE}.data.jsonl', JSONCompactEachRow) AS d, file('${TMPFILE}.meta.jsonl', JSONEachRow) AS m "

A Table

CREATE TABLE planes_mercator ( mercator_x UInt32 MATERIALIZED 0xFFFFFFFF * ((lon + 180) / 360), mercator_y UInt32 MATERIALIZED 0xFFFFFFFF * (1/2 - log(tan((lat + 90) / 360 * pi())) / 2 / pi()), INDEX idx_x (mercator_x) TYPE minmax, INDEX idx_y (mercator_y) TYPE minmax, time DateTime64(3), date Date, icao String, r String, t LowCardinality(String), dbFlags Int32, noRegData Bool, ownOp LowCardinality(String), year UInt16, desc LowCardinality(String), lat Float64, lon Float64, altitude Int32, ground_speed Float32, track_degrees Float32, flags UInt32, vertical_rate Int32, aircraft_alert Int64, aircraft_alt_geom Int64, aircraft_gva Int64, aircraft_nac_p Int64, aircraft_nac_v Int64, aircraft_nic Int64, aircraft_nic_baro Int64, aircraft_rc Int64, aircraft_sda Int64, aircraft_sil Int64, aircraft_sil_type LowCardinality(String), aircraft_spi Int64, aircraft_track Float64, aircraft_type LowCardinality(String), aircraft_version Int64, aircraft_category Enum8( 'A0', 'A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'C0', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'D0', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', ''), aircraft_emergency Enum8('', 'none', 'general', 'downed', 'lifeguard', 'minfuel', 'nordo', 'unlawful', 'reserved'), aircraft_flight LowCardinality(String), aircraft_squawk String, aircraft_baro_rate Int64, aircraft_nav_altitude_fms Int64, aircraft_nav_altitude_mcp Int64, aircraft_nav_modes Array(Enum8('althold', 'approach', 'autopilot', 'lnav', 'tcas', 'vnav')), aircraft_nav_qnh Float64, aircraft_geom_rate Int64, aircraft_ias Int64, aircraft_mach Float64, aircraft_mag_heading Float64, aircraft_oat Int64, aircraft_roll Float64, aircraft_tas Int64, aircraft_tat Int64, aircraft_true_heading Float64, aircraft_wd Int64, aircraft_ws Int64, aircraft_track_rate Float64, aircraft_nav_heading Float64, source LowCardinality(String), geometric_altitude Int32, geometric_vertical_rate Int32, indicated_airspeed Int32, roll_angle Float32, data_source Enum8('', 'adsb.lol', 'adsbexchange', 'airplanes.live') ) ENGINE = MergeTree ORDER BY (mortonEncode(mercator_x, mercator_y), time);

I have a table, now what?

After a day, the table contains 50 billion records, around 1.6 TB.

What to do with this data?

— something beautiful!

Visualize it on the fly in the browser.

Aggregate and generate reports.

Demo

https://adsb.exposed/

How Is It Possible?

REST API

ClickHouse has HTTP API included.

const url = `${host}/?user=website&default_format=JSON&param_x=${x}`; const response = await fetch(host, { method: 'POST', body: sql }); const data = await response.json();

You can query it directly from the web page.

Advanced usage? — use the TypeScript driver.

Any other language? — there are drivers for all of them.

Parameterized Queries

Type-safe parameterized queries:

WITH tile_size * {x:UInt16} AS tile_x_begin, tile_size * {y:UInt16} AS tile_y_begin, ... SELECT round(red)::UInt8, round(green)::UInt8, round(blue)::UInt8 FROM {table:Identifier}

Bonus: parameterized views.

CREATE VIEW example AS SELECT ...; SELECT * FROM example(x = 123, y = 456);

Materialized Views

CREATE TABLE planes_mercator_sample10 AS planes_mercator; CREATE TABLE planes_mercator_sample100 AS planes_mercator; CREATE MATERIALIZED VIEW view_sample10 TO planes_mercator_sample10 AS SELECT * FROM planes_mercator WHERE rand() % 10 = 0; CREATE MATERIALIZED VIEW view_sample100 TO planes_mercator_sample100 AS SELECT * FROM planes_mercator WHERE rand() % 100 = 0;

Can be used for incremental (continuous) data transformations on inserts,
such as aggregation, filtering, and indexing data in different ways.

Role-Based Access Control

CREATE USER website IDENTIFIED WITH sha256_hash BY '...' SETTINGS add_http_cors_header = 1, max_result_rows = 1048576, ... readonly = 1; GRANT SELECT ON default.planes_mercator TO website; GRANT SELECT ON default.planes_mercator_sample10 TO website; GRANT SELECT ON default.planes_mercator_sample100 TO website; CREATE USER website_progress IDENTIFIED WITH sha256_hash BY ...; GRANT SELECT ON system.processes TO website_progress; GRANT REMOTE ON *.* TO website_progress;

Quotas and Query Complexity Limits

CREATE QUOTA website KEYED BY ip_address FOR RANDOMIZED INTERVAL 1 MINUTE MAX query_selects = 1000, read_rows = 100_000_000_000, FOR RANDOMIZED INTERVAL 1 HOUR MAX query_selects = 10000, read_rows = 1_000_000_000_000, FOR RANDOMIZED INTERVAL 1 DAY MAX query_selects = 50000, read_rows = 5_000_000_000_000 TO website; CREATE USER ... SETTINGS max_result_rows = 1048576, enable_http_compression = 1, max_execution_time = 180;

Indices On Expressions

CREATE TABLE planes_mercator ( mercator_x UInt32 MATERIALIZED 0xFFFFFFFF * ((lon + 180) / 360), mercator_y UInt32 MATERIALIZED 0xFFFFFFFF * (1/2 - log(tan((lat + 90) / 360 * pi())) / 2 / pi()), INDEX idx_x (mercator_x) TYPE minmax, INDEX idx_y (mercator_y) TYPE minmax, time DateTime64(3), lat Float64, lon Float64, ... ) ENGINE = MergeTree ORDER BY (mortonEncode(mercator_x, mercator_y), time);

Modern, Friendly SQL

WITH -- can be used for CTE and for scalar aliases tile_size * {x:UInt16} AS tile_x_begin, tile_size * {y:UInt16} AS tile_y_begin, ... mercator_x >= tile_x_begin AND mercator_x < tile_x_end AND mercator_y >= tile_y_begin AND mercator_y < tile_y_end AS in_tile, ... bitShiftRight(mercator_x - tile_x_begin, 32 - 10 - {z:UInt8}) AS x, bitShiftRight(mercator_y - tile_y_begin, 32 - 10 - {z:UInt8}) AS y, ... -- aliases can refernce each other in expressions; y * 1024 + x AS pos, ... SELECT round(red)::UInt8, round(green)::UInt8, round(blue)::UInt8 FROM {table:Identifier} -- query parameters WHERE in_tile -- aliases in any place of a query GROUP BY pos ORDER BY pos WITH FILL FROM 0 TO 1024*1024 -- gaps filling;

Input/Output Formats

TSV, CSV, CustomSeparated... — text formats;

RowBinary — compact, raw bitmap (we use it for pictures);

JSON, JSONLines, JSONCompact... — good for web UIs;

Parquet, ORC, Arrow, Native... — column-oriented (for data lakes);

Protobuf, Avro, MsgPack... — import/export (for Kafka and similar);

... every format you might need is already supported by ClickHouse!

Compression Out Of The Box

Data inside ClickHouse is compressed
in a column-oriented format (MergeTree);

Compression for external data is automatically recognized
(gz, lz4, zstd, br, snappy, bz2, xz; zip, 7z, tar.*);

Compression over network is supported as well
(Accept-Encoding, Content-Encoding in the HTTP interface);

Infinite Storage

ClickHouse supports:

— local storage;

— object storage (AWS S3);

— hybrid storage (tiering and/or caching);

— external tables (data lakes and querying other databases);

* ClickHouse Cloud has unlimited storage.

Parallel And Distributed Queries

Add CPU and/or machines, and your queries run faster:

— queries parallelize across all CPU cores on the machine
  (you can also limit it);

— queries parallelize using shards or replicas;

Query Result Cache

$ cat /etc/clickhouse-server/config.d/query_cache.yaml query_result_cache: size: 1073741824 max_entries: 1024 max_entry_size: 104857600 max_entry_records: 30000000

SET use_query_cache = 1;

Local And Server Modes

ClickHouse can work as a server (clickhouse-server)
or as a tool without installation (clickhouse-local).

clickhouse-local
— a small tool for data analytics on local and remote datasets.

All possible data formats:
— text: CSV, TSV, JSON, Values, MySQLDump, Regexp...
— binary: Parquet, Arrow, ORC, Avro, BSON, Protobuf, MsgPack...
— schemaful and schemaless;

External data:
— remote databases: MySQL, PostgreSQL, MongoDB, ODBC, JDBC...
— object storages: AWS S3, ...
— external tables: Hudi, Delta Lake...
— from URL and local files; from SQLite.

Preprocessing the Data

With clickhouse-local
— a small tool for data analytics on local and remote datasets.

Similar tools:
textql, octosql, dsq, duckdb, glaredb, trdsql, q, datafusion-cli, spyql.

Install it:

curl https://clickhouse.com/ | sh ./clickhouse local

What Else?

ClickHouse is the best DBMS for analytic applications:

— Fast, scalable, and resource efficient;

— Easy to use and pleasant to work with;

— Robust and reliable;

— Batteries included.

Friends tell friends to use ClickHouse!

Q&A

Photo by SounderBruce, 2016, CC-BY-SA-4.0