What for?
— Testing.
— An internal company contest.
— I always like to play with ClickHouse and have fun :)
We will need:
— an idea.
— a nice dataset.
— a way to analyze and visialize it.
I need a large and interesting dataset. How to get it? 🤔
Examples:
— social networks (Reddit, Twitter, Stackoverflow);
— web crawl (CommonCrawl);
— public sector (NYPD crime reports, UK house prices);
— wikipedia dumps and page views;
— maps (OpenStreetMap);
— sensor data (sensor.community);
https://clickhouse.com/docs/en/getting-started/example-datasets
Automatic Dependent Surveillance–Broadcast
The data broadcast by "transponders"
installed on airplanes (and not only planes).
Contains a lot of telemetry (coordinates, velocity, heading, roll angle)
and metadata (aircraft type, owner, year).
Unencrypted and open for use without restrictions.
You can collect the data using a software-defined radio! 📡
There are many ADS-B data exchanges:
— adsb.fi
— adsb.one
— adsb.lol
— adsbhub.org
— airplanes.live
— openflights.org
— flightaware.com
— adsbexchange.com
— flightradar24.com
— opensky-network.org
Download all the data.
Insert into ClickHouse.
The data is just a bunch of JSON files, produced by the readsb tool.
{"icao":"020100",
"r":"CN-COH",
"t":"AT76", <-- metadata
"dbFlags":0, traces
"desc":"ATR-72-600", |
"timestamp": 1680307200.000, v
"trace":[
[44993.52,36.589873,-8.205326,20000,200.9,340.2,1,0,null,"adsb_icao",21050,32,175,0.7],
[45022.97,36.615921,-8.216915,20000,200.9,340.2,1,0,null,"adsb_icao",21050,null,null,null],
[45063.81,36.651352,-8.232708,20000,200.9,340.2,1,0,{"type":"adsb_icao","flight":"RAM980E ","alt_geom":21050,"ias":175,"tas":242,"mach":0.384,"wd":304,"ws":50,"track":340.21,"roll":0.00,"mag_heading":334.34,"true_heading":333.25,"baro_rate":0,"geom_rate":32,"squawk":"7737","category":"A2","nav_qnh":1013.6,"nav_altitude_mcp":20000,"nic":8,"rc":186,"version":2,"nic_baro":1,"nac_p":9,"nac_v":2,"sil":3,"sil_type":"perhour","alert":0,"spi":0},"adsb_icao",21050,32,175,0.0],
[45081.72,36.667191,-8.239746,20000,200.9,340.2,1,0,null,"adsb_icao",21050,32,175,0.0],
[45094.87,36.678589,-8.244839,20000,199.6,340.4,0,64,null,"adsb_icao",21050,64,173,0.0],
[45120.17,36.700993,-8.254765,20000,199.9,340.1,1,0,null,"adsb_icao",21050,0,173,0.0],
Join two JSON files and cleanup the data:
clickhouse-local --query "
INSERT INTO FUNCTION remoteSecure('...aws.clickhouse.cloud',
default.planes_mercator, 'default', '')
SELECT
CAST(timestamp + time_offset AS DateTime64(3)) AS time,
time::Date AS date,
icao, r, t, dbFlags, noRegData, ownOp, year, desc,
lat, lon,
toInt32OrZero(altitude),
...
FROM
file('${TMPFILE}.data.jsonl', JSONCompactEachRow) AS d,
file('${TMPFILE}.meta.jsonl', JSONEachRow) AS m
"
CREATE TABLE planes_mercator
(
mercator_x UInt32 MATERIALIZED 0xFFFFFFFF * ((lon + 180) / 360),
mercator_y UInt32 MATERIALIZED 0xFFFFFFFF * (1/2 - log(tan((lat + 90) / 360 * pi())) / 2 / pi()),
INDEX idx_x (mercator_x) TYPE minmax,
INDEX idx_y (mercator_y) TYPE minmax,
time DateTime64(3),
date Date,
icao String,
r String,
t LowCardinality(String),
dbFlags Int32,
noRegData Bool,
ownOp LowCardinality(String),
year UInt16,
desc LowCardinality(String),
lat Float64,
lon Float64,
altitude Int32,
ground_speed Float32,
track_degrees Float32,
flags UInt32,
vertical_rate Int32,
aircraft_alert Int64,
aircraft_alt_geom Int64,
aircraft_gva Int64,
aircraft_nac_p Int64,
aircraft_nac_v Int64,
aircraft_nic Int64,
aircraft_nic_baro Int64,
aircraft_rc Int64,
aircraft_sda Int64,
aircraft_sil Int64,
aircraft_sil_type LowCardinality(String),
aircraft_spi Int64,
aircraft_track Float64,
aircraft_type LowCardinality(String),
aircraft_version Int64,
aircraft_category Enum8(
'A0', 'A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7',
'B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7',
'C0', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7',
'D0', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7',
''),
aircraft_emergency Enum8('', 'none', 'general', 'downed', 'lifeguard', 'minfuel', 'nordo', 'unlawful', 'reserved'),
aircraft_flight LowCardinality(String),
aircraft_squawk String,
aircraft_baro_rate Int64,
aircraft_nav_altitude_fms Int64,
aircraft_nav_altitude_mcp Int64,
aircraft_nav_modes Array(Enum8('althold', 'approach', 'autopilot', 'lnav', 'tcas', 'vnav')),
aircraft_nav_qnh Float64,
aircraft_geom_rate Int64,
aircraft_ias Int64,
aircraft_mach Float64,
aircraft_mag_heading Float64,
aircraft_oat Int64,
aircraft_roll Float64,
aircraft_tas Int64,
aircraft_tat Int64,
aircraft_true_heading Float64,
aircraft_wd Int64,
aircraft_ws Int64,
aircraft_track_rate Float64,
aircraft_nav_heading Float64,
source LowCardinality(String),
geometric_altitude Int32,
geometric_vertical_rate Int32,
indicated_airspeed Int32,
roll_angle Float32,
data_source Enum8('', 'adsb.lol', 'adsbexchange', 'airplanes.live')
) ENGINE = MergeTree ORDER BY (mortonEncode(mercator_x, mercator_y), time);
After a day, the table contains 50 billion records, around 1.6 TB.
What to do with this data?
— something beautiful!
Visualize it on the fly in the browser.
Aggregate and generate reports.
ClickHouse has HTTP API included.
const url = `${host}/?user=website&default_format=JSON¶m_x=${x}`;
const response = await fetch(host, { method: 'POST', body: sql });
const data = await response.json();
You can query it directly from the web page.
Advanced usage? — use the TypeScript driver.
Any other language? — there are drivers for all of them.
Type-safe parameterized queries:
WITH
tile_size * {x:UInt16} AS tile_x_begin,
tile_size * {y:UInt16} AS tile_y_begin,
...
SELECT round(red)::UInt8, round(green)::UInt8, round(blue)::UInt8
FROM {table:Identifier}
Bonus: parameterized views.
CREATE VIEW example AS SELECT ...;
SELECT * FROM example(x = 123, y = 456);
CREATE TABLE planes_mercator_sample10 AS planes_mercator;
CREATE TABLE planes_mercator_sample100 AS planes_mercator;
CREATE MATERIALIZED VIEW view_sample10 TO planes_mercator_sample10
AS SELECT * FROM planes_mercator WHERE rand() % 10 = 0;
CREATE MATERIALIZED VIEW view_sample100 TO planes_mercator_sample100
AS SELECT * FROM planes_mercator WHERE rand() % 100 = 0;
Can be used for incremental (continuous) data transformations on inserts,
such as aggregation, filtering, and indexing data in different ways.
CREATE USER website IDENTIFIED WITH sha256_hash BY '...'
SETTINGS
add_http_cors_header = 1,
max_result_rows = 1048576,
...
readonly = 1;
GRANT SELECT ON default.planes_mercator TO website;
GRANT SELECT ON default.planes_mercator_sample10 TO website;
GRANT SELECT ON default.planes_mercator_sample100 TO website;
CREATE USER website_progress IDENTIFIED WITH sha256_hash BY ...;
GRANT SELECT ON system.processes TO website_progress;
GRANT REMOTE ON *.* TO website_progress;
CREATE QUOTA website
KEYED BY ip_address
FOR RANDOMIZED INTERVAL 1 MINUTE
MAX query_selects = 1000, read_rows = 100_000_000_000,
FOR RANDOMIZED INTERVAL 1 HOUR
MAX query_selects = 10000, read_rows = 1_000_000_000_000,
FOR RANDOMIZED INTERVAL 1 DAY
MAX query_selects = 50000, read_rows = 5_000_000_000_000
TO website;
CREATE USER ... SETTINGS
max_result_rows = 1048576,
enable_http_compression = 1,
max_execution_time = 180;
CREATE TABLE planes_mercator
(
mercator_x UInt32 MATERIALIZED 0xFFFFFFFF * ((lon + 180) / 360),
mercator_y UInt32 MATERIALIZED
0xFFFFFFFF * (1/2 - log(tan((lat + 90) / 360 * pi())) / 2 / pi()),
INDEX idx_x (mercator_x) TYPE minmax,
INDEX idx_y (mercator_y) TYPE minmax,
time DateTime64(3),
lat Float64,
lon Float64,
...
) ENGINE = MergeTree
ORDER BY (mortonEncode(mercator_x, mercator_y), time);
WITH -- can be used for CTE and for scalar aliases
tile_size * {x:UInt16} AS tile_x_begin,
tile_size * {y:UInt16} AS tile_y_begin,
...
mercator_x >= tile_x_begin AND mercator_x < tile_x_end
AND mercator_y >= tile_y_begin AND mercator_y < tile_y_end AS in_tile,
...
bitShiftRight(mercator_x - tile_x_begin, 32 - 10 - {z:UInt8}) AS x,
bitShiftRight(mercator_y - tile_y_begin, 32 - 10 - {z:UInt8}) AS y,
...
-- aliases can refernce each other in expressions;
y * 1024 + x AS pos,
...
SELECT round(red)::UInt8, round(green)::UInt8, round(blue)::UInt8
FROM {table:Identifier} -- query parameters
WHERE in_tile -- aliases in any place of a query
GROUP BY pos
ORDER BY pos WITH FILL FROM 0 TO 1024*1024 -- gaps filling;
TSV, CSV, CustomSeparated... — text formats;
RowBinary — compact, raw bitmap (we use it for pictures);
JSON, JSONLines, JSONCompact... — good for web UIs;
Parquet, ORC, Arrow, Native... — column-oriented (for data lakes);
Protobuf, Avro, MsgPack... — import/export (for Kafka and similar);
... every format you might need is already supported by ClickHouse!
Data inside ClickHouse is compressed
in a column-oriented format (MergeTree);
Compression for external data is automatically recognized
(gz, lz4, zstd, br, snappy, bz2, xz; zip, 7z, tar.*);
Compression over network is supported as well
(Accept-Encoding, Content-Encoding in the HTTP interface);
ClickHouse supports:
— local storage;
— object storage (AWS S3);
— hybrid storage (tiering and/or caching);
— external tables (data lakes and querying other databases);
* ClickHouse Cloud has unlimited storage.
Add CPU and/or machines, and your queries run faster:
— queries parallelize across all CPU cores on the machine
(you can also limit it);
— queries parallelize using shards or replicas;
$ cat /etc/clickhouse-server/config.d/query_cache.yaml
query_result_cache:
size: 1073741824
max_entries: 1024
max_entry_size: 104857600
max_entry_records: 30000000
SET use_query_cache = 1;
ClickHouse can work as a server (clickhouse-server)
or as a tool without installation (clickhouse-local).
clickhouse-local
— a small tool for data analytics on local and remote datasets.
All possible data formats:
— text: CSV, TSV, JSON, Values, MySQLDump, Regexp...
— binary: Parquet, Arrow, ORC, Avro, BSON, Protobuf, MsgPack...
— schemaful and schemaless;
External data:
— remote databases: MySQL, PostgreSQL, MongoDB, ODBC, JDBC...
— object storages: AWS S3, ...
— external tables: Hudi, Delta Lake...
— from URL and local files; from SQLite.
With clickhouse-local
— a small tool for data analytics on local and remote datasets.
Similar tools:
textql, octosql, dsq, duckdb, glaredb, trdsql, q, datafusion-cli, spyql.
Install it:
curl https://clickhouse.com/ | sh
./clickhouse local
ClickHouse is the best DBMS for analytic applications:
— Fast, scalable, and resource efficient;
— Easy to use and pleasant to work with;
— Robust and reliable;
— Batteries included.
Friends tell friends to use ClickHouse!