— data collection;
— data preparation;
— data loading;
— data visualization;
Apache Flink Apache Beam Apache Kafka Apache Pulsar Apache Spark Debezium Decodable Airflow Airbyte dbt Grafana Apache Superset Metabase AWS Lambda Prometheus Apache Iceberg Databricks Datacracks Kubernetes Envoy Istio Node.js Deno React Typescript Webpack
— I have no idea what I want to get...
— let's take some interesting dataset,
and we will find something!
Logs, Metrics, Time-Series, Events, Clicks, Posts...
... we want to analyze everything!
A plenty of good, public, updateable datasets exist:
— Internet Archive
— Zenodo.org
— Common Crawl, HTTP Archive
— Wikipedia
— GitHub Archive
— Blockchains
— Network scans
Wikimedia Downloads:
https://dumps.wikimedia.org/
— wiki dumps;
— edits history;
— pageview statistics;
https://dumps.wikimedia.org/other/pageviews/
— public domain!
— hourly updated.
— ~70 000 files, 3.5 TB gzip.
Let's download it.
Get the list of links:
for i in {2015..2023}; do
for j in {01..12}; do
echo "${i}-${j}" >&2
curl -sSL "https://dumps.wikimedia.org/other/pageviews/$i/$i-$j/" \
| grep -oE 'pageviews-[0-9]+-[0-9]+\.gz'
done
done | sort | uniq | tee links.txt
Download them (rate limited, max 3 connections):
sed -r 's!pageviews-([0-9]{4})([0-9]{2})[0-9]{2}-[0-9]+\.gz!
https://dumps.wikimedia.org/other/pageviews/\1/\1-\2/\0!' \
links.txt | xargs -P3 wget --continue
Preview the data:
en.m FORU_Oceania_Cup 1 0
en.m FORscene 1 0
en.m FOSB 1 0
en.m FOSDEM 1 0
en.m FOSD_metamodels 1 0
| \ \ \
lang.subproject Title Pageviews unused
What is the data format and how to process it?
Not CSV, not TSV, not JSON... whitespace-separated.
How to load it? I don't want to use sed, awk, perl.
With clickhouse-local
— a small tool for data analytics on local and remote datasets.
ClickHouse can work as a server (clickhouse-server)
or as a tool without installation (clickhouse-local).
All possible data formats:
— text: CSV, TSV, JSON, Values, MySQLDump, Regexp...
— binary: Parquet, Arrow, ORC, Avro, BSON, Protobuf, MsgPack...
— schemaful and schemaless;
External data:
— remote databases: MySQL, PostgreSQL, MongoDB, ODBC, JDBC...
— object storages: S3, HDFS, Azure...
— external tables: Hudi, Delta Lake...
— from URL and local files; from SQLite.
With clickhouse-local
— a small tool for data analytics on local and remote datasets.
Similar tools: textql, octosql, dsq, duckdb, trdsql, q, datafusion-cli, spyql.
Install it:
curl https://clickhouse.com/ | sh
./clickhouse local
$ clickhouse-local
ClickHouse local version 23.2.1.726 (official build).
:) SELECT * FROM url(
'https://dumps.wikimedia.org/.../pageviews-20230203-{00..23}0000.gz',
LineAsString)
LIMIT 10
┌─line───────────────────────────────┐
│ aa Category:Top_Level 1 0 │
│ aa File:Bilgine-sor.jpg 6 0 │
│ aa Main_Page 26 0 │
│ aa Main_page 2 0 │
│ aa Special:UserLogin 2 0 │
│ aa Talk:Main_Page 1 0 │
│ aa User:127.0.0.1 1 0 │
│ aa User:Afar_god 2 0 │
│ aa Wikipedia:Community_Portal 1 0 │
│ aa.b File:Incubator-notext.svg 1 0 │
└────────────────────────────────────┘
:) SELECT _file, * FROM file('*.gz', LineAsString) WHERE line LIKE '%FOSDEM%'
┌─_file────────────────────────┬─line──────────────────────────────────────────────────────────────┐
│ pageviews-20150501-030000.gz │ commons.m File:FOSDEM2007Logo.png 1 0 │
│ pageviews-20150501-030000.gz │ commons.m Image:Dries_Buytaert_at_FOSDEM_2008_by_Wikinews.jpg 1 0 │
└──────────────────────────────┴───────────────────────────────────────────────────────────────────┘
Data cleaning, preparation, conversion:
WITH
parseDateTimeBestEffort(extract(_file, '^pageviews-([\\d\\-]+)\\.gz$')) AS time,
splitByChar(' ', line) AS values,
splitByChar('.', values[1]) AS projects
SELECT
time,
projects[1] AS project,
projects[2] AS subproject,
decodeURLComponent(values[2]) AS path,
CAST(values[3], 'UInt64') AS hits
FROM file('pageviews*.gz', LineAsString)
WHERE length(values) >= 3
LIMIT 10
┌────────────────time─┬─project─┬─subproject─┬─path────────────────────────────┬─hits─┐
│ 2015-07-03 12:00:00 │ aa │ │ Main_Page │ 2 │
│ 2015-07-03 12:00:00 │ aa │ │ Talk:Main_Page │ 1 │
│ 2015-07-03 12:00:00 │ aa │ m │ Main_Page │ 1 │
│ 2015-07-03 12:00:00 │ ab │ │ 2005 │ 1 │
│ 2015-07-03 12:00:00 │ ab │ │ 2017 │ 1 │
│ 2015-07-03 12:00:00 │ ab │ │ Акатегориа:Имереҭи │ 1 │
│ 2015-07-03 12:00:00 │ ab │ │ Акатегориа:Польша │ 1 │
│ 2015-07-03 12:00:00 │ ab │ │ Алахәыла:Everton_da_Costa_Neves │ 1 │
│ 2015-07-03 12:00:00 │ ab │ │ Алахәыла:Vikipedisssttt │ 1 │
│ 2015-07-03 12:00:00 │ ab │ │ Алахәыла:WikiDreamer_Bot │ 1 │
└─────────────────────┴─────────┴────────────┴─────────────────────────────────┴──────┘
10 rows in set. Elapsed: 0.138 sec.
Data cleaning, preparation, conversion:
WITH
parseDateTimeBestEffort(extract(_file, '^pageviews-([\\d\\-]+)\\.gz$')) AS time,
splitByChar(' ', line) AS values,
splitByChar('.', values[1]) AS projects
SELECT
time,
projects[1] AS project,
projects[2] AS subproject,
decodeURLComponent(values[2]) AS path,
CAST(values[3], 'UInt64') AS hits
FROM s3(
'https://clickhouse-public-datasets.s3.amazonaws.com/wikistat/original/pageviews*.gz',
LineAsString)
WHERE length(values) >= 3
LIMIT 10
┌────────────────time─┬─project─┬─subproject─┬─path────────────────────────────┬─hits─┐
│ 2015-07-03 12:00:00 │ aa │ │ Main_Page │ 2 │
│ 2015-07-03 12:00:00 │ aa │ │ Talk:Main_Page │ 1 │
│ 2015-07-03 12:00:00 │ aa │ m │ Main_Page │ 1 │
│ 2015-07-03 12:00:00 │ ab │ │ 2005 │ 1 │
│ 2015-07-03 12:00:00 │ ab │ │ 2017 │ 1 │
│ 2015-07-03 12:00:00 │ ab │ │ Акатегориа:Имереҭи │ 1 │
│ 2015-07-03 12:00:00 │ ab │ │ Акатегориа:Польша │ 1 │
│ 2015-07-03 12:00:00 │ ab │ │ Алахәыла:Everton_da_Costa_Neves │ 1 │
│ 2015-07-03 12:00:00 │ ab │ │ Алахәыла:Vikipedisssttt │ 1 │
│ 2015-07-03 12:00:00 │ ab │ │ Алахәыла:WikiDreamer_Bot │ 1 │
└─────────────────────┴─────────┴────────────┴─────────────────────────────────┴──────┘
10 rows in set. Elapsed: 0.138 sec.
Install a real ClickHouse server instead of clickhouse-local.
— but clickhouse-server, clickhouse-local, and clickhouse-client
are actually the same single binary!
./clickhouse server # run server in the current directory
./clickhouse install # install ClickHouse with configs, logs, etc
sudo clickhouse start # run the installed server
clickhouse-client # connect the client
CREATE TABLE wikistat
(
time DateTime CODEC(Delta, ZSTD(3)),
project LowCardinality(String),
subproject LowCardinality(String),
path String CODEC(ZSTD(3)),
hits UInt64 CODEC(ZSTD(3))
)
ENGINE = MergeTree
ORDER BY (path, time);
Kafka, Pulsar, Airbyte, dbt...?
INSERT SELECT ... FROM s3!
INSERT INTO wikistat WITH
parseDateTimeBestEffort(extract(_file, '^pageviews-([\\d\\-]+)\\.gz$')) AS time,
splitByChar(' ', line) AS values,
splitByChar('.', values[1]) AS projects
SELECT
time,
projects[1] AS project,
projects[2] AS subproject,
decodeURLComponent(values[2]) AS path,
CAST(values[3], 'UInt64') AS hits
FROM s3(
'https://clickhouse-public-datasets.s3.amazonaws.com/wikistat/original/pageviews*.gz',
LineAsString)
WHERE length(values) >= 3
It took 8 hours to load. The table size on disk is 720 GB (original 3.5 TB).
The speed degraded to 14 million rows/second over time.
:) SELECT count() FROM wikistat
┌──────count()─┐
│ 380325960756 │
└──────────────┘
:) SELECT sum(hits) FROM wikistat
┌─────sum(hits)─┐
│ 1340877721420 │
└───────────────┘
:) SELECT total_bytes / total_rows FROM system.tables
WHERE name = 'wikistat' AND database = 'default'
┌─divide(total_bytes, total_rows)─┐
│ 2.032924892694963 │
└─────────────────────────────────┘
:) SELECT name,
formatReadableSize(data_compressed_bytes) AS compr,
formatReadableSize(data_uncompressed_bytes) AS orig,
round(data_uncompressed_bytes / data_compressed_bytes, 2) AS ratio
FROM system.columns WHERE table = 'wikistat' AND database = 'default'
ORDER BY data_compressed_bytes DESC
┌─name───────┬─compr──────┬─orig───────┬──ratio─┐
│ time │ 258.32 GiB │ 1.38 TiB │ 5.48 │
│ subproject │ 170.14 GiB │ 452.88 GiB │ 2.66 │
│ hits │ 114.95 GiB │ 2.77 TiB │ 24.65 │
│ project │ 84.25 GiB │ 709.66 GiB │ 8.42 │
│ path │ 29.00 GiB │ 7.67 TiB │ 270.85 │
└────────────┴────────────┴────────────┴────────┘
I still have no idea why do I need these
hundreds of billions of records of something...
— let's make realtime dashboards!
Grafana? Superset? Metabase? Tableau? Observable?
— why use Grafana if we can write our own Grafana in a day!
What JS framework to use? React? Vue? Svelte?
Should I use Typescript? npm?
— no, we will use modern JavaScript
and a random library for charts.
You write everything is a single static .html file.
No frameworks. No build systems. No friends.
What charting library to use?
— https://github.com/leeoniya/uPlot/
μPlot is a fast, memory-efficient library... for plotting time series
— sold.
How to query the database?
— directly from web browser, with REST API.
async function load() {
const response = await fetch(
clickhouse_url,
{ method: "POST", body: `
SELECT ...
FROM wikistat
FORMAT JSON` });
...
}
— authentication and access control;
— quotas, rate limiting, query complexity limiting;
— parameterized queries and views;
— configurable URL handlers;
— CORS configuration;
Open ClickHouse to the Internet and get away with that.
SELECT toMonday(time), sum(hits)
FROM wikistat WHERE path = {path:String}
GROUP BY 1 ORDER BY 1
https://pastila.nl/?0160574b/cec44b20f21da275fcbd8c1d2f3708d9.link
ClickHouse Playground: https://play.clickhouse.com/play?user=play
ClickHouse CI status: https://aretestsgreenyet.com/
Pastila.nl: https://pastila.nl/
Dashboards source code: https://github.com/ClickHouse/ClickHouse/blob/master/programs/server/dashboard.html
* Alternative services do exist. Example: https://www.wikishark.com/
It is using a hand-crafted data structure instead of ClickHouse,
the implementation is non-trivial, and it is less space and time efficient.
* All the tools, e.g. Grafana, Superset, Metabase, dbt, Airbyte
— work perfectly with ClickHouse, and their usages are preferred.