Author: Alexey Milovidov, 2018-05-17.
On disk — columns.
Data is stored by columns.
In memory — columns.
Data is processed by columns.
As pieces of columns — for example, 65,536 elements.
The size of a piece — depends on many things.
For SELECT — see the max_block_size setting.
Represented as objects with IColumn interface.
Variants — ColumnVector<T>, ColumnString, ColumnArray...
ColumnVector<T> — almost like std::vector<T>.
But under the IColumn interface.
And instead of std::vector<T> — PODArray<T> (why?).
Previously it was std::vector. PODArray — just an optimization.
PODArray:
— no unnecessary memset;
— has 15 bytes padding at the end;
— uses an allocator interface different from std::allocator, which sometimes allows mremap.
ColumnString — consists of two components:
1. Bytes laid out sequentially.
2. Offsets to the i+1 string.
h e l l o \0 w o r l d \0
6 12
ColumnConst
From one nested column,
containing one value.
Basic operations:
— cut — extract part of a column, to implement LIMIT;
— filter — to implement WHERE;
— compareAt, permute — to implement ORDER BY;
...
Ownership — using COWPtr<IColumn>.
Previously it was std::shared_ptr<IColumn>.
virtual Ptr filter(const Filter & filt, ssize_t result_size_hint) const = 0;
Instead of modifying content, they create
and return a new column object.
This is normal, as operations are "coarse-grained".
But there are also "fine-grained", mutating operations.
— storing data in memory;
— common operations on columns.
— Apache Arrow;
— arrays in NumPy;
— arrays in APL, J, K.
Isolate maximally efficient
inner loops from wrapper code.
Code doesn't have to be efficient as a whole.
Optimizable places should be localizable.
«vectorized engine»
Bonus:
— SIMD instructions;
— clever optimizations for homogeneous data
(IColumn::filter, LIKE function implementation);
IDataType
— binary serialization and deserialization to data streams;
— one column can be written to multiple physical streams, example: Array(Array(UInt8));
— serialization and deserialization in text form for different data formats;
— data type properties;
— completely immutable: std::shared_ptr<const IDataType>.
DataTypeUInt32 \ / ColumnUInt32
X
DataTypeDateTime / \ ColumnConst(ColumnUInt32)
A piece of a table: a set of { ColumnPtr, DataTypePtr, name }
Data processing in the query execution pipeline
is performed on blocks.
... here there is an architectural error that will need to be fixed.
Split into
Header: a set of { ColumnPtr, DataTypePtr, name }
Block: { size_t num_rows, std::vector
IBlockInputStream: Block read();
IBlockOutputStream: void write(Block);
They implement:
— data formats (CSV, JSON, Native...)
— reading and writing to tables;
— transformations on data (Limit, Filter, Expression, ...)
Strictly typed — blocks have the same data types
and constant values.
/** Get data structure of the stream... */
virtual Block getHeader() const = 0;
(Very recently this method didn't exist.
Question — what did it allow to do?)
Interface — IStorage.
Implementations — StorageMemory, StorageMergeTree...
Class instances — are tables.
virtual BlockInputStreams read(
const Names & /*column_names*/,
const SelectQueryInfo & /*query_info*/,
const Context & /*context*/,
QueryProcessingStage::Enum & /*processed_stage*/,
size_t /*max_block_size*/,
unsigned /*num_streams*/)
Why BlockInputStreams?
It's just a BlockInputStream (for SELECT)
or BlockOutputStream (for INSERT),
which performs all necessary transformations
when calling the read or write method.
Question — what about the pipeline for INSERT SELECT?
SELECT works on pull principle, INSERT on push principle.
Actually this — is an architectural error.
— control flow is not controlled from outside;
— no ability to glue together the execution of multiple queries;
— more complex implementation of query cancellation, priorities and cooperative execution;
— hard to attach operations to the side of the pipeline — scalar subqueries execution, window functions, etc.
— no transparent access to algebraic properties of pipeline elements.
— hard to perform transformations on the pipeline;
— query execution plan is not available in declarative form;
— for distributed queries only a fragment of the execution plan is visible.
Parser — recursive descent parser, hand-written.
Ready-made SQL parsers:
— parser from PostgreSQL (library available);
— parser from sqlite;
— ready-made variants for ANTLR;
— Apache Calcite;
Features of ClickHouse parser:
— Nested columns;
— lambda functions;
— aliases and expressions anywhere in the query;
InterpreterSelectQuery
ExpressionAnalyzer
Mostly — rule-based optimizations:
— constant folding;
— merging identical expressions;
— removing unnecessary calculations;
— removing unnecessary columns;
...
— pushing ARRAY JOIN closer to the end;
— converting OR chains to IN;
Need to rewrite everything :)
Separate individual optimizations by their places.
Make query optimizations easily pluggable.
Extract common interfaces for some types of optimizations.
(example — peephole optimizations)
Work on an entire block at once.
The code implements not one function application,
but an entire loop over arrays.
Inner loop (usually) is its own
for each combination of argument types.
Inside the loop (usually) there are no virtual calls,
type checks, unnecessary branches.
Example: for the addition operator there are
UInt8 UInt16 UInt32 UInt64 UInt8 UInt16 UInt32 UInt64
Int8 Int16 Int32 Int64 ✕ Int8 Int16 Int32 Int64
Float32 Float64 Float32 Float64
combinations.
And also one of the arguments can be constant:
10 * 10 * 3 = 300 implementations.
Advantages:
— maximum specialization;
— (almost) maximum possible efficiency;
— sufficiently isolated code.
Disadvantages:
— inconvenient interface for implementation;
— bulky code (C++ templates);
— significant increase in binary size;
— no ability to optimize fused operations, example: x * y + z;
1. Vectorized engine.
Examples:
APL, A+, J (jd), K (kdb+, q)
Vectorwise (Actian Vector, VectorH), initial support in Hive
2. Runtime compilation of expressions.
Examples:
Impala, MemSQL, DBToaster
1. Vectorized engine.
2. Runtime compilation of expressions.
Best of all — both!
— SIMD;
— Instruction Level Parallelism;
— Out Of Order Execution;
IAggregateFunction
create — initialize state
in a pre-prepared piece of memory;
update — update state with argument values;
merge — merge two states into one;
serialize, deserialize
— write to I/O stream (network, file, table)
insertResultInto — get final value.
Aggregate function state is a first class citizen in ClickHouse!
— regular function;
https://github.com/ClickHouse/ClickHouse/pull/1535
gcd, lcm — Maks Skorokhod
— aggregate function;
https://github.com/ClickHouse/ClickHouse/pull/2352
windowFunnel — Sundy Li
— table function;
https://github.com/ClickHouse/ClickHouse/pull/2164/
file — decaseal (Topvisor)
— data format;
https://github.com/ClickHouse/ClickHouse/pull/1387
Cap'n'Proto — Marek Vavruša
— table engine;
https://github.com/ClickHouse/ClickHouse/pull/1331
Kafka — Marek Vavruša
— database engine;
https://github.com/ClickHouse/ClickHouse/pull/914
Dictionary — Nicolai Kochetov
— external dictionary source;
https://github.com/ClickHouse/ClickHouse/pull/204
http, executable — Oleg Alexeenkov
— external dictionary layout;
https://github.com/ClickHouse/ClickHouse/pull/785
iptrie — Marek Vavruša
— syntax construct;
https://github.com/ClickHouse/ClickHouse/pull/2134
parentheses for UNION ALL elements — zhang2014
— query section;
https://github.com/ClickHouse/ClickHouse/pull/293
LIMIT BY — Artemkin Pavel
— query type;
https://github.com/ClickHouse/ClickHouse/pull/1163
SYSTEM — Vitaliy Lyudvichenko
— data type;
https://github.com/ClickHouse/ClickHouse/pull/945
UUID — Guillaume Tassery
— compression algorithm;
https://github.com/ClickHouse/ClickHouse/pull/1045
none — Paweł Róg
— server protocol;

Web site: https://clickhouse.com/
Google groups: https://groups.google.com/forum/#!forum/clickhouse
Maillist: [email protected]
Telegram chat: https://telegram.me/clickhouse_ru (more than 1500 participants) and https://telegram.me/clickhouse_en
GitHub: https://github.com/ClickHouse/ClickHouse/
Twitter: https://twitter.com/ClickHouseDB
+ meetups. Moscow, St. Petersburg, Novosibirsk, Yekaterinburg, Minsk, Nizhny Novgorod, Berlin, Palo Alto, Beijing, Sunnyvale, San Francisco...