Author: Nikolai Kochetov, 2018-10-19.
Nikolay Kochetov, Yandex
Nikolay Kochetov
ClickHouse Developer
SELECT avg(length(URL)) FROM hits WHERE URL != ''URL columnURL != ''length(URL)Each operation can be executed independently
What benefits does using a computational pipeline provide?
Pipeline represented as a graph of operators (Exchange operator)
Each operator (of relational algebra) supports the interface:
Row-by-row query execution (tuple-at-time)
Advantages
Disadvantages
MonetDB/X100: Hyper-Pipelining Query Execution
Idea
Result:
Disadvantages
Pipeline can be executed differently
Let's consider the task in a generalized form
Data structure iteration
External
for (element : array) /// Outer iteration.
{
foo(element);
if (!element) /// Early exit.
break;
}Internal
void Tree::inOrder(auto & function)
{
if (left_child)
left_child->inOrder(function);
function(this);
if (right_child)
right_child->inOrder(function);
}push and pull approaches are implemented with internal iteration
Graph representation allows external iteration
Example: LocustDB
Mixed scheme
Query execution - transferring from InputStream to OutputStream
Some block operations are the same for InputStream and OutputStream
Approaches:
Volcano: exchange operator
ClickHouse: AsynchronousBlockInputStream
Data parallelism
Example
SELECT length(URL) AS l, count() FROM hits GROUP BY l ORDER BY lLet's look at query execution result

URL length distribution has a heavy tail
Data is processed at different speeds

Solution - "stealing" tasks from neighboring thread during reading

Task stealing doesn't solve all problems
Example: reading in primary key order (optimization)
SELECT value FROM table LIMIT 10 -- Any 10 rows
SELECT value FROM table ORDER BY value LIMIT 10 -- All dataWhat if data is already partially sorted (within a part)?
Representing query processing pipeline as a graph
Components:
Types of processors
Processors are in different states
Pipeline execution - also a processor
Processors
Ports
Processors
Ports
Processors
Ports
Processors
Ports
Processors
Ports
Processors
Ports
Processors
Ports
Processors
Ports
Processors
Ports
Processors
Ports
Processors
Ports
Processors
Ports
Processors
Ports
Processors
Ports
Processors
Ports
Processors
Ports
Processors
Ports
Processors
Ports

Example from QueryLog
SELECT sum(value) FROM table WHERE foo(x) = 1;
SELECT sum(value) FROM table WHERE foo(x) = 5;
SELECT sum(value) FROM table WHERE foo(x) = 7;Can compose shared pipeline for multiple queries
Query optimizations can be performed at pipeline construction stage
Examples:
SELECT value IN (SELECT x FROM table) WHERE (value + 1) IN (SELECT x FROM table)Unified pipeline allows transparent CPU sharing between users
Idea: create language for describing query execution pipeline
Additional capabilities
External
Internal