ClickHouse Query Execution Pipeline

Author: Nikolai Kochetov, 2018-10-19.

Nikolay Kochetov, Yandex

ClickHouse Query Execution Pipeline

Nikolay Kochetov

ClickHouse Developer

Why Is Pipeline Needed
in a DBMS

Chain of Independent Computations

SELECT avg(length(URL)) FROM hits WHERE URL != ''

Each operation can be executed independently

Pipeline Execution Tasks

What benefits does using a computational pipeline provide?

Volcano — An Extensible and Parallel Query Evaluation System

Pipeline represented as a graph of operators (Exchange operator)

Each operator (of relational algebra) supports the interface:

Row-by-row query execution (tuple-at-time)

Volcano

Advantages

Disadvantages

Column Computations

MonetDB/X100: Hyper-Pipelining Query Execution

Idea

Column Computations

Result:

Disadvantages

How Query Pipeline Is Executed

Query Pipeline Execution

Pipeline can be executed differently


Query Pipeline Execution

Let's consider the task in a generalized form

Data structure iteration

Types of Data Iteration

External

for (element : array)  /// Outer iteration.
{
    foo(element);

    if (!element)  /// Early exit.
        break;
}

Types of Data Iteration

Internal

void Tree::inOrder(auto & function)
{
    if (left_child)
        left_child->inOrder(function);

    function(this);

    if (right_child)
        right_child->inOrder(function);
}

Query Pipeline Execution

push and pull approaches are implemented with internal iteration

Graph representation allows external iteration

Example: LocustDB

Pipeline Execution in ClickHouse

Mixed scheme

Query execution - transferring from InputStream to OutputStream

Some block operations are the same for InputStream and OutputStream

How to Parallelize Query Execution

Parallel Query Execution

Approaches:

Vertical Pipeline Parallelization

Volcano: exchange operator

ClickHouse: AsynchronousBlockInputStream

Parallel Execution in ClickHouse

Data parallelism

Example

SELECT length(URL) AS l, count() FROM hits GROUP BY l ORDER BY l

Parallel Execution in ClickHouse

Let's look at query execution result

URL length distribution has a heavy tail

Parallel Execution in ClickHouse

Data is processed at different speeds

Parallel Execution in ClickHouse

Solution - "stealing" tasks from neighboring thread during reading

Parallel Execution in ClickHouse

Task stealing doesn't solve all problems

Example: reading in primary key order (optimization)

SELECT value FROM table LIMIT 10                 -- Any 10 rows
SELECT value FROM table ORDER BY value LIMIT 10  -- All data

What if data is already partially sorted (within a part)?

How Pipeline Execution Will Change in ClickHouse

Processors in ClickHouse

Representing query processing pipeline as a graph

Components:

Processors in ClickHouse

Types of processors


Pipeline Execution

Processors are in different states

Pipeline execution - also a processor

Pipeline Execution

Processors

Ports

Pipeline Execution

Processors

Ports

Pipeline Execution

Processors

Ports

Pipeline Execution

Processors

Ports

Pipeline Execution

Processors

Ports

Pipeline Execution

Processors

Ports

Pipeline Execution

Processors

Ports

Pipeline Execution

Processors

Ports

Pipeline Execution

Processors

Ports

Pipeline Execution

Processors

Ports

Pipeline Execution

Processors

Ports

Pipeline Execution

Processors

Ports

Pipeline Execution

Processors

Ports

Pipeline Execution

Processors

Ports

Pipeline Execution

Processors

Ports

Pipeline Execution

Processors

Ports

Pipeline Execution

Processors

Ports

Pipeline Execution

Processors

Ports

Shared Pipeline Execution

Example from QueryLog

SELECT sum(value) FROM table WHERE foo(x) = 1;
SELECT sum(value) FROM table WHERE foo(x) = 5;
SELECT sum(value) FROM table WHERE foo(x) = 7;

Can compose shared pipeline for multiple queries

Pipeline Optimizations

Query optimizations can be performed at pipeline construction stage

Examples:

SELECT value IN (SELECT x FROM table) WHERE (value + 1) IN (SELECT x FROM table)

Resource Allocation

Unified pipeline allows transparent CPU sharing between users

Pipeline Description Language

Idea: create language for describing query execution pipeline

Additional capabilities

Introspection

External

Internal

What We Expect from New Pipeline

Contacts