New features of external dictionaries

New features of external dictionaries

Dictionaries introduction

How to create dictionary:

CREATE DICTIONARY dict_name(
    ... - attributes
)
PRIMARY KEY ... - complex or single key configuration
SOURCE(...) - source configuration
LAYOUT(...) - memory layout configuration
LIFETIME(...) - lifetime of dictionary in memory

https://clickhouse.tech/docs/en/sql-reference/dictionaries/

Dictionary attributes and primary key

Example xml configuration:

<dictionary>
    <name>dictionary_name</name>

    <structure>
        <key>
            key attributes
        </key>

        attributes
    </structure>

    <source>
        dictionary source configuration
    </source>

    <layout><dictionary_layout /></layout>
    <lifetime>dictionary_lifetime</lifetime>
</dictionary>

https://clickhouse.tech/docs/en/sql-reference/dictionaries/

Dictionary attributes and primary key

Example configuration:

CREATE DICTIONARY dict_name (
    id UInt64,
    value String,
    ... attributes
)
PRIMARY KEY id
...

https://clickhouse.tech/docs/en/sql-reference/dictionaries/external-dictionaries/external-dicts-dict-sources/

Dictionary source

MySQL PostgreSQL Mongo Redis Cassandra ClickHouse File HTTP Executable program/script Any database throught ODBC driver

MySQL source configuration:

    SOURCE(MYSQL(
    port 3306
    user 'clickhouse'
    password 'qwerty'
    db 'db_name'
    table 'table_name'))

https://clickhouse.tech/docs/en/sql-reference/dictionaries/external-dictionaries/external-dicts-dict-sources/

Dictionary layout

Flat Hashed Cache SSDCache Direct IPTrie Polygon

Cache layout configuration:

    LAYOUT(CACHE(SIZE_IN_CELLS 100000))
    LAYOUT(SSD_CACHE(
        BLOCK_SIZE 4096
        FILE_SIZE 16777216
        READ_BUFFER_SIZE 1048576
        PATH /path/))

https://clickhouse.tech/docs/en/sql-reference/dictionaries/external-dictionaries/external-dicts-dict-layout

Dictionary updates

ClickHouse periodically updates the dictionaries. Dictionary updates (other than loading for first use) do not block queries.

    LIFETIME(MIN 300 MAX 360)

https://clickhouse.tech/docs/en/sql-reference/dictionaries/external-dictionaries/external-dicts-dict-lifetime/

Functions for working with dictionaries

Get dictionary attribute for specific keys

For primary key from single attribute

dictGet(dictionary_name, attribute_name, keys)

For primary key from multiple attributes

dictGet(dictionary_name, attribute_name, (key1, key2, ..., keyN))

For primary key from single attribute

dictHas(dictionary_name, keys)

For primary key from multiple attributes

dictHas(dictionary_name, (key1, key2, ..., keyN))

https://clickhouse.tech/docs/en/sql-reference/dictionaries/external-dictionaries/external-dicts-dict-lifetime/

Example MySQL dictionary

CREATE TABLE dictionaries.dictionary_example (id Int, value TEXT);
INSERT INTO dictionaries.dictionary_example
VALUES (1, 'First Value'), (2, 'Second Value'), (3, 'Third Value');
CREATE DICTIONARY mysql_dictionary_example (
    id UInt64,
    value String DEFAULT 'Default Value'
)
PRIMARY KEY id
SOURCE(MYSQL(
    port 3306
    user 'test'
    password 'test'
    db 'dictionaries'
    table 'dictionary_example'
))
LIFETIME(MIN 300 MAX 300)
LAYOUT(HASHED());

Example MySQL dictionary

SELECT number, dictGet('mysql_dictionary_example', 'value', number) as values
FROM system.numbers
LIMIT 5
┌─number─┬─values────────┐
│      0 │ Default Value │
│      1 │ First Value   │
│      2 │ Second Value  │
│      3 │ Third Value   │
│      4 │ Default Value │
└────────┴───────────────┘
SELECT number, dictHas('mysql_dictionary_example', number) as values
FROM system.numbers
LIMIT 5
┌─number─┬─values─┐
│      0 │      0 │
│      1 │      1 │
│      2 │      1 │
│      3 │      1 │
│      4 │      0 │
└────────┴────────┘

Source Executable

File data.tsv in TabSeparated format

1	First Value
2	Second Value
3	Third Value
<name>executable_tsv</name>

<structure>
    <id><name>x</name></id>

    <attribute>
        <name>value</name>
        <type>String</type>
        <null_value>No such value</null_value>
    </attribute>
</structure>

<source>
    <executable>
        <command>cat data.tsv</command>
        <format>TabSeparated</format>
    </executable>
</source>

<layout><direct /></layout>

<lifetime>300</lifetime>

Source Executable

SELECT number, dictGet('executable_tsv_simple', 'value', number) as values
FROM system.numbers
LIMIT 5
┌─number─┬─values────────┐
│      0 │ No such value │
│      1 │ First Value   │
│      2 │ Second Value  │
│      3 │ Third Value   │
│      4 │ No such value │
└────────┴───────────────┘
SELECT number, dictHas('executable_tsv_simple', number) as values
FROM system.numbers
LIMIT 5
┌─number─┬─values─┐
│      0 │      0 │
│      1 │      1 │
│      2 │      1 │
│      3 │      1 │
│      4 │      0 │
└────────┴────────┘

Source Executable implicit key option

Available since 21.3.

File data.tsv in TabSeparated format

First Value
Second Value
Third Value
<source>
    <executable>
        <command>cat data.tsv</command>
        <format>TabSeparated</format>
        <implicit_key>1</implicit_key>
    </executable>
</source>

Source Executable problems

ClickHouse will vfork + exec to create new process for each block to process during dictGet, or dictHas

1. User program has long initialization

2. User program need to save state

Source ExecutablePool

Available since 21.3.

Create processes on demand, put them in pool with some size, and then reuse them during dictGet, or dictHas

<source>
    <executable_pool>
        <command>executable-program</command>
        <format>TabSeparated</format>
        <size>10</size>
    </executable_pool>
</source>

Source ExecutablePool benchmarks

int main(int argc, char **argv)
{
    int result;
    int identifier;

    std::this_thread::sleep_for(std::chrono::milliseconds(50));

    while ((result = scanf("%i", &identifier) != EOF))
        std::cout << identifier << "\t" << (identifier + 1) << "\n" << std::flush;

    return 0;
}

Source ExecutablePool benchmarks

Executable pool:

clickhouse-benchmark --query="SELECT dictGet('executable_pool', 'value', number)
FROM system.numbers_mt LIMIT 100000 FORMAT Null" --concurrency=10
localhost:9000, queries 189, QPS: 187.377, MiB/s: 2996.612

Executable:

clickhouse-benchmark --query="SELECT dictGet('executable', 'value', number)
FROM system.numbers_mt LIMIT 100000 FORMAT Null" --concurrency=10
localhost:9000, queries 268, QPS: 92.657, MiB/s: 1481.805

Bonus UDF based on executable pool

<dictionary>
    <name>user_custom_function</name>

    <structure>
        <key>
            <attribute>
                <name>first_word</name>
                <type>String</type><null_value></null_value>
            </attribute>
            <attribute>
                <name>second_word</name>
                <type>String</type><null_value></null_value>
            </attribute>
        </key>
        <attribute>
            <name>distance</name>
            <type>Float64</type> <null_value>0.0</null_value>
        </attribute>
    </structure>

    <source>
        <executable_pool>
            <format>TabSeparated</format>
            <command>word2vec-function</command>
            <implicit_key>1</implicit_key>
            <size>5</size>
        </executable_pool>
    </source>

    <layout><complex_key_direct /></layout>
    <lifetime>300</lifetime>
</dictionary>

Bonus UDF based on executable pool

SELECT dictGet('user_custom_function', 'distance', ('FirstWord', 'SecondWord'));
┌─dictGet('user_custom_function', 'distance', tuple('FirstWord', 'SecondWord'))─┐
│                                                                          0.46 │
└───────────────────────────────────────────────────────────────────────────────┘

CacheDictionary added LRU support

Available since 21.3.

1. Current implementation use fixed hash table with max collision resolution chain = 10

2. Cache attributes allocated memory for all elements on initialization

Issues:

https://github.com/ClickHouse/ClickHouse/issues/20194

https://github.com/ClickHouse/ClickHouse/issues/20252

https://en.wikipedia.org/wiki/Cache_replacement_policies#Least_recently_used_(LRU)

CacheDictionary problem example

clickhouse-benchmark
--query="SELECT dictGet('clickhouse_simple_cache_dictionary', 'value1', number)
FROM system.numbers LIMIT 80000 FORMAT Null"
localhost:9000, queries 270, QPS: 87.105, MiB/s: 87.064
clickhouse-benchmark
--query="SELECT dictGet('clickhouse_simple_cache_dictionary', 'value1', number)
FROM system.numbers LIMIT 80000 FORMAT Null"
localhost:9000, queries 60, QPS: 20.105, MiB/s: 22.024

CacheDictionary LRU remove with probabilities

Will be available since 21.4. or 21.5

https://redis.io/topics/lru-cache

http://antirez.com/news/109

Cache dictionaries support for async update

Available since 21.3.

If all keys are found in cache, but some of them are expired, return result to client

Implemented for both Cache, SSDCache dictionary layouts

Nullable attribute type support

Available since 21.3.

Added nullable type support for all dictionary layouts, except Polygon and IPTrie

Evalutate result type of function calling dictGet without dictionary loading

Available since 21.3.

Problem. If your table initialization depends on dictGet result type, and dictionary cannot be loaded because of some error server startup can fails.

Developer — Vitaliy Baranov

Plans. Hierarchy dictionary improvements

Will be available in 21.4. or 21.5

1. Support for single complex key hierarchy attribute

2. Add dictGetChildren, dictGetDescendants methods

3. Improve performance

Issues:

https://github.com/ClickHouse/ClickHouse/issues/21471

Plans. Minor tickets

Will be available in 21.4. or 21.5

1. Array dictionary attribute type support

2. Decimal256 dictionary attribute type support

3. Suport Enum type

4. Table function dictionary for dictionaries created using XML

5. Synchronous initialization of dictionaries created using DDL

6. Use system.dictionaries information without loading dictionary

7. Improve documentation with examples

8. Simplify simple, complex key dictionary configuration created with DDL, XML

?

Links

https://clickhouse.tech/docs/en/sql-reference/dictionaries/