Vector Search in ClickHouse

Author: Alexey Milovidov, 2022-02-04.

Vector Search
in ClickHouse

Vector Similarity Search in ClickHouse

VSS — searching for similar objects in unstructured data:

— Search for similar objects by their embeddings.

— Using indexes in multidimensional spaces.

— In data exceeding memory capacity and single machine.

— Dynamically updatable indexes.

Relevance, Demand

The Vector Similarity Search direction is developing right now
and combines two modern directions of engineering thought:
machine learning and big data processing.

Examples:

Google announcement, December 13, 2021:
"Find anything blazingly fast with Google's vector search technology"
— proprietary technology for Cloud customers.

New startup companies around this technology: Cohere.ai

What We Will Do

1. Dataset preparation:

— images from the internet (400 million images);

— social media comments (Reddit, Hacker News dumps);

— source code files (10 million repositories);

— HTML pages (6 million websites).

What We Will Do

2. Study of algorithms and their implementations:

— hnswlib (Hierarchical Navigable Small World Graphs);
— redis-hnsw (implementation in Redis);
— nmslib (Non-Metric Space Library);
— Faiss;
— ScaNN;

What We Will Do

3. Baseline — testing brute-force search:

— search by full enumeration with on-the-fly Lp metric calculation.

What We Will Do

4. Choosing data structure for the index:

— monolithic index, fully loaded into memory;

— sparse index with granule statistics,
  variants in external memory;

— experiments in building locality sensitive hashes
  using grid in VP-tree and space-filling curves.

What We Will Do

5. Implementation of data structures in ClickHouse.

--- this is where all the most interesting things happen ---

What We Will Do

6. Conducting experiments and comparative testing:

— memory consumption;
— search performance;
— index loading speed;
— result accuracy.

Performance optimization.

Development of tests and documentation.

Technology Stack

90% C++
10% Python (data preparation)

Knowledge areas:
— Databases and distributed systems;
— Algorithms and data structures;
— Machine learning.

Resources in the Cloud will be provided for the work.

Requirements

— 4..6 people in the team;
— confident knowledge of C++.

Contacts

Team leader — Artur Filatyenkov,
Telegram @illusion_cat.

Project supervisor — Alexey Milovidov,
Telegram @milovidov_an.

Q&A