Vector Search in ClickHouse

Vector Similarity Search in ClickHouse

VSS — searching for similar objects in unstructured data:

— Search for similar objects by their embeddings.

— Using indexes in multidimensional spaces.

— In data exceeding memory capacity and single machine.

— Dynamically updatable indexes.

Relevance, Demand

The Vector Similarity Search direction is developing right now
and combines two modern directions of engineering thought:
machine learning and big data processing.

Examples:

Google announcement, December 13, 2021:
"Find anything blazingly fast with Google's vector search technology"
— proprietary technology for Cloud customers.

New startup companies around this technology: Cohere.ai

What We Will Do

1. Dataset preparation:

— images from the internet (400 million images);

— social media comments (Reddit, Hacker News dumps);

— source code files (10 million repositories);

— HTML pages (6 million websites).

What We Will Do

2. Study of algorithms and their implementations:

— hnswlib (Hierarchical Navigable Small World Graphs);
— redis-hnsw (implementation in Redis);
— nmslib (Non-Metric Space Library);
— Faiss;
— ScaNN;

What We Will Do

3. Baseline — testing brute-force search:

— search by full enumeration with on-the-fly Lp metric calculation.

What We Will Do

4. Choosing data structure for the index:

— monolithic index, fully loaded into memory;

— sparse index with granule statistics,
variants in external memory;

— experiments in building locality sensitive hashes
using grid in VP-tree and space-filling curves.

What We Will Do

5. Implementation of data structures in ClickHouse.

--- this is where all the most interesting things happen ---

What We Will Do

6. Conducting experiments and comparative testing:

— memory consumption;
— search performance;
— index loading speed;
— result accuracy.

Performance optimization.

Development of tests and documentation.

Technology Stack

90% C++
10% Python (data preparation)

Knowledge areas:
— Databases and distributed systems;
— Algorithms and data structures;
— Machine learning.

Resources in the Cloud will be provided for the work.

Requirements

— 4..6 people in the team;
— confident knowledge of C++.

Contacts

Team leader — Artur Filatyenkov,
Telegram @illusion_cat.

Project supervisor — Alexey Milovidov,
Telegram @milovidov_an.

Vector Searchin ClickHouse

Vector Similarity Search in ClickHouse

Relevance, Demand

What We Will Do

What We Will Do

What We Will Do

What We Will Do

What We Will Do

What We Will Do

Technology Stack

Requirements

Contacts

Q&A

Vector Search
in ClickHouse