Author: Alexey Milovidov, 2022-02-04.
VSS — searching for similar objects in unstructured data:
— Search for similar objects by their embeddings.
— Using indexes in multidimensional spaces.
— In data exceeding memory capacity and single machine.
— Dynamically updatable indexes.
The Vector Similarity Search direction is developing right now
and combines two modern directions of engineering thought:
machine learning and big data processing.
Examples:
Google announcement, December 13, 2021:
"Find anything blazingly fast with Google's vector search technology"
— proprietary technology for Cloud customers.
New startup companies around this technology: Cohere.ai
1. Dataset preparation:
— images from the internet (400 million images);
— social media comments (Reddit, Hacker News dumps);
— source code files (10 million repositories);
— HTML pages (6 million websites).
2. Study of algorithms and their implementations:
— hnswlib (Hierarchical Navigable Small World Graphs);
— redis-hnsw (implementation in Redis);
— nmslib (Non-Metric Space Library);
— Faiss;
— ScaNN;
3. Baseline — testing brute-force search:
— search by full enumeration with on-the-fly Lp metric calculation.
4. Choosing data structure for the index:
— monolithic index, fully loaded into memory;
— sparse index with granule statistics,
variants in external memory;
— experiments in building locality sensitive hashes
using grid in VP-tree and space-filling curves.
5. Implementation of data structures in ClickHouse.
--- this is where all the most interesting things happen ---
6. Conducting experiments and comparative testing:
— memory consumption;
— search performance;
— index loading speed;
— result accuracy.
Performance optimization.
Development of tests and documentation.
90% C++
10% Python (data preparation)
Knowledge areas:
— Databases and distributed systems;
— Algorithms and data structures;
— Machine learning.
Resources in the Cloud will be provided for the work.
— 4..6 people in the team;
— confident knowledge of C++.
Team leader — Artur Filatyenkov,
Telegram @illusion_cat.
Project supervisor — Alexey Milovidov,
Telegram @milovidov_an.