David Okpare

AI Engineer

Image from Clipdrop by Stability AI
cosmos.com

Building a Vector Database on SQLite, Numpy and KNNs

Not too long ago, vector databases were the talk of the tech world, and it seemed like every startup with a vector database concept was catching the attention of venture capitalists. Well, that fever certainly caught me too, and here’s my journey of delving into the world of vector databases and, yes, even attempting to secure some of that coveted VC funding (fingers crossed)! 🤞


What are vector databases?

Vector databases are purpose-built databases that are specialized to tackle the problems that arise when managing vector embeddings in production scenarios.

Vector databases became a necessity in the wake of the AI revolution. As we harnessed the impressive powers of large language models (LLMs), we realized they needed to stretch their intellectual horizons beyond their initial training data. LLMs are like sponges soaking up language data, but they're do not understand language like humans; they decode information into high-dimensional vectors. Think of them as language wizards speaking in numbers, or ndarray to be precise.

Implementation

Now, the twist in the tale is that traditional relational databases like SQLite, PostgreSQL, and MySQL, weren't exactly cut out for handling these ndarray. However, we can perform serialization before the data is stored and after the data is retrieved.

In our case, we employ Pickle to convert ndarray into bytes and store them using base64 as a TEXT field. We use SQLite's register_adapter and register_converter to elegantly store and restore these ndarray gems. (A shoutout to the genius who shared this on Stack Overflow!)

import sqlite3
import numpy as np
import io

def adapt_array(arr):
    out = io.BytesIO()
    np.save(out, arr)
    out.seek(0)
    return sqlite3.Binary(out.read())

def convert_array(text):
    out = io.BytesIO(text)
    out.seek(0)
    return np.load(out)

sqlite3.register_adapter(np.ndarray, adapt_array)
sqlite3.register_converter("array", convert_array)

So, with SQLite on our side, we can now store and fetch ndarray treasures, but let's not forget, there's more to this story. How do we pluck out those crucial embeddings without summoning the whole blob?

For most relational databases, querying data is like saying, "Hey, give me everything that matches this condition."

SELECT * FROM db_table WHERE "condition" IS "MATCHED"

But in our case, it's a bit different. We're on the hunt for similarities between our vector embeddings—enter, Similarity Search!

Similarity Search is like the matchmaking service of vector databases. It's all about finding the closest vectors to a given query vector. Now, here comes kNN (K-Nearest Neighbors) into play. A classic in the machine learning world, kNN is like that friendly neighbor who's always there to help.

In a nutshell, kNN operates by calculating distances between vectors. For our case, it's the good ol' Euclidean distance. But no need to whip out your calculators; Numpy swoops in to save the day with numpy.linalg.norm(vector1 - vector2).

And as promised, here's a nifty implementation of the Euclidean distance and kNN in Python:


def euclidean_distance(point1, point2):
    """
    Calculate the Euclidean distance between two points represented as NumPy arrays.
    """
    if point1.shape != point2.shape:
        raise ValueError("Input points must have the same shape.")

    # Calculate the Euclidean distance
    distance = np.linalg.norm(point1 - point2)

    return distance


def get_nearest_neighbor(train, test_row, num_neighbors: int = 1):
    """
    Find the nearest neighbors of a test data point in a dataset.
    """
    distances = []

    for train_row in train:
        dist = euclidean_distance(test_row, train_row)
        distances.append((train_row, dist))

    distances.sort(key=lambda tup: tup[1])
    neighbors = []

    for i in range(num_neighbors):
        neighbors.append(distances[i][0])

    return neighbors

Conclusion

So there you have it, a sneak peek into my journey of tackling vector databases and making sense of those high-dimensional treasures. Who knows, maybe my vector database venture will find its way to the VC radar one day. Until then, happy coding and may your vectors always align! 🚀

Find the complete source code and more details on my GitHub repository: https://github.com/DaveOkpare/sqlite_vector

Reference