Our simple vector database implementation

William Zeng - March 12th, 2024

Standard vector databases worked great for Sweep when we had a single index, but they became a pain as we started managing multiple indices, frequent updates, and self-hosting. Our backend now consists of a single docker image that handles vector search over hundreds of repositories, each with a few thousand files.
We use a simple Redis-based vector database that we can run on our own infrastructure, and we'd like to share it with the community!

Code Walkthrough

Here’s our implementation: https://github.com/sweepai/sweep/blob/main/docs/public/redis_vector_db.py (opens in a new tab).
To use it you can copy and paste our code into your project and then start Redis with docker run -p 6379:6379 -d redis.
You'll need to install the following packages:

numpy
redis
loguru
tiktoken
openai

Here's a usage example:

from redis_vector_db import get_most_similar_texts
 
query = "example_query"
document = "example_document"
texts = [document for _ in range(n)]
most_similar_texts = get_most_similar_texts(query, texts)
# most_similar_texts is a sorted list of the most similar texts to the query

We also added a couple of configuration options:

Config	Value	Description
`BATCH_SIZE`	`512`	The number of texts that will be fetched from redis/embedded at once.
`REDIS_URL`	`os.environ.get("REDIS_URL", "redis://0.0.0.0:6379/0")`	You can point to a remote Redis instance for greater performance.
`CACHE_VERSION`	`"v0.0.1"`	If you change the format/model for the embeddings, you should change this to avoid incorrect cache hits.
`DIMENSIONS_TO_KEEP`	`512`	The number of dimensions to keep from the OpenAI embeddings. 512 dimensions worked best in this case.

This is our main loop. It turns a batch of texts into embeddings, handling cache reads and writes.

# 0. If we don't have a redis client, just call openai
if not redis_client:
    return openai_call_embedding(batch)
 
# 1. Get embeddings from redis, using the hash of the text as the key
embeddings: list[np.ndarray] = [None] * len(batch)
cache_keys = [hash_text(text) + CACHE_VERSION for text in batch]
try:
    for i, cache_value in enumerate(redis_client.mget(cache_keys)):
        if cache_value:
            embeddings[i] = np.array(json.loads(cache_value))
except Exception as e:
    logger.exception(e)
 
# 2. If we have all the embeddings, return them
batch = [text for idx, text in enumerate(batch) if isinstance(embeddings[idx], type(None))]
if len(batch) == 0:
    embeddings = np.array(embeddings)
    return embeddings
 
# 3. If we don't have all the embeddings, call openai for the missing ones
try:
    new_embeddings = openai_call_embedding(batch)
except requests.exceptions.Timeout as e:
    logger.exception(f"Timeout error occured while embedding: {e}")
except BadRequestError as e:
    try:
        # 4. If we get a BadRequestError, truncate the text and try again
        batch = [truncate_string_tiktoken(text) for text in batch] # truncation is slow, so we only do it if we have to
        new_embeddings = openai_call_embedding(batch)
    except Exception as e:
        logger.exception(e)
 
# 5. Place the new embeddings in the correct position
indices = [i for i, emb in enumerate(embeddings) if emb is None]
for i, index in enumerate(indices):
    embeddings[index] = new_embeddings[i]
 
# 6. Store the new embeddings in redis
redis_client.mset(
    {
        cache_key: json.dumps(embedding.tolist())
        for cache_key, embedding in zip(cache_keys, embeddings)
    }
)
return np.array(embeddings)

Implementation Details

The first nuance is that we only truncate the text if we get a BadRequestError. This is because truncation is slow, and we only want to do it if we have to. This is optional, and you can truncate beforehand if you prefer.
It's important to prune empty strings. Otherwise openai will throw an error without telling you why. We added this line to handle it. texts = [text if text else " " for text in texts]

Design Choices

We designed our vector database in this way because it’s simple and reliable. We don’t have to worry about managing separate infrastructure, and we can handle a lot of reads/writes.

Requirements

We need to handle hundreds of codebases with frequent updates
- This architecture does not require any indices. Typically in Pinecone you’d have separate repositories under different namespaces (https://docs.pinecone.io/reference/fetch (opens in a new tab)).
- This makes a lot of sense for a small amount of indices, but becomes cumbersome with a large amount of indices. We’d also have to manage the reindexing as each user pushes code to their repos (multiple times an hour).
Reliability and developer experience
- Using an external store introduces another dependency for us. We’d have to handle when Pinecone fails, and it’s not as easily self-hostable.
- Learning how to use a new API and manage new infrastructure was not worth it in this case.

Non-requirements

This approach is relatively slow (~1 second per query). We don’t need sub-second latency for our generating pull requests because our GPT4 calls are much slower.
- This is because we process/cache the entire index at runtime. We key our cache on the actual content’s SHA256 hash.
- A codebase doesn’t tend to change a substantial percentage of it’s files day-to-day. So if we use the actual contents as a hash, we only have to re-embed the diffs. This lets us completely skip offline indexing, simplifying our vector db implementation.
1M+ embeddings in a single index
- This approach doesn’t scale well for indices with 1M+ embeddings. Fortunately for us that hasn’t been a problem yet because the codebases we work with do not usually exceed 30k files.
- We recently began embedding larger chunks, giving us a chunks:doc ratio of about 1.5 or ~45k files. This takes about 1 second to serve, and we own all of the infrastructure!

Conclusion

I don't think everyone needs a separate dependency for their vector database. Depending on how you value development speed, reliability, and latency, you might be better off with a simple self-hosted solution like ours. If you want to see more of Sweep, check out our GitHub repository at https://github.com/sweepai/sweep (opens in a new tab).

📂 A better Python cache for slow function calls 🔒 Sweep's SOC 2 Compliant and Enterprise Offering