How to Choose the Right Vector Database
And what role do they play in Retrieval Augmented Generation (RAG)?
I’m writing a book on Retrieval Augmented Generation (RAG) for Wiley Publishing, and vector databases are an inescapable part of building a performant RAG system.
Over several blogs, I’ll teach you everything you need to know to start working with a vector database (in this series, you’ll learn to use Qdrant).
This blog focuses on an intuitive and conceptual understanding of vector databases and why they’re needed in RAG workflows. The rest of the series of blogs will be more technical and hands-on and require a lot of coding. By the end of the series, you will have acquired a valuable skill set that is highly sought-after in the market. I’m also writing the book in a way that minimizes reliance on frameworks like LlamaIndex and LangChain, so almost all of the code will be dealing directly with Qdrant and the OpenAI API.
I’ll batch-publish four blogs a month. Your feedback, comments, and questions are welcome. Your input will help make the book amazing!
Let’s get to it.
What’s in This Series?
You’re probably familiar with traditional databases, like relational databases or NoSQL databases.
They store data in tables, with each row representing a record and each column representing a particular type of data, like name, age, or address. Searching and querying these databases is straightforward. You can use SQL to retrieve records based on specific criteria.
For example:
-- Find all users with the name 'Harpreet'
SELECT * FROM users WHERE name = 'Harpreet';
-- Get all products with a price greater than $100
SELECT * FROM products WHERE price > 100;
But what if you need to store and query data that can’t be easily represented as rows and columns?
What if your data is more complex, like images, audio files, or abstract concepts like user preferences or semantic meanings? Imagine trying to find all images that look like a specific pair of shoes in a traditional database. You’d have to manually tag each image with relevant keywords and then search for those tags. Unstructured data like images, sounds, complex text documents, or even molecular data can’t easily be parsed into discrete rows and columns.
But what if there was a way to represent complex, unstructured data in a format that captures its inherent relationships and allows for efficient similarity-based searching? This is where vectors come to the rescue.
If the term sounds mathematical, that’s because it is.
A vector is a mathematical object with both magnitude (length) and direction. Specifically, a vector is a sequence of floating point numbers representing a direction in high-dimensional space. Each dimension in the space corresponds to a specific feature or attribute of the data. The magnitude of a vector along each dimension represents the value or importance of that feature for the particular data point. This sequence of numbers, a vector, encapsulates information about the data point in a high-dimensional space. This high-dimensional representation allows complex relationships between data points to be captured and analyzed, enabling tasks such as similarity searches and clustering.
For example:
- Text documents can be represented as vectors, each dimension corresponding to a word (or, more specifically, a token), and the value indicating, to a certain degree, the importance of that word in the document taken in context with the words around it.
- Images can be converted into vectors where each pixel is a dimension
- Audio clips can be transformed into vectors based on various audio features
This becomes especially handy when you want to represent and compare complex, unstructured data numerically. Vectors representing similar objects or concepts will be close to each other in multidimensional.
Imagine you’re building a recommendation system for a music streaming service. You could represent each song as a vector, where the dimensions of the vector correspond to different features like genre, tempo, mood, and lyrics.
When you represent a song as a vector, each dimension captures some aspect or feature of the song.
However, the individual values in the vector don’t have an explicit, human-interpretable meaning on their own. For example, suppose a song vector has values [0.2, -0.5, 0.8, 0.1]. You can’t point to the 0.8 and definitively say, “This means the song is very danceable.” Instead, that 0.8 value contributes to the song’s “danceability” and all the other feature values and vectors.
This expressivity provided by a vector allows you to capture the complex relationships between songs and users in a way that’s impossible with traditional database rows.
So, why can’t you just use a regular database to store and query these vectors?
Traditional databases are designed to work with discrete, categorical data, not continuous, numerical data like vectors. They’re great for storing and querying structured data. But they’re not optimized for searching, filtering, or ranking data based on complex, high-dimensional relationships.
For example, suppose you wanted to find all songs in your music database that have a similar vibe to “Particles” by Lucy in Disguise (just a random song that I happen to be listening to while writing this). With a traditional database, you’d have to search through discrete fields like genre, artist, etc. However, songs with similar vibes may span multiple genres and artists. Not to mention that you’re assuming that you can get to the vibe of a particular track based on discrete attributes of a song.
Instead, represent each song as a high-dimensional vector capturing attributes like tempo, mood, lyrics, etc. You can find the most similar songs by looking for the nearest vectors in that space. However, searching through billions of high-dimensional vectors is computationally expensive and not something traditional databases are designed for.
Traditional databases aren’t built to handle these queries efficiently because they’re designed to search through discrete, well-defined fields rather than multidimensional ones. But with vectors, you can represent the audio profile of each song as a vector in a multidimensional space. Then, to find similar songs, you just find the vectors closest to your target vector.
Not only that, but you might need to search through billions of high-dimensional vectors to find songs with a similar vibe. That’s computationally expensive, and traditional databases simply aren’t designed for this task.
This is where vector databases come in — specifically designed and built to store, search, and efficiently query this data type.
They’re optimized for high-performance similarity searches, clustering, and other critical operations in recommendation systems, computer vision, and natural language processing applications.
So, what is a vector database, exactly?
A vector database is a database that’s specifically designed to store, manage, query, and perform operations on large collections of vectors.
They use specialized indexing and search algorithms, like approximate nearest neighbour search (ANNS), for fast and accurate searching, filtering, and ranking of vectors based on their similarity, distance, or other relationships.
Unlike traditional databases that match exact queries, vector databases help you search through data in a way that mimics human-like perception, finding items that are conceptually “similar,” even if not identical. This is particularly important for applications like recommendation systems, semantic search, and other machine learning use cases.
While traditional databases are great for storing and querying structured data, vector databases are purpose-built for the unique challenges of managing, searching and analyzing vast amounts of unstructured, high-dimensional data. Compared to a traditional database, here’s what makes vector databases well-suited for AI workflows:
- Optimized for storing and querying high-dimensional vector data
- Support fast approximate similarity search
- Enable searching and recommendations based on semantic meaning and relevance
- Scale to massive datasets
- Integrate well with ML/AI workflows and libraries
As the volume of unstructured data expands, vector databases will play an important role in powering this, and the next, generation of context-aware applications.
The Role of Vector Databases in RAG
In a typical RAG workflow, the external data used to augment the LLM’s knowledge is first converted into vector embeddings.
These numerical representations capture the data’s semantic meaning and context, allowing similar items to be grouped closer together in vector space. The vector embeddings are then stored in a vector database optimized for performing fast and accurate similarity searches.
When a user submits a query, the RAG system follows these steps:
1. The query is converted into a vector embedding.
2. The vector database is searched to find the data points most similar to the query vector.
3. The retrieved data is used to augment the original query prompt.
4. The augmented prompt is fed to the LLM to generate a response.
In traditional database systems, the search mechanism is designed to match exact queries to entries in the database. The system might need to retrieve conceptually similar information that does not precisely match verbatim. For instance, a query about “tips for writing a novel” might benefit from retrieving “creative writing techniques” information, even if the exact phrase wasn’t used in the query. This level of semantic understanding and contextual retrieval is beyond the capabilities of traditional databases.
This is where vector databases come in.
Vector databases address these challenges by using vectors to represent and understand the semantic similarities between different pieces of data. Using a vector database, the RAG system sifts through massive amounts of data to find the most relevant context for each query.
We’ll discuss how to do this in-depth as we progress. But for now, I want to discuss choosing a vector database.
How to Choose the Right Vector Database
With the growing popularity of Generative AI, the market for vector databases has exploded.
This hasn’t made my job as an author and course creator easier. There are a ton of options to choose from, and each vendor has particular strengths and weaknesses.
So, how do you pick the right one for your project? I think it comes down to a few important factors:
- Performance and scalability
- Data and query flexibility
- Community and ecosystem
- Cost and licensing
Performance and Scalability
To perform similarity searches over large datasets, you need a vector database that is fast in indexing and querying, can handle high-dimensional vectors, and is scalable.
Choose a distributed architecture for handling massive datasets. Benchmark databases using datasets similar to your production use case for realistic performance assessment.
Data and Query Flexibility
Different vector databases offer varying levels of flexibility for data storage and queries. Consider your needs:
- Do you need to store and query metadata with vectors?
- Do you require advanced query types like k-NN or range searches?
- Are you working with dense, sparse, or mixed data?
Choose a database that meets your needs to avoid costly workarounds or limitations.
Community and Ecosystem
Look for a product with active development, responsive forums, extensive docs, integrations, and community support. An offering with a lot of community member-generated content and tutorials tells me that practitioners love to use it. A thriving community will boost your project’s success because you have more help and resources to lean on.
Cost and Licensing
Consider the total cost of ownership and licensing of the vector database. Evaluate if it’s open-source or proprietary, the costs for development, production, and scaling, and any limitations in the free/community edition. Compare its pricing to similar options and ensure it aligns with your budget and long-term requirements.
That’s it for this one!
In this post, I broke down why vector databases are essential for RAG and how they handle complex data that traditional databases can’t.
Stay tuned!